Classification using Logistic Regression in Apache Spark MLlib with Java

Classification is a task of identifying the features of an entity and classifying the entity to one of the predefined classes/categories.

Logistic Regression is a model which knows about relation between categorical variable and its corresponding features of an experiment.

Logistic meaning detailed organization and implementation of a complex operation. Which means identifying common features for all examples/experiments and transforming all of the examples to feature vectors.

Regression is a measure of relation between mean value of output variable to the dependent variables. Output is the label a problem instance is classified to. Variable values are the feature values.

An Example for Classification using Logistic Regression in Apache Spark MLlib with Java

In this Apache Spark Tutorial, we shall look into an example, with step by step explanation, in generating a Logistic Regression Model for classification using Spark MLlib.

  1. Configure Spark.
  2. Start a spark context.
  3. Load Data and Split the data to be used for training and testing. The data file used in this example is present in the folder “data” in “apache spark“, downloaded from official website.
  4. Train a Naive Bayes model.
  5. Use the model to predict on the test data, and calculate accuracy.
  6. Save the trained classifier model to local for future use.
  7. Stop the spark context.

Complete example program is given below :