Classification using Naive Bayes in Apache Spark MLlib with Java

Classification is a task of identifying the features of an entity and classifying the entity to one of the predefined classes/categories based on the previous knowledge.

Naive Bayes is one of the simplest methods used for classification. Naive Bayes Classifier could be built in scenarios where problem instances (/ examples / data set / training data) could be represented as feature vectors. And the distinctive feature of Naive Bayes is : it considers that features independently play a part in deciding the category of the problem instance i.e., Naive Bayes does not care about the correlation between features if present any. Despite the fact that many other classifiers beat out Naive Bayes, it is still sustaining in the machine learning community because it requires relatively small number of training data for estimating the parameters required for classification.

In this Apache Spark Tutorial, we shall learn to classify items using Naive Bayes Algorithm of Apache Spark MLlib in Java Programming Language.

Classification using Naive Bayes in Apache Spark MLlib with Java

Following is a step by step process to build a classifier using Naive Bayes algorithm of MLLib. You may setup Java Project with Apache Spark and follow the steps.

  1. Configure Spark.
  2. Start a spark context.
  3. Load Data and Split the data to be used for training and testing. The data file used in this example is present in the folder “data” in “apache spark“, downloaded from official website.
  4. Train a Naive Bayes model.
  5. Use the model to predict on the test data, and calculate accuracy.
  6. Save the trained classifier model to local for future use.
  7. Stop the spark context.

Following is the complete Java program :