Apache Spark MLlib Naive Bayes Classification with Java
Classification is a task of identifying the features of an entity and assigning the entity to one of the predefined classes/categories based on previous knowledge. In Apache Spark MLlib, Naive Bayes classification works with labeled feature vectors, learns class probabilities from training data, and predicts the most likely label for new records.
Naive Bayes is one of the simplest methods used for classification. Naive Bayes Classifier could be built in scenarios where problem instances (/ examples / data set / training data) could be represented as feature vectors. And the distinctive feature of Naive Bayes is : it considers that features independently play a part in deciding the category of the problem instance i.e., Naive Bayes does not care about the correlation between features if present any. Despite the fact that many other classifiers beat out Naive Bayes, it is still sustaining in the machine learning community because it requires relatively small number of training data for estimating the parameters required for classification.
This tutorial uses the RDD-based org.apache.spark.mllib API with JavaRDD, LabeledPoint, and NaiveBayesModel. Spark also provides the DataFrame-based org.apache.spark.ml API for pipeline-style machine learning. If you are maintaining an older MLlib Java example, the code below is useful; for a new project, also check the current Spark ML Naive Bayes API for your Spark version.
In this Apache Spark Tutorial, we shall learn to classify items using Naive Bayes Algorithm of Apache Spark MLlib in Java Programming Language.
When Naive Bayes in Apache Spark MLlib Is a Good Fit
Naive Bayes is commonly used when the input can be represented as non-negative feature vectors. It is often a good baseline for text classification, document categorization, and sparse count-based features. The independence assumption is simple, but it can still give useful results when features are counts, frequencies, or binary indicators.
| Use Naive Bayes when | Be careful when |
|---|---|
| Features are non-negative counts, frequencies, or indicators | Raw features contain negative numeric values |
| You need a simple classification baseline | Feature dependencies are central to the problem |
| The data is sparse, such as document-term features | Accuracy alone hides class imbalance problems |
Apache Spark MLlib Naive Bayes Input Data Format
The example uses Spark’s sample LIBSVM file. LIBSVM is a sparse feature-vector format where each row contains a label followed by feature index-value pairs.
<label> <feature-index>:<feature-value> <feature-index>:<feature-value>
For Naive Bayes, feature values should be non-negative. If your source data contains negative values, transform the features or choose a different classifier before training the model.
Steps for Classification using Naive Bayes in Apache Spark MLlib
Following is a step by step process to build a classifier using Naive Bayes algorithm of MLLib. You may setup Java Project with Apache Spark and follow the steps.
1. Configure Spark for the Naive Bayes Java Application
Create a Spark configuration object and set the application name.
SparkConf sparkConf = new SparkConf().setAppName("NaiveBayesClassifierExample");
2. Start JavaSparkContext for MLlib RDD Operations
The RDD-based MLlib API works with Spark context and RDD objects.
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
3. Load LIBSVM Data and Split Training and Test RDDs
Load the sample data and split it into training and test RDDs. The data file used in this example is present in the folder “data” in “apache spark“, downloaded from official website.
// provide path to data transformed as [feature vectors]
String path = "data/mllib/sample_libsvm_data.txt";
JavaRDD inputData = MLUtils.loadLibSVMFile(jsc.sc(), path).toJavaRDD();
// split the data for training (60%) and testing (40%)
JavaRDD[] tmp = inputData.randomSplit(new double[]{0.6, 0.4});
JavaRDD training = tmp[0]; // training set
JavaRDD test = tmp[1]; // test set
The training RDD is used to fit the model. The test RDD is kept aside to check model performance on records not used during training. If you need repeatable results, set a random seed in randomSplit.
4. Train the Apache Spark MLlib Naive Bayes Model
The NaiveBayes.train() call trains the model from the training RDD. The second argument, 1.0, is the smoothing parameter, which reduces zero-probability issues for unseen feature/class combinations.
NaiveBayesModel model = NaiveBayes.train(training.rdd(), 1.0);
5. Predict Test Labels and Calculate Naive Bayes Accuracy
Predict a label for every test record and compare the prediction with the actual label.
JavaPairRDD<Double, Double> predictionAndLabel =
test.mapToPair(p -> new Tuple2<>(model.predict(p.features()), p.label()));
// calculate the accuracy
double accuracy =
predictionAndLabel.filter(pl -> pl._1().equals(pl._2())).count() / (double) test.count();
The accuracy formula used here is:
accuracy = number of correct predictions / number of test records
Accuracy is easy to read, but for imbalanced classes you should also check precision, recall, F1 score, and a confusion matrix.
6. Save the Trained Naive Bayes Classifier Model
Save the model if you want to reuse it later. Make sure the target directory does not already contain an incompatible saved model.
model.save(jsc.sc(), "target/tmp/myNaiveBayesModel");
7. Stop the Spark Context After the MLlib Job
Stop the Spark context to release resources after the application finishes.
jsc.stop();
Complete Java Program for Naive Bayes Classification in Spark MLlib
Following is the complete Java program.
NaiveBayesClassifierExample.java
import scala.Tuple2;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.mllib.classification.NaiveBayes;
import org.apache.spark.mllib.classification.NaiveBayesModel;
import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.util.MLUtils;
import org.apache.spark.SparkConf;
public class NaiveBayesClassifierExample {
public static void main(String[] args) {
// configure spark
SparkConf sparkConf = new SparkConf().setAppName("JavaNaiveBayesExample")
.setMaster("local[2]").set("spark.executor.memory","2g");
// start a spark context
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
// provide path to data transformed as [feature vectors]
String path = "data/mllib/sample_libsvm_data.txt";
JavaRDD inputData = MLUtils.loadLibSVMFile(jsc.sc(), path).toJavaRDD();
// split the data for training (60%) and testing (40%)
JavaRDD[] tmp = inputData.randomSplit(new double[]{0.6, 0.4});
JavaRDD training = tmp[0]; // training set
JavaRDD test = tmp[1]; // test set
// Train a Naive Bayes model
NaiveBayesModel model = NaiveBayes.train(training.rdd(), 1.0);
// Predict for the test data using the model trained
JavaPairRDD<Double, Double> predictionAndLabel =
test.mapToPair(p -> new Tuple2<>(model.predict(p.features()), p.label()));
// calculate the accuracy
double accuracy =
predictionAndLabel.filter(pl -> pl._1().equals(pl._2())).count() / (double) test.count();
System.out.println("Accuracy is : "+accuracy);
// Save model to local for future use
model.save(jsc.sc(), "target/tmp/myNaiveBayesModel");
// stop the spark context
jsc.stop();
}
}
Output
Accuracy is : 0.975609756097561
How the Spark MLlib Naive Bayes Java Example Works
The program follows a normal supervised learning flow. It creates Spark configuration and context, loads labeled feature vectors from the LIBSVM file, splits the data into training and test RDDs, trains the model, predicts labels for the test RDD, compares predicted labels with actual labels, prints accuracy, saves the model, and stops the context.
The sample accuracy can vary if the random split changes. The displayed output is one possible result for the sample data. In a real project, evaluate the model on a test set that represents the data the model will see later.
Common Errors in Spark MLlib Naive Bayes Java Programs
- Sample data path not found: Check that
data/mllib/sample_libsvm_data.txtexists relative to the directory from which the application runs. - Negative feature values: Naive Bayes expects non-negative feature values. Transform the features or choose another classifier if the raw values can be negative.
- Model save path already exists: Spark may fail when the output path already exists. Delete the old path or use a new target path.
- Different accuracy value: Random splitting can produce a different test set. Use a seed if the same split is required.
- RDD API versus DataFrame API confusion: This example imports
org.apache.spark.mllib. Do not mix it withorg.apache.spark.mlclasses unless you are rewriting the example.
Apache Spark MLlib Naive Bayes FAQs
What is Naive Bayes in Apache Spark MLlib?
Naive Bayes in Apache Spark MLlib is a supervised classification algorithm that learns from labeled feature vectors and predicts a class label for new records. It is often used for sparse, non-negative feature data such as text counts.
Why does this Spark Naive Bayes Java example use LIBSVM data?
LIBSVM is a compact format for labeled sparse feature vectors. Spark MLlib provides MLUtils.loadLibSVMFile, so it is convenient for small classification examples and tutorials.
Can Apache Spark MLlib Naive Bayes use negative feature values?
No. Naive Bayes feature values should be non-negative. If the source data has negative values, transform the features or use a different classification algorithm.
What does the smoothing value 1.0 mean in NaiveBayes.train?
The value 1.0 is the smoothing parameter. Smoothing reduces the effect of zero counts in the training data and helps the model handle features or classes that may be missing in a subset of examples.
Should I use Spark MLlib or Spark ML for Naive Bayes?
This tutorial uses the older RDD-based Spark MLlib API. For new pipeline-style machine learning work, the DataFrame-based Spark ML API is often more convenient. Use the API that matches your Spark version and project structure.
Apache Spark MLlib Naive Bayes QA Checklist
- Confirm that the tutorial clearly identifies this as an RDD-based
org.apache.spark.mllibJava example. - Check that the LIBSVM explanation mentions labels and index-value feature pairs.
- Verify that the tutorial warns about non-negative feature values for Naive Bayes.
- Make sure the train-test split, smoothing parameter, prediction step, and accuracy formula are explained.
- Ensure new code blocks use PrismJS-compatible classes such as
language-plaintext syntaxandoutput.
Conclusion: Naive Bayes Classification in Apache Spark MLlib with Java
In this Apache Spark Tutorial, we learned how to solve Classification problem using Naive Bayes algorithm in Apache Spark MLlib. The example configured Spark, loaded LIBSVM training data, split data into training and test RDDs, trained a Naive Bayes model, predicted labels, calculated accuracy, saved the model, and stopped the Spark context.
TutorialKart.com