RandomForest Classification Example using Spark MLlib

RandomForest Classification Example using Spark MLlib

RandomForest Classification Example using Spark MLlib – In this tutorial, we shall see how to train and generate a model using RandomForest classifier. And use this generated model on test to predict the categories and calculate Test Error and Accuracy of the model.

Training using Random Forest classifier

Spark MLlib understands only numbers. So, the training data should be prepared in a way that MLlib understands. Preparing the training data is the most important step that decides the accuracy a model. And this includes the following

  1. Identify the categories. And index the categories.
  2. Identify the features. And index the features.
  3. Transform the experiments/observations/examples using indexes of categories and features

Note: Feature values could be discrete or continuous. Comments have been provided in the program to make some of the features discrete and others as continuous. With this as reference, features could be configured as per your requirement.

Download the source code of the ongoing example here, RandomForestExampleAttachment. For setting up java project to work with spark MLlib , please refer Create Java Project with Apache Spark.

Sample Training Data for Random Forest

Below is the sample of transformed and ready to be fed, to the RandomForest, to train on. Each row represents an experiment/observation/example. The format of each row is [category feature1:value feature2:value ..]

Training data: trainingValues.txt

Below is the java class, RandomForestTrainerExample.java, that trains a model and saves it to local.

Trainer Class : RandomForestTrainerExample.java :

When the above java class is run, a model is generated, with three decision trees which are shown in the below output :

From the above random forest, following observation could be made:
. features : 0,1,2,4 are considered discrete as [feature 2 not in {5.0,6.0}] . features : 3,5 are considered continuous as [feature 5 > 6.0]

Possible exceptions during training:

One might come across some of the exceptions below, which has to be taken care of

java.lang.IllegalArgumentException – requirement failed – DecisionTree requires maxBins

When  maxBins = 2   and
maximum number of discrete values for a feature in our training data is : 10
Exception in thread “main” java.lang.IllegalArgumentException: requirement failed: DecisionTree requires maxBins (=2) to be at least as large as the number of values in each categorical feature, but categorical feature 2 has 10 values. Considering remove this and other categorical features with a large number of values, or add more training examples.

Solution : Provide maxBins with value >= max(maximum discrete value + 1) among all the features with discrete values.

java.lang.IllegalArgumentException: GiniAggregator given label

When numClasses = 2    and
training data has three categories [0,1,2]
Caused by: java.lang.IllegalArgumentException: GiniAggregator given label 2.0 but requires label < numClasses (= 2).

Solution : Provide numClasses with value >= number of categories in the training data.

Prediction using the saved model from the above Random Forest Classification Example using Spark MLlib – Training part:

Sample of the test data is shown below. Little observation reveals that the format of the test data is same as that of training data.

Prediction using the model generated during training :

Predictor Class : RandomForestPredictor.java

For the test data, we provided, the model is able to predict 100-0.047 = 95.3% accurately. Since test error = 0.047 = 4.7% inaccurate.

Conclusion:

In this Apache Spark Tutorial – RandomForest Classification Example using Spark MLlib, we have learned how to train and predict for a classification problem using RandomForest Classification Example in Apache Spark MLlib.