Classification using Decision Trees in Apache Spark MLlib with Java

Classification is a task of identifying the features of an entity and classifying the entity to one of the predefined classes/categories based on the previous knowledge.

A decision tree has a structure like tree. It has a root which denotes a decision node and also the start of classifying a problem instance. A node can branch out. Each branch represents a possible outcome from the decision block. Each branch can end up with another node or a class label terminating the classification and ending up with the result – class label.

In this Apache Spark Tutorial, we shall build a decision tree, like the one said above, from the training data using Decision Tree Algorithm in Apache Spark MLlib.

Classification using Decision Trees in Apache Spark MLlib with Java

Following is a step by step process to build a classifier using Decision Tree algorithm of MLLib :
Setup Java Project with Apache Spark

  1. Configure Spark.
  2. Start a spark context.
  3. Load Data and Split the data to be used for training and testing. The data file used in this example is present in the folder “data” in “apache spark“, downloaded from official website.
  4. Set the hyper parameters required by Decision Tree.
    impurity : impurity introduced into the feature values, to avoid Decision Tree model over-fitting the training data.
    maxDepth : Maximum number of node levels that can be created from root node by the Decision Tree algorithm during training.
    maxBins : Before even starting with the training a model, the training data is shuffled into bins. maxBins sets a limit on the number of data bins that could be created.
  5. Train a Decision Tree model.
  6. Use the model to predict on the test data, and calculate accuracy. Decision Tree that is generated could be visualized by converting the tree to a readable string.
  7. Save the trained classifier model to local for future use.
  8. Stop the spark context.

Complete java program is given below :

The hyper parameters set only the limits. The Decision Tree algorithm may optimize the tree by reducing the number of nodes and branches. And in this example, despite the maxDepth=5, the tree has been optimized to a depth of 1.