KMeans Classification using spark MLlib in Java

Example Program of KMeans Classification using spark MLlib in Java

KMeans Classification using spark MLlib in Java – KMeans algorithm is used for classification. Basically, it classifies each given observation/experiment/vector into one of the cluster. That cluster is chosen, whose mean vector is less distant to the observation/experiment/vector.

KMeans Classification using spark MLlib in Java - Apache Spark Tutorial - www.tutorialkart.com

KMeans Classification using spark MLlib in Java

Clustering :

Training data is a text file with each row containing space seperated values of features or dimensional values. Example training data is given below:

Each row in the above sample training data is :

  • an observation/experiment/vector which is three dimensional (or)
  • an observation has three features(whose values are continuous), (or)
  • the experiment that has three state variables.

The number of dimensions/features/state-variables could be any number that is real.

The content of the training data file is shown below :

Program :

The java program to demonstrate KMeans classification machine learning algorithm using spark mllib is given below.

When the program is run:

Let see what has been generated during training in detail :

The cluster generation part :

Based on the training data and the hyper parameter, number of clusters = 3,  the algorithm has found three clusters Cluster 0, Cluster 1 and Cluster 2. The centers for these clusters have been calculated and are as shown in the above block.

The cost :

In this algorithm, cost is a metric that shows the price to be paid for choosing a center for cluster. The cost is the sum of squared distances from center of the cluster to each member of the cluster. The cost has to be kept low for better prediction accuracy.

Cost : 1.0733333.. =

[squared distances from center of Cluster 0 to the members of the Cluster 0] +
[squared distances from center of Cluster 0 to the members of the Cluster 0]+
[squared distances from center of Cluster 0 to the members of the Cluster 0]

Save KMeans model to local :

The KMeans classification model generated during training could be saved to local, and be used for prediction.

Prediction :

For the specified input rows to be predicted, the generated KMeans model during training, has predicted the cluster they belong to as shown below:

Conclusion :

In this spark mllib tutorial, we have seen how to train a classification model using KMean Algorithm, save the model as a local file, and use the model for prediction.