In this tutorial, you will learn how to run KMeans clustering with Spark MLlib in Java, save the trained KMeansModel, and use it to assign new vectors to clusters.

KMeans clustering in Spark MLlib Java: what this example does

KMeans Classification using spark MLlib in Java – KMeans is more accurately described as a clustering algorithm, not a supervised classification algorithm. It groups unlabeled observations into a fixed number of clusters. After training, the model can predict the nearest cluster ID for a new vector, but that cluster ID is not the same as a class label unless you map it to a business meaning separately.

The Java program in this page uses the older RDD-based org.apache.spark.mllib.clustering.KMeans API. Apache Spark documentation now treats the DataFrame-based spark.ml API as the primary machine learning API, while the RDD-based spark.mllib package is maintained mainly for compatibility. This tutorial keeps the MLlib Java RDD example because it is useful for projects that still use JavaSparkContext, JavaRDD, and KMeansModel.

For reference, you may compare this example with the official Spark documentation for RDD-based MLlib clustering, DataFrame-based Spark ML clustering, and the Spark MLlib guide.

KMeans Classification using spark MLlib in Java - Apache Spark Tutorial - www.tutorialkart.com
KMeans Classification using spark MLlib in Java

Spark MLlib KMeans input data format for Java RDD API

Training data is a text file with each row containing space seperated values of features or dimensional values. Example training data is given below:

2.9 1.4 6.8
9.0 0.0 9.0

Each row in the above sample training data is:

  • an observation/experiment/vector which is three dimensional (or)
  • an observation has three features(whose values are continuous), (or)
  • the experiment that has three state variables.

The number of dimensions/features/state-variables could be any number that is real.

For KMeans, every row should have the same number of numeric features. In this tutorial, every vector has three values. The first row, for example, becomes the dense vector [0.0, 0.6, 0.0]. If one row has missing values, extra delimiters, or a different number of dimensions, the parsing or training step can fail or produce misleading clusters.

Create kMeansTrainingData.txt for the Spark MLlib Java KMeans example

The content of the training data file is shown below:

0.0 0.6 0.0
0.1 0.1 0.1
0.5 0.2 0.2
2.5 1.2 6.3
2.1 1.8 6.2
2.9 1.4 6.8
9.0 0.0 9.0
9.1 0.1 9.1
9.2 0.2 9.2

Save the file as data/kMeansTrainingData.txt relative to the project working directory, because the Java code reads the path data/kMeansTrainingData.txt. The sample data has three visible groups: small values near zero, values around [2.x, 1.x, 6.x], and values around [9.x, 0.x, 9.x]. That is why the example uses numClusters = 3.

Maven dependencies for Spark MLlib KMeans in a Java project

If you run this example in a Maven project, include Spark Core and Spark MLlib dependencies that match your installed Spark and Scala binary version. The artifact suffix, such as _2.12 or _2.13, must match the Spark build you use.

</>
Copy
<properties>
    <spark.version>your-spark-version</spark.version>
</properties>

<dependencies>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.12</artifactId>
        <version>${spark.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-mllib_2.12</artifactId>
        <version>${spark.version}</version>
    </dependency>
</dependencies>

When using Gradle or another build tool, apply the same rule: keep Spark Core, Spark MLlib, Scala binary version, and the Spark runtime version consistent. Version mismatches are a common reason for NoSuchMethodError, class loading errors, and runtime failures in Spark examples.

Java KMeans program using Spark MLlib RDD API

The java program to demonstrate KMeans classification machine learning algorithm using spark mllib is given below.

JavaKMeansExample.java

</>
Copy
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;

import org.apache.commons.io.FileUtils;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.mllib.clustering.KMeans;
import org.apache.spark.mllib.clustering.KMeansModel;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;

/**
 * KMeans Classification using spark MLlib in Java
 * @author arjun
 */
public class JavaKMeansExample {
	public static void main(String[] args) {

		System.out.println("KMeans Classification using spark MLlib in Java . . .");

		// hadoop home dir [path to bin folder containing winutils.exe]
		System.setProperty("hadoop.home.dir", "D:\\hadoop\\");

		SparkConf conf = new SparkConf().setAppName("JavaKMeansExample")
				.setMaster("local[2]")
				.set("spark.executor.memory","3g")
				.set("spark.driver.memory", "3g");

		JavaSparkContext jsc = new JavaSparkContext(conf);

		// Load and parse data
		String path = "data/kMeansTrainingData.txt";
		JavaRDD data = jsc.textFile(path);
		JavaRDD parsedData = data.map(s -> {
			String[] sarray = s.split(" ");
			double[] values = new double[sarray.length];
			for (int i = 0; i < sarray.length; i++) {
				values[i] = Double.parseDouble(sarray[i]);
			}
			return Vectors.dense(values);
		});
		parsedData.cache();

		// Cluster the data into three classes using KMeans
		int numClusters = 3;
		int numIterations = 20;
		KMeansModel clusters = KMeans.train(parsedData.rdd(), numClusters, numIterations);

		System.out.println("\n*****Training*****");
		int clusterNumber = 0;
		for (Vector center: clusters.clusterCenters()) {
			System.out.println("Cluster center for Clsuter "+ (clusterNumber++) + " : " + center);
		}
		double cost = clusters.computeCost(parsedData.rdd());
		System.out.println("\nCost: " + cost);

		// Evaluate clustering by computing Within Set Sum of Squared Errors
		double WSSSE = clusters.computeCost(parsedData.rdd());
		System.out.println("Within Set Sum of Squared Errors = " + WSSSE);

		try {
			FileUtils.forceDelete(new File("KMeansModel"));
			System.out.println("\nDeleting old model completed.");
		} catch (FileNotFoundException e1) {
		} catch (IOException e) {
		}

		// Save and load model
		clusters.save(jsc.sc(), "KMeansModel");
		System.out.println("\rModel saved to KMeansModel/");
		KMeansModel sameModel = KMeansModel.load(jsc.sc(),
				"KMeansModel");

		// prediction for test vectors
		System.out.println("\n*****Prediction*****");
		System.out.println("[9.0, 0.6, 9.0] belongs to cluster "+sameModel.predict(Vectors.dense(9.0, 0.6, 9.0)));
		System.out.println("[0.2, 0.5, 0.4] belongs to cluster "+sameModel.predict(Vectors.dense(0.2, 0.5, 0.4)));
		System.out.println("[2.8, 1.6, 6.0] belongs to cluster "+sameModel.predict(Vectors.dense(2.8, 1.6, 6.0)));

		jsc.stop();
	}
}

The code loads the text file as an RDD, converts each line into a dense vector, caches the parsed data, trains a KMeans model with three clusters, prints cluster centers, computes the clustering cost, saves the model, loads it again, and predicts the cluster for three new vectors.

For new Java projects, it is better to use typed RDD declarations so the compiler can catch mistakes earlier. The parsing part can be written like this while keeping the same MLlib API:

</>
Copy
JavaRDD<String> data = jsc.textFile(path);

JavaRDD<Vector> parsedData = data.map(s -> {
    String[] sarray = s.trim().split("\\s+");
    double[] values = new double[sarray.length];

    for (int i = 0; i < sarray.length; i++) {
        values[i] = Double.parseDouble(sarray[i]);
    }

    return Vectors.dense(values);
});

The trim().split("\s+") form is safer than splitting only on one space because it handles multiple spaces or tabs between values. The complete program above is kept unchanged so that it matches the original example flow.

Output from the Spark MLlib Java KMeans program

Output

KMeans Classification using spark MLlib in Java . . .
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties

*****Training*****
Cluster center for Clsuter 0 : [2.5,1.4666666666666668,6.433333333333334]
Cluster center for Clsuter 1 : [0.19999999999999998,0.29999999999999993,0.1]
Cluster center for Clsuter 2 : [9.1,0.1,9.1]

Cost: 1.0733333333333688
Within Set Sum of Squared Errors = 1.0733333333333688

Deleting old model completed.
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Model saved to KMeansModel/

*****Prediction*****
[9.0, 0.6, 9.0] belongs to cluster 2
[0.2, 0.5, 0.4] belongs to cluster 1
[2.8, 1.6, 6.0] belongs to cluster 0

The exact cluster numbers can change if the initialization, Spark version, or input order changes. KMeans cluster IDs are identifiers assigned by the algorithm; they should not be treated as stable category names. The important part is whether nearby vectors are assigned to the expected cluster centers.

Let see what has been generated during training in detail :

Cluster centers generated by Spark MLlib KMeans

*****Training*****
Cluster center for Clsuter 0 : [2.5,1.4666666666666668,6.433333333333334]
Cluster center for Clsuter 1 : [0.19999999999999998,0.29999999999999993,0.1]
Cluster center for Clsuter 2 : [9.1,0.1,9.1]

Based on the training data and the hyper parameter, number of clusters = 3,  the algorithm has found three clusters Cluster 0, Cluster 1 and Cluster 2. The centers for these clusters have been calculated and are as shown in the above block.

A cluster center is the mean position of the vectors assigned to that cluster. In this example, Cluster 1 is centered near the low-value vectors, Cluster 0 is centered near the vectors around [2.x, 1.x, 6.x], and Cluster 2 is centered near the vectors around [9.x, 0.x, 9.x]. When a new vector is passed to predict(), Spark returns the cluster whose center is closest to that vector.

WSSSE cost in Spark MLlib KMeans

In this algorithm, cost is a metric that shows the price to be paid for choosing a center for cluster. The cost is the sum of squared distances from center of the cluster to each member of the cluster. The cost has to be kept low for better prediction accuracy.

Cost: 1.0733333333333688
Within Set Sum of Squared Errors = 1.0733333333333688

Cost : 1.0733333.. =

[squared distances from center of Cluster 0 to the members of Cluster 0] +
[squared distances from center of Cluster 1 to the members of Cluster 1] +
[squared distances from center of Cluster 2 to the members of Cluster 2]

This metric is also called Within Set Sum of Squared Errors, or WSSSE. A lower WSSSE usually means the points are closer to their assigned centers for the chosen value of k. However, WSSSE alone should not be used blindly to choose k, because it generally decreases as you add more clusters.

Choosing numClusters for KMeans in Spark MLlib Java

The example uses numClusters = 3 because the sample data was created with three clear groups. In real data, choosing k requires inspection and validation. Common practical checks include comparing WSSSE for multiple values of k, looking for an elbow point, checking whether cluster centers make domain sense, and verifying that clusters are not created only because feature scales are different.

  • Use the same feature scale. KMeans is distance-based, so large-valued features can dominate smaller-valued features.
  • Try more than one value of k. A single run with one value can hide poor cluster separation.
  • Use a seed when reproducibility matters. KMeans initialization can affect final centers.
  • Review the cluster centers. Centers should be interpretable for the problem you are solving.

Save and load KMeansModel in Spark MLlib Java

The KMeans classification model generated during training could be saved to local, and be used for prediction.

In the program, clusters.save(jsc.sc(), "KMeansModel") writes the trained model to the KMeansModel directory. Then KMeansModel.load(jsc.sc(), "KMeansModel") loads the same model back. This pattern is useful when training and prediction are not performed in the same run.

The example deletes the old KMeansModel directory before saving. Without that step, Spark may fail if the output path already exists. In production code, handle this carefully so that you do not accidentally delete a model you still need.

Predict cluster IDs with KMeansModel in Java

For the specified input rows to be predicted, the generated KMeans model during training, has predicted the cluster they belong to as shown below:

*****Prediction*****
[9.0, 0.6, 9.0] belongs to cluster 2
[0.2, 0.5, 0.4] belongs to cluster 1
[2.8, 1.6, 6.0] belongs to cluster 0

The vector [9.0, 0.6, 9.0] is closest to the center near [9.1, 0.1, 9.1], so the model assigns it to cluster 2 in this run. The vector [0.2, 0.5, 0.4] is closest to the low-value cluster, and [2.8, 1.6, 6.0] is closest to the middle group.

KMeans classification vs clustering in Spark MLlib

Many tutorials and older examples use the phrase KMeans classification, but KMeans is an unsupervised clustering algorithm. A classifier learns from labeled examples such as spam and not spam. KMeans does not require labels. It only uses feature distances to divide data into clusters.

You can use KMeans results inside a classification workflow, but the cluster ID itself is not a supervised class. For example, you may cluster customers first and then label each cluster manually as low activity, medium activity, or high activity. That label is your interpretation, not something KMeans learned from labeled training data.

spark.ml vs spark.mllib for Java KMeans examples

Spark documentation uses MLlib as the broader machine learning library name, but there are two APIs you will see in code examples:

API packageMain data structureUse in new code
org.apache.spark.mlDataFrame / DatasetPreferred for new Spark ML pipelines
org.apache.spark.mllibRDDUseful for older RDD-based projects and compatibility

The program on this page uses org.apache.spark.mllib. If you are starting a new application and already use Spark SQL DataFrames, the DataFrame-based org.apache.spark.ml.clustering.KMeans API is usually the better choice. If your project already works with RDDs and old MLlib models, this Java RDD example is still useful.

Common Spark MLlib Java KMeans errors and checks

  • Input path not found: confirm that data/kMeansTrainingData.txt exists relative to the working directory from which the program runs.
  • NumberFormatException: check for blank lines, commas, headers, or non-numeric values in the training data file.
  • Different vector sizes: every row must have the same number of feature values.
  • Windows winutils issue: the hadoop.home.dir line is only for local Windows setups that require Hadoop native utilities.
  • SLF4J warning: the warning in the sample output is a logging binding warning. It does not necessarily mean KMeans training failed.
  • Existing model path: Spark save operations may fail if the output directory already exists.

FAQs on KMeans using Spark MLlib in Java

Is Spark MLlib deprecated for KMeans clustering?

MLlib as Spark’s machine learning library is not deprecated. The older RDD-based spark.mllib API is in maintenance mode, while the DataFrame-based spark.ml API is the primary API for new machine learning pipelines.

Can KMeans be used for classification in Java Spark applications?

KMeans can assign a new vector to a cluster ID, but it is not supervised classification. You may interpret clusters as categories after analysis, but KMeans itself does not learn from labeled class values.

What is the difference between Spark ml and Spark mllib for KMeans?

spark.ml uses DataFrames and pipeline components such as estimators and transformers. spark.mllib uses RDDs and older model classes such as KMeansModel. This tutorial uses spark.mllib.

What does WSSSE mean in Spark MLlib KMeans output?

WSSSE means Within Set Sum of Squared Errors. It is the sum of squared distances between each point and the center of the cluster assigned to that point.

Which type of clustering is KMeans in Spark MLlib?

KMeans is a partition-based clustering method. It divides observations into a predefined number of clusters and assigns each observation to the nearest cluster center.

Editorial QA checklist for this Spark MLlib Java KMeans tutorial

  • The tutorial clearly says that KMeans is clustering, even though the post title uses the older phrase KMeans Classification.
  • The sample input file path matches the Java code path data/kMeansTrainingData.txt.
  • The explanation of WSSSE refers to all three clusters, not only Cluster 0.
  • The Spark API distinction between spark.ml and spark.mllib is included for readers using newer Spark versions.
  • New code snippets use PrismJS-compatible language classes.

KMeans Spark MLlib Java tutorial summary

In this Apache Spark Tutorial, we have seen how to train a KMeans clustering model using Spark MLlib in Java, inspect cluster centers, compute WSSSE cost, save the trained model, load it again, and predict cluster IDs for new vectors. For modern Spark projects, also review the DataFrame-based spark.ml KMeans API before choosing the older RDD-based spark.mllib API.