What is topic modelling in Apache Spark MLlib?

Topic modelling is a natural language processing technique used to discover hidden themes in a collection of documents. Instead of manually assigning a category to each document, a topic model studies word patterns and groups words that frequently occur together.

For example, a set of documents about forests may frequently contain words such as trees, animals, ecosystem, rainfall, canopy, and biodiversity. A set of documents about Apache Spark may contain words such as RDD, DataFrame, cluster, executor, MLlib, and pipeline. Topic modelling converts these word patterns into a mathematical model so that each topic can be represented as a probability distribution over words.

In Spark MLlib, the commonly used topic modelling algorithm is Latent Dirichlet Allocation, usually abbreviated as LDA. The phrase is sometimes mistyped as Latent Dirichlet Condition, but the correct algorithm name is Latent Dirichlet Allocation.

Topic modelling using Latent Dirichlet Allocation in Apache Spark MLlib

Latent Dirichlet Allocation is an unsupervised learning algorithm. It does not need labelled documents. Given a corpus represented as word-count vectors, LDA tries to learn two useful patterns:

  • Topics as distributions over words: each topic gives higher weight to words that are strongly associated with that topic.
  • Documents as mixtures of topics: each document can contain more than one topic with different probabilities.

The Java example in this tutorial uses Spark MLlib’s RDD-based LDA API. For new Spark machine learning pipelines, also check the DataFrame-based LDA API in the official Spark documentation because Spark’s newer machine learning examples generally use spark.ml pipelines. This page keeps the Java MLlib example because it explains the original RDD-based approach step by step.

Official references: PySpark LDA API reference and Spark ML clustering guide for LDA.

Input format required for Spark MLlib LDA topic modelling

Spark MLlib LDA does not directly accept raw text paragraphs. The input corpus must already be converted into vectors. Each document is represented by a numeric vector, usually a term-frequency vector or another count-based representation.

The sample file used below, data/mllib/sample_lda_data.txt, contains one document per line. Each number in the line represents the count or weight for one vocabulary position.

</>
Copy
0 1 2 0 0 3 0 1 0 0 2
1 0 0 4 0 0 1 0 0 2 0

In a real application, you normally prepare the corpus before training LDA. A typical text-preparation flow is: collect documents, clean text, tokenize words, remove stop words, build a vocabulary, and convert each document into a word-count vector. The example below starts after this vectorization step.

Java MLlib LDA example workflow in Apache Spark

The following Java workflow trains an LDA topic model in Apache Spark MLlib, prints the learned topic-word distributions, saves the model, and stops the Spark context.

Step 1: start the Spark context for the LDA Java application

Configure the Spark application to run locally and start the Spark context. In production, the master setting is usually supplied by the cluster environment instead of being hard-coded.

</>
Copy
// Configure spark application
SparkConf conf = new SparkConf().setAppName("TopicModellingLDAExampleApp")
        .setMaster("local[2]");
// start Spark Context
JavaSparkContext jsc = new JavaSparkContext(conf);
 

Step 2: load the LDA sample data into a Spark RDD

Load and parse the sample data from data/mllib/sample_lda_data.txt. Each line in the file represents one document vector. The code also gives each document a unique Long id because MLlib LDA expects a corpus in the form of document id and vector pairs.

</>
Copy
// Load and parse the sample data
String path = "data/mllib/sample_lda_data.txt";
JavaRDD<String> data = jsc.textFile(path);
JavaRDD<Vector> parsedData = data.map(s -> {
    String[] sarray = s.trim().split(" ");
    double[] values = new double[sarray.length];
    for (int i = 0; i < sarray.length; i++) {
        values[i] = Double.parseDouble(sarray[i]);
    }
    return Vectors.dense(values);
});
 
// Index documents with unique IDs : Long - document id, Vector - Transformed document contents
JavaPairRDD<Long, Vector> corpus =
        JavaPairRDD.fromJavaRDD(parsedData.zipWithIndex().map(Tuple2::swap));
corpus.cache();

The corpus.cache() call is useful because LDA is iterative. Caching avoids recomputing the parsed corpus repeatedly during model training.

Step 3: train the Spark MLlib LDA topic model

Set the number of topics with setK(3) and run LDA against the corpus. Here, Spark is asked to learn three topics from the input documents.

</>
Copy
LDAModel ldaModel = new LDA().setK(3).run(corpus);

The value of K is not discovered automatically in this basic example. You choose it based on the problem, inspect the results, and adjust it if the learned topics are too broad or too fragmented.

Step 4: print LDA topic distributions over the vocabulary

Once the model is generated, print the topics’ distribution over the vocabulary. Each row position in the vocabulary has a weight for each topic.

</>
Copy
System.out.println("Learned topics (as distributions over vocab of " + ldaModel.vocabSize()
+ " words):");
Matrix topics = ldaModel.topicsMatrix();
for (int topic = 0; topic < 3; topic++) {
    System.out.print("Topic " + topic + ":");
    for (int word = 0; word < ldaModel.vocabSize(); word++) {
        System.out.print(" " + topics.apply(word, topic));
    }
    System.out.println();
}

The printed values are easier to understand when you also keep the vocabulary mapping. For example, if vocabulary index 4 maps to the word forest, then the value at word index 4 for a topic tells how strongly that word belongs to that topic.

Step 5: save the trained LDA model from Spark MLlib

Save the generated model if it must be reused later. Make sure the output path does not already contain an incompatible saved model from an earlier run.

</>
Copy
ldaModel.save(jsc.sc(),
        "TopicModellingLDAExampleApp");

Step 6: stop the Spark context after LDA training

Stop the Spark context after the job completes so that local resources and cluster resources are released cleanly.

</>
Copy
jsc.stop();

Complete Java example for topic modelling using LDA in Spark MLlib

In the following example program, we shall perform Topic Modelling using Latent Dirichlet Condition.

TopicModellingLDAExample.java

</>
Copy
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
 
import scala.Tuple2;
 
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.mllib.clustering.DistributedLDAModel;
import org.apache.spark.mllib.clustering.LDA;
import org.apache.spark.mllib.clustering.LDAModel;
import org.apache.spark.mllib.linalg.Matrix;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;
 
/**
* Topic modelling using Latent Dirichlet Condition in Apache Spark MLlib
*/
public class TopicModellingLDAExample {
    public static void main(String[] args) {
 
        // Configure spark application
        SparkConf conf = new SparkConf().setAppName("TopicModellingLDAExampleApp")
                .setMaster("local[2]");
        // start Spark Context
        JavaSparkContext jsc = new JavaSparkContext(conf);
 
        // Load and parse the sample data
        String path = "data/mllib/sample_lda_data.txt";
        JavaRDD<String> data = jsc.textFile(path);
        JavaRDD<Vector> parsedData = data.map(s -> {
            String[] sarray = s.trim().split(" ");
            double[] values = new double[sarray.length];
            for (int i = 0; i < sarray.length; i++) {
                values[i] = Double.parseDouble(sarray[i]);
            }
            return Vectors.dense(values);
        });
        
        // Index documents with unique IDs : Long - document id, Vector - Transformed document contents
        JavaPairRDD<Long, Vector> corpus =
                JavaPairRDD.fromJavaRDD(parsedData.zipWithIndex().map(Tuple2::swap));
        corpus.cache();
 
        // Cluster the documents into three topics using LDA
        LDAModel ldaModel = new LDA().setK(3).run(corpus);
 
        // Output topics. Each is a distribution over words (matching word count vectors)
        System.out.println("Learned topics (as distributions over vocab of " + ldaModel.vocabSize()
        + " words):");
        Matrix topics = ldaModel.topicsMatrix();
        for (int topic = 0; topic < 3; topic++) {
            System.out.print("Topic " + topic + ":");
            for (int word = 0; word < ldaModel.vocabSize(); word++) {
                System.out.print(" " + topics.apply(word, topic));
            }
            System.out.println();
        }
 
        // save the model
        ldaModel.save(jsc.sc(),
                "TopicModellingLDAExampleApp");
        
        // test if the model loads
        DistributedLDAModel sameModel = DistributedLDAModel.load(jsc.sc(),
                "TopicModellingLDAExampleApp");
 
        // stop the Spark Context
        jsc.stop();
    }
}

Output

Learned topics (as distributions over vocab of 11 words):
Topic 0: 7.576226952795377 5.816129763687888 3.443659463610819 13.523621733565031 5.564859588817557 6.605326794930297 14.782903558924001 3.063190448611529 2.8630735297090064 6.5170961047598635 17.015304210597
Topic 1: 8.966136190838393 7.450099807005361 4.338884068311933 18.900255115448275 9.235601145516164 7.2157902467479875 9.709434717615075 3.0358356116955343 2.2640073474546254 5.074403401553405 9.89058872292561
Topic 2: 9.457636856366229 15.73377042930675 4.2174564680772475 7.576123150986693 10.199539265666282 8.178882958321715 6.507661723460921 3.9009739396929377 2.8729191228363677 12.40850049368673 6.094107066477388

How to interpret the LDA output from topicsMatrix()

The output shows three topics because the example uses setK(3). Each topic line contains 11 numeric values because the sample data has a vocabulary size of 11. The larger a value is for a vocabulary position, the more strongly that word position is associated with that topic.

The output does not show actual words because the sample data is already numeric. In a real topic modelling project, keep a separate vocabulary array so that topic weights can be converted into readable top words. Without that mapping, the topic numbers are mathematically valid but difficult to explain to users.

Choosing the number of LDA topics in Apache Spark

The setK() value controls how many topics the model learns. There is no single correct value for every corpus. A small K may merge unrelated themes into one topic, while a large K may split one meaningful theme into several narrow topics.

  • Start with a small number of topics and inspect the top words for each topic.
  • Increase K if topics are too broad or mix unrelated terms.
  • Decrease K if many topics look nearly identical or contain mostly noise.
  • Use domain knowledge to decide whether the discovered topics are meaningful.
  • Keep preprocessing consistent when comparing different K values.

Text preprocessing checklist before Spark LDA training

The quality of LDA results depends strongly on preprocessing. If the input vectors contain noisy words, formatting tokens, or inconsistent vocabulary entries, the learned topics will also be noisy.

  • Normalize text: convert text to a consistent case and remove unnecessary punctuation where appropriate.
  • Tokenize carefully: split text into meaningful terms for the language and domain.
  • Remove stop words: common words such as the, is, and are often add little topic meaning.
  • Handle rare and very frequent terms: terms that appear in nearly every document or only once may reduce topic quality.
  • Build a vocabulary mapping: store the index-to-word mapping so the final topics can be interpreted.
  • Use count-based vectors: LDA is commonly trained on term-count style vectors rather than raw strings.

Spark MLlib LDA vs spark.ml LDA for topic modelling

The example above uses the older RDD-based org.apache.spark.mllib API. It is useful for understanding the basic mechanics of LDA in Spark: build an RDD corpus, train the model, inspect topic-word distributions, and save the model.

For newer Spark projects, the DataFrame-based org.apache.spark.ml API is usually easier to combine with tokenizers, stop-word removers, count vectorizers, and machine learning pipelines. If your project already uses Spark SQL or DataFrames, consider using the DataFrame-based LDA implementation for a more pipeline-friendly design.

Common issues in Apache Spark LDA topic modelling

  • Empty or low-quality topics: check whether the corpus is too small, too sparse, or poorly preprocessed.
  • Unclear topic names: LDA returns word distributions, not human-readable labels. Assign labels after inspecting the top words.
  • Hard-to-read numeric output: keep the vocabulary mapping and print the top terms for each topic.
  • Out-of-memory errors: reduce vocabulary size, filter noisy terms, cache carefully, and test on a smaller sample first.
  • Wrong input expectation: raw text must be converted into feature vectors before using the MLlib RDD-based LDA API.

Apache Spark LDA topic modelling FAQs

Is the algorithm called Latent Dirichlet Condition or Latent Dirichlet Allocation?

The correct algorithm name is Latent Dirichlet Allocation. LDA is a probabilistic topic modelling algorithm that represents documents as mixtures of topics and topics as distributions over words.

Can Spark MLlib LDA train directly on raw text documents?

No. The RDD-based MLlib LDA example expects each document as a numeric vector. Raw text must first be cleaned, tokenized, converted into a vocabulary, and transformed into count vectors or similar feature vectors.

What does setK(3) mean in the Spark LDA example?

setK(3) tells Spark to learn three topics from the corpus. You can change the value based on your dataset and inspect the resulting top words to decide whether the number of topics is useful.

Why does the Spark LDA output show numbers instead of topic names?

LDA learns numeric distributions. It does not automatically create descriptive topic names. To make results readable, map vocabulary indexes back to words, list the top words for each topic, and then assign a human-readable label if needed.

Should new Spark projects use MLlib LDA or spark.ml LDA?

If a project already uses RDD-based MLlib code, the MLlib example is still useful. For new pipeline-based projects, the DataFrame-based spark.ml LDA API is usually the better starting point because it works naturally with Spark ML transformers and estimators.

Editorial QA checklist for this Spark LDA tutorial

  • Verify that the tutorial identifies LDA as Latent Dirichlet Allocation, while preserving older code blocks unchanged.
  • Confirm that all added code blocks use PrismJS-compatible classes such as language-plaintext syntax.
  • Check that the explanation makes clear that Spark MLlib LDA needs numeric document vectors, not raw text.
  • Confirm that the Java example keeps the original RDD-based MLlib approach and does not mix it with DataFrame-based code in the same program.
  • Review official Spark documentation links when updating version-specific API details.
  • Ensure that any real-world extension includes vocabulary mapping so topic outputs can be interpreted as words.

Conclusion

In this Spark Tutorial, we learned how topic modelling works with Latent Dirichlet Allocation in Apache Spark MLlib. We also looked at the required vector input format, trained an LDA model with a Java example, printed topic distributions, saved the model, and reviewed practical points for interpreting and improving LDA results.