Topic modelling using Latent Dirichlet Condition in Apache Spark MLlib

What is Topic Modelling ?

Topic Modelling is a natural language processing task of identifying the probable topic that is being represented by the text in the document.

We come across articles or documents containing text that usually belong to a topic. For example, consider some news articles or research papers or internet pages. Each of these describe or explain about a topic. Infact one starts writing text regarding to a topic.

The very example is right here. In this tutorial, we are discussing about Topic Modelling. So, our topic is “Topic Modelling”. You might come across the following words more frequently than others :

  • document
  • natural language processing
  • task
  • topic
  • model
  • probability

As another example, if a document belongs to a topic, “forest”, it might contain frequent words like trees, animals, types of forest, forest, life cycle, ecosystem, etc.

To capture these kind of information into a mathematical model, Apache Spark MLlib provides Topic modelling using Latent Dirichlet Condition.

Topic modelling using Latent Dirichlet Condition in Apache Spark MLlib

Now, we shall learn the process of generating the Topic Model and using the same for prediction, in a step by step process.

  • Step 1 : Start Spark Context

    Configure the ecosystem to run on local and Start Spark Context.

  • Step 2 : Load Data into Spark RDD

    Load and Parse the sample data from data/mllib/sample_lda_data.txt (we are using the sample data provided in Apache Spark MLlib Examples github). Each line in the file represents a document. So, index each document with an unique id.

  • Step 3 : Run LDA Topic Modeller

    Set the number of topics, and run the LDA Topic Modeller against the data corpus.

  • Step 4 : Output Topics Distribution over vocabulary

    Once the model is generated, we may print the topics’ distribution over vocabulary

  • Step 5 : Model Persistence

    Save the model generated to predict topic for further documents.

  • Step 6 : Stop Spark Context

Example program :