Training of Document Categorizer using Maximum Entropy Model in OpenNLP

Training of Document Categorizer using Maximum Entropy Model in OpenNLP

In this Apache OpenNLP Tutorial, we shall learn the Training of Document Categorizer using Maximum Entropy Model in OpenNLP.

Document Categorizing is requirement based task. Hence there is no pre-built model for this problem of natural language processing in Apache openNLP.

In this tutorial, we shall train the Document Categorizer to classify two categories : Thriller, Romantic. The categories chosen are movie generes. The data for each document is the plot of the movie.

Following are the steps to train Document Categorizer that uses Maxent( or Maximum Entropy) mechanism for creating a Model :

  • Step 1 : Prepare the training data.
    The training data file should contain an example for each observation or document with the format : Category followed by data of document, seperated by space.
    For example, consider the below line which is from the training file :

    Thriller John Hannibal Smith Liam Neeson is held captive in Mexico

    Here ,
    Category is “Thriller”
    Data of the document is “John Hannibal Smith Liam Neeson is held captive in Mexico”.

    Find the complete training file used in the example, here en-movie-category.

  • Step 2 : Read the training data file.

  • Step 3 : Define the training parameters.

  • Step 4 : Train and create a model from the read training data and defined training parameters.

  • Step 5 : Save the newly trained model to a local file, which can be used later for predicting movie genere.

  • Step 6 : Test the model for a sample string and print the probabilities for the string to belong to different categories. The method DocumentCategorizer.categorize(String[] wordsOfDoc) takes an array of Strings which are words of the document as argument

The complete program is shown in the following :

When the above program is run, the output to the console is as shown below :

The location of the training file and the locally saved model file are shown in the following picture :

Location of Training file and Generated Model file - Training of Document Categorizer using Maximum Entropy Model in OpenNLP - www.tutorialkart.com

Location of Training file and Model file

Conclusion :

In this OpenNLP Tutorial, we have learnt briefly the training input requirements for Document Categorizer API of OpenNLP and also learnt the example program for Training of Document Categorizer using Maximum Entropy Model in OpenNLP.