Document Classification using NGram Features in OpenNLP

In this tutorial, we shall learn how to use NGram features for Document Classification in OpenNLP using an example.

This topic is kind of continuation to document classification using Maxent model or document classification using Naive Bayes model, where a detailed explanation has been given on how to train a model for document classification or categorization with default features incorporated in DoccatFactory.

Following is the snippet of Java code, where we try to define and initialize N-gram feature generators that could be used for Document Categorizer.

FeatureGenerator[] featureGenerators = { new NGramFeatureGenerator(1,1),
					new NGramFeatureGenerator(2,3) };
DoccatFactory factory = new DoccatFactory(featureGenerators);

featureGenearators is an array where a list of feature generators(which implement FeatureGenerator interface) could be provided. You may build your own class of feature generator extending FeatureGenerator and use the same for document categorizer, by just adding it in the list.

The arguments passed in “new NGramFeatureGenerator(2,3)”, i.e., 2, 3 are minimum and maximum number of words respectively that should be considered as a feature. For more information onNGramFeatureGenerator, please refer the java documentation of NGramFeatureGenerator[http://opennlp.apache.org/docs/1.7.2/apidocs/opennlp-tools/opennlp/tools/doccat/NGramFeatureGenerator.html#NGramFeatureGenerator-int-int-].

Example – Document Classification using NGram Features in OpenNLP

Complete program that takes in a training file, incorporates NGramFeatureGenerator, and generates a model is as shown in the following.

DocClassificationNGramFeaturesDemo.java

import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;

import opennlp.tools.doccat.BagOfWordsFeatureGenerator;
import opennlp.tools.doccat.DoccatFactory;
import opennlp.tools.doccat.DoccatModel;
import opennlp.tools.doccat.DocumentCategorizer;
import opennlp.tools.doccat.DocumentCategorizerME;
import opennlp.tools.doccat.DocumentSample;
import opennlp.tools.doccat.DocumentSampleStream;
import opennlp.tools.doccat.FeatureGenerator;
import opennlp.tools.doccat.NGramFeatureGenerator;
import opennlp.tools.util.InputStreamFactory;
import opennlp.tools.util.MarkableFileInputStreamFactory;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import opennlp.tools.util.TrainingParameters;

/**
 * oepnnlp version 1.7.2
 * Usage of NGram features for Document Classification in OpenNLP
 * @author www.tutorialkart.com
 */
public class DocClassificationNGramFeaturesDemo {

	public static void main(String[] args) {

		try {
			// read the training data
			InputStreamFactory dataIn = new MarkableFileInputStreamFactory(new File("train"+File.separator+"en-movie-category.train"));
			ObjectStream lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
			ObjectStream sampleStream = new DocumentSampleStream(lineStream);

			// define the training parameters
			TrainingParameters params = new TrainingParameters();
			params.put(TrainingParameters.ITERATIONS_PARAM, 10+"");
			params.put(TrainingParameters.CUTOFF_PARAM, 0+"");
			
			// feature generators - N-gram feature generators
			FeatureGenerator[] featureGenerators = { new NGramFeatureGenerator(1,1),
					new NGramFeatureGenerator(2,3) };
		    DoccatFactory factory = new DoccatFactory(featureGenerators);

			// create a model from traning data
			DoccatModel model = DocumentCategorizerME.train("en", sampleStream, params, factory);
			System.out.println("\nModel is successfully trained.");

			// save the model to local
			BufferedOutputStream modelOut = new BufferedOutputStream(new FileOutputStream("model"+File.separator+"en-movie-classifier-maxent.bin"));
			model.serialize(modelOut);
			System.out.println("\nTrained Model is saved locally at : "+"model"+File.separator+"en-movie-classifier-maxent.bin");

			// test the model file by subjecting it to prediction
			DocumentCategorizer doccat = new DocumentCategorizerME(model);
			String[] docWords = "Afterwards Stuart and Charlie notice Kate in the photos Stuart took at Leopolds ball and realise that her destiny must be to go back and be with Leopold That night while Kate is accepting her promotion at a company banquet he and Charlie race to meet her and show her the pictures Kate initially rejects their overtures and goes on to give her acceptance speech but it is there that she sees Stuarts picture and realises that she truly wants to be with Leopold".replaceAll("[^A-Za-z]", " ").split(" ");
			double[] aProbs = doccat.categorize(docWords);

			// print the probabilities of the categories
			System.out.println("\n---------------------------------\nCategory : Probability\n---------------------------------");
			for(int i=0;i<doccat.getNumberOfCategories();i++){
				System.out.println(doccat.getCategory(i)+" : "+aProbs[i]);
			}
			System.out.println("---------------------------------");

			System.out.println("\n"+doccat.getBestCategory(aProbs)+" : is the predicted category for the given sentence.");
		}
		catch (IOException e) {
			System.out.println("An exception in reading the training file. Please check.");
			e.printStackTrace();
		}
	}
}

The training file could be downloaded from here.

Output

Indexing events using cutoff of 0

	Computing event counts...  done. 66 events
	Indexing...  done.
Sorting and merging events... done. Reduced 66 events to 66.
Done indexing.
Incorporating indexed data for training...  
done.
	Number of Event Tokens: 66
	    Number of Outcomes: 2
	  Number of Predicates: 74080
...done.
Computing model parameters ...
Performing 10 iterations.
  1:  ... loglikelihood=-45.747713916956386	0.4090909090909091
  2:  ... loglikelihood=-39.482448265755195	1.0
  3:  ... loglikelihood=-34.73809942995604	1.0
  4:  ... loglikelihood=-31.01773543201995	1.0
  5:  ... loglikelihood=-28.021100513571625	1.0
  6:  ... loglikelihood=-25.55532624366708	1.0
  7:  ... loglikelihood=-23.490627352875972	1.0
  8:  ... loglikelihood=-21.736377961873213	1.0
  9:  ... loglikelihood=-20.227350308507212	1.0
 10:  ... loglikelihood=-18.915391558485368	1.0

Model is successfully trained.

Trained Model is saved locally at : model/en-movie-classifier-maxent.bin

---------------------------------
Category : Probability
---------------------------------
Thriller : 0.4912161738321056
Romantic : 0.508783.6.0678945
---------------------------------

Romantic : is the predicted category for the given sentence.

Conclusion

In this Apache OpenNLP Tutorial, we have learnt how to use an N-Gram Feature Generator for Document Categorizer that helps in document classification.