How to use NGram features for Document Classification in OpenNLP

Using NGram features for Document Classification in OpenNLP

In this OpenNLP Tutorial, we shall learn how to use NGram features for Document Classification in OpenNLP using an example.

This topic is kind of continuation to document classification using Maxent model or document classification using Naive Bayes model, where a detailed explanation has been given on how to train a model for document classification or categorization with default features incorporated in DoccatFactory.

Following is the snippet of Java code, where we try to define and initialize N-gram feature generators that could be used for Document Categorizer.

featureGenearators is an array where a list of feature generators(which implement FeatureGenerator interface) could be provided. You may build your own class of feature generator extending FeatureGenerator and use the same for document categorizer, by just adding it in the list.

The arguments passed in “new NGramFeatureGenerator(2,3)”, i.e., 2, 3 are minimum and maximum number of words respectively that should be considered as a feature. For more information onNGramFeatureGenerator, please refer the java documentation of NGramFeatureGenerator.

Complete program that takes in a training file, incorporates NGramFeatureGenerator, and generates a model is provided below :

The training file could be downloaded from here.

Output:

Conclusion :

In this Apache OpenNLP Tutorial, we have learnt how to use an N-Gram Feature Generator for Document Categorizer that helps in document classification.