Language Detector Example in Apache OpenNLP

In this tutorial, we shall learn Language Detector Example in Apache OpenNLP.

At the time of writing this tutorial, “langdetect” is a package that has been merged into  opennlp-master at github very recently (two days back). In which case you may not find this in the standard binary package of opennlp, but you can build the project by cloning the master from github.

To build the project by cloning opennlp-master from github, using maven, follow the instructions in README.md .

Once the project is built, import the project to IDE of your choice like Eclipse, IntelliJ IDEA, etc.

Training file and Code of different methods from opennlp-tools test folder have been taken to put this example to a piece. Feel free to explore some more methods from https://github.com/apache/opennlp/tree/master/opennlp-tools/src/test/java/opennlp/tools/langdetect.

Steps to Use Language Detector

Following are the steps to learn how to use LanguageDetector from Apache OpenNLP.

Step 1: Load the training data

Load the training data into LanguageDetectorSampleStream.

LanguageDetectorSampleStream sampleStream = null;
try {
    InputStreamFactory dataIn = new MarkableFileInputStreamFactory(new File("training-data" + File.separator + "DoccatSample.txt"));
    ObjectStream lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
    sampleStream = new LanguageDetectorSampleStream(lineStream);
} catch (FileNotFoundException e){
    e.printStackTrace();
} catch (IOException e) {
    e.printStackTrace();
}

And by the way, the structure of training data is similar to that of document categorizer. Each line in the training file belongs to a language and the first word in the line is the actual language name. Language name and data in the line should be separated by a white space character.

Refer DoccatSample.txt for the training file.

Step 2: Define the training parameters.

TrainingParameters params = new TrainingParameters();
params.put(TrainingParameters.ITERATIONS_PARAM, 100);
params.put(TrainingParameters.CUTOFF_PARAM, 5);
params.put("DataIndexer", "TwoPass");
params.put(TrainingParameters.ALGORITHM_PARAM, "NAIVEBAYES");

Training parameters are the ones used by the training algorithm, and also you can specify the algorithm to be used to train the language detection trainer.

Some of the training parameters are number of iterations, cutoff, algorithm, etc.

Step 3:  Train the model.

LanguageDetectorModel model = LanguageDetectorME.train(sampleStream, params, new LanguageDetectorFactory());

Step 4: Predict using the model.

Once the model is built, we can load the model to use it for prediction. We shall print the confidence scores for the possible languages from the model for the test data.

LanguageDetector ld = new LanguageDetectorME(model);
Language[] languages = ld.predictLanguages("estava em uma marcenaria na Rua Bruno");
System.out.println("Predicted languages..");
for(Language language:languages){
    System.out.println(language.getLang()+"  confidence:"+language.getConfidence());
}

Example – Language Detector

LanguageDetectorMEExample.java

import java.io.*;
 
import opennlp.tools.langdetect.*;
import opennlp.tools.util.*;
 
/**
* Language Detector Example in Apache OpenNLP
*/
public class LanguageDetectorMEExample {
 
    private static LanguageDetectorModel model;
 
    public static void main(String[] args){
 
        // loading the training data to LanguageDetectorSampleStream
        LanguageDetectorSampleStream sampleStream = null;
        try {
            InputStreamFactory dataIn = new MarkableFileInputStreamFactory(new File("training-data" + File.separator + "DoccatSample.txt"));
            ObjectStream lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
            sampleStream = new LanguageDetectorSampleStream(lineStream);
        } catch (FileNotFoundException e){
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
 
        // training parameters
        TrainingParameters params = new TrainingParameters();
        params.put(TrainingParameters.ITERATIONS_PARAM, 100);
        params.put(TrainingParameters.CUTOFF_PARAM, 5);
        params.put("DataIndexer", "TwoPass");
        params.put(TrainingParameters.ALGORITHM_PARAM, "NAIVEBAYES");
 
        // train the model
        try {
            model = LanguageDetectorME.train(sampleStream, params, new LanguageDetectorFactory());
        } catch (IOException e) {
            e.printStackTrace();
        }
        System.out.println("Completed");
 
        // load the model
        LanguageDetector ld = new LanguageDetectorME(model);
        // use model for predicting the language
        Language[] languages = ld.predictLanguages("estava em uma marcenaria na Rua Bruno");
        System.out.println("Predicted languages..");
        for(Language language:languages){
            // printing the language and the confidence score for the test data to belong to the language
            System.out.println(language.getLang()+"  confidence:"+language.getConfidence());
        }
    }
}

Output :

/usr/lib/jvm/default-java/bin/java -javaagent:/media/arjun/0AB650F1B650DF2F/SOFTs/ubuntu/idea-IC-171.4249.39/lib/idea_rt.jar=43869:/media/arjun/0AB650F1B650DF2F/SOFTs/ubuntu/idea-IC-171.4249.39/bin -Dfile.encoding=UTF-8 -classpath /usr/lib/jvm/default-java/jre/lib/charsets.jar:/usr/lib/jvm/default-java/jre/lib/ext/cldrdata.jar:/usr/lib/jvm/default-java/jre/lib/ext/dnsns.jar:/usr/lib/jvm/default-java/jre/lib/ext/icedtea-sound.jar:/usr/lib/jvm/default-java/jre/lib/ext/jaccess.jar:/usr/lib/jvm/default-java/jre/lib/ext/jfxrt.jar:/usr/lib/jvm/default-java/jre/lib/ext/localedata.jar:/usr/lib/jvm/default-java/jre/lib/ext/nashorn.jar:/usr/lib/jvm/default-java/jre/lib/ext/sunec.jar:/usr/lib/jvm/default-java/jre/lib/ext/sunjce_provider.jar:/usr/lib/jvm/default-java/jre/lib/ext/sunpkcs11.jar:/usr/lib/jvm/default-java/jre/lib/ext/zipfs.jar:/usr/lib/jvm/default-java/jre/lib/jce.jar:/usr/lib/jvm/default-java/jre/lib/jfxswt.jar:/usr/lib/jvm/default-java/jre/lib/jsse.jar:/usr/lib/jvm/default-java/jre/lib/management-agent.jar:/usr/lib/jvm/default-java/jre/lib/resources.jar:/usr/lib/jvm/default-java/jre/lib/rt.jar:/home/arjun/workspace/opennlp-master/opennlp-tools/target/classes LanguageDetectorMEExample
Indexing events with TwoPass using cutoff of 5

	Computing event counts...  done. 99 events
	Indexing...  done.
Collecting events... Done indexing in 1.35 s.
Incorporating indexed data for training...  
done.
	Number of Event Tokens: 99
	    Number of Outcomes: 4
	  Number of Predicates: 4849
Computing model parameters...
Stats: (25/99) 0.25252525252525254
...done.
Completed
Predicted languages..
pob  confidence:0.9998990013343246
ita  confidence:1.0030518375770318E-4
spa  confidence:6.934808895132994E-7
fra  confidence:1.0283097500463277E-12

Process finished with exit code 0

Conclusion

In this Apache OpenNLP Tutorial, we have learnt how to use Language Detector in Apache OpenNLP, an NLP library.