Command Line Tools in Apache OpenNLP

Command line tools in Apache OpenNLP – In this OpenNLP tutorial, we shall learn how to use command line tools that Apache OpenNLP provides to do natural language processing tasks like Named Entity Recognition (NER), Parts Of Speech tagging, Chunking, Sentence Detection, Document Classification or Categorization, Tokenization etc.

Following are the steps to setup command line tools in Apache OpenNLP.

Step 1: Download Apache OpenNLP

Click on the latest build of Apache OpenNLP from [http://redrockdigimark.com/apachemirror/opennlp/].

OpenNLP Mirror
OpenNLP Mirror for Download

Click on the bin package (zip). We are not going to build it from source, we are just going to use the pre-built version

OpenNLP Built Package - www.tutorialkart.com
OpenNLP Built Package

Step 2: Unzip Contents

Unzip the package and navigate into bin folder.

Extract contents from OpenNLP zip - www.tutorialkart.com
Extract contents from OpenNLP zip
OpenNLP bin - www.tutorialkart.com
OpenNLP bin
OpenNLP shell/batch file - Use command line tools in Apache OpenNLP - www.tutorialkart.com
OpenNLP shell/batch file

For Ubuntu : Open the terminal and run the following command.

./opennlp

For Windows : Open the command prompt and give the command opennlp.bat

opennlp.bat

The following Usage of OpenNLP should be echoed on to the terminal or prompt.

arjun@arjun-VPCEH26EN:~/apache-opennlp-1.8.0/bin$ ./opennlp
OpenNLP 1.8.0. Usage: opennlp TOOL
where TOOL is one of:
  Doccat                            learned document categorizer
  DoccatTrainer                     trainer for the learnable document categorizer
  DoccatEvaluator                   Measures the performance of the Doccat model with the reference data
  DoccatCrossValidator              K-fold cross validator for the learnable Document Categorizer
  DoccatConverter                   converts leipzig data format to native OpenNLP format
  DictionaryBuilder                 builds a new dictionary
  SimpleTokenizer                   character class tokenizer
  TokenizerME                       learnable tokenizer
  TokenizerTrainer                  trainer for the learnable tokenizer
  TokenizerMEEvaluator              evaluator for the learnable tokenizer
  TokenizerCrossValidator           K-fold cross validator for the learnable tokenizer
  TokenizerConverter                converts foreign data formats (ad,pos,conllx,namefinder,parse) to native OpenNLP format
  DictionaryDetokenizer             
  SentenceDetector                  learnable sentence detector
  SentenceDetectorTrainer           trainer for the learnable sentence detector
  SentenceDetectorEvaluator         evaluator for the learnable sentence detector
  SentenceDetectorCrossValidator    K-fold cross validator for the learnable sentence detector
  SentenceDetectorConverter         converts foreign data formats (ad,pos,conllx,namefinder,parse,moses,letsmt) to native OpenNLP format
  TokenNameFinder                   learnable name finder
  TokenNameFinderTrainer            trainer for the learnable name finder
  TokenNameFinderEvaluator          Measures the performance of the NameFinder model with the reference data
  TokenNameFinderCrossValidator     K-fold cross validator for the learnable Name Finder
  TokenNameFinderConverter          converts foreign data formats (evalita,ad,conll03,bionlp2004,conll02,muc6,ontonotes,brat) to native OpenNLP format
  CensusDictionaryCreator           Converts 1990 US Census names into a dictionary
  POSTagger                         learnable part of speech tagger
  POSTaggerTrainer                  trains a model for the part-of-speech tagger
  POSTaggerEvaluator                Measures the performance of the POS tagger model with the reference data
  POSTaggerCrossValidator           K-fold cross validator for the learnable POS tagger
  POSTaggerConverter                converts foreign data formats (ad,conllx,parse,ontonotes,conllu) to native OpenNLP format
  LemmatizerME                      learnable lemmatizer
  LemmatizerTrainerME               trainer for the learnable lemmatizer
  LemmatizerEvaluator               Measures the performance of the Lemmatizer model with the reference data
  ChunkerME                         learnable chunker
  ChunkerTrainerME                  trainer for the learnable chunker
  ChunkerEvaluator                  Measures the performance of the Chunker model with the reference data
  ChunkerCrossValidator             K-fold cross validator for the chunker
  ChunkerConverter                  converts ad data format to native OpenNLP format
  Parser                            performs full syntactic parsing
  ParserTrainer                     trains the learnable parser
  ParserEvaluator                   Measures the performance of the Parser model with the reference data
  ParserConverter                   converts foreign data formats (ontonotes,frenchtreebank) to native OpenNLP format
  BuildModelUpdater                 trains and updates the build model in a parser model
  CheckModelUpdater                 trains and updates the check model in a parser model
  TaggerModelReplacer               replaces the tagger model in a parser model
  EntityLinker                      links an entity to an external data set
  NGramLanguageModel                gives the probability and most probable next token(s) of a sequence of tokens in a language model
All tools print help when invoked with help parameter
Example: opennlp SimpleTokenizer help
arjun@arjun-VPCEH26EN:~/apache-opennlp-1.8.0/bin$

Step 3: Run OpenNLP Command

Run OpenNLP Command for help on any of the modules echoed to console in the above step.

Help regarding any of the available task could be checked out using the Example mentioned in the response to OpenNLP command.

$ ./opennlp SimpleTokenizer help

The response to the above command is shown below.

arjun@arjun-VPCEH26EN:~/apache-opennlp-1.8.0/bin$ ./opennlp SimpleTokenizer help
Usage: opennlp SimpleTokenizer < sentences

Step 4: Verify

As an example, lets try to actually use SimpleTokenizer.

Create a text file, “sentences.txt” in the bin folder with sentences in it like below.

I am Joey.
And I don't share food.
Welcome to friends.

Run the command

./opennlp SimpleTokenizer < sentences.txt

The following output of SimpleTokenizer on sentences.txt is echoed to the terminal or prompt.

arjun@arjun-VPCEH26EN:~/apache-opennlp-1.8.0/bin$ ./opennlp SimpleTokenizer < sentences.txt
I am Joey .
And I don ' t share food .
Welcome to friends .


Average: 750.0 sent/s 
Total: 3 sent
Runtime: 0.004s
Execution time: 0.033 seconds
arjun@arjun-VPCEH26EN:~/apache-opennlp-1.8.0/bin$

SimpleTokenizer has found the tokens in the sentences and echoed on to the terminal. It also reported that there are three sentences in the file, “sentences.txt”.

Conclusion

In this OpenNLP Tutorial, we have successfully learned how to setup and use Command Line Tools in Apache OpenNLP. In our further tutorials, we shall see how to do other Natural Language Processing tasks using Apache’s OpenNLP Command Line Tools.