Command Line Tools in Apache OpenNLP
Command line tools in Apache OpenNLP – In this OpenNLP tutorial, we shall learn how to use command line tools that Apache OpenNLP provides to do natural language processing tasks like Named Entity Recognition (NER), Parts Of Speech tagging, Chunking, Sentence Detection, Document Classification or Categorization, Tokenization etc.
Following are the steps to setup command line tools in Apache OpenNLP.
Step 1: Download Apache OpenNLP
Click on the latest build of Apache OpenNLP from [http://redrockdigimark.com/apachemirror/opennlp/].
Click on the bin package (zip). We are not going to build it from source, we are just going to use the pre-built version
Step 2: Unzip Contents
Unzip the package and navigate into bin folder.
For Ubuntu : Open the terminal and run the following command.
./opennlp
For Windows : Open the command prompt and give the command opennlp.bat
opennlp.bat
The following Usage of OpenNLP should be echoed on to the terminal or prompt.
arjun@arjun-VPCEH26EN:~/apache-opennlp-1.8.0/bin$ ./opennlp OpenNLP 1.8.0. Usage: opennlp TOOL where TOOL is one of: Doccat learned document categorizer DoccatTrainer trainer for the learnable document categorizer DoccatEvaluator Measures the performance of the Doccat model with the reference data DoccatCrossValidator K-fold cross validator for the learnable Document Categorizer DoccatConverter converts leipzig data format to native OpenNLP format DictionaryBuilder builds a new dictionary SimpleTokenizer character class tokenizer TokenizerME learnable tokenizer TokenizerTrainer trainer for the learnable tokenizer TokenizerMEEvaluator evaluator for the learnable tokenizer TokenizerCrossValidator K-fold cross validator for the learnable tokenizer TokenizerConverter converts foreign data formats (ad,pos,conllx,namefinder,parse) to native OpenNLP format DictionaryDetokenizer SentenceDetector learnable sentence detector SentenceDetectorTrainer trainer for the learnable sentence detector SentenceDetectorEvaluator evaluator for the learnable sentence detector SentenceDetectorCrossValidator K-fold cross validator for the learnable sentence detector SentenceDetectorConverter converts foreign data formats (ad,pos,conllx,namefinder,parse,moses,letsmt) to native OpenNLP format TokenNameFinder learnable name finder TokenNameFinderTrainer trainer for the learnable name finder TokenNameFinderEvaluator Measures the performance of the NameFinder model with the reference data TokenNameFinderCrossValidator K-fold cross validator for the learnable Name Finder TokenNameFinderConverter converts foreign data formats (evalita,ad,conll03,bionlp2004,conll02,muc6,ontonotes,brat) to native OpenNLP format CensusDictionaryCreator Converts 1990 US Census names into a dictionary POSTagger learnable part of speech tagger POSTaggerTrainer trains a model for the part-of-speech tagger POSTaggerEvaluator Measures the performance of the POS tagger model with the reference data POSTaggerCrossValidator K-fold cross validator for the learnable POS tagger POSTaggerConverter converts foreign data formats (ad,conllx,parse,ontonotes,conllu) to native OpenNLP format LemmatizerME learnable lemmatizer LemmatizerTrainerME trainer for the learnable lemmatizer LemmatizerEvaluator Measures the performance of the Lemmatizer model with the reference data ChunkerME learnable chunker ChunkerTrainerME trainer for the learnable chunker ChunkerEvaluator Measures the performance of the Chunker model with the reference data ChunkerCrossValidator K-fold cross validator for the chunker ChunkerConverter converts ad data format to native OpenNLP format Parser performs full syntactic parsing ParserTrainer trains the learnable parser ParserEvaluator Measures the performance of the Parser model with the reference data ParserConverter converts foreign data formats (ontonotes,frenchtreebank) to native OpenNLP format BuildModelUpdater trains and updates the build model in a parser model CheckModelUpdater trains and updates the check model in a parser model TaggerModelReplacer replaces the tagger model in a parser model EntityLinker links an entity to an external data set NGramLanguageModel gives the probability and most probable next token(s) of a sequence of tokens in a language model All tools print help when invoked with help parameter Example: opennlp SimpleTokenizer help arjun@arjun-VPCEH26EN:~/apache-opennlp-1.8.0/bin$
Step 3: Run OpenNLP Command
Run OpenNLP Command for help on any of the modules echoed to console in the above step.
Help regarding any of the available task could be checked out using the Example mentioned in the response to OpenNLP command.
$ ./opennlp SimpleTokenizer help
The response to the above command is shown below.
arjun@arjun-VPCEH26EN:~/apache-opennlp-1.8.0/bin$ ./opennlp SimpleTokenizer help Usage: opennlp SimpleTokenizer < sentences
Step 4: Verify
As an example, lets try to actually use SimpleTokenizer.
Create a text file, “sentences.txt” in the bin folder with sentences in it like below.
I am Joey. And I don't share food. Welcome to friends.
Run the command
./opennlp SimpleTokenizer < sentences.txt
The following output of SimpleTokenizer on sentences.txt is echoed to the terminal or prompt.
arjun@arjun-VPCEH26EN:~/apache-opennlp-1.8.0/bin$ ./opennlp SimpleTokenizer < sentences.txt I am Joey . And I don ' t share food . Welcome to friends . Average: 750.0 sent/s Total: 3 sent Runtime: 0.004s Execution time: 0.033 seconds arjun@arjun-VPCEH26EN:~/apache-opennlp-1.8.0/bin$
SimpleTokenizer has found the tokens in the sentences and echoed on to the terminal. It also reported that there are three sentences in the file, “sentences.txt”.
Conclusion
In this OpenNLP Tutorial, we have successfully learned how to setup and use Command Line Tools in Apache OpenNLP. In our further tutorials, we shall see how to do other Natural Language Processing tasks using Apache’s OpenNLP Command Line Tools.