Command line tools in Apache OpenNLP
Command line tools in Apache OpenNLP – In this OpenNLP tutorial, we shall learn how to use command line tools that Apache OpenNLP provides to do natural language processing tasks like Named Entity Recognition (NER), Parts Of Speech tagging, Chunking, Sentence Detection, Document Classification or Categorization, Tokenization etc.
Following are the steps to setup command line tools in Apache OpenNLP :
Step 1 : Download Apache OpenNLP.
Click on the latest build of Apache OpenNLP from http://redrockdigimark.com/apachemirror/opennlp/
Click on the bin package (zip). We are not going to build it from source, we are just going to use the pre-built version
Step 2 : Unzip the package and navigate into bin folder.
For Ubuntu : Open the terminal and run the command
For Windows : Open the command prompt and give the command opennlp.bat
The following Usage of OpenNLP should be echoed on to the terminal or prompt :Usage of OpenNLP12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152arjun@arjun-VPCEH26EN:~/apache-opennlp-1.8.0/bin$ ./opennlpOpenNLP 1.8.0. Usage: opennlp TOOLwhere TOOL is one of:Doccat learned document categorizerDoccatTrainer trainer for the learnable document categorizerDoccatEvaluator Measures the performance of the Doccat model with the reference dataDoccatCrossValidator K-fold cross validator for the learnable Document CategorizerDoccatConverter converts leipzig data format to native OpenNLP formatDictionaryBuilder builds a new dictionarySimpleTokenizer character class tokenizerTokenizerME learnable tokenizerTokenizerTrainer trainer for the learnable tokenizerTokenizerMEEvaluator evaluator for the learnable tokenizerTokenizerCrossValidator K-fold cross validator for the learnable tokenizerTokenizerConverter converts foreign data formats (ad,pos,conllx,namefinder,parse) to native OpenNLP formatDictionaryDetokenizerSentenceDetector learnable sentence detectorSentenceDetectorTrainer trainer for the learnable sentence detectorSentenceDetectorEvaluator evaluator for the learnable sentence detectorSentenceDetectorCrossValidator K-fold cross validator for the learnable sentence detectorSentenceDetectorConverter converts foreign data formats (ad,pos,conllx,namefinder,parse,moses,letsmt) to native OpenNLP formatTokenNameFinder learnable name finderTokenNameFinderTrainer trainer for the learnable name finderTokenNameFinderEvaluator Measures the performance of the NameFinder model with the reference dataTokenNameFinderCrossValidator K-fold cross validator for the learnable Name FinderTokenNameFinderConverter converts foreign data formats (evalita,ad,conll03,bionlp2004,conll02,muc6,ontonotes,brat) to native OpenNLP formatCensusDictionaryCreator Converts 1990 US Census names into a dictionaryPOSTagger learnable part of speech taggerPOSTaggerTrainer trains a model for the part-of-speech taggerPOSTaggerEvaluator Measures the performance of the POS tagger model with the reference dataPOSTaggerCrossValidator K-fold cross validator for the learnable POS taggerPOSTaggerConverter converts foreign data formats (ad,conllx,parse,ontonotes,conllu) to native OpenNLP formatLemmatizerME learnable lemmatizerLemmatizerTrainerME trainer for the learnable lemmatizerLemmatizerEvaluator Measures the performance of the Lemmatizer model with the reference dataChunkerME learnable chunkerChunkerTrainerME trainer for the learnable chunkerChunkerEvaluator Measures the performance of the Chunker model with the reference dataChunkerCrossValidator K-fold cross validator for the chunkerChunkerConverter converts ad data format to native OpenNLP formatParser performs full syntactic parsingParserTrainer trains the learnable parserParserEvaluator Measures the performance of the Parser model with the reference dataParserConverter converts foreign data formats (ontonotes,frenchtreebank) to native OpenNLP formatBuildModelUpdater trains and updates the build model in a parser modelCheckModelUpdater trains and updates the check model in a parser modelTaggerModelReplacer replaces the tagger model in a parser modelEntityLinker links an entity to an external data setNGramLanguageModel gives the probability and most probable next token(s) of a sequence of tokens in a language modelAll tools print help when invoked with help parameterExample: opennlp SimpleTokenizer helparjun@arjun-VPCEH26EN:~/apache-opennlp-1.8.0/bin$
Step 3 : Run opennlp command for help on any of the modules it presented in the above step
Help regarding any of the available task could be checked out using the Example mentioned in the response to opennlp command
$ ./opennlp SimpleTokenizer help
The response to the above command is shown below :12arjun@arjun-VPCEH26EN:~/apache-opennlp-1.8.0/bin$ ./opennlp SimpleTokenizer helpUsage: opennlp SimpleTokenizer < sentences
Step 4 : Lets try to actually use SimpleTokenizer
Create a text file, “sentences.txt” in the bin folder with sentences in it like below:
I am Joey.
And I don’t share food.
Welcome to friends.
Run the command
./opennlp SimpleTokenizer < sentences.txt
The following output of SimpleTokenizer on sentences.txt is echoed to the terminal or promptSimpleTokenizer Command Prompt Example1234567891011arjun@arjun-VPCEH26EN:~/apache-opennlp-1.8.0/bin$ ./opennlp SimpleTokenizer < sentences.txtI am Joey .And I don ' t share food .Welcome to friends .Average: 750.0 sent/sTotal: 3 sentRuntime: 0.004sExecution time: 0.033 secondsarjun@arjun-VPCEH26EN:~/apache-opennlp-1.8.0/bin$
SimpleTokenizer has found the tokens in the sentences and echoed on to the terminal. It also reported that there are three sentences in the file, “sentences.txt”.
We have successfully learned how to setup and use Command Line Tools in Apache OpenNLP. In our furthur tutorials, we shall see how to do other Natural Language Processing tasks using Apache’s OpenNLP Command Line Tools.