Tokenizer Example in Apache openNLP

In this openNLP Tutorial, we shall look into Tokenizer Example in Apache openNLP. Also, a little understanding of the tokenizaion process.

What is tokenization ?

Tokenization is a process of segmenting strings into smaller parts called tokens(say sub-strings). These tokens are usually words, punctuation marks, sequence of digits, and like. An example is shown in the following table :

Input to TokenizerJohn is 26 years old.
Output of Tokenizer
Johnis26yearsold.

Tokenization in OpenNLP

Tokenizer API in OpenNLP provides following three ways for tokenization :

Note : OpenNLP version used is 1.7.2.

Please observe the differences in the output from these three ways of tokenization in the examples provided below.

TokenizerME class loaded with a token model

  • Step 1 : Read the pretrained model into a stream.
  • Step 2 : Read the stream to a Tokenizer model.
  • Step 3 : Initialize the tokenizer with the model.
  • Step 4 : Use TokenizerME.tokenize() method to extract the tokens to a String Array.
  • Step 5 : Use TokenizerME.getTokenProbabilities() to get the probabilities for the segments to be tokens.
  • Step 6 : Finally, print the results.

Everything put together, is the below below program :

When the above program is run, the output to the console is as shown below :

WhitespaceTokenizer

Following is the example to demostrate WhitespaceTokenizer of OpenNLP Tokenizer API

When the above program is run, the output to the console is as shown below :

SimpleTokenizer

Following is the example to demostrateSimpleTokenizer of OpenNLP Tokenizer API

When the above program is run, the output to the console is as shown below :

Conclusion :

In this Apache OpenNLP Tutorial, we have seen different ways of tokenization the OpenNLP Tokenizer API provides.

Following are some of the other examples of openNLP :