Tokenizer Example in Apache openNLP
In this tutorial, we shall look into Tokenizer Example in Apache OpenNLP. Also, a little understanding of the Tokenizaion process.
What is tokenization?
Tokenization is a process of segmenting strings into smaller parts called tokens(say sub-strings). These tokens are usually words, punctuation marks, sequence of digits, and like.
An example is shown in the following table.
Input to Tokenizer | John is 26 years old. |
Output of Tokenizer | [John, is, 26, years, old] |
Tokenization in OpenNLP
Tokenizer API in OpenNLP provides following three ways for tokenization:
Note : OpenNLP version used is 1.7.2.
Please observe the differences in the output from these three ways of tokenization in the examples provided below.
TokenizerME class Loaded with a Token Model
Step 1: Read the pretrained model into a stream.
InputStream modelIn = new FileInputStream("en-token.bin");
Step 2: Read the stream to a Tokenizer model.
TokenizerModel model = new TokenizerModel(modelIn);
Step 3: Initialize the tokenizer with the model.
TokenizerME tokenizer = new TokenizerME(model);
Step 4: Use TokenizerME.tokenize() method to extract the tokens to a String Array.
String tokens[] = tokenizer.tokenize("John is 26 years old.");
Step 5: Use TokenizerME.getTokenProbabilities() to get the probabilities for the segments to be tokens.
double tokenProbs[] = tokenizer.getTokenProbabilities();
Step 6: Finally, print the results.
Everything put together, is the below below program :
import java.io.FileInputStream; import java.io.IOException; import java.io.InputStream; import opennlp.tools.tokenize.TokenizerME; import opennlp.tools.tokenize.TokenizerModel; /** * www.tutorialkart.com * Tokenizer Example in Apache openNLP using TokenizerME class loaded with pre-trained token model */ public class TokenizerModelExample { public static void main(String[] args) { InputStream modelIn = null; try { modelIn = new FileInputStream("en-token.bin"); TokenizerModel model = new TokenizerModel(modelIn); TokenizerME tokenizer = new TokenizerME(model); String tokens[] = tokenizer.tokenize("John is 26 years old."); double tokenProbs[] = tokenizer.getTokenProbabilities(); System.out.println("Token\t: Probability\n-------------------------------"); for(int i=0;i<tokens.length;i++){ System.out.println(tokens[i]+"\t: "+tokenProbs[i]); } } catch (IOException e) { e.printStackTrace(); } finally { if (modelIn != null) { try { modelIn.close(); } catch (IOException e) { } } } } }
When the above program is run, the output to the console is as shown below :
Token : Probability ------------------------------- John : 1.0 is : 1.0 26 : 1.0 years : 1.0 old : 0.9954218897531331 . : 1.0
WhitespaceTokenizer
Following is the example to demonstrate WhitespaceTokenizer of OpenNLP Tokenizer API.
WhiteSpaceTokenizerExample.java
import opennlp.tools.tokenize.Tokenizer; import opennlp.tools.tokenize.WhitespaceTokenizer; /** * www.tutorialkart.com * Tokenizer Example in Apache openNLP using WhitespaceTokenizer */ public class WhiteSpaceTokenizerExample { public static void main(String[] args) { Tokenizer tokenizer = WhitespaceTokenizer.INSTANCE; String tokens[] = tokenizer.tokenize("John is 26 years old."); System.out.println("Token\n----------------"); for(int i=0;i<tokens.length;i++){ System.out.println(tokens[i]); } } }
When the above program is run, the output to the console is as shown in the following.
Output
Token ---------------- John is 26 years old.
SimpleTokenizer
Following is the example to demostrateSimpleTokenizer of OpenNLP Tokenizer API.
SimpleTokenizerExample.java
import opennlp.tools.tokenize.SimpleTokenizer; import opennlp.tools.tokenize.Tokenizer; /** * www.tutorialkart.com * Tokenizer Example in Apache openNLP using SimpleTokenizer */ public class SimpleTokenizerExample { public static void main(String[] args) { Tokenizer tokenizer = SimpleTokenizer.INSTANCE; String tokens[] = tokenizer.tokenize("John is 26 years old."); System.out.println("Token\n----------------"); for(int i=0;i<tokens.length;i++){ System.out.println(tokens[i]); } } }
When the above program is run, the output to the console is as shown in the following.
Output
Token ---------------- John is 26 years old .
Conclusion
In this Apache OpenNLP Tutorial, we have seen different ways of tokenization the OpenNLP Tokenizer API provides.
Following are some of the other examples of openNLP :