Lemmatizer is a Natural Language Processing tool that aims to remove any changes in form of the word like tense, gender, mood, etc. and return dictionary or base form of word.

In Apache OpenNLP, Lemmatizer returns base or dictionary form of the word (usually called lemma) when it is provided with word and its Parts-Of-Speech tag. For a given word, there could exist many lemmas, but given the Parts-Of-Speech tag also, the number could be narrowed down to almost one, and the one is the more accurate as the context to the word is provided in the form of postag.

In Apache OpenNLP there are two methods to do Lemmatization.

  • Statistical Lemmatization
  • Dictionary based Lemmatization

Statistical Lemmatizer needs a lemmatizer model(that is built from training data) for finding the lemma of a given word, while the Dictionary based Lemmatizer needs a dictionary(which contains all possible and valid combinations of {word, postag and the corresponding lemma}) .

Input to the Lemmatizer is the set of tokens and corresponding postags. So, to find lemmas for words in a sentence, the prior task is : sentence has to be tokenized using a Tokenizer and then pos tagged using a POS Tagger.

Dictionary Lemmatizer Example in Apache OpenNLP

You may download the dictionary from here[https://raw.githubusercontent.com/richardwilly98/elasticsearch-opennlp-auto-tagging/master/src/main/resources/models/en-lemmatizer.dict]. And en-pos-maxent.bin from here[http://opennlp.sourceforge.net/models-1.5/].

Output :

Note : If a combination of the word and postag is not found in the dictionary, the lemma is returned as ‘0’ (like zero findings). In the above example the combinations US_NNP, morning_NN, afternoon_NN and ._. are not found in the dictionary, hence the corresponding lemmas are ‘0’.

Conclusion :

We have learnt what is lemmatization and how to implement it, with the help of Lemmatizer Example in Apache OpenNLP.