OpenNLP Overview

OpenNLP Overview – In this OpenNLP Tutorial, we shall list out some of the tasks in Natural Language Processing and the solutions provided through Apache OpenNLP APIs to solve them.

What is Natural Language Processing?

Natural Language Processing is all about the interaction between computer and human. Generally, humans interact with each other using vocabulary. And the language they are using (say English, Spanish, Hindi, etc.,) has some set of rules. It does not happen all the time that all people speaking these languages to communicate use the grammar of the language alike. Different people might use different words for conveying the same information. But as people around them have already known them or used to such kind, can understand or get the summary or inference from what they are saying.

Humans perceive information like context, inference etc., from the sentences formed using vocabulary and grammar. And when a machine or computer is expected to understand the context, inference or summary or useful information from the data it gets from a human, there are some gaps that needs to filled. These gaps are the tasks that Natural Language Processing deals with, to make a machine understand a human language or speak to human in natural language.

ADVERTISEMENT

Apache OpenNLP for Natural Language Processing

Apache OpenNLP is an open-source library that provides solutions to some of the Natural Language Processing tasks through its APIs and command line tools. Apache OpenNLP uses machine learning approach for the tasks of processing natural language. Following are some of the tasks to which Apache OpenNLP provides APIs[http://opennlp.apache.org/docs/1.7.2/manual/opennlp.html], and those we deal with examples in this OpenNLP Tutorial :

Note : To setup a Java Project with Eclipse, refer how to setup OpenNLP in Java with Eclipse.

Named Entity Recognition (NER)

Named Entity Recognition is to find named entities like person, place, organisation or a thing in a given sentence.

OpenNLP has built models for NER which can be directly used and also helps in training a model for the custom datat we have.

Document Categorizer

Categorizing or Classifying a given document to one of the pre-defined categories is what a Document Categorizer does.

OpenNLP provides an API that helps in categorizing or classifying documents. As categorizing documents cannot be generalized like NER, there are no pre-built models available, but anyone can build a model by his/her own requirements.

Sentence Detection

The process of identifying sentences in a paragraph or a document or a text file is called Sentence Detection.

OpenNLP supports Sentence Detection through its API. It provide pre-built models for sentence detection, and also a means to build a model for requirement specific data.

Parts of Speech Tagging

Understanding grammar is an important task in NLP. Identifying Parts of Speech in a given sentence is a stepping block to understand grammar.

Apache OpenNLP provides APIs to train a model that can identify Parts of Speech or use a pre-built model and identify Parts of Speech in a sentence.

Tokenization

Tokenization is a process of breaking down the given sentence into smaller pieces like words, punctuation marks, numbers etc.

Apache OpenNLP provides APIs to train a model or use a pre-built model and break a sentence into smaller pieces.

Command Line Interface of Apache OpenNLP

All the tools included in OpenNLP could be accessed through command line interface. Following are some of the examples :

Conclusion

In this OpenNLP Tutorial, we have seen Apache OpenNLP Overview. Lets start getting hands on OpenNLP by setting up a Java Project with OpenNLP in Eclipse and trying out the APIs that it provides.