Category: openNLP

Apache OpenNLP Tutorial

Apache OpenNLP is an open source project that is cross platform and written in Java. It is a toolkit, for NLP(Natural Language Processing), based on machine learning.

In this Apache OpenNLP Tutorial, we shall learn the tools it provides to solve some of the Natural Language Processing tasks like Named Entity Recognition, Sentence Detection, Chunking, Tokenization, Parts-of-Speech Tagging, Document Classification or Categorization through Java API and Command Line Interface.

Prerequisites

To understand the usage of Command Line Interface of Apache OpenNLP, no programming skill is required. Basic understanding of Natural Language Processing tasks, Machine Learning parameters would suffice.

To understand the usage of Apache OpenNLP’s Java API, basic Java Programming skills is required along with a little idea on Natural Language Processing tasks and little idea of Machine Learning parameters like number of epochs and cut-off. Appropriate intuition would be provided in the corresponding tutorials for Natural Language Processing tasks.

What is Natural Language Processing and the tasks it deals with

Natural Language Processing is all about the interaction between computer and human. Generally, humans interact with each other using vocabulary. And the language they are using (say English, Spanish, Hindi, etc.,) has some set of rules. It does not happen all the time that all people speaking these languages to communicate use the grammar of the language alike. Different people might use different words for conveying the same information. But as people around them have already known them or used to such kind, can understand or get the summary or inference from what they are saying.

Humans perceive information like context, inference etc., from the sentences formed using vocabulary and grammar. And when a machine or computer is expected to understand the context, inference or summary or useful information from the data it gets from a human, there are some gaps that needs to be filled. These gaps are the tasks that Natural Language Processing deals with, to make a machine understand a human language or speak to human in natural language.

Apache OpenNLP is an open-source library that provides solutions to some of the Natural Language Processing tasks through its APIs and command line tools. Apache OpenNLP uses machine learning approach for the tasks of processing natural language. It also provides some of the pre-built models for some of the tasks. Following are the tasks to which Apache OpenNLP provides APIs, and those we deal with examples in this OpenNLP Tutorial :

Note : To setup a Java Project with Eclipse, refer how to setup OpenNLP in Java with Eclipse.

Apache OpenNLP Tutorial

Apache OpenNLP Tutorial – APIs

  • Named Entity Recognition (NER)

    Named Entity Recognition is to find named entities like person, place, organisation or a thing in a given sentence.
    OpenNLP has built models for NER which can be directly used and also helps in training a model for the custom datat we have.
    Named Entity Recognition Example with existing model
    Named Entity Recognition (NER) Training Example

  • Document Categorizer

    Categorizing or Classifying a given document to one of the pre-defined categories is what a Document Categorizer does.
    OpenNLP provides an API that helps in categorizing or classifying documents. As categorizing documents cannot be generalized like NER, there are no pre-built models available, but anyone can build a model by his/her own requirements.
    Document classification using Maximum Entropy (Maxent)
    Document classification using Naive Bayes
    Example to demonstrate the usage of NGram feature for document classification.

  • Sentence Detection

    The process of identifying sentences in a paragraph or a document or a text file is called Sentence Detection.
    OpenNLP supports Sentence Detection through its API. It provide pre-built models for sentence detection, and also a means to build a model for requirement specific data.
    Sentence Detection Example in Apache OpenNLP using Java
    Sentence Detection Training Example in Apache OpenNLP using Java

  • Parts of Speech Tagging

    Understanding grammar is an important task in NLP. Identifying Parts of Speech in a given sentence is a stepping block to understand grammar.
    Apache OpenNLP provides APIs to train a model that can identify Parts of Speech or use a pre-built model and identify Parts of Speech in a sentence.
    Parts of Speech Tagger Example in Apache OpenNLP using Java

  • Tokenization

    Tokenization is a process of breaking down the given sentence into smaller pieces like words, punctuation marks, numbers etc.
    Apache OpenNLP provides APIs to train a model or use a pre-built model and break a sentence into smaller pieces.
    Tokenizer Example in Apache OpenNLP using Java

  • Lemmatization

    Lemmatization is a process of removing any changes in form of the word like tense, gender, mood, etc. and return dictionary or base form of the word.
    Lemmatization Example

  • Language Detection

    Language Detection is a task of finding the natural language to which the sample text provided belongs to.
    Language Detection Example

Command Line Interface of Apache OpenNLP

All the tools included in OpenNLP could be accessed through command line interface. Following are some of the examples :

Conclusion :

With this Apache OpenNLP tutorial we understood the overview of OpenNLP and the APIs it provides. Lets start getting hands on OpenNLP by setting up a Java Project with OpenNLP in Eclipse and trying out the APIs that it provides.

How to train a model for Sentence Detection in openNLP

How to train a model for Sentence Detection in openNLP

How to train a model for Sentence Detection in openNLP – In this tutorial, we shall understand how to train a model from input training data for Sentence Detection in openNLP using Java.

Why to train a model for Sentence Detection

There would always be a requirement for sentence detection. There could be new structure of statements in your use case or may be sentence detection has to be done for a language different from English or something that is readily not available. These scenarios would call out to build a model of our own, from our own training data, for our own purpose.

Train a model for Sentence Detection

Now let us see how to train a model for Sentence Detection in openNLP. Follow the below steps:

  1. Create a text file and keep a sentence for each line in the text file.
  2. Create an InputStreamFactory from the input file using code snippet shown below.
  3. Set the machine learning hyper parameters like number of iterations and cutoff using the code snippet shown below.
  4. With the help of train() method in SentenceDetectorME, generate a model.

Let us generate a model file with the help of training, as shown in below example:

Sentence Detector Training Example in openNLP

The following example SentenceDetectorTrainingExample.java shows how to train a model for your own training data. If you would like to know how to setup java project to use openNLP, in eclipse, refer to setup of java project with openNLP libraries, in eclipse. The process should be same, even for a different IDE(adding the required jars to the build path should do the magic).

Download ? SentenceDetectorTrainingExample.java & trainingDataSentences.txt

Download ? SentenceDetectorTrainingExample.java & trainingDataSentences.txt

When SentenceDetectorTrainingExample.java is run, the output to console is :

The project structure, training input file location and model file generation location, etc., for the example is shown below:

How to train a model for Sentence Detection in openNLP - project structure - Tutorialkart

Java Project Structure in Eclipse

 

 

Conclusion :

In this openNLP tutorial, we have completed on how to train a model for Sentence Detection in openNLP.

Sentence Detection Example in openNLP using Java

What is Sentence Detection

Sentence Detection or Sentence Segmentation is a process of finding the start and end of a sentence (in a paragraph). This has to be done often in pre-processing section of most of the use cases, which are trying to be solved using Natural Language Processing techniques. Furthermore, Sentence Detection is one of the problems in Natural Language Processing.

Sentence detection is quite challenging because of many reasons in which one of them is : Period symbol (.) which usually denotes the end of a sentence, could also come in an email addresses, abbreviations, decimals etc.,

Sentence Detection Example in openNLP

The following example, SentenceDetectExample.java shows how to use SentenceDetectorME class to detect sentences in a paragraph/string. If you would like to know how to setup eclipse project, refer to setup of java project with openNLP libraries, in eclipse. The process should be same, even for a different IDE(adding the required jars to the build path should do the magic).

When SentenceDetectExample,java is run, the console output is :

The project structure and model file location, etc., for the example is shown below:

Sentence Detection Example in openNLP - example project structure - Tutorialkart

Example Project – Structure

Model File:

The model file en-sent.bin is available at http://opennlp.sourceforge.net/models-1.5/. Stay updated regarding latest releases of openNLP or model files, at https://opennlp.apache.org/download.html

Java Documentation

Find the java documentation for SentenceDetectorME at official site and play with the other methods like getSentenceProbabilities(), sentPosDetect(String s), etc., for a better understanding.

Custom model for Sentence Detection from user defined training data

If you are interested in knowing of how to train and generate a model yourself for Sentence Detection, refer to training a model for Sentence Detection in openNLP.

Conclusion :

In this openNLP tutorial, we have seen Sentence Detection Example in openNLP using Java.

Named Entity Extraction Example in openNLP using Java

Named Entity Extraction Example in openNLP

Named Entity Extraction Example in openNLP – In this openNLP tutorial, we shall try entity extraction from a sentence using openNLP pre-built models, that were already trained to find the named entity.

What is Named Entity Recognition/Extraction (NER)?

Named Entity Recognition is a task of finding the named entities that could possibly belong to categories like persons, organizations, dates, percentages, etc., and categorize the identified entity to one of these categories.

How Named Entity Extraction is done in openNLP ?

In openNLP, Named Entity Extraction is done using statistical models, i.e., machine learning techniques. Coming to specifics, Maxent modeling is used. To get an intuition on how Maxent modeling works, refer to the motivating example of Maxent modeling.

Example: Named Entity Extraction Example in openNLP

The following example, NameFinderExample.java shows how to use NameFinderME class to extract named entities, person and place.

When the example program, NameFinderExample.java is run, the output to console is:

The project structure and the model file location, etc., is shown below:

Named Entity Extraction Example in openNLP using Java - example project structure

Example Project – Structure

Model File:

The model files en-ner-person.bin, en-ner-person.bin and other ner models are available at http://opennlp.sourceforge.net/models-1.5/. Stay updated regarding latest releases of openNLP or model files, at https://opennlp.apache.org/download.html

 

Conclusion :

In this openNLP tutorial, we have seen how to use Named Entity Extraction API of openNLP to extract named entities from a paragraph or sentence.

How to setup openNLP Java Project in Eclipse

How to setup openNLP Java Project

In this openNLP tutorial, we shall see how to setup openNLP java project to use openNLP API with Eclipse (the process should be same, to other IDEs as well).

Following are the steps to be followed :

  1. Create a Java Project in the Eclipse. (Open Eclipse -> File(in Menu) -> New -> Project -> Java -> Java Project)
  2. Provide a project name (Ex : OpenNLPJavaTutorial) and click on “Finish”.
  3. Download jar files of openNLP from http://redrockdigimark.com/apachemirror/opennlp/
    At the time of writing this tutorial, opennlp-1.7.1 is the latest, and the list looks like in the below picture

    How to setup OpenNLP Java Project - opennlp download links - Tutorialkart

    opennlp version links

    Click on opennlp-1.7.1/ . We need bin package, because that could have the library (.jar) files.

    How to setup OpenNLP Java Project - openNLP bin package - Tutorialkart

    openNLP bin package

    Click on apache-opennlp-1.7.1-bin.zip to download.

  4. Once the zip file is downloaded, extract the contents, copy the lib folder and paste in the project as shown in the below picture.
    How to setup OpenNLP Java Project - Lib Folder - Tutorialkart

    opennlp-java-project-lib folder

    Lib folder should contain the list of below jar files:
    aopalliance-repackaged-2.5.0-b30.jar
    grizzly-framework-2.3.28.jar
    grizzly-http-2.3.28.jar
    grizzly-http-server-2.3.28.jar
    hk2-api-2.5.0-b30.jar
    hk2-locator-2.5.0-b30.jar
    hk2-utils-2.5.0-b30.jar
    hppc-0.7.1.jar
    jackson-annotations-2.8.4.jar
    jackson-core-2.8.4.jar
    jackson-databind-2.8.4.jar
    jackson-jaxrs-base-2.8.4.jar
    jackson-jaxrs-json-provider-2.8.4.jar
    jackson-module-jaxb-annotations-2.8.4.jar
    javassist-3.20.0-GA.jar
    javax.annotation-api-1.2.jar
    javax.inject-2.5.0-b30.jar
    javax.ws.rs-api-2.0.1.jar
    jcommander-1.48.jar
    jersey-client-2.25.jar
    jersey-common-2.25.jar
    jersey-container-grizzly2-http-2.25.jar
    jersey-entity-filtering-2.25.jar
    jersey-guava-2.25.jar
    jersey-media-jaxb-2.25.jar
    jersey-media-json-jackson-2.25.jar
    jersey-server-2.25.jar
    morfologik-fsa-2.1.0.jar
    morfologik-fsa-builders-2.1.0.jar
    morfologik-stemming-2.1.0.jar
    morfologik-tools-2.1.0.jar
    opennlp-brat-annotator-1.7.1.jar
    opennlp-morfologik-addon-1.7.1.jar
    opennlp-tools-1.7.1.jar
    opennlp-uima-1.7.1.jar
    osgi-resource-locator-1.0.1.jar
    validation-api-1.1.0.Final.jar

  5. Add these jars to the build path (Project -> Properties -> Java Build Path -> Libraries -> Add Jars -> Select all the jars in lib folder -> Click “Apply” -> Click “OK”)
  6. Apache has already trained some models for different problems in Natural Language Processing, with training data, and these models are available at http://opennlp.sourceforge.net/models-1.5/ . In the subsequent tutorials, we would refer to model files, which are available at this location. Do bookmark the link for a quick access.
  7. We are ready with the openNLP Java Project Setup. Lets try Sentence detection using SentenceDetectExample.java.
  8. Download “en-sent.bin” model file and place in the project. The final project structure should match with the structure shown in the below picture
    How to setup OpenNLP Java Project - java project structure - Tutorialkart

    opennlp java project structure

Example : We shall try out the example, SentenceDetectExample.java to check if the setup is good

When SentenceDetectExample.java is run, the console output is:

We are successfully done with the setup of openNLP Java Project in Eclipse.

Conclusion :

In this openNLP tutorial, we have seen the setup of openNLP Java Project in Eclipse. In our next openNLP tutorials, we shall see :