OpenNLP Tools UIMA Integration Documentation

Introduction

The opennlp project is now the home of a set of java-based NLP tools which perform sentence detection, tokenization, pos-tagging, chunking and parsing, named-entity detection, and coreference.

The sentence detector, tokenizer, name-entity detector and pos tagger are integrated into the UIMA Framework. The integration supports tagging and training of these tools.

The OpenNLP Tools UIMA Integration binary distribution contains all annotators in a jar file, sample descriptors with a type system and a sample PEAR. The sample PEAR is intended for demonstration and includes all annotators with models for english. The annotators are intended for inclusion into a custom analysis engine packaged by the user. To ease the inclusion all types used by the tools annotators can be mapped to the user type system inside the descriptors.

What follows covers:

  1. Running the pear sample in CVD
  2. Downloading Models
  3. Sentence Detector
  4. Tokenizer
  5. Name Finder
  6. Part of Speech Tagger
  7. Chunker
  8. Docat and Parser
  9. Bug Reports

Running the pear sample in CVD

The Cas Visual Debugger is a tool shipped with UIMA which can run the opennlp.uima annotators and display their analysis results. The binary distribution comes with a sample UIMA application which includes the sentence detector, tokenizer, pos tagger, chunker and name finders for English. This sample application is packaged in the pear format and must be installed with the pear installer before it can be run by CVD. Please consult the UIMA documentation for further information about the pear installer.

After the pear is installed start the Cas Visual Debugger shipped with the UIMA framework. And click on Tools -> Load AE. Then select the opennlp.uima.OpenNlpTextAnalyzer_pear.xml file in the file dialog. Now enter some text and start the analysis engine with "Run -> Run OpenNLPTextAnalyzer". Afterwards the results will be displayed. You should see sentences, tokens, chunks, pos tags and maybe some names. Remember the input text must be written in English.

Downloading Models

Models have been trained for various of the components and are required unless one wishes to create their own models exclusively from their own annotated data. These models can be downloaded clicking here or the "Models" link at opennlp.sourceforge.net. The models are large. You may want to just fetch specific ones. Models for the corresponding components can be found in the following directories:

Sentence Detector

The Sentence Detector segments the documents into sentences. It takes the document text as the only input and outputs sentence annotations. The pre-trained OpenNLP sentence detector models assume that sentence detection is the first analysis step.

Tokenizer

The Tokenizer segments text into tokens. Tokenization is performed sentence wise and the Tokenizer outputs annotation of the token type. The pre trained OpenNLP tokenizer models assume that sentence detection is already done, but tokenization will also work satisfying without sentences on a document level.

Name Finder

The named-entity detector can detect all kinds of entities. To detect the entities the Name Finder Annotator needs sentences and tokens. It outputs name annotations of the specified type.

Part of Speech Tagger

The pos tagger detects the part of speech of the individual tokens. It needs sentences and tokens as input and writes the pos tags to the pos feature of the input tokens.

Chunker

Docat and Parser

Both integrations are still experimental and it is not suggested to use them in a production environemnt. Please see the soruce code for further details.

Bug Reports

Please report bugs at the bug section of the OpenNLP sourceforge site:

sourceforge.net/tracker/?group_id=3368&atid=103368

Note: Incorrect automatic-annotation on a specific piece of text does not constitute a bug. The best way to address such errors is to provide annotated data on which the automatic-annotator/tagger can be trained so that it might learn to not make these mistakes in the future.