OpenNLP Tools UIMA Integration Documentation

Introduction

The opennlp project is now the home of a set of java-based NLP tools which perform sentence detection, tokenization, pos-tagging, chunking and parsing, named-entity detection, and coreference.

The sentence detector, tokenizer, name-entity detector and pos tagger are integrated into the UIMA Framework. The integration supports tagging and training of these tools.

The OpenNLP Tools UIMA Integration binary distribution contains all annotators in a jar file, sample descriptors with a type system and a sample PEAR. The sample PEAR is intended for demonstration and includes all annotators with models for english. The annotators are intended for inclusion into a custom analysis engine packaged by the user. To ease the inclusion all types used by the tools annotators can be mapped to the user type system inside the descriptors.

What follows covers:

Running the pear sample in CVD
Downloading Models
Sentence Detector
Tokenizer
Name Finder
Part of Speech Tagger
Chunker
Docat and Parser
Bug Reports

Running the pear sample in CVD

The Cas Visual Debugger is a tool shipped with UIMA which can run the opennlp.uima annotators and display their analysis results. The binary distribution comes with a sample UIMA application which includes the sentence detector, tokenizer, pos tagger, chunker and name finders for English. This sample application is packaged in the pear format and must be installed with the pear installer before it can be run by CVD. Please consult the UIMA documentation for further information about the pear installer.

After the pear is installed start the Cas Visual Debugger shipped with the UIMA framework. And click on Tools -> Load AE. Then select the opennlp.uima.OpenNlpTextAnalyzer_pear.xml file in the file dialog. Now enter some text and start the analysis engine with "Run -> Run OpenNLPTextAnalyzer". Afterwards the results will be displayed. You should see sentences, tokens, chunks, pos tags and maybe some names. Remember the input text must be written in English.

Downloading Models

Models have been trained for various of the components and are required unless one wishes to create their own models exclusively from their own annotated data. These models can be downloaded clicking here or the "Models" link at opennlp.sourceforge.net. The models are large. You may want to just fetch specific ones. Models for the corresponding components can be found in the following directories:

english/namefind - MUC-style named entity finder models.
english/parser - English-Penn-Treebank-style pos-tag models.
english/sentdetect - English sentence detector.
english/tokenize - English-Penn-Treebank-style tokenizer.

spanish/postag - Spanish part-of-speech tagger.
spanish/sentdetect - Spanish sentence detector.
spanish/tokenize - Spanish tokenizer.

german/postag - German part-of-speech tagger.
german/sentdetect - German sentence detector.
german/tokenize - German tokenizer.

thai/sentdetect - Thai sentence detector.
thai/tokenize - Thai tokenizer.
thai/postag - Thai part-of-speech tagger.

Sentence Detector

The Sentence Detector segments the documents into sentences. It takes the document text as the only input and outputs sentence annotations. The pre-trained OpenNLP sentence detector models assume that sentence detection is the first analysis step.

Inputs
- none - The analysis engine operates directly on the document in the CAS
Outputs
- Sentence - one sentence annotation for each detected sentence in the document.

General parameters

Name	Type	Description	Mandatory
opennlp.uima.ModelName	String	Path to the OpenNLP sentence detector model file	yes
opennlp.uima.SentenceType	String	The full name of the sentence type	yes

Sentence Annotator only parameters

Name	Type	Description	Mandatory
opennlp.uima.ProbabilityFeature	String	The name of the double probability feature (not set by default)	no

Tokenizer

The Tokenizer segments text into tokens. Tokenization is performed sentence wise and the Tokenizer outputs annotation of the token type. The pre trained OpenNLP tokenizer models assume that sentence detection is already done, but tokenization will also work satisfying without sentences on a document level.

Inputs
- Sentence - The analysis engine requires Sentence annotations in the CAS. It is possible to set the sentence annotation to the DocumentAnnotation, if no sentences are available.
Outputs
- Token - one token annotation for each detected token.

General parameters

Name	Type	Description	Mandatory
opennlp.uima.ModelName	String	Path to the OpenNLP token model file	yes
opennlp.uima.SentenceType	String	The full name of the sentence type	yes
opennlp.uima.TokenType	String	The full name of the token type	yes

Tokenizer Annotator only parameters

Name	Type	Description	Mandatory
opennlp.uima.ProbabilityFeature	String	The name of the double probability feature (not set by default)	no
opennlp.uima.tokenizer.IsAlphaNumericOptimization	Boolean	If true use alpha numeric optimization. Default setting is false.	no

Name Finder

The named-entity detector can detect all kinds of entities. To detect the entities the Name Finder Annotator needs sentences and tokens. It outputs name annotations of the specified type.

Inputs
- Sentence - The analysis engine requires Sentence annotations in the CAS. It is possible to set the Sentence annotation to the DocumentAnnotation, if no sentences are available.
- Token - The analysis engine requires token annotations in the CAS
Outputs
- Name - one name annotation for each recognized named entity.

General parameters

Name	Type	Description	Mandatory
opennlp.uima.ModelName	String	Path to the OpenNLP Name Finder model file	yes
opennlp.uima.SentenceType	String	The full name of the sentence type	yes
opennlp.uima.TokenType	String	The full name of the token type	yes
opennlp.uima.NameType	String	The full name of the name type	yes
opennlp.uima.Dictionary	String	Path to the dictionary file	no
opennlp.uima.TokenPatternOptimization	Boolean		no
opennlp.uima.namefinder.TokenFeature	Boolean	(default=true)	no
opennlp.uima.namefinder.TokenFeature.previousWindowSize	Integer	(default=3)	no
opennlp.uima.namefinder.TokenFeature.nextWindowSize	Integer	(default=3)	no
opennlp.uima.namefinder.TokenClassFeature	Boolean	(default=true)	no
opennlp.uima.namefinder.TokenClassFeature.previousWindowSize	Integer	(default=3)	no
opennlp.uima.namefinder.TokenClassFeature.nextWindowSize	Integer	(default=3)	no

Name Finder Annotator only parameters

Name	Type	Description	Mandatory
opennlp.uima.ProbabilityFeature	String	The name of the double probability feature (not set by default)	no
opennlp.uima.BeamSize	Integer	Search beam size.	no

Part of Speech Tagger

The pos tagger detects the part of speech of the individual tokens. It needs sentences and tokens as input and writes the pos tags to the pos feature of the input tokens.

Inputs
- Sentence - The analysis engine requires Sentence annotations in the CAS.
- Token - The analysis engine requires token annotations in the CAS
Outputs
- tag - the pos tag is written in to the tag field of each Token annotation which is contained in an input Sentence

General parameters

Name	Type	Description	Mandatory
opennlp.uima.ModelName	String	Path to the OpenNLP POS Tagger model file	yes
opennlp.uima.SentenceType	String	The full name of the sentence type	yes
opennlp.uima.TokenType	String	The full name of the token type	yes
opennlp.uima.POSFeature	String	The name of the token pos feature, the feature must be of type String	yes
opennlp.uima.Dictionary	String	Path to the dictionary file	no
opennlp.uima.TagDictionaryName	String	Path to the tag dictionary file	no

Part of Speech Tagger Annotator only parameters

Name	Type	Description	Mandatory
opennlp.uima.ProbabilityFeature	String	The name of the double probability feature (not set by default)	no
opennlp.uima.BeamSize	Integer	Search beam size.	no

Chunker

Inputs
- Sentence - The analysis engine requires Sentence annotations in the CAS.
- Token - The analysis engine requires token annotations in the CAS
- posTag - The analysis engine requires pos tags from the pos tag fields of Token annotations in the CAS
Outputs
- Chunk - one Chunk annotation for each detected chunk.
- chunkTag - the chunk tag is written into the chunk tag field of the Chunk annotation

General parameters

Name	Type	Description	Mandatory
opennlp.uima.ModelName	String	Path to the OpenNLP Chunker model file	yes
opennlp.uima.SentenceType	String	The full name of the sentence type	yes
opennlp.uima.TokenType	String	The full name of the token type	yes
opennlp.uima.POSFeature	String	The name of the token pos feature, the feature must be of type String	yes
opennlp.uima.ChunkType	String	The full name of the chunk type	yes
opennlp.uima.ChunkTagFeature	String	Name of the chunk feature	yes

Chunker Annotator only parameters

Name Type Description Mandatory

opennlp.uima.BeamSize Integer Search beam size. no

Name	Type	Description	Mandatory
opennlp.uima.BeamSize	Integer	Search beam size.	no

Docat and Parser

Both integrations are still experimental and it is not suggested to use them in a production environemnt. Please see the soruce code for further details.

Bug Reports

Please report bugs at the bug section of the OpenNLP sourceforge site:

sourceforge.net/tracker/?group_id=3368&atid=103368

Note: Incorrect automatic-annotation on a specific piece of text does not constitute a bug. The best way to address such errors is to provide annotated data on which the automatic-annotator/tagger can be trained so that it might learn to not make these mistakes in the future.