The opennlp project is now the home of a set of java-based NLP tools which perform sentence detection, tokenization, pos-tagging, chunking and parsing, named-entity detection, and coreference.
The sentence detector, tokenizer, name-entity detector and pos tagger are integrated into the UIMA Framework. The integration supports tagging and training of these tools.
The OpenNLP Tools UIMA Integration binary distribution contains all annotators in a jar file, sample descriptors with a type system and a sample PEAR. The sample PEAR is intended for demonstration and includes all annotators with models for english. The annotators are intended for inclusion into a custom analysis engine packaged by the user. To ease the inclusion all types used by the tools annotators can be mapped to the user type system inside the descriptors.
What follows covers:
After the pear is installed start the Cas Visual Debugger shipped with the UIMA framework. And click on Tools -> Load AE. Then select the opennlp.uima.OpenNlpTextAnalyzer_pear.xml file in the file dialog. Now enter some text and start the analysis engine with "Run -> Run OpenNLPTextAnalyzer". Afterwards the results will be displayed. You should see sentences, tokens, chunks, pos tags and maybe some names. Remember the input text must be written in English.
Models have been trained for various of the components and are required unless one wishes to create their own models exclusively from their own annotated data. These models can be downloaded clicking here or the "Models" link at opennlp.sourceforge.net. The models are large. You may want to just fetch specific ones. Models for the corresponding components can be found in the following directories:
The Sentence Detector segments the documents into sentences. It takes the document text as the only input and outputs sentence annotations. The pre-trained OpenNLP sentence detector models assume that sentence detection is the first analysis step.
Name | Type | Description | Mandatory |
---|---|---|---|
opennlp.uima.ModelName | String | Path to the OpenNLP sentence detector model file | yes |
opennlp.uima.SentenceType | String | The full name of the sentence type | yes |
Name | Type | Description | Mandatory |
---|---|---|---|
opennlp.uima.ProbabilityFeature | String | The name of the double probability feature (not set by default) | no |
The Tokenizer segments text into tokens. Tokenization is performed sentence wise and the Tokenizer outputs annotation of the token type. The pre trained OpenNLP tokenizer models assume that sentence detection is already done, but tokenization will also work satisfying without sentences on a document level.
Name | Type | Description | Mandatory |
---|---|---|---|
opennlp.uima.ModelName | String | Path to the OpenNLP token model file | yes |
opennlp.uima.SentenceType | String | The full name of the sentence type | yes |
opennlp.uima.TokenType | String | The full name of the token type | yes |
Name | Type | Description | Mandatory |
---|---|---|---|
opennlp.uima.ProbabilityFeature | String | The name of the double probability feature (not set by default) | no |
opennlp.uima.tokenizer.IsAlphaNumericOptimization | Boolean | If true use alpha numeric optimization. Default setting is false. | no |
Name | Type | Description | Mandatory |
---|---|---|---|
opennlp.uima.ModelName | String | Path to the OpenNLP Name Finder model file | yes |
opennlp.uima.SentenceType | String | The full name of the sentence type | yes |
opennlp.uima.TokenType | String | The full name of the token type | yes |
opennlp.uima.NameType | String | The full name of the name type | yes |
opennlp.uima.Dictionary | String | Path to the dictionary file | no |
opennlp.uima.TokenPatternOptimization | Boolean | no | |
opennlp.uima.namefinder.TokenFeature | Boolean | (default=true) | no |
opennlp.uima.namefinder.TokenFeature.previousWindowSize | Integer | (default=3) | no |
opennlp.uima.namefinder.TokenFeature.nextWindowSize | Integer | (default=3) | no |
opennlp.uima.namefinder.TokenClassFeature | Boolean | (default=true) | no |
opennlp.uima.namefinder.TokenClassFeature.previousWindowSize | Integer | (default=3) | no |
opennlp.uima.namefinder.TokenClassFeature.nextWindowSize | Integer | (default=3) | no |
Name | Type | Description | Mandatory |
---|---|---|---|
opennlp.uima.ProbabilityFeature | String | The name of the double probability feature (not set by default) | no |
opennlp.uima.BeamSize | Integer | Search beam size. | no |
Name | Type | Description | Mandatory |
---|---|---|---|
opennlp.uima.ModelName | String | Path to the OpenNLP POS Tagger model file | yes |
opennlp.uima.SentenceType | String | The full name of the sentence type | yes |
opennlp.uima.TokenType | String | The full name of the token type | yes |
opennlp.uima.POSFeature | String | The name of the token pos feature, the feature must be of type String | yes |
opennlp.uima.Dictionary | String | Path to the dictionary file | no |
opennlp.uima.TagDictionaryName | String | Path to the tag dictionary file | no |
Name | Type | Description | Mandatory |
---|---|---|---|
opennlp.uima.ProbabilityFeature | String | The name of the double probability feature (not set by default) | no |
opennlp.uima.BeamSize | Integer | Search beam size. | no |
Name | Type | Description | Mandatory |
---|---|---|---|
opennlp.uima.ModelName | String | Path to the OpenNLP Chunker model file | yes |
opennlp.uima.SentenceType | String | The full name of the sentence type | yes |
opennlp.uima.TokenType | String | The full name of the token type | yes |
opennlp.uima.POSFeature | String | The name of the token pos feature, the feature must be of type String | yes |
opennlp.uima.ChunkType | String | The full name of the chunk type | yes |
opennlp.uima.ChunkTagFeature | String | Name of the chunk feature | yes |
Name | Type | Description | Mandatory |
---|---|---|---|
opennlp.uima.BeamSize | Integer | Search beam size. | no |
Please report bugs at the bug section of the OpenNLP sourceforge site:
sourceforge.net/tracker/?group_id=3368&atid=103368
Note: Incorrect automatic-annotation on a specific piece of text does not constitute a bug. The best way to address such errors is to provide annotated data on which the automatic-annotator/tagger can be trained so that it might learn to not make these mistakes in the future.