Part-of-Speech Tagger
Tagging
The Part of Speech Tagger marks tokens with their corresponding word type
based on the token itself and the context of the token. A token might have
multiple pos tags depending on the token and the context. The OpenNLP POS Tagger
uses a probability model to predict the correct pos tag out of the tag set.
To limit the possible tags for a token a tag dictionary can be used which increases
the tagging and runtime performance of the tagger.
POS Tagger Tool
The easiest way to try out the POS Tagger is the command line tool. The tool is
only intended for demonstration and testing.
Download the english maxent pos model and start the POS Tagger Tool with this command:
The POS Tagger now reads a tokenized sentence per line from stdin.
Copy these two sentences to the console:
the POS Tagger will now echo the sentences with pos tags to the console:
The tag set used by the english pos model is the Penn Treebank tag set.
POS Tagger API
The POS Tagger can be embedded into an application via its API.
First the pos model must be loaded into memory from disk or an other source.
In the sample below its loaded from disk.
After the model is loaded the POSTaggerME can be instantiated.
The POS Tagger instance is now ready to tag data. It expects a tokenized sentence
as input, which is represented as a String array, each String object in the array
is one token.
The following code shows how to determine the most likely pos tag sequence for a sentence.
The tags array contains one part-of-speech tag for each token in the input array. The corresponding
tag can be found at the same index as the token has in the input array.
The confidence scores for the returned tags can be easily retrieved from
a POSTaggerME with the following method call:
The call to probs is stateful and will always return the probabilities of the last
tagged sentence. The probs method should only be called when the tag method
was called before, otherwise the behavior is undefined.
Some applications need to retrieve the n-best pos tag sequences and not
only the best sequence.
The topKSequences method is capable of returning the top sequences.
It can be called in a similar way as tag.
Each Sequence object contains one sequence. The sequence can be retrieved
via Sequence.getOutcomes() which returns a tags array
and Sequence.getProbs() returns the probability array for this sequence.
Training
The POS Tagger can be trained on annotated training material. The training material
is a collection of tokenized sentences where each token has the assigned part-of-speech tag.
The native POS Tagger training material looks like this:
Each sentence must be in one line. The token/tag pairs are combined with "_".
The token/tag pairs are whitespace separated. The data format does not
define a document boundary. If a document boundary should be included in the
training material it is suggested to use an empty line.
The Part-of-Speech Tagger can either be trained with a command line tool,
or via an training API.
Training Tool
OpenNLP has a command line tool which is used to train the models available from the model
download page on various corpora.
Usage of the tool:
The following command illustrates how an english part-of-speech model can be trained:
Training API
The Part-of-Speech Tagger training API supports the training of a new pos model.
Basically three steps are necessary to train it:
The application must open a sample data stream
Call the POSTagger.train method
Save the POSModel to a file or database
The following code illustrates that:
lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
ObjectStream sampleStream = new WordTagSampleStream(lineStream);
model = POSTaggerME.train("en", sampleStream, TrainingParameters.defaultParams(), null, null);
}
catch (IOException e) {
// Failed to read or parse training data, training failed
e.printStackTrace();
}
finally {
if (dataIn != null) {
try {
dataIn.close();
}
catch (IOException e) {
// Not an issue, training already finished.
// The exception should be logged and investigated
// if part of a production system.
e.printStackTrace();
}
}
}]]>
The above code performs the first two steps, opening the data and training
the model. The trained model must still be saved into an OutputStream, in
the sample below it is written into a file.
Tag Dictionary
The tag dictionary is a word dictionary which specifies which tags a specific token can have. Using a tag
dictionary has two advantages, inappropriate tags can not been assigned to tokens in the dictionary and the
beam search algorithm has to consider less possibilities and can search faster.
The dictionary is defined in a xml format and can be created and stored with the POSDictionary class.
Please for now checkout the javadoc and source code of that class.
Note: The format should be documented and sample code should show how to use the dictionary.
Any contributions are very welcome. If you want to contribute please contact us on the mailing list
or comment on the jira issue OPENNLP-287.
Evaluation
The built in evaluation can measure the accuracy of the pos tagger.
The accuracy can be measured on a test data set or via cross validation.
Evaluation Tool
There is a command line tool to evaluate a given model on a test data set.
The following command shows how the tool can be run:
This will display the resulting accuracy score, e.g.:
There is a command line tool to cross validate a test data set.
The following command shows how the tool can be run:
This will display the resulting accuracy score, e.g.: