Part-of-Speech Tagger

Tagging The Part of Speech Tagger marks tokens with their corresponding word type based on the token itself and the context of the token. A token might have multiple pos tags depending on the token and the context. The OpenNLP POS Tagger uses a probability model to predict the correct pos tag out of the tag set. To limit the possible tags for a token a tag dictionary can be used which increases the tagging and runtime performance of the tagger.

POS Tagger Tool The easiest way to try out the POS Tagger is the command line tool. The tool is only intended for demonstration and testing. Download the english maxent pos model and start the POS Tagger Tool with this command: The POS Tagger now reads a tokenized sentence per line from stdin. Copy these two sentences to the console: the POS Tagger will now echo the sentences with pos tags to the console: The tag set used by the english pos model is the Penn Treebank tag set.

POS Tagger API The POS Tagger can be embedded into an application via its API. First the pos model must be loaded into memory from disk or an other source. In the sample below its loaded from disk. After the model is loaded the POSTaggerME can be instantiated. The POS Tagger instance is now ready to tag data. It expects a tokenized sentence as input, which is represented as a String array, each String object in the array is one token. The following code shows how to determine the most likely pos tag sequence for a sentence. The tags array contains one part-of-speech tag for each token in the input array. The corresponding tag can be found at the same index as the token has in the input array. The confidence scores for the returned tags can be easily retrieved from a POSTaggerME with the following method call: The call to probs is stateful and will always return the probabilities of the last tagged sentence. The probs method should only be called when the tag method was called before, otherwise the behavior is undefined. Some applications need to retrieve the n-best pos tag sequences and not only the best sequence. The topKSequences method is capable of returning the top sequences. It can be called in a similar way as tag. Each Sequence object contains one sequence. The sequence can be retrieved via Sequence.getOutcomes() which returns a tags array and Sequence.getProbs() returns the probability array for this sequence.

Training The POS Tagger can be trained on annotated training material. The training material is a collection of tokenized sentences where each token has the assigned part-of-speech tag. The native POS Tagger training material looks like this: Each sentence must be in one line. The token/tag pairs are combined with "_". The token/tag pairs are whitespace separated. The data format does not define a document boundary. If a document boundary should be included in the training material it is suggested to use an empty line. The Part-of-Speech Tagger can either be trained with a command line tool, or via an training API.

Training Tool OpenNLP has a command line tool which is used to train the models available from the model download page on various corpora. Usage of the tool: The following command illustrates how an english part-of-speech model can be trained:

Training API The Part-of-Speech Tagger training API supports the training of a new pos model. Basically three steps are necessary to train it: The application must open a sample data stream Call the POSTagger.train method Save the POSModel to a file or database The following code illustrates that: lineStream = new PlainTextByLineStream(dataIn, "UTF-8"); ObjectStream sampleStream = new WordTagSampleStream(lineStream); model = POSTaggerME.train("en", sampleStream, TrainingParameters.defaultParams(), null, null); } catch (IOException e) { // Failed to read or parse training data, training failed e.printStackTrace(); } finally { if (dataIn != null) { try { dataIn.close(); } catch (IOException e) { // Not an issue, training already finished. // The exception should be logged and investigated // if part of a production system. e.printStackTrace(); } } }]]> The above code performs the first two steps, opening the data and training the model. The trained model must still be saved into an OutputStream, in the sample below it is written into a file.

Tag Dictionary The tag dictionary is a word dictionary which specifies which tags a specific token can have. Using a tag dictionary has two advantages, inappropriate tags can not been assigned to tokens in the dictionary and the beam search algorithm has to consider less possibilities and can search faster. The dictionary is defined in a xml format and can be created and stored with the POSDictionary class. Please for now checkout the javadoc and source code of that class. Note: The format should be documented and sample code should show how to use the dictionary. Any contributions are very welcome. If you want to contribute please contact us on the mailing list or comment on the jira issue OPENNLP-287.

Evaluation The built in evaluation can measure the accuracy of the pos tagger. The accuracy can be measured on a test data set or via cross validation.

Evaluation Tool There is a command line tool to evaluate a given model on a test data set. The following command shows how the tool can be run: This will display the resulting accuracy score, e.g.: There is a command line tool to cross validate a test data set. The following command shows how the tool can be run: This will display the resulting accuracy score, e.g.: