Tokenizer

Tokenization The OpenNLP Tokenizers segment an input character sequence into tokens. Tokens are usually words, punctuation, numbers, etc. The following result shows the individual tokens in a whitespace separated representation. OpenNLP offers multiple tokenizer implementations: Whitespace Tokenizer - A whitespace tokenizer, non whitespace sequences are identified as tokens Simple Tokenizer - A character class tokenizer, sequences of the same character class are tokens Learnable Tokenizer - A maximum entropy tokenizer, detects token boundaries based on probability model Most part-of-speech taggers, parsers and so on, work with text tokenized in this manner. It is important to ensure that your tokenizer produces tokens of the type expected by your later text processing components. With OpenNLP (as with many systems), tokenization is a two-stage process: first, sentence boundaries are identified, then tokens within each sentence are identified.

Tokenizer Tools The easiest way to try out the tokenizers are the command line tools. The tools are only intended for demonstration and testing. There are two tools, one for the Simple Tokenizer and one for the learnable tokenizer. A command line tool the for the Whitespace Tokenizer does not exist, because the whitespace separated output would be identical to the input. The following command shows how to use the Simple Tokenizer Tool. To use the learnable tokenizer download the english token model from our website. To test the tokenizer copy the sample from above to the console. The whitespace separated tokens will be written back to the console. Usually the input is read from a file and written to a file. article-tokenized.txt]]> It can be done in the same way for the Simple Tokenizer. Since most text comes truly raw and doesn't have sentence boundaries and such, its possible to create a pipe which first performs sentence boundary detection and tokenization. The following sample illustrates that. Of course this is all on the command line. Many people use the models directly in their Java code by creating SentenceDetector and Tokenizer objects and calling their methods as appropriate. The following section will explain how the Tokenizers can be used directly from java.

Tokenizer API The Tokenizers can be integrated into an application by the defined API. The shared instance of the WhitespaceTokenizer can be retrieved from a static field WhitespaceTokenizer.INSTANCE. The shared instance of the SimpleTokenizer can be retrieved in the same way from SimpleTokenizer.INSTANCE. To instantiate the TokenizerME (the learnable tokenizer) a Token Model must be created first. The following code sample shows how a model can be loaded. After the model is loaded the TokenizerME can be instantiated. The tokenizer offers two tokenize methods, both expect an input String object which contains the untokenized text. If possible it should be a sentence, but depending on the training of the learnable tokenizer this is not required. The first returns an array of Strings, where each String is one token. The output will be an array with these tokens. The second method, tokenizePos returns an array of Spans, each Span contain the begin and end character offsets of the token in the input String. The tokenSpans array now contain 5 elements. To get the text for one span call Span.getCoveredText which takes a span and the input text. The TokenizerME is able to output the probabilities for the detected tokens. The getTokenProbabilities method must be called directly after one of the tokenize methods was called. The tokenProbs array now contains one double value per token, the value is between 0 and 1, where 1 is the highest possible probability and 0 the lowest possible probability.

Tokenizer Training

Training Tool OpenNLP has a command line tool which is used to train the models available from the model download page on various corpora. The data can be converted to the OpenNLP Tokenizer training format or used directly. The OpenNLP format contains one sentence per line. Tokens are either separated by a whitespace or by a special <SPLIT> tag. The following sample shows the sample from above in the correct format. , 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a nonexecutive director of this British industrial conglomerate.]]> Usage of the tool: To train the english tokenizer use the following command: ... 95: .. loglikelihood=-769.2107474529454 0.999511955191386 96: .. loglikelihood=-763.8891914534009 0.999511955191386 97: .. loglikelihood=-758.6685383254891 0.9995157680414533 98: .. loglikelihood=-753.5458314695236 0.9995157680414533 99: .. loglikelihood=-748.5182305519613 0.9995157680414533 100: .. loglikelihood=-743.5830058068038 0.9995157680414533 Wrote tokenizer model. Path: en-token.bin]]>

Training API The Tokenizer offers an API to train a new tokenization model. Basically three steps are necessary to train it: The application must open a sample data stream Call the TokenizerME.train method Save the TokenizerModel to a file or directly use it The following sample code illustrates these steps: lineStream = new PlainTextByLineStream(new FileInputStream("en-sent.train"), charset); ObjectStream sampleStream = new TokenSampleStream(lineStream); TokenizerModel model; try { model = TokenizerME.train("en", sampleStream, true, TrainingParameters.defaultParams()); } finally { sampleStream.close(); } OutputStream modelOut = null; try { modelOut = new BufferedOutputStream(new FileOutputStream(modelFile)); model.serialize(modelOut); } finally { if (modelOut != null) modelOut.close(); }]]>

Detokenizing Detokenizing is simple the opposite of tokenization, the original non-tokenized string should be constructed out of a token sequence. The OpenNLP implementation was created to undo the tokenization of training data for the tokenizer. It can also be used to undo the tokenization of such a trained tokenizer. The implementation is strictly rule based and defines how tokens should be attached to a sentence wise character sequence. The rule dictionary assign to every token an operation which describes how it should be attached to one continuous character sequence. The following rules can be assigned to a token: MERGE_TO_LEFT - Merges the token to the left side. MERGE_TO_RIGHT - Merges the token to the right side. RIGHT_LEFT_MATCHING - Merges the token to the right side on first occurrence and to the left side on second occurrence. The following sample will illustrate how the detokenizer with a small rule dictionary (illustration format, not the xml data format): The dictionary should be used to de-tokenize the following whitespace tokenized sentence: The tokens would get these tags based on the dictionary: NO_OPERATION said -> NO_OPERATION " -> MERGE_TO_RIGHT This -> NO_OPERATION is -> NO_OPERATION a -> NO_OPERATION test -> NO_OPERATION " -> MERGE_TO_LEFT . -> MERGE_TO_LEFT]]> That will result in the following character sequence: TODO: Add documentation about the dictionary format and how to use the API. Contributions are welcome.

Detokenizing API TODO: Write documentation about the detokenizer api. Any contributions are very welcome. If you want to contribute please contact us on the mailing list or comment on the jira issue OPENNLP-216.

Detokenizer Dictionary TODO: Write documentation about the detokenizer dictionary. Any contributions are very welcome. If you want to contribute please contact us on the mailing list or comment on the jira issue OPENNLP-217.