Chunker

Chunking Text chunking consists of dividing a text in syntactically correlated parts of words, like noun groups, verb groups, but does not specify their internal structure, nor their role in the main sentence.

Chunker Tool The easiest way to try out the Chunker is the command line tool. The tool is only intended for demonstration and testing. Download the english maxent chunker model from the website and start the Chunker Tool with this command: The Chunker now reads a pos tagged sentence per line from stdin. Copy these two sentences to the console: The Chunker will now echo the sentences grouped tokens to the console: The tag set used by the english pos model is the Penn Treebank tag set.

Chunking API The Chunker can be embedded into an application via its API. First the chunker model must be loaded into memory from disk or an other source. In the sample below its loaded from disk. After the model is loaded a Chunker can be instantiated. The Chunker instance is now ready to tag data. It expects a tokenized sentence as input, which is represented as a String array, each String object in the array is one token, and the POS tags associated with each token. The following code shows how to determine the most likely chunk tag sequence for a sentence. The tags array contains one chunk tag for each token in the input array. The corresponding tag can be found at the same index as the token has in the input array. The confidence scores for the returned tags can be easily retrieved from a ChunkerME with the following method call: The call to probs is stateful and will always return the probabilities of the last tagged sentence. The probs method should only be called when the tag method was called before, otherwise the behavior is undefined. Some applications need to retrieve the n-best chunk tag sequences and not only the best sequence. The topKSequences method is capable of returning the top sequences. It can be called in a similar way as chunk. Each Sequence object contains one sequence. The sequence can be retrieved via Sequence.getOutcomes() which returns a tags array and Sequence.getProbs() returns the probability array for this sequence.

Chunker Training The pre-trained models might not be available for a desired language, can not detect important entities or the performance is not good enough outside the news domain. These are the typical reason to do custom training of the chunker on a ne corpus or on a corpus which is extended by private training data taken from the data which should be analyzed. The training data can be converted to the OpenNLP chunker training format, that is based on CoNLL2000. Other formats may also be available. The train data consist of three columns separated by spaces. Each word has been put on a separate line and there is an empty line after each sentence. The first column contains the current word, the second its part-of-speech tag and the third its chunk tag. The chunk tags contain the name of the chunk type, for example I-NP for noun phrase words and I-VP for verb phrase words. Most chunk types have two types of chunk tags, B-CHUNK for the first word of the chunk and I-CHUNK for each other word in the chunk. Here is an example of the file format: Sample sentence of the training data:

Training Tool OpenNLP has a command line tool which is used to train the models available from the model download page on various corpora. Usage of the tool: Its now assumed that the english chunker model should be trained from a file called en-chunker.train which is encoded as UTF-8. The following command will train the name finder and write the model to en-chunker.bin: Additionally its possible to specify the number of iterations, the cutoff and to overwrite all types in the training data with a single type.

Training API The Chunker offers an API to train a new chunker model. The following sample code illustrates how to do it: lineStream = new PlainTextByLineStream(new FileInputStream("en-chunker.train"),charset); ObjectStream sampleStream = new ChunkSampleStream(lineStream); ChunkerModel model; try { model = ChunkerME.train("en", sampleStream, new DefaultChunkerContextGenerator(), TrainingParameters.defaultParams()); } finally { sampleStream.close(); } OutputStream modelOut = null; try { modelOut = new BufferedOutputStream(new FileOutputStream(modelFile)); model.serialize(modelOut); } finally { if (modelOut != null) modelOut.close(); }]]>

Chunker Evaluation The built in evaluation can measure the chunker performance. The performance is either measured on a test dataset or via cross validation.

Chunker Evaluation Tool The following command shows how the tool can be run: A sample of the command considering you have a data sample named en-chunker.eval and you trained a model called en-chunker.bin: and here is a sample output: You can also use the tool to perform 10-fold cross validation of the Chunker. he following command shows how the tool can be run: It is not necessary to pass a model. The tool will automatically split the data to train and evaluate: