// Define some global attributes include::_globattr.adoc[] [[cd_chunker]] Chunker ~~~~~~~ //////////////////////////////////////////// - data/chunk/genia/README - how to prepare Genia chunk training data - data/chunk/ptb/README - how to prepare PTB chunk training data - data/treebank/genia/README - how to prepare Genia Treebank data for ChunkLink - resources/models/README - how to build a chunker model - scripts/perl/README - information on obtaining the chunklink script - test/data/README - a description of the files used to unit test the Chunker annotator //////////////////////////////////////////// In {osp-short} when we refer to a ``chunker'' we often mean a shallow parser -- i.e. a component that tags noun phrases, verb phrases, etc. This project supports three tasks: - Building a model from training data; - Tagging text, using a trained model; - Adjusting the end offset of certain chunks so they envelop other chunks, for certain patterns of chunks. This project provides a UIMA wrapper around the popular OpenNLP chunker. The UIMA examples project provides default wrappers for several of the components in OpenNLP, but not for the chunker. We have borrowed from the UIMA examples project liberally. Our wrapper works with our type system. Additionally, we added features and supporting components. A chunker model footnoteref:[model_disclaimer] is included with this project. [[build_chunker_model]] Building a model ^^^^^^^^^^^^^^^^ [[prepare_genia_chunk]] Prepare GENIA training data +++++++++++++++++++++++++++ You need to download a copy of GENIA's Treebank corpus from http://www-tsujii.is.s.u-tokyo.ac.jp/~genia/topics/Corpus/GTB.html. The version we used is called ``beta''. This version is distributed in a set of two files, one dated Sept. 22, 2004, with 200 ``abstracts'', and the other July 11, 2005, with 300 ``abstracts''. Please download both. After extraction, place all the `.tree` files from the two download into one directory, which we'll refer to ``. Please also download ``chunklink'' from http://ilk.uvt.nl/team/sabine/homepage/software.html. The version we used is `chunklink_2-2-2000_for_conll.pl`. This tool, from the link:{ilk-home}[Induction of Linguistic Knowledge] (ILK) group of Tilburg University, The Netherlands, converts Penn Treebank II files into a one-word-per-line format. Next, we'll use data.chunk.genia.Genia2PTB footnote:[This Java class a) renames the `.tree` files to files that look like `wsj_0001.mrg` and puts them in a directory structure expected by chunklink and creates a mapping of the original new names to the old names; b) reformats the way pos tags are formatted; c) adds an extra set of parentheses to each line of the data.] to convert Genia Treebank corpus to Penn Treebank II format, then use chunklink to convert to chunk data, and finally use data.chunk.Chunklink2OpenNLP to convert to OpenNLP format. . Run data.chunk.genia.Genia2PTB: + -- ______________________________________________________________________________ +*java -cp data.chunk.genia.Genia2PTB \ \ <1> \ <2> 1 \ *+ <3> ______________________________________________________________________________ <1> the directory which holds the GENIA corpus files; <2> the directory where the converted PTB trees will be written to; <3> a file that will created by Genia2PTB to save file name mappings. .Problematic sentences [IMPORTANT] ======================================================== There are a number of problematic sentences in the second set of 300 treebanked abstracts (in `` after processed by data.chunk.genia.Genia2PTB) that caused the chunklink script to fail. We removed them when building our model. The original GENIA file names are listed below for your reference. You need to remove the lines from the output of Genia2PTB. To find out the converted file names, please look at ++. Line numbers are separated by commas. - `93123257.tree` - 6 - `93172387.tree` - 3 - `93186809.tree` - 5 - `93280865.tree` - 7 - `94085904.tree` - 6 - `94193110.tree` - 2 - `96247631.tree` - 3, 5 - `96353916.tree` - 10 - `96357043.tree` - 4 - `97031819.tree` - 3, 4 - `97054651.tree` - 7 - `97074532.tree` - 6, 7 ======================================================== -- + . Run chunklink: + -- ___________________________________________________________________________ +*perl chunklink_2-2-2000_for_conll.pl -NHhftc /wsj_????.mrg > \ *+ <1> ___________________________________________________________________________ <1> the redirected standard output from chunklink. NOTE: The chunklink script doesn't seem to work on Windows. But we did manage to run it in a Cygwin session. -- + . Run data.chunk.Chunklink2OpenNLP + -- _______________________________________________________________ +*java -cp data.chunk.Chunklink2OpenNLP \ \ <1> *+ <2> _______________________________________________________________ <1> the output of chunklink from the previous step. <2> the resulting training data file. -- [[prepare_ptb_training_data]] Prepare Penn Treebank training data +++++++++++++++++++++++++++++++++++ Please refer to <> on <>. ////////////////////////// The version of Penn Treebank (PTB) that I ran contains ~2300 treebanked files from the Wall Street Journal (WSJ) and I believe that is it is version 2. The folder contains some of the following file names: wsj/mrg/00/wsj_0001.mrg wsj/mrg/12/wsj_1231.mrg wsj/mrg/24/wsj_2454.mrg The size of the folder is: 32.4MB (34,017,332 bytes) 2,312 Files, 27 Folders ////////////////////////// Preparing Penn Treebank data is similar to preparing GENIA data, as described in <>, except that the first step is not necessary. . Run chunklink: + -- ____________________________________________________________________________ +*perl chunklink_2-2-2000_for_conll.pl -NHhftc /wsj_????.mrg > \ <1> *+ <2> ____________________________________________________________________________ <1> ++ is your Penn Treebank corpus directory. <2> the redirected standard output. -- + . Run Chunklink2OpenNLP + -- ____________________________________________________ +*java -cp data.chunk.Chunklink2OpenNLP \ \ <1> *+ <2> ____________________________________________________ <1> the output of chunklink from the previous step. <2> the resulting training data file. -- //////////////////// The system output after the script finished was: 2447 2448 2449 2450 2451 2452 2453 2454 1173766 words processed //////////////////// Build a model from your training data +++++++++++++++++++++++++++++++++++++ Building a chunker model is much easier than preparing the training data. After you have obtained training data, run the OpenNLP tool: __________________________________________________________ +*java -cp opennlp.tools.chunker.ChunkerME \ \ <1> \ <2> iterations \ <3> cutoff*+ <4> __________________________________________________________ <1> an OpenNLP training data file. <2> the file name of the resulting model. The name should end with either +.txt+ (for a plain text model) or +.bin.gz+ (for a compressed binary model). <3> determines how many training iterations will be performed. The default is 100. <4> determines the minimum number of times a feature has to be seen to be considered for inclusion in the model. The default cutoff is 5. The iterations and cutoff arguments are, taken together, optional -- i.e. you should provide both or provide neither. Analysis engines (annotators) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Chunker +++++++ - [[chunker_xml]] *Chunker.xml* + The file +desc/Chunker.xml+ provides a descriptor for the Chunker analysis engine which is the UIMA component we have written that wraps the OpenNLP chunker. It calls ``edu.mayo.bmi.uima.chunker.Chunker'', whose Javadoc provides information on how to customize this descriptor. + *Parameters*:: ModelFile;; the file that contains the chunker tagging model ChunkerCreatorClass;; the full class name of an implementation of the interface edu.mayo.bmi.uima.chunker.ChunkerCreator + - *ChunkerAggregate.xml* + The file +desc/ChunkerAggregate.xml+ provides a descriptor that defines a pipeline for shallow parsing so that all the necessary inputs (e.g. tokens, sentences, and POS tags) have been added to the CAS. It inherits two parameters from +<>+ and three from +<>+. + - *ChunkerCPE.xml* + -- The file +desc/ChunkerCPE.xml+ provides an XML-specification of a collection processing engine (CPE). To run it: . Start UIMA CPE GUI. + ___________________________________________________________ +*java -cp {osp-cp} org.apache.uima.tools.cpm.CpmFrame*+ ___________________________________________________________ + . Open this file. . Set the parameters for the collection reader to point to a local collection of files that you want shallow parsed. . Set the parameters for the Chunker as appropriate for your environment. . Set the output directory of the XCAS Writer CAS Consumer. The results of running the pipeline are written to the output directory as XCAS files. These files can be viewed in the CAS Visual Debugger. -- Chunk adjuster ++++++++++++++ - *ChunkAdjuster.xml* + Example of descriptor for the ChunkAdjuster. + - *AdjustNounPhraseToIncludeFollowingNP.xml* + Descriptor for the ChunkAdjuster, with parameters set so that consecutive noun phrase (NP) chunks are pseudo-merged -- the end-offset of the first NP is changed to match the end-offset of the last NP in a consecutive list of NPs. + - *AdjustNounPhraseToIncludeFollowingPPNP.xml* + Descriptor for the ChunkAdjuster, with parameters set so that a sequence of NP PP NP chunks are pseudo-merged -- the end-offset of the first NP is changed to match the end-offset of the last NP in NP PP NP. This adjustment is applied repeatedly, so for a pattern of NP PP NP PP NP, the end offset for the first NP is set to match the end offset of the last NP.