Contents - Listing of README's in this project - Introduction - Building a model - Tagging text - Running the Chunker analysis engine - Chunker.xml - ChunkerAggregate.xml - ChunkerCPE.xml - Running the Chunk Adjuster analysis engine - ChunkAdjuster.xml - AdjustNounPhraseToIncludeFollowingNP.xml - AdjustNounPhraseToIncludeFollowingPPNP.xml ################################### Listing of README's in this project ################################### - data/chunk/genia/README - how to prepare Genia chunk training data - data/chunk/ptb/README - how to prepare PTB chunk training data - data/treebank/genia/README - how to prepare Genia Treebank data for ChunkLink - resources/models/README - how to build a chunker model - scripts/perl/README - information on obtaining the chunklink script - target/test-classes/data/README - a description of the files used to unit test the Chunker annotator ############ Introduction ############ Throughout this document when we refer to a "chunker" we often mean a shallow parser - i.e. a component that tags noun phrases, verb phrases, etc. This project supports three tasks: 1) Building a model from training data 2) Tagging text, using a trained model 3) Adjusting the end offset of certain chunks so they envelop other chunks, for certain patterns of chunks. This project provides a UIMA wrapper around the popular OpenNLP chunker. The UIMA examples project provides default wrappers for several of the components in OpenNLP, but not for the chunker. We have borrowed from the UIMA examples project liberally. Our wrapper works with our type system. Additionally, we added features and supporting components. We also documented how to generate training data and how to build a chunker model. A chunker model is included with this project. The model derives from a combination of GENIA, Penn Treebank (Wall Street Journal) and anonymized clinical data per Safe Harbor HIPAA guidelines. Prior to model building, the clinical data was deidentified for patient names to preserve patient confidentiality. Any person name in the model will originate from non-patient data sources. ################ Building a model ################ If you wish to build your own mode you will need to follow these steps: 1) obtain training data - see data/chunk/genia/README and data/chunk/ptb/README 2) build a model from your training data - see resources/models/README ################################################## Tagging text - Running the Chunker analysis engine ################################################## %%%%%%%%%%%%% Chunker.xml The file desc/Chunker.xml provides a descriptor for the Chunker analysis engine which is the UIMA component we have written that wraps the OpenNLP chunker / shallow parser. Open this file using the Component Descriptor Editor as described in the tutorial. Click on the tab labeled "Overview" to observe that the class called by this descriptor is "org.apache.ctakes.chunker.ae.Chunker". Click on the tab labeled "Parameter Settings" to view the parameters required by the POSTagger component. The descriptor file does not document the parameters because they are documented in the api javadocs for the class org.apache.ctakes.chunker.ae.Chunker. Please consult that documentation for additional details. The parameters are: - ModelFile - the file that contains the chunker tagging model - ChunkerCreatorClass - the full class name of an implementation of the interface org.apache.ctakes.chunker.ae.ChunkerCreator %%%%%%%%%%%%%%%%%%%%%% ChunkerAggregate.xml The file desc/ChunkerAggregate.xml provides a descriptor that defines a pipeline for shallow parsing so that all the necessary inputs (e.g. tokens, sentences, and pos tags) have been added to the CAS. Open this file using the Component Descriptor Editor as described in the tutorial. Click on the tab labeled "Overview" to observe that the engine type is "Aggregate". Click on the tab labeled "Aggregate" to see the components that need to be run before the Chunker can run. Click on the tab labeled "Parameter Settings" to see that the same two parameters need to be set from the Chunker.xml file in addition to the three parameters required by the POSTagger (see "POS tagger/README"). If you assign these parameters acceptable values, you can run open and run desc/ChunkerAggregate.xml using the CAS Visual Debugger as described in the tutorial. %%%%%%%%%%%%%%%% ChunkerCPE.xml The file desc/ChunkerCPE.xml provides an xml-specification of a component processing engine (CPE) which can be opened, edited, and run using the UIMA CPE GUI as described in the tutorial. Open this file using the UIMA CPE GUI and set the parameters for the collection reader to point to a local collection of files that you want shallow parsed. Set the parameters for the Chunker as appropriate for your environment and, finally, set the output directory of the XCAS Writer CAS Consumer. The results of running the pipeline are written to the output directory as XCAS files. These files can be viewed in the CAS Visual Debugger as described in the tutorial. ################################################## Running the Chunk Adjuster analysis engine ################################################## %%%%%%%%%%%%%%%%% ChunkAdjuster.xml Example of descriptor for the ChunkAdjuster. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% AdjustNounPhraseToIncludeFollowingNP.xml Descriptor for the ChunkAdjuster, with parameters set so that consecutive noun phrase (NP) chunks are pseudo-merged -- the end-offset of the first NP is changed to match the end-offset of the last NP in a consecutive list of NPs. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% AdjustNounPhraseToIncludeFollowingPPNP.xml Descriptor for the ChunkAdjuster, with parameters set so that a sequence of NP PP NP chunks are pseudo-merged -- the end-offset of the first NP is changed to match the end-offset of the last NP in NP PP NP. This adjustment is applied repeatedly, so for a pattern of NP PP NP PP NP, the end offset for the first NP is set to match the end offset of the last NP.