This README describes how to generate training data for the part-of-speech (POS) tagger. Contents - Training data format - Sources of training data - Genia - Penn Treebank #################### Training data format #################### The format of a training data file should have one sentence per line, with each 'word' immediately followed by "_" and the word's part-of-speech tag, which is then followed by a space. Here is an example snippet from one line of training data: the_DT stories_NNS about_IN well-heeled_JJ communities_NNS and_CC - What if my text contains underscores? No problem. OpenNLP splits the word from the tag using the last underscore. However, there will be difficulties if your data uses an underscore as a part-of-speech tag. - What if I have a "token" that contains a space? This is a problem. OpenNLP will not be able to handle a token that contains a space in it. GENIA, for example, contains 108 occurrences of spaces inside tokens. The whitespace must be removed from these tokens or ignored (see below). ######################## Sources of training data ######################## There are a variety of sources of part-of-speech data that may be useful for training a part-of-speech tagger. We have used the following three sources for training a part-of-speech tagger for clinical data: - Mayo part-of-speech corpus - this is a corpus owned and maintained by the Mayo Clinic. Unfortunately, because of legal and privacy issues it is not currently available for distribution. However, a part-of-speech model based on this data will likely be available. - GENIA - see below - Penn Treebank - see below ##### Genia ##### We have provided a simple script to convert GENIA data to OpenNLP part-of-speech data. To create a training data file from the GENIA corpus, do the following: 1) obtain GENIAcorpus3.02.pos.xml from http://www-tsujii.is.s.u-tokyo.ac.jp/~genia/corpus/GENIAcorpus3.02p.tgz 2) run scripts/java/data.pos.training.GeniaPosTrainingDataExtractor with the following command: java data.pos.training.GeniaPosTrainingDataExtractor GENIAcorpus3.02.pos.xml data/pos/training/genia-pos-training.txt For the few cases in Genia where tokens contain white space - these are simply ignored and not added to the training data file. ############# Penn Treebank ############# We do not have scripts that we can share for converting Penn Treebank version 2 into OpenNLP-formatted training data. However, there are many libraries that are available that can be used to parse the Penn Treebank. Two suggestions are: - OpenNLP library - see opennlp.tools.parser.ParserEventStream - Stanford parser - see http://nlp.stanford.edu/software/lex-parser.shtml Another strategy is to take the output of the chunker training data as detailed in chunker/data/chunk/ptb/README and convert it to the correct format.