This file describes how to prepare OpenNLP chunker training data from the GENIA treebank data. #################################### Step 0 - prepare Genia Treebank data #################################### Before you can execute the steps below, you must first prepare the Genia treebank data as described in data/treebank/genia/README You must also obtain the chunklink script as described in scripts/perl/README ############################################ Step 1 - run chunklink_2-2-2000_for_conll.pl ############################################ Run the chunklink script to convert Genia data to chunk data. Please see scripts/perl/README for information on obtaining the chunklink script. I can only get this script to work correctly using cygwin. So, open a cygwin session and cd to data/treebank/genia to run the script. I have not run this script under mac or linux - but I assume that it should work fine on these platforms. >cd data/treebank/genia >perl ../../../scripts/perl/chunklink_2-2-2000_for_conll.pl -NHhftc ??/wsj_????.mrg > ../../chunk/genia/genia.chunklink.chunks ############################################### Step 2 - convert chunklink data to OpenNLP data ############################################### The output of chunklink needs to be converted to OpenNLP format. This can be done using the script scripts/java/data/chunk/Chunklink2OpenNLP.java: java data.chunk.Chunklink2OpenNLP data/chunk/genia/genia.chunklink.chunks data/chunk/genia/genia.opennlp.chunks