This file describes how to prepare OpenNLP chunker training data from the PTB treebank data. ####################################### Step 0 - obtain version of WSJ PTB data ####################################### The version of Penn Treebank (PTB) that I ran contains ~2300 treebanked files from the Wall Street Journal (WSJ) and I believe that is it is version 2. The folder contains some of the following file names: wsj/mrg/00/wsj_0001.mrg wsj/mrg/12/wsj_1231.mrg wsj/mrg/24/wsj_2454.mrg The size of the folder is: 32.4MB (34,017,332 bytes) 2,312 Files, 27 Folders ############################################ Step 1 - run chunklink_2-2-2000_for_conll.pl ############################################ Run the chunklink script to convert PTB data to chunk data. Please see scripts/perl/README for information on obtaining the chunklink script. I can only get this script to work correctly using cygwin. Open a cygwin session and cd to data/treebank/ptb to run the script. I have not run this script under mac or linux - but I assume that it should work fine on these platforms. Use commands similar to the following: >perl scripts/perl/chunklink_2-2-2000_for_conll.pl -NHhftc ??/wsj_????.mrg > data/chunk/ptb/ptb.chunklink.chunks The system output after the script finished was: 2447 2448 2449 2450 2451 2452 2453 2454 1173766 words processed ############################################### Step 2 - convert chunklink data to OpenNLP data ############################################### The output of chunklink needs to be converted to OpenNLP format. This can be done using the script scripts/java/data/chunk/Chunklink2OpenNLP.java: java data.chunk.Chunklink2OpenNLP data/chunk/ptb/ptb.chunklink.chunks data/chunk/ptb/ptb.opennlp.chunks