This file describes how to generate treebank data that can be consumed by the Chunklink perl script for the purpose of creating chunk data for the OpenNLP chunker. ##################################### Step 1 - Download GENIA treebank data ##################################### Download the Genia treebank data from the Genia website. The version is called "Beta" and consists of 500 treebanked files distributed in 2 tar.gz files given by the links: http://www-tsujii.is.s.u-tokyo.ac.jp/~genia/corpus/GTB.tar.gz http://www-tsujii.is.s.u-tokyo.ac.jp/~genia/corpus/GTB-B2.tar.gz The resulting tar files after being gunzipped have the following md5sums: 80a9d8ad7e5a7f6f4d01b71427f2059b 79a71617ca86ad8595a61dbf6c78c917 - gunzip and untar GTB.tar.gz to a directory named treebank/genia/GTB200/ - there should be 401 objects in the directory when you are done - verify that the file treebank/genia/GTB200/91079577.tree exists - gunzip and untar GTB-B2.tar.gz to a directory named treebank/genia/GTB300/ - the files will be copied to subdirectories called 300 and 300tree respectively - delete the directory '300' - move all files in 300tree to treebank/genia/GTB300 - verify that the file treebank/genia/GTB300/93123257.tree exists ###################### Step 2 - Run Genia2PTB ###################### Run the class data.chunk.genia.Genia2PTB on the data: java Genia2PTB "data/treebank/genia/GTB200" data/treebank/genia/00 1 data/treebank/genia/names-00.txt java Genia2PTB "data/treebank/genia/GTB300" data/treebank/genia/01 201 data/treebank/genia/names-01.txt There should now be 200 .mrg files in data/treebank/genia/00 and 300 .mrg files in data/treebank/genia/01 This script does three things: 1) renames the .tree files to files that look like wsj_0001.mrg and puts them in a directory structure expected by chunklink (e.g. 00) and creates a mapping of the original new names to the old names (e.g. names-00.txt) 2) reformats the way pos tags are formatted 3) adds an extra set of parentheses to each line of the data ################################# Step 3 - Remove problem sentences ################################# There are a number of problem sentences in the 2nd set of 300 treebanked abstracts that I removed because they caused the chunklink script to fail. Simply open the following files in the directory data/treebank/genia/01 remove the lines indicated (e.g. remove line 6 from wsj_0201.mrg) wsj_0201.mrg - 6 wsj_0217.mrg - 3 wsj_0222.mrg - 5 wsj_0268.mrg - 7 wsj_0301.mrg - 6 wsj_0322.mrg - 2 wsj_0350.mrg - 3, 5 wsj_0383.mrg - 10 wsj_0388.mrg - 4 wsj_0429.mrg - 3, 4 wsj_0442.mrg - 7 wsj_0455.mrg - 6, 7 ##### Done! ##### You should be able to run the Chunklink script as described in data/chunk/genia/README