Corpora

Corpora OpenNLP has built-in support to convert into the native training format or directly use various corpora needed by the different trainable components.

CONLL CoNLL stands for the Conference on Computational Natural Language Learning and is not a single project but a consortium of developers attempting to broaden the computing environment. More information about the entire conference series can be obtained here for CoNLL.

CONLL 2000 The shared task of CoNLL-2000 is Chunking.

Getting the data CoNLL-2000 made available training and test data for the Chunk task in English. The data consists of the same partitions of the Wall Street Journal corpus (WSJ) as the widely used data for noun phrase chunking: sections 15-18 as training data (211727 tokens) and section 20 as test data (47377 tokens). The annotation of the data has been derived from the WSJ corpus by a program written by Sabine Buchholz from Tilburg University, The Netherlands. Both training and test data can be obtained from http://www.cnts.ua.ac.be/conll2000/chunking.

Converting the data The data don't need to be transformed because Apache OpenNLP Chunker follows the CONLL 2000 format for training. Check Chunker Training section to learn more.

Training We can train the model for the Chunker using the train.txt available at CONLL 2000:

Evaluating We evaluate the model using the file test.txt available at CONLL 2000:

CONLL 2002 The shared task of CoNLL-2002 is language independent named entity recognition for Spanish and Dutch.

Getting the data The data consists of three files per language: one training file and two test files testa and testb. The first test file will be used in the development phase for finding good parameters for the learning system. The second test file will be used for the final evaluation. Currently there are data files available for two languages: Spanish and Dutch. The Spanish data is a collection of news wire articles made available by the Spanish EFE News Agency. The articles are from May 2000. The annotation was carried out by the TALP Research Center of the Technical University of Catalonia (UPC) and the Center of Language and Computation (CLiC)of the University of Barcelona (UB), and funded by the European Commission through the NAMIC project (IST-1999-12392). The Dutch data consist of four editions of the Belgian newspaper "De Morgen" of 2000 (June 2, July 1, August 1 and September 1). The data was annotated as a part of the Atranos project at the University of Antwerp. You can find the Spanish files here: http://www.lsi.upc.edu/~nlp/tools/nerc/nerc.html You must download esp.train.gz, unzip it and you will see the file esp.train. You can find the Dutch files here: http://www.cnts.ua.ac.be/conll2002/ner.tgz You must unzip it and go to /ner/data/ned.train.gz, so you unzip it too, and you will see the file ned.train.

Converting the data I will use Spanish data as reference, but it would be the same operations to Dutch. You just must remember change “-lang es” to “-lang nl” and use the correct training files. So to convert the information to the OpenNLP format: es_corpus_train_persons.txt]]> Optionally, you can convert the training test samples as well. corpus_testa.txt $ opennlp TokenNameFinderConverter conll02 -data esp.testb -lang es -types per > corpus_testb.txt]]>

Training with Spanish data To train the model for the name finder:

CONLL 2003 The shared task of CoNLL-2003 is language independent named entity recognition for English and German.

Getting the data The English data is the Reuters Corpus, which is a collection of news wire articles. The Reuters Corpus can be obtained free of charges from the NIST for research purposes: http://trec.nist.gov/data/reuters/reuters.html The German data is a collection of articles from the German newspaper Frankfurter Rundschau. The articles are part of the ECI Multilingual Text Corpus which can be obtained for 75$ (2010) from the Linguistic Data Consortium: http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC94T5 After one of the corpora is available the data must be transformed as explained in the README file to the CONLL format. The transformed data can be read by the OpenNLP CONLL03 converter.

Converting the data (optional) To convert the information to the OpenNLP format: corpus_train.txt]]> Optionally, you can convert the training test samples as well. corpus_testa.txt $ opennlp TokenNameFinderConverter conll03 -lang en -types per -data eng.testb > corpus_testb.txt]]>

Training with English data You can train the model for the name finder this way: If you have converted the data, then you can train the model for the name finder this way: Either way you should see the following output during the training process:

Evaluating with English data You can evaluate the model for the name finder this way: If you converted the test A and B files above, you can use them to evaluate the model. Either way you should see the following output:

Arvores Deitadas The Portuguese corpora available at Floresta Sintá(c)tica project follow the Arvores Deitadas (AD) format. Apache OpenNLP includes tools to convert from AD format to native format.

Getting the data The Corpus can be downloaded from here: http://www.linguateca.pt/floresta/corpus.html The Name Finder models were trained using the Amazonia corpus: amazonia.ad. The Chunker models were trained using the Bosque_CF_8.0.ad.

Converting the data (optional) To extract NameFinder training data from Amazonia corpus: corpus.txt]]> To extract Chunker training data from Bosque_CF_8.0.ad corpus: bosque-chunk]]>

Training and Evaluation To perform the evaluation the corpus was split into a training and a test part. corpus_train.txt $ sed '55172,100000000d' corpus.txt > corpus_test.txt]]>

Leipzig Corpora The Leipzig Corpora collection presents corpora in different languages. The corpora is a collection of individual sentences collected from the web and newspapers. The Corpora is available as plain text and as MySQL database tables. The OpenNLP integration can only use the plain text version. The corpora in the different languages can be used to train a document categorizer model which can detect the document language. The individual plain text packages can be downloaded here: http://corpora.uni-leipzig.de/download.html After all packages have been downloaded, unzip them and use the following commands to produce a training file which can be processed by the Document Categorizer: > lang.train $ opennlp DoccatConverter leipzig -lang de -data Leipzig/de100k/sentences.txt >> lang.train $ opennlp DoccatConverter leipzig -lang dk -data Leipzig/dk100k/sentences.txt >> lang.train $ opennlp DoccatConverter leipzig -lang ee -data Leipzig/ee100k/sentences.txt >> lang.train $ opennlp DoccatConverter leipzig -lang en -data Leipzig/en100k/sentences.txt >> lang.train $ opennlp DoccatConverter leipzig -lang fi -data Leipzig/fi100k/sentences.txt >> lang.train $ opennlp DoccatConverter leipzig -lang fr -data Leipzig/fr100k/sentences.txt >> lang.train $ opennlp DoccatConverter leipzig -lang it -data Leipzig/it100k/sentences.txt >> lang.train $ opennlp DoccatConverter leipzig -lang jp -data Leipzig/jp100k/sentences.txt >> lang.train $ opennlp DoccatConverter leipzig -lang kr -data Leipzig/kr100k/sentences.txt >> lang.train $ opennlp DoccatConverter leipzig -lang nl -data Leipzig/nl100k/sentences.txt >> lang.train $ opennlp DoccatConverter leipzig -lang no -data Leipzig/no100k/sentences.txt >> lang.train $ opennlp DoccatConverter leipzig -lang se -data Leipzig/se100k/sentences.txt >> lang.train $ opennlp DoccatConverter leipzig -lang sorb -data Leipzig/sorb100k/sentences.txt >> lang.train $ opennlp DoccatConverter leipzig -lang tr -data Leipzig/tr100k/sentences.txt >> lang.train]]> Depending on your platform local it might be problematic to output characters which are not supported by that encoding, we suggest to run these command on a platform which has a unicode default encoding, e.g. Linux with UTF-8. After the lang.train file is created the actual language detection document categorizer model can be created with the following command. In the sample above the language detection model was trained to distinguish two languages, danish and english. After the model is created it can be used to detect the two languages: