Corpora
OpenNLP has built-in support to convert into the native training format or directly use
various corpora needed by the different trainable components.
CONLL
CoNLL stands for the Conference on Computational Natural Language Learning and is not
a single project but a consortium of developers attempting to broaden the computing
environment. More information about the entire conference series can be obtained here
for CoNLL.
CONLL 2000
The shared task of CoNLL-2000 is Chunking.
Getting the data
CoNLL-2000 made available training and test data for the Chunk task in English.
The data consists of the same partitions of the Wall Street Journal corpus (WSJ)
as the widely used data for noun phrase chunking: sections 15-18 as training data
(211727 tokens) and section 20 as test data (47377 tokens). The annotation of the
data has been derived from the WSJ corpus by a program written by Sabine Buchholz
from Tilburg University, The Netherlands. Both training and test data can be
obtained from http://www.cnts.ua.ac.be/conll2000/chunking.
Converting the data
The data don't need to be transformed because Apache OpenNLP Chunker follows
the CONLL 2000 format for training. Check Chunker Training section to learn more.
Training
We can train the model for the Chunker using the train.txt available at CONLL 2000:
Evaluating
We evaluate the model using the file test.txt available at CONLL 2000:
CONLL 2002
The shared task of CoNLL-2002 is language independent named entity recognition for Spanish and Dutch.
Getting the dataThe data consists of three files per language: one training file and two test files testa and testb.
The first test file will be used in the development phase for finding good parameters for the learning system.
The second test file will be used for the final evaluation. Currently there are data files available for two languages:
Spanish and Dutch.
The Spanish data is a collection of news wire articles made available by the Spanish EFE News Agency. The articles are
from May 2000. The annotation was carried out by the TALP Research Center of the Technical University of Catalonia (UPC)
and the Center of Language and Computation (CLiC)of the University of Barcelona (UB), and funded by the European Commission
through the NAMIC project (IST-1999-12392).
The Dutch data consist of four editions of the Belgian newspaper "De Morgen" of 2000 (June 2, July 1, August 1 and September 1).
The data was annotated as a part of the Atranos project at the University of Antwerp.
You can find the Spanish files here:
http://www.lsi.upc.edu/~nlp/tools/nerc/nerc.html
You must download esp.train.gz, unzip it and you will see the file esp.train.
You can find the Dutch files here:
http://www.cnts.ua.ac.be/conll2002/ner.tgz
You must unzip it and go to /ner/data/ned.train.gz, so you unzip it too, and you will see the file ned.train.
Converting the data
I will use Spanish data as reference, but it would be the same operations to Dutch. You just must remember change “-lang es” to “-lang nl” and use
the correct training files. So to convert the information to the OpenNLP format:
es_corpus_train_persons.txt]]>
Optionally, you can convert the training test samples as well.
corpus_testa.txt
$ opennlp TokenNameFinderConverter conll02 -data esp.testb -lang es -types per > corpus_testb.txt]]>
Training with Spanish data
To train the model for the name finder:
CONLL 2003
The shared task of CoNLL-2003 is language independent named entity recognition
for English and German.
Getting the data
The English data is the Reuters Corpus, which is a collection of news wire articles.
The Reuters Corpus can be obtained free of charges from the NIST for research
purposes: http://trec.nist.gov/data/reuters/reuters.html
The German data is a collection of articles from the German newspaper Frankfurter
Rundschau. The articles are part of the ECI Multilingual Text Corpus which
can be obtained for 75$ (2010) from the Linguistic Data Consortium:
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC94T5After one of the corpora is available the data must be
transformed as explained in the README file to the CONLL format.
The transformed data can be read by the OpenNLP CONLL03 converter.
Converting the data (optional)
To convert the information to the OpenNLP format:
corpus_train.txt]]>
Optionally, you can convert the training test samples as well.
corpus_testa.txt
$ opennlp TokenNameFinderConverter conll03 -lang en -types per -data eng.testb > corpus_testb.txt]]>
Training with English data
You can train the model for the name finder this way:
If you have converted the data, then you can train the model for the name finder this way:
Either way you should see the following output during the training process:
Evaluating with English data
You can evaluate the model for the name finder this way:
If you converted the test A and B files above, you can use them to evaluate the
model.
Either way you should see the following output:
Arvores Deitadas
The Portuguese corpora available at Floresta Sintá(c)tica project follow the Arvores Deitadas (AD) format. Apache OpenNLP includes tools to convert from AD format to native format.
Getting the data
The Corpus can be downloaded from here: http://www.linguateca.pt/floresta/corpus.html
The Name Finder models were trained using the Amazonia corpus: amazonia.ad.
The Chunker models were trained using the Bosque_CF_8.0.ad.
Converting the data (optional)
To extract NameFinder training data from Amazonia corpus:
corpus.txt]]>
To extract Chunker training data from Bosque_CF_8.0.ad corpus:
bosque-chunk]]>
Training and Evaluation
To perform the evaluation the corpus was split into a training and a test part.
corpus_train.txt
$ sed '55172,100000000d' corpus.txt > corpus_test.txt]]>
Leipzig Corpora
The Leipzig Corpora collection presents corpora in different languages. The corpora is a collection of individual sentences collected
from the web and newspapers. The Corpora is available as plain text and as MySQL database tables. The OpenNLP integration can only
use the plain text version.
The corpora in the different languages can be used to train a document categorizer model which can detect the document language.
The individual plain text packages can be downloaded here:
http://corpora.uni-leipzig.de/download.html
After all packages have been downloaded, unzip them and use the following commands to
produce a training file which can be processed by the Document Categorizer:
> lang.train
$ opennlp DoccatConverter leipzig -lang de -data Leipzig/de100k/sentences.txt >> lang.train
$ opennlp DoccatConverter leipzig -lang dk -data Leipzig/dk100k/sentences.txt >> lang.train
$ opennlp DoccatConverter leipzig -lang ee -data Leipzig/ee100k/sentences.txt >> lang.train
$ opennlp DoccatConverter leipzig -lang en -data Leipzig/en100k/sentences.txt >> lang.train
$ opennlp DoccatConverter leipzig -lang fi -data Leipzig/fi100k/sentences.txt >> lang.train
$ opennlp DoccatConverter leipzig -lang fr -data Leipzig/fr100k/sentences.txt >> lang.train
$ opennlp DoccatConverter leipzig -lang it -data Leipzig/it100k/sentences.txt >> lang.train
$ opennlp DoccatConverter leipzig -lang jp -data Leipzig/jp100k/sentences.txt >> lang.train
$ opennlp DoccatConverter leipzig -lang kr -data Leipzig/kr100k/sentences.txt >> lang.train
$ opennlp DoccatConverter leipzig -lang nl -data Leipzig/nl100k/sentences.txt >> lang.train
$ opennlp DoccatConverter leipzig -lang no -data Leipzig/no100k/sentences.txt >> lang.train
$ opennlp DoccatConverter leipzig -lang se -data Leipzig/se100k/sentences.txt >> lang.train
$ opennlp DoccatConverter leipzig -lang sorb -data Leipzig/sorb100k/sentences.txt >> lang.train
$ opennlp DoccatConverter leipzig -lang tr -data Leipzig/tr100k/sentences.txt >> lang.train]]>
Depending on your platform local it might be problematic to output characters which are not supported by that encoding,
we suggest to run these command on a platform which has a unicode default encoding, e.g. Linux with UTF-8.
After the lang.train file is created the actual language detection document categorizer model
can be created with the following command.
In the sample above the language detection model was trained to distinguish two languages, danish and english.
After the model is created it can be used to detect the two languages: