Name FinderNamed Entity Recognition
The Name Finder can detect named entities and numbers in text. To be able to
detect entities the Name Finder needs a model. The model is dependent on the
language and entity type it was trained for. The OpenNLP projects offers a number
of pre-trained name finder models which are trained on various freely available corpora.
They can be downloaded at our model download page. To find names in raw text the text
must be segmented into tokens and sentences. A detailed description is given in the
sentence detector and tokenizer tutorial. It is important that the tokenization for
the training data and the input text is identical.
Name Finder Tool
The easiest way to try out the Name Finder is the command line tool.
The tool is only intended for demonstration and testing. Download the
English
person model and start the Name Finder Tool with this command:
The name finder now reads a tokenized sentence per line from stdin, an empty
line indicates a document boundary and resets the adaptive feature generators.
Just copy this text to the terminal:
the name finder will now output the text with markup for person names:
Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mr . Vinken is chairman of Elsevier N.V. , the Dutch publishing group .
Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields PLC ,
was named a director of this British industrial conglomerate .]]>
Name Finder API
To use the Name Finder in a production system it is strongly recommended to embed it
directly into the application instead of using the command line interface.
First the name finder model must be loaded into memory from disk or an other source.
In the sample below it is loaded from disk.
There is a number of reasons why the model loading can fail:
Issues with the underlying I/OThe version of the model is not compatible with the OpenNLP versionThe model is loaded into the wrong component,
for example a tokenizer model is loaded with TokenNameFinderModel class.The model content is not valid for some other reason
After the model is loaded the NameFinderME can be instantiated.
The initialization is now finished and the Name Finder can be used. The NameFinderME
class is not thread safe, it must only be called from one thread. To use multiple threads
multiple NameFinderME instances sharing the same model instance can be created.
The input text should be segmented into documents, sentences and tokens.
To perform entity detection an application calls the find method for every sentence in the
document. After every document clearAdaptiveData must be called to clear the adaptive data in
the feature generators. Not calling clearAdaptiveData can lead to a sharp drop in the detection
rate after a few documents.
The following code illustrates that:
the following snippet shows a call to find
The nameSpans arrays contains now exactly one Span which marks the name Pierre Vinken.
The elements between the begin and end offsets are the name tokens. In this case the begin
offset is 0 and the end offset is 2. The Span object also knows the type of the entity.
In this case it is person (defined by the model). It can be retrieved with a call to Span.getType().
Additionally to the statistical Name Finder, OpenNLP also offers a dictionary and a regular
expression name finder implementation.
TODO: Explain how to retrieve probs from the name finder for names and for non recognized names
Name Finder Training
The pre-trained models might not be available for a desired language, can not detect
important entities or the performance is not good enough outside the news domain.
These are the typical reason to do custom training of the name finder on a new corpus
or on a corpus which is extended by private training data taken from the data which should be analyzed.
Training Tool
OpenNLP has a command line tool which is used to train the models available from the model
download page on various corpora.
The data can be converted to the OpenNLP name finder training format. Which is one
sentence per line. Some other formats are available as well.
The sentence must be tokenized and contain spans which mark the entities. Documents are separated by
empty lines which trigger the reset of the adaptive feature generators. A training file can contain
multiple types. If the training file contains multiple types the created model will also be able to
detect these multiple types.
Sample sentence of the data:
Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mr . Vinken is chairman of Elsevier N.V. , the Dutch publishing group .]]>
The training data should contain at least 15000 sentences to create a model which performs well.
Usage of the tool:
It is now assumed that the english person name finder model should be trained from a file
called en-ner-person.train which is encoded as UTF-8. The following command will train
the name finder and write the model to en-ner-person.bin:
The example above will train models with a pre-defined feature set. It is also possible to use the -resources parameter to generate features based on external knowledge such as those based on word representation (clustering) features. The external resources must all be placed in a resource directory which is then passed as a parameter. If this option is used it is then required to pass, via the -featuregen parameter, a XML custom feature generator which includes some of the clustering features shipped with the TokenNameFinder. Currently three formats of clustering lexicons are accepted:
Space separated two column file specifying the token and the cluster class as generated by toolkits such as word2vec.Space separated three column file specifying the token, clustering class and weight as such as Clark's clusters.Tab separated three column Brown clusters as generated by
Liang's toolkit.
Additionally it is possible to specify the number of iterations,
the cutoff and to overwrite all types in the training data with a single type. Finally, the -sequenceCodec parameter allows to specify a BIO (Begin, Inside, Out) or BILOU (Begin, Inside, Last, Out, Unit) encoding to represent the Named Entities. An example of one such command would be as follows:
Training API
To train the name finder from within an application it is recommended to use the training
API instead of the command line tool.
Basically three steps are necessary to train it:
The application must open a sample data streamCall the NameFinderME.train methodSave the TokenNameFinderModel to a file or database
The three steps are illustrated by the following sample code:
lineStream =
new PlainTextByLineStream(new FileInputStream("en-ner-person.train"), charset);
ObjectStream sampleStream = new NameSampleDataStream(lineStream);
TokenNameFinderModel model;
try {
model = NameFinderME.train("en", "person", sampleStream, TrainingParameters.defaultParams(),
TokenNameFinderFactory nameFinderFactory);
}
finally {
sampleStream.close();
}
try {
modelOut = new BufferedOutputStream(new FileOutputStream(modelFile));
model.serialize(modelOut);
} finally {
if (modelOut != null)
modelOut.close();
}]]>
Custom Feature Generation
OpenNLP defines a default feature generation which is used when no custom feature
generation is specified. Users which want to experiment with the feature generation
can provide a custom feature generator. Either via API or via an xml descriptor file.
Feature Generation defined by API
The custom generator must be used for training
and for detecting the names. If the feature generation during training time and detection
time is different the name finder might not be able to detect names.
The following lines show how to construct a custom feature generator
which is similar to the default feature generator but with a BrownTokenFeature added.
The javadoc of the feature generator classes explain what the individual feature generators do.
To write a custom feature generator please implement the AdaptiveFeatureGenerator interface or
if it must not be adaptive extend the FeatureGeneratorAdapter.
The train method which should be used is defined as
samples, TrainingParameters trainParams,
TokenNameFinderFactory factory) throws IOException]]>
where the TokenNameFinderFactory allows to specify a custom feature generator.
To detect names the model which was returned from the train method must be passed to the NameFinderME constructor.
Feature Generation defined by XML Descriptor
OpenNLP can also use a xml descriptor file to configure the feature generation. The
descriptor
file is stored inside the model after training and the feature generators are configured
correctly when the name finder is instantiated.
The following sample shows a xml descriptor which contains the default feature generator plus several types of clustering features:
]]>
The root element must be generators, each sub-element adds a feature generator to the configuration.
The sample xml is constains aditional feature generators with respect to the API defined above.
The following table shows the supported elements:
Generator elementsElementAggregatedAttributesgeneratorsyesnonecacheyesnonecharngramnomin and max specify the length of the generated character ngramsdefinitionnononedictionarynodict is the key of the dictionary resource to use,
and prefix is a feature prefix stringprevmapnononesentencenobegin and end to generate begin or end features, both are optional and are boolean valuestokenclassnononetokennononebigramnononetokenpatternnononewordclusternodict is the key of the clustering resource to usebrownclustertokennodict is the key of the clustering resource to usebrownclustertokenclassnodict is the key of the clustering resource to usebrownclusterbigramnodict is the key of the clustering resource to usewindowyesprevLength and nextLength must be integers ans specify the window sizecustomnoclass is the name of the feature generator class which will be loaded
Aggregated feature generators can contain other generators, like the cache or the window feature
generator in the sample.
Evaluation
The built in evaluation can measure the named entity recognition performance of the name finder.
The performance is either measured on a test dataset or via cross validation.
Evaluation Tool
The following command shows how the tool can be run:
Note: The command line interface does not support cross evaluation in the current version.
Evaluation API
The evaluation can be performed on a pre-trained model and a test dataset or via cross validation.
In the first case the model must be loaded and a NameSample ObjectStream must be created (see code samples above),
assuming these two objects exist the following code shows how to perform the evaluation:
In the cross validation case all the training arguments must be
provided (see the Training API section above).
To perform cross validation the ObjectStream must be resettable.
sampleStream = new PlainTextByLineStream(sampleDataIn.getChannel(), "UTF-8");
TokenNameFinderCrossValidator evaluator = new TokenNameFinderCrossValidator("en", 100, 5);
evaluator.evaluate(sampleStream, 10);
FMeasure result = evaluator.getFMeasure();
System.out.println(result.toString());]]>
Named Entity Annotation Guidelines
Annotation guidelines define what should be labeled as an entity. To build
a private corpus it is important to know these guidelines and maybe write a
custom one.
Here is a list of publicly available annotation guidelines:
MUC6
MUC7
ACE
CONLL 2002
CONLL 2003