Document CategorizerClassifying
The OpenNLP Document Categorizer can classify text into pre-defined categories.
It is based on maximum entropy framework. For someone interested in Gross Margin,
the sample text given below could be classified as GMDecrease
and the text below could be classified as GMIncrease
To be able to classify a text, the document categorizer needs a model.
The classifications are requirements-specific
and hence there is no pre-built model for document categorizer under OpenNLP project.
Document Categorizer Tool
The easiest way to try out the document categorizer is the command line tool. The tool is only
intended for demonstration and testing. The following command shows how to use the document categorizer tool.
The input is read from standard input and output is written to standard output, unless they are redirected
or piped. As with most components in OpenNLP, document categorizer expects input which is segmented into sentences.
Document Categorizer API
To perform classification you will need a maxent model -
these are encapsulated in the DoccatModel class of OpenNLP tools.
First you need to grab the bytes from the serialized model on an InputStream -
we'll leave it you to do that, since you were the one who serialized it to begin with. Now for the easy part:
With the DoccatModel in hand we are just about there:
Training
The Document Categorizer can be trained on annotated training material. The data
can be in OpenNLP Document Categorizer training format. This is one document per line,
containing category and text separated by a whitespace. Other formats can also be
available.
The following sample shows the sample from above in the required format. Here GMDecrease and GMIncrease
are the categories.
Note: The line breaks marked with a backslash are just inserted for formatting purposes and must not be
included in the training data.
Training Tool
The following command will train the document categorizer and write the model to en-doccat.bin:
Additionally it is possible to specify the number of iterations, and the cutoff.
Training API
So, naturally you will need some access to many pre-classified events to train your model.
The class opennlp.tools.doccat.DocumentSample encapsulates a text document and its classification.
DocumentSample has two constructors. Each take the text's category as one argument. The other argument can either be raw
text, or an array of tokens. By default, the raw text will be split into tokens by whitespace. So, let's say
your training data was contained in a text file, where the format is as described above.
Then you might want to write something like this to create a collection of DocumentSamples:
lineStream =
new PlainTextByLineStream(dataIn, "UTF-8");
ObjectStream sampleStream = new DocumentSampleStream(lineStream);
model = DocumentCategorizerME.train("en", sampleStream);
}
catch (IOException e) {
// Failed to read or parse training data, training failed
e.printStackTrace();
}
finally {
if (dataIn != null) {
try {
dataIn.close();
}
catch (IOException e) {
// Not an issue, training already finished.
// The exception should be logged and investigated
// if part of a production system.
e.printStackTrace();
}
}
}]]>
Now might be a good time to cruise over to Hulu or something, because this could take a while if you've got a large training set.
You may see a lot of output as well. Once you're done, you can pretty quickly step to classification directly,
but first we'll cover serialization. Feel free to skim.