Document Categorizer

Classifying The OpenNLP Document Categorizer can classify text into pre-defined categories. It is based on maximum entropy framework. For someone interested in Gross Margin, the sample text given below could be classified as GMDecrease and the text below could be classified as GMIncrease To be able to classify a text, the document categorizer needs a model. The classifications are requirements-specific and hence there is no pre-built model for document categorizer under OpenNLP project.

Document Categorizer Tool The easiest way to try out the document categorizer is the command line tool. The tool is only intended for demonstration and testing. The following command shows how to use the document categorizer tool. The input is read from standard input and output is written to standard output, unless they are redirected or piped. As with most components in OpenNLP, document categorizer expects input which is segmented into sentences.

Document Categorizer API To perform classification you will need a maxent model - these are encapsulated in the DoccatModel class of OpenNLP tools. First you need to grab the bytes from the serialized model on an InputStream - we'll leave it you to do that, since you were the one who serialized it to begin with. Now for the easy part: With the DoccatModel in hand we are just about there:

Training The Document Categorizer can be trained on annotated training material. The data can be in OpenNLP Document Categorizer training format. This is one document per line, containing category and text separated by a whitespace. Other formats can also be available. The following sample shows the sample from above in the required format. Here GMDecrease and GMIncrease are the categories. Note: The line breaks marked with a backslash are just inserted for formatting purposes and must not be included in the training data.

Training Tool The following command will train the document categorizer and write the model to en-doccat.bin: Additionally it is possible to specify the number of iterations, and the cutoff.

Training API So, naturally you will need some access to many pre-classified events to train your model. The class opennlp.tools.doccat.DocumentSample encapsulates a text document and its classification. DocumentSample has two constructors. Each take the text's category as one argument. The other argument can either be raw text, or an array of tokens. By default, the raw text will be split into tokens by whitespace. So, let's say your training data was contained in a text file, where the format is as described above. Then you might want to write something like this to create a collection of DocumentSamples: lineStream = new PlainTextByLineStream(dataIn, "UTF-8"); ObjectStream sampleStream = new DocumentSampleStream(lineStream); model = DocumentCategorizerME.train("en", sampleStream); } catch (IOException e) { // Failed to read or parse training data, training failed e.printStackTrace(); } finally { if (dataIn != null) { try { dataIn.close(); } catch (IOException e) { // Not an issue, training already finished. // The exception should be logged and investigated // if part of a production system. e.printStackTrace(); } } }]]> Now might be a good time to cruise over to Hulu or something, because this could take a while if you've got a large training set. You may see a lot of output as well. Once you're done, you can pretty quickly step to classification directly, but first we'll cover serialization. Feel free to skim.