Sentence DetectorSentence Detection
The OpenNLP Sentence Detector can detect that a punctuation character
marks the end of a sentence or not. In this sense a sentence is defined
as the longest white space trimmed character sequence between two punctuation
marks. The first and last sentence make an exception to this rule. The first
non whitespace character is assumed to be the begin of a sentence, and the
last non whitespace character is assumed to be a sentence end.
The sample text below should be segmented into its sentences.
After detecting the sentence boundaries each sentence is written in its own line.
Usually Sentence Detection is done before the text is tokenized and that's the way the pre-trained models on the web site are trained,
but it is also possible to perform tokenization first and let the Sentence Detector process the already tokenized text.
The OpenNLP Sentence Detector cannot identify sentence boundaries based on the contents of the sentence. A prominent example is the first sentence in an article where the title is mistakenly identified to be the first part of the first sentence.
Most components in OpenNLP expect input which is segmented into sentences.
Sentence Detection Tool
The easiest way to try out the Sentence Detector is the command line tool. The tool is only intended for demonstration and testing.
Download the english sentence detector model and start the Sentence Detector Tool with this command:
Just copy the sample text from above to the console. The Sentence Detector will read it and echo one sentence per line to the console.
Usually the input is read from a file and the output is redirected to another file. This can be achieved with the following command.
output.txt]]>
For the english sentence model from the website the input text should not be tokenized.
Sentence Detection API
The Sentence Detector can be easily integrated into an application via its API.
To instantiate the Sentence Detector the sentence model must be loaded first.
After the model is loaded the SentenceDetectorME can be instantiated.
The Sentence Detector can output an array of Strings, where each String is one sentence.
The result array now contains two entries. The first String is "First sentence." and the
second String is "Second sentence." The whitespace before, between and after the input String is removed.
The API also offers a method which simply returns the span of the sentence in the input string.
The result array again contains two entries. The first span beings at index 2 and ends at
17. The second span begins at 18 and ends at 34. The utility method Span.getCoveredText can be used to create a substring which only covers the chars in the span.
Sentence Detector TrainingTraining Tool
OpenNLP has a command line tool which is used to train the models available from the model
download page on various corpora. The data must be converted to the OpenNLP Sentence Detector
training format. Which is one sentence per line. An empty line indicates a document boundary.
In case the document boundary is unknown, its recommended to have an empty line every few ten
sentences. Exactly like the output in the sample above.
Usage of the tool:
To train an English sentence detector use the following command:
It should produce the following output:
...
95: .. loglikelihood=-288.25556805874436 0.9834118369854598
96: .. loglikelihood=-287.2283680343481 0.9834118369854598
97: .. loglikelihood=-286.2174830344526 0.9834118369854598
98: .. loglikelihood=-285.222486981048 0.9834118369854598
99: .. loglikelihood=-284.24296917223916 0.9834118369854598
100: .. loglikelihood=-283.2785335773966 0.9834118369854598
Wrote sentence detector model.
Path: en-sent.bin
]]>
Training API
The Sentence Detector also offers an API to train a new sentence detection model.
Basically three steps are necessary to train it:
The application must open a sample data streamCall the SentenceDetectorME.train methodSave the SentenceModel to a file or directly use it
The following sample code illustrates these steps:
lineStream =
new PlainTextByLineStream(new FileInputStream("en-sent.train"), charset);
ObjectStream sampleStream = new SentenceSampleStream(lineStream);
SentenceModel model;
try {
model = SentenceDetectorME.train("en", sampleStream, true, null, TrainingParameters.defaultParams());
}
finally {
sampleStream.close();
}
OutputStream modelOut = null;
try {
modelOut = new BufferedOutputStream(new FileOutputStream(modelFile));
model.serialize(modelOut);
} finally {
if (modelOut != null)
modelOut.close();
}]]>
EvaluationEvaluation Tool
The command shows how the evaluator tool can be run:
The en-sent.eval file has the same format as the training data.