The page describes how to use the OpenNLP Sentence Detector.

= Detecting Sentence Boundaries =
The OpenNLP Sentence Detector can detect that a punctuation character marks the end of a sentence or not. In this sense
a sentence is defined as the longest white space trimmed character sequence between two punctuation marks. The first and last
sentence make an exception to this rule. The first non whitespace character is assumed to be the begin
of a sentence, and the last non whitespace character is assumed to be a sentence end.  

The sample text below should be segmented into its sentences.
<blockquote>
''Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.  Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.  Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate. ''
</blockquote>

After detecting the sentence boundaries each sentence is written in its own line.
<blockquote>
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.<br>
Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.<br>
Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate.<br>
</blockquote>

Usually Sentence Detection is done before the text is tokenized and thats the way the pre-trained models on the web site are trained,
but it is also possible to perform tokenization first and let the Sentence Detector process the already tokenized text.

The OpenNLP Sentence Detector cannot identify sentence boundaries based on the contents of the sentence. A prominent example is
the first sentence in an article where the title is mistakenly identified to be the first part of the first sentence.

Most components in OpenNLP expect input which is segmented into sentences.
== Sentence Detector Tool ==
The easiest way to try out the Sentence Detector is the command line tool. The tool
is only intended for demonstration and testing.

Download the [http://opennlp.sourceforge.net/models-1.5/en-sent.bin english sentence detector model] and start 
the Sentence Detector Tool with this command:
<pre>
bin/opennlp SentenceDetector en-sent.bin
</pre>
Just copy the sample text from above to the console. The Sentence Detector will read it and
echo one sentence per line to the console.

Usually the input is read from a file and the output is redirected to another file.
This can be achieved with the following command.
<pre>
bin/opennlp SentenceDetector en-sent.bin < input.txt > output.txt
</pre>
For the english sentence model from the website the input text should not be tokenized.

== Sentence Detector API ==
The Sentence Detector can be easily integrated into an application via its API.

To instantiate the Sentence Detector the sentence model must be loaded
first.
<pre>
InputStream modelIn = new FileInputStream("en-sent.bin");

try {
  SentenceModel model = new SentenceModel(modelIn);
}
catch (IOException e) {
  e.printStackTrace();
}
finally {
  if (modelIn != null) {
    try {
      modelIn.close();
    }
    catch (IOException e) {
    }
  }
}
</pre>

After the model is loaded the SentenceDetectorME can be instantiated.
<pre>
SentenceDetectorME sentenceDetector = new SentenceDetectorME(model);
</pre>

The Sentence Detector can output an array of Strings, where each String is one 
sentence.
<pre>
String sentences[] = sentenceDetector.sentDetect("  First sentence. Second sentence. ");
</pre>
The result array now contains two entires. The first String is "First sentence." and the second
String is "Second sentence." The whitespace before, between and after the input String is removed.

The API also offers a method which simply returns the span of the sentence in the input
string.
<pre>
Span sentences[] = sentenceDetector.sentDetect("  First sentence. Second sentence. ");
</pre>
The result array again contains two entires. The first span beings at index 2 and ends at 17.
The second span begins at 18 and ends at 34. The utility method Span.getCoveredText
can be used to create a substring which only covers the chars in the span.

= Training =
== Training Tool ==
OpenNLP has a command line tool which is used to train the models available from the model download page on various corpora.
The data must be converted to the OpenNLP Sentence Detector training format. Which is one sentence per line. An empty line
indicates a document boundary. In case the document boundary is unknown, its recommended to have an empty line every
few ten sentences.
Exactly like the output in the sample above.

Usage of the tool:
<pre>
bin/opennlp SentenceDetectorTrainer
Usage: opennlp SentenceDetectorTrainer -lang language -encoding charset [-iterations num] [-cutoff num] -data trainingData -model model
-lang language     specifies the language which is being processed.
-encoding charset  specifies the encoding which should be used for reading and writing text.
-iterations num    specified the number of training iterations
-cutoff num        specifies the min number of times a feature must be seen
</pre>

To train an english sentence detector use the following command:
<pre>
bin/opennlp SentenceDetectorTrainer -encoding UTF-8 -lang en -data en-sent.train -model en-sent.bin

Indexing events using cutoff of 5

	Computing event counts...  done. 4883 events
	Indexing...  done.
Sorting and merging events... done. Reduced 4883 events to 2945.
Done indexing.
Incorporating indexed data for training...  
done.
	Number of Event Tokens: 2945
	    Number of Outcomes: 2
	  Number of Predicates: 467
...done.
Computing model parameters...
Performing 100 iterations.
  1:  .. loglikelihood=-3384.6376826743144	0.38951464263772273
  2:  .. loglikelihood=-2191.9266688597672	0.9397911120212984
  3:  .. loglikelihood=-1645.8640771555981	0.9643661683391358
  4:  .. loglikelihood=-1340.386303774519	0.9739913987302887
  5:  .. loglikelihood=-1148.4141548519624	0.9748105672742167

 ...<skipping a bunch of iterations>...

 95:  .. loglikelihood=-288.25556805874436	0.9834118369854598
 96:  .. loglikelihood=-287.2283680343481	0.9834118369854598
 97:  .. loglikelihood=-286.2174830344526	0.9834118369854598
 98:  .. loglikelihood=-285.222486981048	0.9834118369854598
 99:  .. loglikelihood=-284.24296917223916	0.9834118369854598
100:  .. loglikelihood=-283.2785335773966	0.9834118369854598
Wrote sentence detector model.
Path: en-sent.bin
</pre>

== Training API ==

The Sentence Detector also offers an API to train a new sentence detection model.

Basically three steps are necessary to train it:
* The application must open a sample data stream
* Call the SentenceDetectorME.train method
* Save the SentenceModel to a file or directly use it

The following sample code illustrates these steps:
<pre>
ObjectStream<String> lineStream = new PlainTextByLineStream(new FileInputStream("en-sent.train"), "UTF-8");
ObjectStream<SentenceSample> sampleStream = new SentenceSampleStream(lineStream);

SentenceModel model = SentenceDetectorME.train("en",sampleStream, true, null, 5, 100);

try {
  modelOut = new BufferedOutputStream(new FileOutputStream(modelFile));
  model.serialize(modelOut);
} finally {
  if (modelOut != null) 
     modelOut.close();      
}
</pre>

== Custom Feature Generation ==
TBD

= Evaluation =
==  Evaluator Tool ==
The command shows how the evaluator tool can be run:
<pre>
$ bin/opennlp SentenceDetectorEvaluator -encoding UTF-8 -model en-sent.bin -data en-sent.eval  

Loading model ... done
Evaluating ... done

Precision: 0.9465737514518002
Recall: 0.9095982142857143
F-Measure: 0.9277177006260672
</pre>
The en-sent.eval file has the same format as the training data.