// Define some global attributes
include::_globattr.adoc[]

Core
~~~~
This project contains several annotators, including:

- a sentence detector annotator
- a tokenizer
- an annotator that does not update the CAS in any way
- an annotator that creates a single ((Segment)) annotation
  encompassing the entire document text

[NOTE]
=======================================================================
- End-of-line characters are considered end-of-sentence markers.
- Hyphenated words that appear in the hyphenated words list with
frequency values greater than the ((FreqCutoff)) will be considered one
token (see <<tokenizer_annot, tokenizer>> below).
=======================================================================

A sentence detector model
footnoteref:["model_disclaimer",{model-disclaimer}]
is included with this project.


Analysis engines (annotators)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- *AggregateAE.xml*
+
The file +desc/analysis_engine/AggregateAE.xml+ defines a
``pipeline'' used by UIMA's PEAR installer to verify an install, if
the project is installed from a PEAR file. This descriptor is
typically not used in a more complete pipeline -- one or more of the
individual analysis engines is normally included.
+
- *CopyAnnotator.xml*
+
This is a utility annotator that copies data from an existing JCas
object into a new JCas object.
+
- *NullAnnotator.xml*
+
As its name implies, this annotator does nothing. It can be useful if
you are using the UIMA CPE GUI and you are required to specify an
analysis engine but you don't actually want to specify one.
+
- *OverlapAnnotator.xml*
+
--
An annotator that modifies one annotation (begin and end offsets) or
deletes one (or both) of the annotations, when two annotations
overlap. The action taken depends on the configuration parameters. It
can extend an annotation to encompass overlapping annotations. It can
also be configured to delete annotations of type A that are subsumed
by other annotations of type A if you only want the longest
annotations of the given type to be kept.

See the Javadoc for edu.mayo.bmi.uima.core.ae.OverlapAnnotator for
more details.
--
+
- [[sentdetect_annot]] *SentenceDetectorAnnotator.xml*
+
A wrapper around the link:{opennlp-home}[OpenNLP] sentence detector
that creates Sentence annotations based on the location of end-of-line
characters and on the output of the OpenNLP sentence detector. This
annotator considers an end-of-line character as an end-of-sentence
marker. Optionally it can skip certain sections of the document. See
<<run_sentdetect_token_annot>> for more details.
+
*Parameters*::
  SegmentsToSkip;; (optional) the list of sections not to create Sentence annotations for.
*Resources*::
  +MaxentModelFile+;; the Maxent model sentence detector.
+
- *SimpleSegmentAnnotator.xml*
+
Creates a single ((Segment)) annotation, encompassing the entire
document. For use prior to annotators that require a Segment
annotation, when the pipeline does not contain a different annotator
that creates Segment annotations. This annotator is used for plain text
files, which doesn't have section (aka segment) tags; but not for CDA
documents, as the CdaCasInitializer annotator creates Segment
annotations.
+
*Parameters*::
  SegmentID;; (optional) the identifier to use for the Segment annotation created.
+
- [[tokenizer_annot]] *TokenizerAnnotator.xml*
+
Tokenizes text. Hyphenated words that appear in the hyphenated words
list (`HyphFreqFile`) with frequency values greater than the
FreqCutoff will be considered one token. See
<<run_sentdetect_token_annot>> for more details. See classes
edu.mayo.bmi.uima.core.ae.TokenizerAnnotator and
edu.mayo.bmi.nlp.tokenizer.Tokenizer for implementation details.
+
*Parameters*::
  SegmentsToSkip;; (optional) the list of sections not to create token annotations for.
  FreqCutoff;; cutoff value for which entries to include from the hyphenated words list (+HyphFreqFile+)
*Resources*::
  +HyphFreqFile+;; a file containing a list of hyphenated words and their frequency within some corpus.

///////////////////////////////////////
// MOVE THIS SECTION TO DEVELOPER GUIDE
///////////////////////////////////////
Tools
^^^^^
[[train_sentdetect_model]]
Training a sentence detector model
++++++++++++++++++++++++++++++++++
To train a sentence detector that recognizes the same set of candidate
end-of-sentence characters that the
<<sentdetect_annot,SentenceDetectorAnnotator>> uses:

___________________________________________________
+*java -cp <classpath> edu.mayo.bmi.uima.core.ae.SentenceDetector \
                     <sents_file> \ <1>
                     <model> \ <2>
                     iters \ <3>
                     cut*+ <4>
___________________________________________________

<1> your sentences training data file, one sentence per line, see an example in <<sents_file_eg>>.
<2> name of the model file to be created.
<3> (optional) number of iterations for training.
<4> (optional) cutoff value.

TIP: Eclipse users may run ``SentenceDetector\--train_ a_ new_model''
launch.

[[sents_file_eg]]
.Sentence detector training data file sample
============================
One sentence per line.
-----
The boy ran.
Did the girl run too?
Yes, she did.
Where did she go?
-----
============================

.Verify you can train a sentence detector model successfully
[TIP]
=====================================================================
The sample model `resources/sentdetect/sample_sd_included.mod` was
trained from `data/test/sample_sd_training_sentences.txt`, using
default values (not specifying on the command line) for ``iters'' and
``cut''. You can verify your trained model with the sample one, using
your favorite tool.
=====================================================================

.Using OpenNLP directly to train sentence detector model
*******************************************************************
You can train a sentence detector directly using the OpenNLP
sentence detector (SentenceDetectorME) with the default set of
candidate end-of-sentence characters, using

______________________________________________________
+*java -cp <classpath> opennlp.tools.sentdetect.SentenceDetectorME \
                     <infile> \
                     <outfile> \
                     iters \
                     cut*+
_______________________________________________________
The four parameters have the same meaning as the tool we provided,
``infile'' uses the same format as in <<sents_file_eg>>.
*******************************************************************

[[run_sentdetect_token_annot]]
Running the sentence detector and tokenizer
+++++++++++++++++++++++++++++++++++++++++++

We provided a sentence detector CPE descriptor and a tokenizer CPE
descriptor in this project. To run the CPE:

. Run
+
___________________________________________________________
+*java -cp {osp-cp} org.apache.uima.tools.cpm.CpmFrame*+
___________________________________________________________
+
. Open
+
--
- +desc/collection_processing_engine/SentenceDetecorCPE.xml+ to run a sentence detector; or
- +desc/collection_processing_engine/SentencesAndTokensCPE.xml+ to run a tokenizer
--

The sentence detector CPE uses the analysis engines listed in
+desc/analysis_engine/SentenceDetectorAggregate.xml+, and the
tokenizer CPE uses those listed in
+desc/analysis_engine/SentencesAndTokensAggregate.xml+. The two CPEs
are defined to read from plain text file(s) in
+data/test/sample_notes_plaintext+ using the
FilesInDirectoryCollectionReader.

TIP: Eclipse users may use the ``SentenceDetector_annotator'' and the
``Tokenizer annotator'' launches.

.How do the CPEs work?
**********************************************************************
Since the sentence annotator processes the text one section at a time,
there must be at least one section (segment) annotation for the
SentenceDetectorAnnotator to add Sentence annotations. Therefore the
first analysis engine is the SimpleSegmentAnnotator, which creates a
single Segment annotation that covers the entire text. Then the
SentenceDetectorAnnotator analysis engine adds Sentence
annotations. Then if you're running the tokenizer, the
TokenizerAnnotator analysis engine adds annotations for tokens, such
as PunctuationToken, WordToken, NewlineToken.

Strictly speaking, it would not be necessary to run the
SentenceDetectorAnnotator in order to test the TokenizerAnnotator. The
TokenizerAnnotator does not require the presence of Sentence
annotations.
**********************************************************************

///////////////////////////////////////
// MOVE THIS SECTION TO DEVELOPER GUIDE
///////////////////////////////////////