This class illustrates the pipeline needed to run the ClearNLP dependency parser and SRL systems
Note: This uses small, highly inaccurate model files, to keep the expense of running down.
There is the fixed list of hyphenated words to not be split (hyphenatedWordsLookup)
And here are some made-up examples of words using affixes to keep together
chronic-itis 1 suffix
mega-huge 1 prefix
e-game-fest 1 prefix and 1 suffix
salon-o-torium 1 suffix that contains 2 hyphens
urban-esque-wise 2 suffixes
for a word like 80's or P'yongyang or James' or Sean's or 80's-like or 80's-esque
(or can't or haven't, which are to be split)
determine whether the singlequote(apostrophe)
needs to be kept with the surrounding letters/numbers
and what to do about hyphenated afterwards if there is a hyphen after....
type -
Variable in class org.apache.ctakes.utils.xcas_comparison.XcasAnnotation
TYPE_CONTRACTION -
Static variable in class org.apache.ctakes.core.nlp.tokenizer.Token
Contains contractions and possessives (since they cannot be
differentiated without context).
TYPE_EOL -
Static variable in class org.apache.ctakes.core.nlp.tokenizer.Token
A EOL token is defined as a line feed or carriage return character.
TYPE_NUMBER -
Static variable in class org.apache.ctakes.core.nlp.tokenizer.Token
A number token is defined as a consecutive series of digits.
TYPE_PUNCT -
Static variable in class org.apache.ctakes.core.nlp.tokenizer.Token
A punctuation token is defined as one character that can be either a
period, double quote, single quote, question mark, exclamation point,
hyphen (if not surrounded by word characters), etc...
TYPE_SYMBOL -
Static variable in class org.apache.ctakes.core.nlp.tokenizer.Token