This class illustrates the pipeline needed to run the ClearParser dependency parser and SRL systems
Note: This uses small, highly inaccurate model files, to keep the expense of running down.
Equivalent to cTAKES: edu.mayo.bmi.uima.cdt.type.TimeAnnotation
Updated by JCasGen Tue Apr 09 11:44:44 EDT 2013
XML source:apache-ctakes-3-0-0/ctakes-pad-term-spotter/src/main/resources/org/apache/ctakes/padtermspotter/types/TypeSystem.xml
There is the fixed list of hyphenated words to not be split (hyphenatedWordsLookup)
And here are some made-up examples of words using affixes to keep together
chronic-itis 1 suffix
mega-huge 1 prefix
e-game-fest 1 prefix and 1 suffix
salon-o-torium 1 suffix that contains 2 hyphens
urban-esque-wise 2 suffixes
for a word like 80's or P'yongyang or James' or Sean's or 80's-like or 80's-esque
(or can't or haven't, which are to be split)
determine whether the singlequote(apostrophe)
needs to be kept with the surrounding letters/numbers
and what to do about hyphenated afterwards if there is a hyphen after....
type -
Variable in class org.apache.ctakes.utils.xcas_comparison.XcasAnnotation
TYPE_CONTRACTION -
Static variable in class org.apache.ctakes.core.nlp.tokenizer.Token
Contains contractions and possessives (since they cannot be
differentiated without context).
TYPE_EOL -
Static variable in class org.apache.ctakes.core.nlp.tokenizer.Token
A EOL token is defined as a line feed or carriage return character.
TYPE_NUMBER -
Static variable in class org.apache.ctakes.core.nlp.tokenizer.Token
A number token is defined as a consecutive series of digits.
TYPE_PUNCT -
Static variable in class org.apache.ctakes.core.nlp.tokenizer.Token
A punctuation token is defined as one character that can be either a
period, double quote, single quote, question mark, exclamation point,
hyphen (if not surrounded by word characters), etc...
TYPE_SYMBOL -
Static variable in class org.apache.ctakes.core.nlp.tokenizer.Token