org.apache.ctakes.core.nlp.tokenizer
Class HyphenatedPTB
java.lang.Object
org.apache.ctakes.core.nlp.tokenizer.HyphenatedPTB
public class HyphenatedPTB
- extends java.lang.Object
- Author:
- Mayo Clinic
Method Summary |
(package private) static boolean |
isContractionThatStartsWithApostrophe(int currentPosition,
java.lang.String textSegment)
|
(package private) static int |
lenIfHyphenatedSuffix(java.lang.String lowerCasedString,
int position)
|
(package private) static int |
lenOfFirstTokenInContraction(java.lang.String s)
|
static void |
main(java.lang.String[] args)
|
static int |
tokenLengthCheckingForHyphenatedTerms(java.lang.String lowerCasedString)
There is the fixed list of hyphenated words to not be split (hyphenatedWordsLookup)
And here are some made-up examples of words using affixes to keep together
chronic-itis 1 suffix
mega-huge 1 prefix
e-game-fest 1 prefix and 1 suffix
salon-o-torium 1 suffix that contains 2 hyphens
urban-esque-wise 2 suffixes |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
MultiTokenWords
static java.lang.String[] MultiTokenWords
MultiTokenWordLenToken1
static int[] MultiTokenWordLenToken1
MultiTokenWordLenToken2
static int[] MultiTokenWordLenToken2
MultiTokenWordsLookup
static java.util.HashMap<java.lang.String,java.lang.Integer> MultiTokenWordsLookup
possibleContractionEndings
static java.lang.String[] possibleContractionEndings
lettersAfterApostropheForMiddleOfContraction
static java.lang.String lettersAfterApostropheForMiddleOfContraction
contractionsStartingWithApostrophe
static java.lang.String[] contractionsStartingWithApostrophe
hyphenatedPrefixes
static java.lang.String[] hyphenatedPrefixes
hyphenatedPrefixesLookup
static java.util.HashSet<java.lang.String> hyphenatedPrefixesLookup
hyphenatedSuffixes
static java.lang.String[] hyphenatedSuffixes
hyphenatedSuffixesLookup
static java.util.HashSet<java.lang.String> hyphenatedSuffixesLookup
hyphenatedWords
static java.lang.String[] hyphenatedWords
hyphenatedWordsLookup
static java.util.HashSet<java.lang.String> hyphenatedWordsLookup
MINUS_OR_HYPHEN
static char MINUS_OR_HYPHEN
HyphenatedPTB
public HyphenatedPTB()
lenOfFirstTokenInContraction
static int lenOfFirstTokenInContraction(java.lang.String s)
- Parameters:
s
-
- Returns:
- See Also:
isMiddleOfContraction
isContractionThatStartsWithApostrophe
static boolean isContractionThatStartsWithApostrophe(int currentPosition,
java.lang.String textSegment)
main
public static void main(java.lang.String[] args)
tokenLengthCheckingForHyphenatedTerms
public static int tokenLengthCheckingForHyphenatedTerms(java.lang.String lowerCasedString)
- There is the fixed list of hyphenated words to not be split (hyphenatedWordsLookup)
And here are some made-up examples of words using affixes to keep together
chronic-itis 1 suffix
mega-huge 1 prefix
e-game-fest 1 prefix and 1 suffix
salon-o-torium 1 suffix that contains 2 hyphens
urban-esque-wise 2 suffixes
- Parameters:
lowerCasedString
- because of "-o-torium", input might contain more than 1 hyphen....
- Returns:
- len to keep together, as far as we know. see hyphen hyphen hyphen case below.
throws exception if there's no hyphen;
number of characters to keep.
Does not mean to split at n+1 hyphen... need to recheck that one
lenIfHyphenatedSuffix
static int lenIfHyphenatedSuffix(java.lang.String lowerCasedString,
int position)