org.apache.ctakes.core.nlp.tokenizer
Class HyphenatedPTB

java.lang.Object
  extended by org.apache.ctakes.core.nlp.tokenizer.HyphenatedPTB

public class HyphenatedPTB
extends java.lang.Object

Author:
Mayo Clinic

Field Summary
(package private) static java.lang.String[] contractionsStartingWithApostrophe
           
(package private) static java.lang.String[] hyphenatedPrefixes
           
(package private) static java.util.HashSet<java.lang.String> hyphenatedPrefixesLookup
           
(package private) static java.lang.String[] hyphenatedSuffixes
           
(package private) static java.util.HashSet<java.lang.String> hyphenatedSuffixesLookup
           
(package private) static java.lang.String[] hyphenatedWords
           
(package private) static java.util.HashSet<java.lang.String> hyphenatedWordsLookup
           
(package private) static java.lang.String lettersAfterApostropheForMiddleOfContraction
           
(package private) static char MINUS_OR_HYPHEN
           
(package private) static int[] MultiTokenWordLenToken1
           
(package private) static int[] MultiTokenWordLenToken2
           
(package private) static java.lang.String[] MultiTokenWords
           
(package private) static java.util.HashMap<java.lang.String,java.lang.Integer> MultiTokenWordsLookup
           
(package private) static java.lang.String[] possibleContractionEndings
           
 
Constructor Summary
HyphenatedPTB()
           
 
Method Summary
(package private) static boolean isContractionThatStartsWithApostrophe(int currentPosition, java.lang.String textSegment)
           
(package private) static int lenIfHyphenatedSuffix(java.lang.String lowerCasedString, int position)
           
(package private) static int lenOfFirstTokenInContraction(java.lang.String s)
           
static void main(java.lang.String[] args)
           
static int tokenLengthCheckingForHyphenatedTerms(java.lang.String lowerCasedString)
          There is the fixed list of hyphenated words to not be split (hyphenatedWordsLookup) And here are some made-up examples of words using affixes to keep together chronic-itis 1 suffix mega-huge 1 prefix e-game-fest 1 prefix and 1 suffix salon-o-torium 1 suffix that contains 2 hyphens urban-esque-wise 2 suffixes
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

MultiTokenWords

static java.lang.String[] MultiTokenWords

MultiTokenWordLenToken1

static int[] MultiTokenWordLenToken1

MultiTokenWordLenToken2

static int[] MultiTokenWordLenToken2

MultiTokenWordsLookup

static java.util.HashMap<java.lang.String,java.lang.Integer> MultiTokenWordsLookup

possibleContractionEndings

static java.lang.String[] possibleContractionEndings

lettersAfterApostropheForMiddleOfContraction

static java.lang.String lettersAfterApostropheForMiddleOfContraction

contractionsStartingWithApostrophe

static java.lang.String[] contractionsStartingWithApostrophe

hyphenatedPrefixes

static java.lang.String[] hyphenatedPrefixes

hyphenatedPrefixesLookup

static java.util.HashSet<java.lang.String> hyphenatedPrefixesLookup

hyphenatedSuffixes

static java.lang.String[] hyphenatedSuffixes

hyphenatedSuffixesLookup

static java.util.HashSet<java.lang.String> hyphenatedSuffixesLookup

hyphenatedWords

static java.lang.String[] hyphenatedWords

hyphenatedWordsLookup

static java.util.HashSet<java.lang.String> hyphenatedWordsLookup

MINUS_OR_HYPHEN

static char MINUS_OR_HYPHEN
Constructor Detail

HyphenatedPTB

public HyphenatedPTB()
Method Detail

lenOfFirstTokenInContraction

static int lenOfFirstTokenInContraction(java.lang.String s)
Parameters:
s -
Returns:
See Also:
isMiddleOfContraction

isContractionThatStartsWithApostrophe

static boolean isContractionThatStartsWithApostrophe(int currentPosition,
                                                     java.lang.String textSegment)

main

public static void main(java.lang.String[] args)

tokenLengthCheckingForHyphenatedTerms

public static int tokenLengthCheckingForHyphenatedTerms(java.lang.String lowerCasedString)
There is the fixed list of hyphenated words to not be split (hyphenatedWordsLookup) And here are some made-up examples of words using affixes to keep together chronic-itis 1 suffix mega-huge 1 prefix e-game-fest 1 prefix and 1 suffix salon-o-torium 1 suffix that contains 2 hyphens urban-esque-wise 2 suffixes

Parameters:
lowerCasedString - because of "-o-torium", input might contain more than 1 hyphen....
Returns:
len to keep together, as far as we know. see hyphen hyphen hyphen case below. throws exception if there's no hyphen; number of characters to keep. Does not mean to split at n+1 hyphen... need to recheck that one

lenIfHyphenatedSuffix

static int lenIfHyphenatedSuffix(java.lang.String lowerCasedString,
                                 int position)