org.apache.ctakes.core.nlp.tokenizer
Class ContractionsPTB

java.lang.Object
  extended by org.apache.ctakes.core.nlp.tokenizer.ContractionsPTB

public class ContractionsPTB
extends java.lang.Object

Author:
Mayo Clinic

Field Summary
(package private) static ContractionResult contractionResult
           
(package private) static java.lang.String[] contractionsStartingWithApostrophe
           
(package private) static java.lang.String lettersAfterApostropheForMiddleOfContraction
           
(package private) static int[] MultiTokenWordLenToken1
           
(package private) static int[] MultiTokenWordLenToken2
           
(package private) static int[] MultiTokenWordLenToken3
           
(package private) static java.lang.String[] MultiTokenWords
           
(package private) static java.util.HashMap<java.lang.String,java.lang.Integer> MultiTokenWordsLookup
           
(package private) static java.lang.String[] possibleContractionEndings
           
 
Constructor Summary
ContractionsPTB()
           
 
Method Summary
(package private) static boolean allDigits(java.lang.String s)
           
(package private) static boolean breakAtApostrophe(java.lang.String s, int positionOfApostropheToTest)
          Assumes apostrophe is not first character....
(package private) static int getLenContractionToken(int currentPosition, java.lang.String lowerCasedText)
           
static ContractionResult getLengthIfNextApostIsMiddleOfContraction(int position, int nextNonLetterDigit, java.lang.String lowerCasedText)
          Determine if the text starting at 'position' within 'text' is the start of a contraction such as "should've" or "hasn't" or "it's" by looking at whether there is a letter before the apostrophe, and the appropriate letters after the apostrophe (or in the case of "n't", verify the letter before is an 'n' Note that if the text starting at 'position' is something like "n't" which isn't a complete word, returns null.
(package private) static boolean isContractionThatStartsWithApostrophe(int currentPosition, java.lang.String lowerCasedText)
           
(package private) static int lenOfFirstTokenInContraction(java.lang.String s)
           
(package private) static int lenOfSecondTokenInContraction(java.lang.String s)
           
(package private) static int lenOfThirdTokenInContraction(java.lang.String s)
           
static void main(java.lang.String[] args)
           
(package private) static int tokenLengthCheckingForSingleQuoteWordsToKeepTogether(java.lang.String lowerCasedText)
          for a word like 80's or P'yongyang or James' or Sean's or 80's-like or 80's-esque (or can't or haven't, which are to be split) determine whether the singlequote(apostrophe) needs to be kept with the surrounding letters/numbers and what to do about hyphenated afterwards if there is a hyphen after....
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

contractionResult

static ContractionResult contractionResult

MultiTokenWords

static java.lang.String[] MultiTokenWords

MultiTokenWordLenToken1

static int[] MultiTokenWordLenToken1

MultiTokenWordLenToken2

static int[] MultiTokenWordLenToken2

MultiTokenWordLenToken3

static int[] MultiTokenWordLenToken3

MultiTokenWordsLookup

static java.util.HashMap<java.lang.String,java.lang.Integer> MultiTokenWordsLookup

possibleContractionEndings

static java.lang.String[] possibleContractionEndings

lettersAfterApostropheForMiddleOfContraction

static java.lang.String lettersAfterApostropheForMiddleOfContraction

contractionsStartingWithApostrophe

static java.lang.String[] contractionsStartingWithApostrophe
Constructor Detail

ContractionsPTB

public ContractionsPTB()
Method Detail

getLengthIfNextApostIsMiddleOfContraction

public static ContractionResult getLengthIfNextApostIsMiddleOfContraction(int position,
                                                                          int nextNonLetterDigit,
                                                                          java.lang.String lowerCasedText)
Determine if the text starting at 'position' within 'text' is the start of a contraction such as "should've" or "hasn't" or "it's" by looking at whether there is a letter before the apostrophe, and the appropriate letters after the apostrophe (or in the case of "n't", verify the letter before is an 'n' Note that if the text starting at 'position' is something like "n't" which isn't a complete word, returns null.

Parameters:
position - first char of next token
lowerCasedText - text into which parameter position is an index into
Returns:
the length of the WordToken part of the contraction. Note this is not always the position of the apostrophe. For example, for can't, which is tokenized as ca n't the length is 2. For "it's", the length is also 2.
See Also:
for handling contractions like "cannot" that don't have an apostrophe

getLenContractionToken

static int getLenContractionToken(int currentPosition,
                                  java.lang.String lowerCasedText)

lenOfFirstTokenInContraction

static int lenOfFirstTokenInContraction(java.lang.String s)
Parameters:
s -
Returns:
See Also:
isMiddleOfContraction

lenOfSecondTokenInContraction

static int lenOfSecondTokenInContraction(java.lang.String s)

lenOfThirdTokenInContraction

static int lenOfThirdTokenInContraction(java.lang.String s)

isContractionThatStartsWithApostrophe

static boolean isContractionThatStartsWithApostrophe(int currentPosition,
                                                     java.lang.String lowerCasedText)

breakAtApostrophe

static boolean breakAtApostrophe(java.lang.String s,
                                 int positionOfApostropheToTest)
Assumes apostrophe is not first character.... that case is handled elsewhere Assumes s is lower case.


allDigits

static boolean allDigits(java.lang.String s)

tokenLengthCheckingForSingleQuoteWordsToKeepTogether

static int tokenLengthCheckingForSingleQuoteWordsToKeepTogether(java.lang.String lowerCasedText)
for a word like 80's or P'yongyang or James' or Sean's or 80's-like or 80's-esque (or can't or haven't, which are to be split) determine whether the singlequote(apostrophe) needs to be kept with the surrounding letters/numbers and what to do about hyphenated afterwards if there is a hyphen after.... For possessives, do split. Note that things that start with an apostrophe like 'Assad were handled elsewhere

Returns:
len of how much to keep: len to apostrophe, or to next breaking char (the space after s for "80's ") or end of hyphenated suffix that should also remain attached, or -1

main

public static void main(java.lang.String[] args)