org.apache.ctakes.core.nlp.tokenizer
Class ContractionsPTB
java.lang.Object
org.apache.ctakes.core.nlp.tokenizer.ContractionsPTB
public class ContractionsPTB
- extends java.lang.Object
- Author:
- Mayo Clinic
Method Summary |
(package private) static boolean |
allDigits(java.lang.String s)
|
(package private) static boolean |
breakAtApostrophe(java.lang.String s,
int positionOfApostropheToTest)
Assumes apostrophe is not first character.... |
(package private) static int |
getLenContractionToken(int currentPosition,
java.lang.String lowerCasedText)
|
static ContractionResult |
getLengthIfNextApostIsMiddleOfContraction(int position,
int nextNonLetterDigit,
java.lang.String lowerCasedText)
Determine if the text starting at 'position' within 'text' is the start of a
contraction such as "should've" or "hasn't" or "it's" by looking at whether
there is a letter before the apostrophe, and the appropriate letters after the
apostrophe (or in the case of "n't", verify the letter before is an 'n'
Note that if the text starting at 'position' is something like "n't" which
isn't a complete word, returns null. |
(package private) static boolean |
isContractionThatStartsWithApostrophe(int currentPosition,
java.lang.String lowerCasedText)
|
(package private) static int |
lenOfFirstTokenInContraction(java.lang.String s)
|
(package private) static int |
lenOfSecondTokenInContraction(java.lang.String s)
|
(package private) static int |
lenOfThirdTokenInContraction(java.lang.String s)
|
static void |
main(java.lang.String[] args)
|
(package private) static int |
tokenLengthCheckingForSingleQuoteWordsToKeepTogether(java.lang.String lowerCasedText)
for a word like 80's or P'yongyang or James' or Sean's or 80's-like or 80's-esque
(or can't or haven't, which are to be split)
determine whether the singlequote(apostrophe)
needs to be kept with the surrounding letters/numbers
and what to do about hyphenated afterwards if there is a hyphen after.... |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
contractionResult
static ContractionResult contractionResult
MultiTokenWords
static java.lang.String[] MultiTokenWords
MultiTokenWordLenToken1
static int[] MultiTokenWordLenToken1
MultiTokenWordLenToken2
static int[] MultiTokenWordLenToken2
MultiTokenWordLenToken3
static int[] MultiTokenWordLenToken3
MultiTokenWordsLookup
static java.util.HashMap<java.lang.String,java.lang.Integer> MultiTokenWordsLookup
possibleContractionEndings
static java.lang.String[] possibleContractionEndings
lettersAfterApostropheForMiddleOfContraction
static java.lang.String lettersAfterApostropheForMiddleOfContraction
contractionsStartingWithApostrophe
static java.lang.String[] contractionsStartingWithApostrophe
ContractionsPTB
public ContractionsPTB()
getLengthIfNextApostIsMiddleOfContraction
public static ContractionResult getLengthIfNextApostIsMiddleOfContraction(int position,
int nextNonLetterDigit,
java.lang.String lowerCasedText)
- Determine if the text starting at 'position' within 'text' is the start of a
contraction such as "should've" or "hasn't" or "it's" by looking at whether
there is a letter before the apostrophe, and the appropriate letters after the
apostrophe (or in the case of "n't", verify the letter before is an 'n'
Note that if the text starting at 'position' is something like "n't" which
isn't a complete word, returns null.
- Parameters:
position
- first char of next tokenlowerCasedText
- text into which parameter position is an index into
- Returns:
- the length of the WordToken part of the contraction. Note this is not always the position of the
apostrophe. For example, for can't, which is tokenized as ca n't the
length is 2. For "it's", the length is also 2.
- See Also:
for handling contractions like "cannot" that don't have an apostrophe
getLenContractionToken
static int getLenContractionToken(int currentPosition,
java.lang.String lowerCasedText)
lenOfFirstTokenInContraction
static int lenOfFirstTokenInContraction(java.lang.String s)
- Parameters:
s
-
- Returns:
- See Also:
isMiddleOfContraction
lenOfSecondTokenInContraction
static int lenOfSecondTokenInContraction(java.lang.String s)
lenOfThirdTokenInContraction
static int lenOfThirdTokenInContraction(java.lang.String s)
isContractionThatStartsWithApostrophe
static boolean isContractionThatStartsWithApostrophe(int currentPosition,
java.lang.String lowerCasedText)
breakAtApostrophe
static boolean breakAtApostrophe(java.lang.String s,
int positionOfApostropheToTest)
- Assumes apostrophe is not first character.... that case is handled elsewhere
Assumes
s
is lower case.
allDigits
static boolean allDigits(java.lang.String s)
tokenLengthCheckingForSingleQuoteWordsToKeepTogether
static int tokenLengthCheckingForSingleQuoteWordsToKeepTogether(java.lang.String lowerCasedText)
- for a word like 80's or P'yongyang or James' or Sean's or 80's-like or 80's-esque
(or can't or haven't, which are to be split)
determine whether the singlequote(apostrophe)
needs to be kept with the surrounding letters/numbers
and what to do about hyphenated afterwards if there is a hyphen after....
For possessives, do split.
Note that things that start with an apostrophe like 'Assad were handled elsewhere
- Returns:
- len of how much to keep: len to apostrophe, or to next breaking char (the space after s for "80's ") or end of hyphenated suffix that should also remain attached, or -1
main
public static void main(java.lang.String[] args)