ContractionsPTB

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.ctakes.core.nlp.tokenizer
Class ContractionsPTB

java.lang.Object
  org.apache.ctakes.core.nlp.tokenizer.ContractionsPTB

public class ContractionsPTB
extends java.lang.Object
extends java.lang.Object

Author:: Mayo Clinic

Field Summary
`(package private) static ContractionResult`	`contractionResult`
`(package private) static java.lang.String[]`	`contractionsStartingWithApostrophe`
`(package private) static java.lang.String`	`lettersAfterApostropheForMiddleOfContraction`
`(package private) static int[]`	`MultiTokenWordLenToken1`
`(package private) static int[]`	`MultiTokenWordLenToken2`
`(package private) static int[]`	`MultiTokenWordLenToken3`
`(package private) static java.lang.String[]`	`MultiTokenWords`
`(package private) static java.util.HashMap<java.lang.String,java.lang.Integer>`	`MultiTokenWordsLookup`
`(package private) static java.lang.String[]`	`possibleContractionEndings`

Constructor Summary
`ContractionsPTB()`

Method Summary
`(package private) static boolean`	`allDigits(java.lang.String s)`
`(package private) static boolean`	`breakAtApostrophe(java.lang.String s, int positionOfApostropheToTest)` Assumes apostrophe is not first character....
`(package private) static int`	`getLenContractionToken(int currentPosition, java.lang.String lowerCasedText)`
`static ContractionResult`	`getLengthIfNextApostIsMiddleOfContraction(int position, int nextNonLetterDigit, java.lang.String lowerCasedText)` Determine if the text starting at 'position' within 'text' is the start of a contraction such as "should've" or "hasn't" or "it's" by looking at whether there is a letter before the apostrophe, and the appropriate letters after the apostrophe (or in the case of "n't", verify the letter before is an 'n' Note that if the text starting at 'position' is something like "n't" which isn't a complete word, returns null.
`(package private) static boolean`	`isContractionThatStartsWithApostrophe(int currentPosition, java.lang.String lowerCasedText)`
`(package private) static int`	`lenOfFirstTokenInContraction(java.lang.String s)`
`(package private) static int`	`lenOfSecondTokenInContraction(java.lang.String s)`
`(package private) static int`	`lenOfThirdTokenInContraction(java.lang.String s)`
`static void`	`main(java.lang.String[] args)`
`(package private) static int`	`tokenLengthCheckingForSingleQuoteWordsToKeepTogether(java.lang.String lowerCasedText)` for a word like 80's or P'yongyang or James' or Sean's or 80's-like or 80's-esque (or can't or haven't, which are to be split) determine whether the singlequote(apostrophe) needs to be kept with the surrounding letters/numbers and what to do about hyphenated afterwards if there is a hyphen after....

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

contractionResult

static ContractionResult contractionResult

MultiTokenWords

static java.lang.String[] MultiTokenWords

MultiTokenWordLenToken1

static int[] MultiTokenWordLenToken1

MultiTokenWordLenToken2

static int[] MultiTokenWordLenToken2

MultiTokenWordLenToken3

static int[] MultiTokenWordLenToken3

MultiTokenWordsLookup

static java.util.HashMap<java.lang.String,java.lang.Integer> MultiTokenWordsLookup

possibleContractionEndings

static java.lang.String[] possibleContractionEndings

lettersAfterApostropheForMiddleOfContraction

static java.lang.String lettersAfterApostropheForMiddleOfContraction

contractionsStartingWithApostrophe

static java.lang.String[] contractionsStartingWithApostrophe

Constructor Detail

ContractionsPTB

public ContractionsPTB()

Method Detail

getLengthIfNextApostIsMiddleOfContraction

public static ContractionResult getLengthIfNextApostIsMiddleOfContraction(int position,
                                                                          int nextNonLetterDigit,
                                                                          java.lang.String lowerCasedText)

Determine if the text starting at 'position' within 'text' is the start of a contraction such as "should've" or "hasn't" or "it's" by looking at whether there is a letter before the apostrophe, and the appropriate letters after the apostrophe (or in the case of "n't", verify the letter before is an 'n' Note that if the text starting at 'position' is something like "n't" which isn't a complete word, returns null.

Parameters:: position - first char of next token; lowerCasedText - text into which parameter position is an index into
Returns:: the length of the WordToken part of the contraction. Note this is not always the position of the apostrophe. For example, for can't, which is tokenized as ca n't the length is 2. For "it's", the length is also 2.
See Also:: for handling contractions like "cannot" that don't have an apostrophe

getLenContractionToken

static int getLenContractionToken(int currentPosition,
                                  java.lang.String lowerCasedText)

lenOfFirstTokenInContraction

static int lenOfFirstTokenInContraction(java.lang.String s)

Parameters:: s -
Returns:
See Also:: isMiddleOfContraction

lenOfSecondTokenInContraction

static int lenOfSecondTokenInContraction(java.lang.String s)

lenOfThirdTokenInContraction

static int lenOfThirdTokenInContraction(java.lang.String s)

isContractionThatStartsWithApostrophe

static boolean isContractionThatStartsWithApostrophe(int currentPosition,
                                                     java.lang.String lowerCasedText)

breakAtApostrophe

static boolean breakAtApostrophe(java.lang.String s,
                                 int positionOfApostropheToTest)

Assumes apostrophe is not first character.... that case is handled elsewhere Assumes s is lower case.

allDigits

static boolean allDigits(java.lang.String s)

tokenLengthCheckingForSingleQuoteWordsToKeepTogether

static int tokenLengthCheckingForSingleQuoteWordsToKeepTogether(java.lang.String lowerCasedText)

for a word like 80's or P'yongyang or James' or Sean's or 80's-like or 80's-esque (or can't or haven't, which are to be split) determine whether the singlequote(apostrophe) needs to be kept with the surrounding letters/numbers and what to do about hyphenated afterwards if there is a hyphen after.... For possessives, do split. Note that things that start with an apostrophe like 'Assad were handled elsewhere

Returns:: len of how much to keep: len to apostrophe, or to next breaking char (the space after s for "80's ") or end of hyphenated suffix that should also remain attached, or -1

main

public static void main(java.lang.String[] args)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.ctakes.core.nlp.tokenizer Class ContractionsPTB

contractionResult

MultiTokenWords

MultiTokenWordLenToken1

MultiTokenWordLenToken2

MultiTokenWordLenToken3

MultiTokenWordsLookup

possibleContractionEndings

lettersAfterApostropheForMiddleOfContraction

contractionsStartingWithApostrophe

ContractionsPTB

getLengthIfNextApostIsMiddleOfContraction

getLenContractionToken

lenOfFirstTokenInContraction

lenOfSecondTokenInContraction

lenOfThirdTokenInContraction

isContractionThatStartsWithApostrophe

breakAtApostrophe

allDigits

tokenLengthCheckingForSingleQuoteWordsToKeepTogether

main

org.apache.ctakes.core.nlp.tokenizer
Class ContractionsPTB