|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.apache.ctakes.core.nlp.tokenizer.TokenizerPTB
public class TokenizerPTB
A class used to break natural text into tokens following PTB rules. See Supplementary Guidelines for ETTB 2.0 dated April 6th, 2009. The token markup is external to the text and is not embedded. Character offset location is used to identify the boundaries of a token.
Constructor Summary | |
---|---|
TokenizerPTB()
Constructor |
Method Summary | |
---|---|
int |
findFirstCharOfNextToken(String s,
int startPosition)
|
static void |
main(String[] args)
|
List<?> |
tokenize(String text)
Tokenize a string that is assumed to be the entire document (or at least to start at 0) |
List<?> |
tokenizeTextSegment(org.apache.uima.jcas.JCas jcas,
String textSegment,
int offsetAdjustment,
boolean includeTextNotJustOffsets)
Tokenize text that starts at offset offsetAdjustment within the complete text |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public TokenizerPTB()
Method Detail |
---|
public List<?> tokenizeTextSegment(org.apache.uima.jcas.JCas jcas, String textSegment, int offsetAdjustment, boolean includeTextNotJustOffsets)
textSegment
- the text to tokenizeoffsetAdjustment
- what to add to all offsets within textSegment to make them be offsets from the start of the text for the jcasincludeTextNotJustOffsets
- whether to copy the text covered by this token into the token object itself
public List<?> tokenize(String text)
text
- the String to tokenize
public int findFirstCharOfNextToken(String s, int startPosition)
public static void main(String[] args)
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |