org.apache.ctakes.core.nlp.tokenizer
Class TokenizerPTB

java.lang.Object
  extended by org.apache.ctakes.core.nlp.tokenizer.TokenizerPTB

public class TokenizerPTB
extends Object

A class used to break natural text into tokens following PTB rules. See Supplementary Guidelines for ETTB 2.0 dated April 6th, 2009. The token markup is external to the text and is not embedded. Character offset location is used to identify the boundaries of a token.

Author:
Mayo Clinic

Constructor Summary
TokenizerPTB()
          Constructor
 
Method Summary
 int findFirstCharOfNextToken(String s, int startPosition)
           
static void main(String[] args)
           
 List<?> tokenize(String text)
          Tokenize a string that is assumed to be the entire document (or at least to start at 0)
 List<?> tokenizeTextSegment(org.apache.uima.jcas.JCas jcas, String textSegment, int offsetAdjustment, boolean includeTextNotJustOffsets)
          Tokenize text that starts at offset offsetAdjustment within the complete text
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

TokenizerPTB

public TokenizerPTB()
Constructor

Method Detail

tokenizeTextSegment

public List<?> tokenizeTextSegment(org.apache.uima.jcas.JCas jcas,
                                   String textSegment,
                                   int offsetAdjustment,
                                   boolean includeTextNotJustOffsets)
Tokenize text that starts at offset offsetAdjustment within the complete text

Parameters:
textSegment - the text to tokenize
offsetAdjustment - what to add to all offsets within textSegment to make them be offsets from the start of the text for the jcas
includeTextNotJustOffsets - whether to copy the text covered by this token into the token object itself
Returns:
the list of new tokens

tokenize

public List<?> tokenize(String text)
Tokenize a string that is assumed to be the entire document (or at least to start at 0)

Parameters:
text - the String to tokenize
Returns:
the list of new tokens

findFirstCharOfNextToken

public int findFirstCharOfNextToken(String s,
                                    int startPosition)

main

public static void main(String[] args)


Copyright © 2012-2013 The Apache Software Foundation. All Rights Reserved.