TokenizerPTB (Apache cTAKES core 3.1.1 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.ctakes.core.nlp.tokenizer
Class TokenizerPTB

java.lang.Object
  org.apache.ctakes.core.nlp.tokenizer.TokenizerPTB

public class TokenizerPTB
extends Object
extends Object

A class used to break natural text into tokens following PTB rules. See Supplementary Guidelines for ETTB 2.0 dated April 6th, 2009. The token markup is external to the text and is not embedded. Character offset location is used to identify the boundaries of a token.

Author:: Mayo Clinic

Constructor Summary
`TokenizerPTB()` Constructor

Method Summary
`int`	`findFirstCharOfNextToken(String s, int startPosition)`
`static void`	`main(String[] args)`
`List<?>`	`tokenize(String text)` Tokenize a string that is assumed to be the entire document (or at least to start at 0)
`List<?>`	`tokenizeTextSegment(org.apache.uima.jcas.JCas jcas, String textSegment, int offsetAdjustment, boolean includeTextNotJustOffsets)` Tokenize text that starts at offset offsetAdjustment within the complete text

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

TokenizerPTB

public TokenizerPTB()

Constructor

Method Detail

tokenizeTextSegment

public List<?> tokenizeTextSegment(org.apache.uima.jcas.JCas jcas,
                                   String textSegment,
                                   int offsetAdjustment,
                                   boolean includeTextNotJustOffsets)

Tokenize text that starts at offset offsetAdjustment within the complete text

Parameters:: textSegment - the text to tokenize; offsetAdjustment - what to add to all offsets within textSegment to make them be offsets from the start of the text for the jcas; includeTextNotJustOffsets - whether to copy the text covered by this token into the token object itself
Returns:: the list of new tokens

tokenize

public List<?> tokenize(String text)

Tokenize a string that is assumed to be the entire document (or at least to start at 0)

Parameters:: text - the String to tokenize
Returns:: the list of new tokens

findFirstCharOfNextToken

public int findFirstCharOfNextToken(String s,
                                    int startPosition)

main

public static void main(String[] args)