|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.apache.lucene.analysis.standard.StandardTokenizerImpl
public final class StandardTokenizerImpl
This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29
Tokens produced are of the following types:
Field Summary | |
---|---|
static int |
HANGUL_TYPE
|
static int |
HIRAGANA_TYPE
|
static int |
IDEOGRAPHIC_TYPE
|
static int |
KATAKANA_TYPE
|
static int |
NUMERIC_TYPE
Numbers |
static int |
SOUTH_EAST_ASIAN_TYPE
Chars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.). |
static int |
WORD_TYPE
Alphanumeric sequences |
static int |
YYEOF
This character denotes the end of file |
static int |
YYINITIAL
lexical states |
Constructor Summary | |
---|---|
StandardTokenizerImpl(InputStream in)
Creates a new scanner. |
|
StandardTokenizerImpl(Reader in)
Creates a new scanner There is also a java.io.InputStream version of this constructor. |
Method Summary | |
---|---|
int |
getNextToken()
Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs. |
void |
getText(CharTermAttribute t)
Fills CharTermAttribute with the current token text. |
void |
yybegin(int newState)
Enters a new lexical state |
int |
yychar()
Returns the current position. |
char |
yycharat(int pos)
Returns the character at position pos from the matched text. |
void |
yyclose()
Closes the input stream. |
int |
yylength()
Returns the length of the matched text region. |
void |
yypushback(int number)
Pushes the specified amount of characters back into the input stream. |
void |
yyreset(Reader reader)
Resets the scanner to read from a new input stream. |
int |
yystate()
Returns the current lexical state. |
String |
yytext()
Returns the text matched by the current regular expression. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final int YYEOF
public static final int YYINITIAL
public static final int WORD_TYPE
public static final int NUMERIC_TYPE
public static final int SOUTH_EAST_ASIAN_TYPE
See Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/#SA
public static final int IDEOGRAPHIC_TYPE
public static final int HIRAGANA_TYPE
public static final int KATAKANA_TYPE
public static final int HANGUL_TYPE
Constructor Detail |
---|
public StandardTokenizerImpl(Reader in)
in
- the java.io.Reader to read input from.public StandardTokenizerImpl(InputStream in)
in
- the java.io.Inputstream to read input from.Method Detail |
---|
public final int yychar()
public final void getText(CharTermAttribute t)
public final void yyclose() throws IOException
IOException
public final void yyreset(Reader reader)
reader
- the new input streampublic final int yystate()
public final void yybegin(int newState)
newState
- the new lexical statepublic final String yytext()
public final char yycharat(int pos)
pos
- the position of the character to fetch.
A value from 0 to yylength()-1.
public final int yylength()
public void yypushback(int number)
number
- the number of characters to be read again.
This number must not be greater than yylength()!public int getNextToken() throws IOException
IOException
- if any I/O-Error occurs
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |