org.apache.lucene.analysis.standard
Class UAX29URLEmailTokenizer
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer
- All Implemented Interfaces:
- Closeable
public final class UAX29URLEmailTokenizer
- extends Tokenizer
This class implements Word Break rules from the Unicode Text Segmentation
algorithm, as specified in
Unicode Standard Annex #29
URLs and email addresses are also tokenized according to the relevant RFCs.
Tokens produced are of the following types:
- <ALPHANUM>: A sequence of alphabetic and numeric characters
- <NUM>: A number
- <URL>: A URL
- <EMAIL>: An email address
- <SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast
Asian languages, including Thai, Lao, Myanmar, and Khmer
- <IDEOGRAPHIC>: A single CJKV ideographic character
- <HIRAGANA>: A single hiragana character
You must specify the required Version
compatibility when creating UAX29URLEmailTokenizer:
- As of 3.4, Hiragana and Han characters are no longer wrongly split
from their combining characters. If you use a previous version number,
you get the exact broken behavior for backwards compatibility.
Fields inherited from class org.apache.lucene.analysis.Tokenizer |
input |
Methods inherited from class org.apache.lucene.util.AttributeSource |
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState |
ALPHANUM
public static final int ALPHANUM
- See Also:
- Constant Field Values
NUM
public static final int NUM
- See Also:
- Constant Field Values
SOUTHEAST_ASIAN
public static final int SOUTHEAST_ASIAN
- See Also:
- Constant Field Values
IDEOGRAPHIC
public static final int IDEOGRAPHIC
- See Also:
- Constant Field Values
HIRAGANA
public static final int HIRAGANA
- See Also:
- Constant Field Values
KATAKANA
public static final int KATAKANA
- See Also:
- Constant Field Values
HANGUL
public static final int HANGUL
- See Also:
- Constant Field Values
URL
public static final int URL
- See Also:
- Constant Field Values
EMAIL
public static final int EMAIL
- See Also:
- Constant Field Values
TOKEN_TYPES
public static final String[] TOKEN_TYPES
- String token types that correspond to token type int constants
UAX29URLEmailTokenizer
public UAX29URLEmailTokenizer(Version matchVersion,
Reader input)
- Creates a new instance of the UAX29URLEmailTokenizer. Attaches
the
input
to the newly created JFlex scanner.
- Parameters:
input
- The input reader
UAX29URLEmailTokenizer
public UAX29URLEmailTokenizer(Version matchVersion,
AttributeSource.AttributeFactory factory,
Reader input)
- Creates a new UAX29URLEmailTokenizer with a given
AttributeSource.AttributeFactory
setMaxTokenLength
public void setMaxTokenLength(int length)
- Set the max allowed token length. Any token longer
than this is skipped.
getMaxTokenLength
public int getMaxTokenLength()
- See Also:
setMaxTokenLength(int)
incrementToken
public final boolean incrementToken()
throws IOException
- Specified by:
incrementToken
in class TokenStream
- Throws:
IOException
end
public final void end()
throws IOException
- Overrides:
end
in class TokenStream
- Throws:
IOException
reset
public void reset()
throws IOException
- Overrides:
reset
in class TokenStream
- Throws:
IOException
Copyright © 2000-2013 Apache Software Foundation. All Rights Reserved.