org.apache.lucene.analysis.icu.segmentation
Class ICUTokenizer
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.analysis.icu.segmentation.ICUTokenizer
- All Implemented Interfaces:
- Closeable
public final class ICUTokenizer
- extends org.apache.lucene.analysis.Tokenizer
Breaks text into words according to UAX #29: Unicode Text Segmentation
(http://www.unicode.org/reports/tr29/)
Words are broken across script boundaries, then segmented according to
the BreakIterator and typing provided by the ICUTokenizerConfig
- See Also:
ICUTokenizerConfig
- WARNING: This API is experimental and might change in incompatible ways in the next release.
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource |
org.apache.lucene.util.AttributeSource.AttributeFactory, org.apache.lucene.util.AttributeSource.State |
Fields inherited from class org.apache.lucene.analysis.Tokenizer |
input |
Constructor Summary |
ICUTokenizer(Reader input)
Construct a new ICUTokenizer that breaks text into words from the given
Reader. |
ICUTokenizer(Reader input,
ICUTokenizerConfig config)
Construct a new ICUTokenizer that breaks text into words from the given
Reader, using a tailored BreakIterator configuration. |
Methods inherited from class org.apache.lucene.analysis.Tokenizer |
close, correctOffset |
Methods inherited from class org.apache.lucene.util.AttributeSource |
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString |
ICUTokenizer
public ICUTokenizer(Reader input)
- Construct a new ICUTokenizer that breaks text into words from the given
Reader.
The default script-specific handling is used.
- Parameters:
input
- Reader containing text to tokenize.- See Also:
DefaultICUTokenizerConfig
ICUTokenizer
public ICUTokenizer(Reader input,
ICUTokenizerConfig config)
- Construct a new ICUTokenizer that breaks text into words from the given
Reader, using a tailored BreakIterator configuration.
- Parameters:
input
- Reader containing text to tokenize.config
- Tailored BreakIterator configuration
incrementToken
public boolean incrementToken()
throws IOException
- Specified by:
incrementToken
in class org.apache.lucene.analysis.TokenStream
- Throws:
IOException
reset
public void reset()
throws IOException
- Overrides:
reset
in class org.apache.lucene.analysis.TokenStream
- Throws:
IOException
reset
public void reset(Reader input)
throws IOException
- Overrides:
reset
in class org.apache.lucene.analysis.Tokenizer
- Throws:
IOException
end
public void end()
throws IOException
- Overrides:
end
in class org.apache.lucene.analysis.TokenStream
- Throws:
IOException
Copyright © 2000-2011 Apache Software Foundation. All Rights Reserved.