public class LetterTokenizer extends CharTokenizer
Note: this does a decent job for most European languages, but does a terrible job for some Asian languages, where words are not separated by spaces.
You must specify the required Version
compatibility when creating
LetterTokenizer
:
CharTokenizer
uses an int based API to normalize and
detect token characters. See CharTokenizer.isTokenChar(int)
and
CharTokenizer.normalize(int)
for details.AttributeSource.AttributeFactory, AttributeSource.State
Constructor and Description |
---|
LetterTokenizer(Version matchVersion,
AttributeSource.AttributeFactory factory,
Reader in)
Construct a new LetterTokenizer using a given
AttributeSource.AttributeFactory . |
LetterTokenizer(Version matchVersion,
AttributeSource source,
Reader in)
Construct a new LetterTokenizer using a given
AttributeSource . |
LetterTokenizer(Version matchVersion,
Reader in)
Construct a new LetterTokenizer.
|
Modifier and Type | Method and Description |
---|---|
protected boolean |
isTokenChar(int c)
Collects only characters which satisfy
Character.isLetter(int) . |
end, incrementToken, normalize, reset
close, correctOffset, setReader
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState
public LetterTokenizer(Version matchVersion, Reader in)
matchVersion
- Lucene version to match See abovein
- the input to split up into tokenspublic LetterTokenizer(Version matchVersion, AttributeSource source, Reader in)
AttributeSource
.public LetterTokenizer(Version matchVersion, AttributeSource.AttributeFactory factory, Reader in)
AttributeSource.AttributeFactory
.protected boolean isTokenChar(int c)
Character.isLetter(int)
.isTokenChar
in class CharTokenizer
Copyright © 2000-2012 Apache Software Foundation. All Rights Reserved.