org.apache.lucene.analysis.ar
Class ArabicLetterTokenizer
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.analysis.CharTokenizer
org.apache.lucene.analysis.LetterTokenizer
org.apache.lucene.analysis.ar.ArabicLetterTokenizer
public class ArabicLetterTokenizer
- extends org.apache.lucene.analysis.LetterTokenizer
Tokenizer that breaks text into runs of letters and diacritics.
The problem with the standard Letter tokenizer is that it fails on diacritics.
Handling similar to this is necessary for Indic Scripts, Hebrew, Thaana, etc.
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource |
org.apache.lucene.util.AttributeSource.AttributeFactory, org.apache.lucene.util.AttributeSource.State |
Fields inherited from class org.apache.lucene.analysis.Tokenizer |
input |
Method Summary |
protected boolean |
isTokenChar(char c)
Allows for Letter category or NonspacingMark category |
Methods inherited from class org.apache.lucene.analysis.CharTokenizer |
end, incrementToken, next, next, normalize, reset |
Methods inherited from class org.apache.lucene.analysis.Tokenizer |
close, correctOffset |
Methods inherited from class org.apache.lucene.analysis.TokenStream |
getOnlyUseNewAPI, reset, setOnlyUseNewAPI |
Methods inherited from class org.apache.lucene.util.AttributeSource |
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, restoreState, toString |
ArabicLetterTokenizer
public ArabicLetterTokenizer(Reader in)
ArabicLetterTokenizer
public ArabicLetterTokenizer(org.apache.lucene.util.AttributeSource source,
Reader in)
ArabicLetterTokenizer
public ArabicLetterTokenizer(org.apache.lucene.util.AttributeSource.AttributeFactory factory,
Reader in)
isTokenChar
protected boolean isTokenChar(char c)
- Allows for Letter category or NonspacingMark category
- Overrides:
isTokenChar
in class org.apache.lucene.analysis.LetterTokenizer
- See Also:
LetterTokenizer.isTokenChar(char)
Copyright © 2000-2010 Apache Software Foundation. All Rights Reserved.