|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.apache.lucene.util.AttributeSource org.apache.lucene.analysis.TokenStream org.apache.lucene.analysis.TokenFilter org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase org.apache.lucene.analysis.compound.DictionaryCompoundWordTokenFilter
public class DictionaryCompoundWordTokenFilter
A TokenFilter
that decomposes compound words found in many Germanic languages.
"Donaudampfschiff" becomes Donau, dampf, schiff so that you can find "Donaudampfschiff" even when you only enter "schiff". It uses a brute-force algorithm to achieve this.
You must specify the required Version
compatibility when creating
CompoundWordTokenFilterBase:
If you pass in a CharArraySet
as dictionary,
it should be case-insensitive unless it contains only lowercased entries and you
have LowerCaseFilter
before this filter in your analysis chain.
For optional performance (as this filter does lots of lookups to the dictionary,
you should use the latter analysis chain/CharArraySet). Be aware: If you supply arbitrary
Sets
to the ctors or String[]
dictionaries, they will be automatically
transformed to case-insensitive!
Nested Class Summary |
---|
Nested classes/interfaces inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase |
---|
CompoundWordTokenFilterBase.CompoundToken |
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource |
---|
AttributeSource.AttributeFactory, AttributeSource.State |
Field Summary |
---|
Fields inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase |
---|
DEFAULT_MAX_SUBWORD_SIZE, DEFAULT_MIN_SUBWORD_SIZE, DEFAULT_MIN_WORD_SIZE, dictionary, maxSubwordSize, minSubwordSize, minWordSize, offsetAtt, onlyLongestMatch, termAtt, tokens |
Fields inherited from class org.apache.lucene.analysis.TokenFilter |
---|
input |
Constructor Summary | |
---|---|
DictionaryCompoundWordTokenFilter(TokenStream input,
Set dictionary)
Deprecated. use DictionaryCompoundWordTokenFilter(Version, TokenStream, Set) instead |
|
DictionaryCompoundWordTokenFilter(TokenStream input,
Set dictionary,
int minWordSize,
int minSubwordSize,
int maxSubwordSize,
boolean onlyLongestMatch)
Deprecated. use DictionaryCompoundWordTokenFilter(Version, TokenStream, Set, int, int, int, boolean) instead |
|
DictionaryCompoundWordTokenFilter(TokenStream input,
String[] dictionary)
Deprecated. use DictionaryCompoundWordTokenFilter(Version, TokenStream, String[]) instead |
|
DictionaryCompoundWordTokenFilter(TokenStream input,
String[] dictionary,
int minWordSize,
int minSubwordSize,
int maxSubwordSize,
boolean onlyLongestMatch)
Deprecated. use DictionaryCompoundWordTokenFilter(Version, TokenStream, String[], int, int, int, boolean) instead |
|
DictionaryCompoundWordTokenFilter(Version matchVersion,
TokenStream input,
Set<?> dictionary)
Creates a new DictionaryCompoundWordTokenFilter |
|
DictionaryCompoundWordTokenFilter(Version matchVersion,
TokenStream input,
Set<?> dictionary,
int minWordSize,
int minSubwordSize,
int maxSubwordSize,
boolean onlyLongestMatch)
Creates a new DictionaryCompoundWordTokenFilter |
|
DictionaryCompoundWordTokenFilter(Version matchVersion,
TokenStream input,
String[] dictionary)
Deprecated. Use the constructors taking Set |
|
DictionaryCompoundWordTokenFilter(Version matchVersion,
TokenStream input,
String[] dictionary,
int minWordSize,
int minSubwordSize,
int maxSubwordSize,
boolean onlyLongestMatch)
Deprecated. Use the constructors taking Set |
Method Summary | |
---|---|
protected void |
decompose()
Decomposes the current CompoundWordTokenFilterBase.termAtt and places CompoundWordTokenFilterBase.CompoundToken instances in the CompoundWordTokenFilterBase.tokens list. |
Methods inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase |
---|
incrementToken, makeDictionary, reset |
Methods inherited from class org.apache.lucene.analysis.TokenFilter |
---|
close, end |
Methods inherited from class org.apache.lucene.util.AttributeSource |
---|
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString |
Methods inherited from class java.lang.Object |
---|
clone, finalize, getClass, notify, notifyAll, wait, wait, wait |
Constructor Detail |
---|
@Deprecated public DictionaryCompoundWordTokenFilter(TokenStream input, String[] dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
DictionaryCompoundWordTokenFilter(Version, TokenStream, String[], int, int, int, boolean)
instead
DictionaryCompoundWordTokenFilter
.
input
- the TokenStream
to processdictionary
- the word dictionary to match againstminWordSize
- only words longer than this get processedminSubwordSize
- only subwords longer than this get to the output streammaxSubwordSize
- only subwords shorter than this get to the output streamonlyLongestMatch
- Add only the longest matching subword to the stream@Deprecated public DictionaryCompoundWordTokenFilter(TokenStream input, String[] dictionary)
DictionaryCompoundWordTokenFilter(Version, TokenStream, String[])
instead
DictionaryCompoundWordTokenFilter
input
- the TokenStream
to processdictionary
- the word dictionary to match against@Deprecated public DictionaryCompoundWordTokenFilter(TokenStream input, Set dictionary)
DictionaryCompoundWordTokenFilter(Version, TokenStream, Set)
instead
DictionaryCompoundWordTokenFilter
input
- the TokenStream
to processdictionary
- the word dictionary to match against. If this is a CharArraySet
it must have set ignoreCase=false and only contain
lower case strings.@Deprecated public DictionaryCompoundWordTokenFilter(TokenStream input, Set dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
DictionaryCompoundWordTokenFilter(Version, TokenStream, Set, int, int, int, boolean)
instead
DictionaryCompoundWordTokenFilter
input
- the TokenStream
to processdictionary
- the word dictionary to match against. If this is a CharArraySet
it must have set ignoreCase=false and only contain
lower case strings.minWordSize
- only words longer than this get processedminSubwordSize
- only subwords longer than this get to the output streammaxSubwordSize
- only subwords shorter than this get to the output streamonlyLongestMatch
- Add only the longest matching subword to the stream@Deprecated public DictionaryCompoundWordTokenFilter(Version matchVersion, TokenStream input, String[] dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
Set
DictionaryCompoundWordTokenFilter
matchVersion
- Lucene version to enable correct Unicode 4.0 behavior in the
dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.input
- the TokenStream
to processdictionary
- the word dictionary to match againstminWordSize
- only words longer than this get processedminSubwordSize
- only subwords longer than this get to the output streammaxSubwordSize
- only subwords shorter than this get to the output streamonlyLongestMatch
- Add only the longest matching subword to the stream@Deprecated public DictionaryCompoundWordTokenFilter(Version matchVersion, TokenStream input, String[] dictionary)
Set
DictionaryCompoundWordTokenFilter
matchVersion
- Lucene version to enable correct Unicode 4.0 behavior in the
dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.input
- the TokenStream
to processdictionary
- the word dictionary to match againstpublic DictionaryCompoundWordTokenFilter(Version matchVersion, TokenStream input, Set<?> dictionary)
DictionaryCompoundWordTokenFilter
matchVersion
- Lucene version to enable correct Unicode 4.0 behavior in the
dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.input
- the TokenStream
to processdictionary
- the word dictionary to match against.public DictionaryCompoundWordTokenFilter(Version matchVersion, TokenStream input, Set<?> dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
DictionaryCompoundWordTokenFilter
matchVersion
- Lucene version to enable correct Unicode 4.0 behavior in the
dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.input
- the TokenStream
to processdictionary
- the word dictionary to match against.minWordSize
- only words longer than this get processedminSubwordSize
- only subwords longer than this get to the output streammaxSubwordSize
- only subwords shorter than this get to the output streamonlyLongestMatch
- Add only the longest matching subword to the streamMethod Detail |
---|
protected void decompose()
CompoundWordTokenFilterBase
CompoundWordTokenFilterBase.termAtt
and places CompoundWordTokenFilterBase.CompoundToken
instances in the CompoundWordTokenFilterBase.tokens
list.
The original token may not be placed in the list, as it is automatically passed through this filter.
decompose
in class CompoundWordTokenFilterBase
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |