Apache Lucene.Net 2.4.0 Class Library API |
|
Lucene.Net.Analysis Namespace
Namespace hierarchy
Classes
Class |
Description |
Analyzer
|
|
CachingTokenFilter
|
This class can be used if the Tokens of a TokenStream are intended to be consumed more than once. It caches all Tokens locally in a List. CachingTokenFilter implements the optional method {@link TokenStream#Reset()}, which repositions the stream to the first Token. |
CharArraySet
|
A simple class that stores Strings as char[]'s in a hash table. Note that this is not a general purpose class. For example, it cannot remove items from the set, nor does it resize its hash table to be smaller, etc. It is designed to be quick to test if a char[] is in the set without the necessity of converting it to a String first. |
CharArraySet.CharArraySetIterator
|
|
CharTokenizer
|
An abstract base class for simple, character-oriented tokenizers. |
ISOLatin1AccentFilter
|
|
KeywordAnalyzer
|
"Tokenizes" the entire stream as a single token. This is useful for data like zip codes, ids, and some product names. |
KeywordTokenizer
|
Emits the entire input as a single token. |
LengthFilter
|
Removes words that are too long and too short from the stream. |
LetterTokenizer
|
A LetterTokenizer is a tokenizer that divides text at non-letters. That's to say, it defines tokens as maximal strings of adjacent letters, as defined by java.lang.Character.isLetter() predicate. Note: this does a decent job for most European languages, but does a terrible job for some Asian languages, where words are not separated by spaces. |
LowerCaseFilter
|
Normalizes token text to lower case. |
LowerCaseTokenizer
|
|
PerFieldAnalyzerWrapper
|
|
PorterStemFilter
|
|
SimpleAnalyzer
|
An Analyzer that filters LetterTokenizer with LowerCaseFilter. |
SinkTokenizer
|
A SinkTokenizer can be used to cache Tokens for use in an Analyzer |
StopAnalyzer
|
Filters LetterTokenizer with LowerCaseFilter and StopFilter. |
StopFilter
|
Removes stop words from a token stream. |
TeeTokenFilter
|
Works in conjunction with the SinkTokenizer to provide the ability to set aside tokens that have already been analyzed. This is useful in situations where multiple fields share many common analysis steps and then go their separate ways. It is also useful for doing things like entity extraction or proper noun analysis as part of the analysis workflow and saving off those tokens for use in another field. SinkTokenizer sink1 = new SinkTokenizer(null); SinkTokenizer sink2 = new SinkTokenizer(null); TokenStream source1 = new TeeTokenFilter(new TeeTokenFilter(new WhitespaceTokenizer(reader1), sink1), sink2); TokenStream source2 = new TeeTokenFilter(new TeeTokenFilter(new WhitespaceTokenizer(reader2), sink1), sink2); TokenStream final1 = new LowerCaseFilter(source1); TokenStream final2 = source2; TokenStream final3 = new EntityDetect(sink1); TokenStream final4 = new URLDetect(sink2); d.add(new Field("f1", final1)); d.add(new Field("f2", final2)); d.add(new Field("f3", final3)); d.add(new Field("f4", final4)); In this example, sink1 and sink2 will both get tokens from both reader1 and reader2 after whitespace tokenizer and now we can further wrap any of these in extra analysis, and more "sources" can be inserted if desired. Note, the EntityDetect and URLDetect TokenStreams are for the example and do not currently exist in Lucene See http://issues.apache.org/jira/browse/LUCENE-1058 |
Token
|
|
TokenFilter
|
|
Tokenizer
|
|
TokenStream
|
|
WhitespaceAnalyzer
|
An Analyzer that uses WhitespaceTokenizer. |
WhitespaceTokenizer
|
A WhitespaceTokenizer is a tokenizer that divides text at whitespace. Adjacent sequences of non-Whitespace characters form tokens. |
WordlistLoader
|
Loader for text files that represent a list of stopwords. |