Lucene.Net.Analysis

Classes

Class	Description
Analyzer
CachingTokenFilter	This class can be used if the Tokens of a TokenStream are intended to be consumed more than once. It caches all Tokens locally in a List. CachingTokenFilter implements the optional method {@link TokenStream#Reset()}, which repositions the stream to the first Token.
CharArraySet	A simple class that stores Strings as char[]'s in a hash table. Note that this is not a general purpose class. For example, it cannot remove items from the set, nor does it resize its hash table to be smaller, etc. It is designed to be quick to test if a char[] is in the set without the necessity of converting it to a String first.
CharArraySet.CharArraySetIterator
CharTokenizer	An abstract base class for simple, character-oriented tokenizers.
ISOLatin1AccentFilter
KeywordAnalyzer	"Tokenizes" the entire stream as a single token. This is useful for data like zip codes, ids, and some product names.
KeywordTokenizer	Emits the entire input as a single token.
LengthFilter	Removes words that are too long and too short from the stream.
LetterTokenizer	A LetterTokenizer is a tokenizer that divides text at non-letters. That's to say, it defines tokens as maximal strings of adjacent letters, as defined by java.lang.Character.isLetter() predicate. Note: this does a decent job for most European languages, but does a terrible job for some Asian languages, where words are not separated by spaces.
LowerCaseFilter	Normalizes token text to lower case.
LowerCaseTokenizer
PerFieldAnalyzerWrapper
PorterStemFilter
SimpleAnalyzer	An Analyzer that filters LetterTokenizer with LowerCaseFilter.
SinkTokenizer	A SinkTokenizer can be used to cache Tokens for use in an Analyzer
StopAnalyzer	Filters LetterTokenizer with LowerCaseFilter and StopFilter.
StopFilter	Removes stop words from a token stream.
TeeTokenFilter	Works in conjunction with the SinkTokenizer to provide the ability to set aside tokens that have already been analyzed. This is useful in situations where multiple fields share many common analysis steps and then go their separate ways. It is also useful for doing things like entity extraction or proper noun analysis as part of the analysis workflow and saving off those tokens for use in another field. SinkTokenizer sink1 = new SinkTokenizer(null); SinkTokenizer sink2 = new SinkTokenizer(null); TokenStream source1 = new TeeTokenFilter(new TeeTokenFilter(new WhitespaceTokenizer(reader1), sink1), sink2); TokenStream source2 = new TeeTokenFilter(new TeeTokenFilter(new WhitespaceTokenizer(reader2), sink1), sink2); TokenStream final1 = new LowerCaseFilter(source1); TokenStream final2 = source2; TokenStream final3 = new EntityDetect(sink1); TokenStream final4 = new URLDetect(sink2); d.add(new Field("f1", final1)); d.add(new Field("f2", final2)); d.add(new Field("f3", final3)); d.add(new Field("f4", final4)); In this example, sink1 and sink2 will both get tokens from both reader1 and reader2 after whitespace tokenizer and now we can further wrap any of these in extra analysis, and more "sources" can be inserted if desired. Note, the EntityDetect and URLDetect TokenStreams are for the example and do not currently exist in Lucene See http://issues.apache.org/jira/browse/LUCENE-1058
Token
TokenFilter
Tokenizer
TokenStream
WhitespaceAnalyzer	An Analyzer that uses WhitespaceTokenizer.
WhitespaceTokenizer	A WhitespaceTokenizer is a tokenizer that divides text at whitespace. Adjacent sequences of non-Whitespace characters form tokens.
WordlistLoader	Loader for text files that represent a list of stopwords.

Lucene.Net.Analysis Namespace

Classes