Apache Lucene.Net 2.4.0 Class Library API

Lucene.Net.Analysis Namespace

Namespace hierarchy

Classes

Class Description
Analyzer  
CachingTokenFilter This class can be used if the Tokens of a TokenStream are intended to be consumed more than once. It caches all Tokens locally in a List. CachingTokenFilter implements the optional method {@link TokenStream#Reset()}, which repositions the stream to the first Token.
CharArraySet A simple class that stores Strings as char[]'s in a hash table. Note that this is not a general purpose class. For example, it cannot remove items from the set, nor does it resize its hash table to be smaller, etc. It is designed to be quick to test if a char[] is in the set without the necessity of converting it to a String first.
CharArraySet.CharArraySetIterator  
CharTokenizer An abstract base class for simple, character-oriented tokenizers.
ISOLatin1AccentFilter  
KeywordAnalyzer "Tokenizes" the entire stream as a single token. This is useful for data like zip codes, ids, and some product names.
KeywordTokenizer Emits the entire input as a single token.
LengthFilter Removes words that are too long and too short from the stream.
LetterTokenizer A LetterTokenizer is a tokenizer that divides text at non-letters. That's to say, it defines tokens as maximal strings of adjacent letters, as defined by java.lang.Character.isLetter() predicate. Note: this does a decent job for most European languages, but does a terrible job for some Asian languages, where words are not separated by spaces.
LowerCaseFilter Normalizes token text to lower case.
LowerCaseTokenizer  
PerFieldAnalyzerWrapper  
PorterStemFilter  
SimpleAnalyzer An Analyzer that filters LetterTokenizer with LowerCaseFilter.
SinkTokenizer A SinkTokenizer can be used to cache Tokens for use in an Analyzer
StopAnalyzer Filters LetterTokenizer with LowerCaseFilter and StopFilter.
StopFilter Removes stop words from a token stream.
TeeTokenFilter Works in conjunction with the SinkTokenizer to provide the ability to set aside tokens that have already been analyzed. This is useful in situations where multiple fields share many common analysis steps and then go their separate ways.

It is also useful for doing things like entity extraction or proper noun analysis as part of the analysis workflow and saving off those tokens for use in another field.
 SinkTokenizer sink1 = new SinkTokenizer(null); SinkTokenizer sink2 = new SinkTokenizer(null); TokenStream source1 = new TeeTokenFilter(new TeeTokenFilter(new WhitespaceTokenizer(reader1), sink1), sink2); TokenStream source2 = new TeeTokenFilter(new TeeTokenFilter(new WhitespaceTokenizer(reader2), sink1), sink2); TokenStream final1 = new LowerCaseFilter(source1); TokenStream final2 = source2; TokenStream final3 = new EntityDetect(sink1); TokenStream final4 = new URLDetect(sink2); d.add(new Field("f1", final1)); d.add(new Field("f2", final2)); d.add(new Field("f3", final3)); d.add(new Field("f4", final4)); 
In this example, sink1 and sink2 will both get tokens from both reader1 and reader2 after whitespace tokenizer and now we can further wrap any of these in extra analysis, and more "sources" can be inserted if desired. Note, the EntityDetect and URLDetect TokenStreams are for the example and do not currently exist in Lucene

See http://issues.apache.org/jira/browse/LUCENE-1058
Token  
TokenFilter  
Tokenizer  
TokenStream  
WhitespaceAnalyzer An Analyzer that uses WhitespaceTokenizer.
WhitespaceTokenizer A WhitespaceTokenizer is a tokenizer that divides text at whitespace. Adjacent sequences of non-Whitespace characters form tokens.
WordlistLoader Loader for text files that represent a list of stopwords.