Class Analyzer
- java.lang.Object
-
- org.apache.lucene.analysis.Analyzer
-
- All Implemented Interfaces:
Closeable
,AutoCloseable
- Direct Known Subclasses:
AnalyzerWrapper
public abstract class Analyzer extends Object implements Closeable
An Analyzer builds TokenStreams, which analyze text. It thus represents a policy for extracting index terms from text.In order to define what analysis is done, subclasses must define their
TokenStreamComponents
increateComponents(String, Reader)
. The components are then reused in each call totokenStream(String, Reader)
.Simple example:
Analyzer analyzer = new Analyzer() { @Override protected TokenStreamComponents createComponents(String fieldName, Reader reader) { Tokenizer source = new FooTokenizer(reader); TokenStream filter = new FooFilter(source); filter = new BarFilter(filter); return new TokenStreamComponents(source, filter); } };
For more examples, see theAnalysis package documentation
.For some concrete implementations bundled with Lucene, look in the analysis modules:
- Common: Analyzers for indexing content in different languages and domains.
- ICU: Exposes functionality from ICU to Apache Lucene.
- Kuromoji: Morphological analyzer for Japanese text.
- Morfologik: Dictionary-driven lemmatization for the Polish language.
- Phonetic: Analysis for indexing phonetic signatures (for sounds-alike search).
- Smart Chinese: Analyzer for Simplified Chinese, which indexes words.
- Stempel: Algorithmic Stemmer for the Polish Language.
- UIMA: Analysis integration with Apache UIMA.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
Analyzer.GlobalReuseStrategy
Deprecated.This implementation class will be hidden in Lucene 5.0.static class
Analyzer.PerFieldReuseStrategy
Deprecated.This implementation class will be hidden in Lucene 5.0.static class
Analyzer.ReuseStrategy
Strategy defining how TokenStreamComponents are reused per call totokenStream(String, java.io.Reader)
.static class
Analyzer.TokenStreamComponents
This class encapsulates the outer components of a token stream.
-
Field Summary
Fields Modifier and Type Field Description static Analyzer.ReuseStrategy
GLOBAL_REUSE_STRATEGY
A predefinedAnalyzer.ReuseStrategy
that reuses the same components for every field.static Analyzer.ReuseStrategy
PER_FIELD_REUSE_STRATEGY
A predefinedAnalyzer.ReuseStrategy
that reuses components per-field by maintaining a Map of TokenStreamComponent per field name.
-
Constructor Summary
Constructors Constructor Description Analyzer()
Create a new Analyzer, reusing the same set of components per-thread across calls totokenStream(String, Reader)
.Analyzer(Analyzer.ReuseStrategy reuseStrategy)
Expert: create a new Analyzer with a customAnalyzer.ReuseStrategy
.
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description void
close()
Frees persistent resources used by this Analyzerprotected abstract Analyzer.TokenStreamComponents
createComponents(String fieldName, Reader reader)
Creates a newAnalyzer.TokenStreamComponents
instance for this analyzer.int
getOffsetGap(String fieldName)
Just likegetPositionIncrementGap(java.lang.String)
, except for Token offsets instead.int
getPositionIncrementGap(String fieldName)
Invoked before indexing a IndexableField instance if terms have already been added to that field.Analyzer.ReuseStrategy
getReuseStrategy()
Returns the usedAnalyzer.ReuseStrategy
.protected Reader
initReader(String fieldName, Reader reader)
Override this if you want to add a CharFilter chain.TokenStream
tokenStream(String fieldName, Reader reader)
Returns a TokenStream suitable forfieldName
, tokenizing the contents ofreader
.TokenStream
tokenStream(String fieldName, String text)
Returns a TokenStream suitable forfieldName
, tokenizing the contents oftext
.
-
-
-
Field Detail
-
GLOBAL_REUSE_STRATEGY
public static final Analyzer.ReuseStrategy GLOBAL_REUSE_STRATEGY
A predefinedAnalyzer.ReuseStrategy
that reuses the same components for every field.
-
PER_FIELD_REUSE_STRATEGY
public static final Analyzer.ReuseStrategy PER_FIELD_REUSE_STRATEGY
A predefinedAnalyzer.ReuseStrategy
that reuses components per-field by maintaining a Map of TokenStreamComponent per field name.
-
-
Constructor Detail
-
Analyzer
public Analyzer()
Create a new Analyzer, reusing the same set of components per-thread across calls totokenStream(String, Reader)
.
-
Analyzer
public Analyzer(Analyzer.ReuseStrategy reuseStrategy)
Expert: create a new Analyzer with a customAnalyzer.ReuseStrategy
.NOTE: if you just want to reuse on a per-field basis, its easier to use a subclass of
AnalyzerWrapper
such as PerFieldAnalyerWrapper instead.
-
-
Method Detail
-
createComponents
protected abstract Analyzer.TokenStreamComponents createComponents(String fieldName, Reader reader)
Creates a newAnalyzer.TokenStreamComponents
instance for this analyzer.- Parameters:
fieldName
- the name of the fields content passed to theAnalyzer.TokenStreamComponents
sink as a readerreader
- the reader passed to theTokenizer
constructor- Returns:
- the
Analyzer.TokenStreamComponents
for this analyzer.
-
tokenStream
public final TokenStream tokenStream(String fieldName, Reader reader) throws IOException
Returns a TokenStream suitable forfieldName
, tokenizing the contents ofreader
.This method uses
createComponents(String, Reader)
to obtain an instance ofAnalyzer.TokenStreamComponents
. It returns the sink of the components and stores the components internally. Subsequent calls to this method will reuse the previously stored components after resetting them throughAnalyzer.TokenStreamComponents.setReader(Reader)
.NOTE: After calling this method, the consumer must follow the workflow described in
TokenStream
to properly consume its contents. See theAnalysis package documentation
for some examples demonstrating this. NOTE: If your data is available as aString
, usetokenStream(String, String)
which reuses aStringReader
-like instance internally.- Parameters:
fieldName
- the name of the field the created TokenStream is used forreader
- the reader the streams source reads from- Returns:
- TokenStream for iterating the analyzed content of
reader
- Throws:
AlreadyClosedException
- if the Analyzer is closed.IOException
- if an i/o error occurs.- See Also:
tokenStream(String, String)
-
tokenStream
public final TokenStream tokenStream(String fieldName, String text) throws IOException
Returns a TokenStream suitable forfieldName
, tokenizing the contents oftext
.This method uses
createComponents(String, Reader)
to obtain an instance ofAnalyzer.TokenStreamComponents
. It returns the sink of the components and stores the components internally. Subsequent calls to this method will reuse the previously stored components after resetting them throughAnalyzer.TokenStreamComponents.setReader(Reader)
.NOTE: After calling this method, the consumer must follow the workflow described in
TokenStream
to properly consume its contents. See theAnalysis package documentation
for some examples demonstrating this.- Parameters:
fieldName
- the name of the field the created TokenStream is used fortext
- the String the streams source reads from- Returns:
- TokenStream for iterating the analyzed content of
reader
- Throws:
AlreadyClosedException
- if the Analyzer is closed.IOException
- if an i/o error occurs (may rarely happen for strings).- See Also:
tokenStream(String, Reader)
-
initReader
protected Reader initReader(String fieldName, Reader reader)
Override this if you want to add a CharFilter chain.The default implementation returns
reader
unchanged.- Parameters:
fieldName
- IndexableField name being indexedreader
- original Reader- Returns:
- reader, optionally decorated with CharFilter(s)
-
getPositionIncrementGap
public int getPositionIncrementGap(String fieldName)
Invoked before indexing a IndexableField instance if terms have already been added to that field. This allows custom analyzers to place an automatic position increment gap between IndexbleField instances using the same field name. The default value position increment gap is 0. With a 0 position increment gap and the typical default token position increment of 1, all terms in a field, including across IndexableField instances, are in successive positions, allowing exact PhraseQuery matches, for instance, across IndexableField instance boundaries.- Parameters:
fieldName
- IndexableField name being indexed.- Returns:
- position increment gap, added to the next token emitted from
tokenStream(String,Reader)
. This value must be>= 0
.
-
getOffsetGap
public int getOffsetGap(String fieldName)
Just likegetPositionIncrementGap(java.lang.String)
, except for Token offsets instead. By default this returns 1. This method is only called if the field produced at least one token for indexing.- Parameters:
fieldName
- the field just indexed- Returns:
- offset gap, added to the next token emitted from
tokenStream(String,Reader)
. This value must be>= 0
.
-
getReuseStrategy
public final Analyzer.ReuseStrategy getReuseStrategy()
Returns the usedAnalyzer.ReuseStrategy
.
-
close
public void close()
Frees persistent resources used by this Analyzer- Specified by:
close
in interfaceAutoCloseable
- Specified by:
close
in interfaceCloseable
-
-