Lucene.Net
3.0.3
Lucene.Net is a .NET port of the Java Lucene Indexing Library
|
A grammar-based tokenizer constructed with JFlex More...
Inherits Lucene.Net.Analysis.Tokenizer.
Public Member Functions | |
StandardTokenizer (Version matchVersion, System.IO.TextReader input) | |
Creates a new instance of the Lucene.Net.Analysis.Standard.StandardTokenizer. Attaches the input to the newly created JFlex scanner. | |
StandardTokenizer (Version matchVersion, AttributeSource source, System.IO.TextReader input) | |
Creates a new StandardTokenizer with a given AttributeSource. | |
StandardTokenizer (Version matchVersion, AttributeFactory factory, System.IO.TextReader input) | |
Creates a new StandardTokenizer with a given Lucene.Net.Util.AttributeSource.AttributeFactory | |
override bool | IncrementToken () |
Consumers (i.e., IndexWriter) use this method to advance the stream to the next token. Implementing classes must implement this method and update the appropriate Util.Attributes with the attributes of the next token. | |
override void | End () |
This method is called by the consumer after the last token has been consumed, after IncrementToken returned false (using the new TokenStream API). Streams implementing the old API should upgrade to use this feature. This method can be used to perform any end-of-stream operations, such as setting the final offset of a stream. The final offset of a stream might differ from the offset of the last token eg in case one or more whitespaces followed after the last token, but a WhitespaceTokenizer was used. | |
override void | Reset (System.IO.TextReader reader) |
Expert: Reset the tokenizer to a new reader. Typically, an analyzer (in its reusableTokenStream method) will use this to re-use a previously created tokenizer. | |
void | SetReplaceInvalidAcronym (bool replaceInvalidAcronym) |
Remove in 3.X and make true the only valid value See https://issues.apache.org/jira/browse/LUCENE-1068 | |
Public Attributes | |
const int | ALPHANUM = 0 |
const int | APOSTROPHE = 1 |
const int | ACRONYM = 2 |
const int | COMPANY = 3 |
const int | EMAIL = 4 |
const int | HOST = 5 |
const int | NUM = 6 |
const int | CJ = 7 |
const int | ACRONYM_DEP = 8 |
Static Public Attributes | |
static readonly System.String[] | TOKEN_TYPES = new System.String[]{"<ALPHANUM>", "<APOSTROPHE>", "<ACRONYM>", "<COMPANY>", "<EMAIL>", "<HOST>", "<NUM>", "<CJ>", "<ACRONYM_DEP>"} |
String token types that correspond to token type int constants | |
Properties | |
int | MaxTokenLength [get, set] |
Set the max allowed token length. Any token longer than this is skipped. | |
Additional Inherited Members | |
Protected Member Functions inherited from Lucene.Net.Analysis.Tokenizer | |
override void | Dispose (bool disposing) |
A grammar-based tokenizer constructed with JFlex
This should be a good tokenizer for most European-language documents:
Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer.
You must specify the required Version compatibility when creating StandardAnalyzer:
Definition at line 56 of file StandardTokenizer.cs.
Lucene.Net.Analysis.Standard.StandardTokenizer.StandardTokenizer | ( | Version | matchVersion, |
System.IO.TextReader | input | ||
) |
Creates a new instance of the Lucene.Net.Analysis.Standard.StandardTokenizer. Attaches the input
to the newly created JFlex scanner.
matchVersion | |
input | The input reader |
See http://issues.apache.org/jira/browse/LUCENE-1068
Definition at line 106 of file StandardTokenizer.cs.
Lucene.Net.Analysis.Standard.StandardTokenizer.StandardTokenizer | ( | Version | matchVersion, |
AttributeSource | source, | ||
System.IO.TextReader | input | ||
) |
Creates a new StandardTokenizer with a given AttributeSource.
Definition at line 114 of file StandardTokenizer.cs.
Lucene.Net.Analysis.Standard.StandardTokenizer.StandardTokenizer | ( | Version | matchVersion, |
AttributeFactory | factory, | ||
System.IO.TextReader | input | ||
) |
Creates a new StandardTokenizer with a given Lucene.Net.Util.AttributeSource.AttributeFactory
Definition at line 124 of file StandardTokenizer.cs.
|
virtual |
This method is called by the consumer after the last token has been consumed, after IncrementToken returned false
(using the new TokenStream
API). Streams implementing the old API should upgrade to use this feature. This method can be used to perform any end-of-stream operations, such as setting the final offset of a stream. The final offset of a stream might differ from the offset of the last token eg in case one or more whitespaces followed after the last token, but a WhitespaceTokenizer was used.
<throws> IOException </throws>
Reimplemented from Lucene.Net.Analysis.TokenStream.
Definition at line 207 of file StandardTokenizer.cs.
|
virtual |
Consumers (i.e., IndexWriter) use this method to advance the stream to the next token. Implementing classes must implement this method and update the appropriate Util.Attributes with the attributes of the next token.
The producer must make no assumptions about the attributes after the method has been returned: the caller may arbitrarily change it. If the producer needs to preserve the state for subsequent calls, it can use AttributeSource.CaptureState to create a copy of the current attribute state.
This method is called for every token of a document, so an efficient implementation is crucial for good performance. To avoid calls to AttributeSource.AddAttribute{T}() and AttributeSource.GetAttribute{T}(), references to all Util.Attributes that this stream uses should be retrieved during instantiation.
To ensure that filters and consumers know which attributes are available, the attributes must be added during instantiation. Filters and consumers are not required to check for availability of attributes in IncrementToken().
Implements Lucene.Net.Analysis.TokenStream.
Definition at line 159 of file StandardTokenizer.cs.
|
virtual |
Expert: Reset the tokenizer to a new reader. Typically, an analyzer (in its reusableTokenStream method) will use this to re-use a previously created tokenizer.
Reimplemented from Lucene.Net.Analysis.Tokenizer.
Definition at line 214 of file StandardTokenizer.cs.
void Lucene.Net.Analysis.Standard.StandardTokenizer.SetReplaceInvalidAcronym | ( | bool | replaceInvalidAcronym | ) |
Remove in 3.X and make true the only valid value See https://issues.apache.org/jira/browse/LUCENE-1068
replaceInvalidAcronym | Set to true to replace mischaracterized acronyms as HOST. |
Definition at line 227 of file StandardTokenizer.cs.
const int Lucene.Net.Analysis.Standard.StandardTokenizer.ACRONYM = 2 |
Definition at line 67 of file StandardTokenizer.cs.
const int Lucene.Net.Analysis.Standard.StandardTokenizer.ACRONYM_DEP = 8 |
<deprecated> this solves a bug where HOSTs that end with '.' are identified as ACRONYMs. </deprecated>
Definition at line 78 of file StandardTokenizer.cs.
const int Lucene.Net.Analysis.Standard.StandardTokenizer.ALPHANUM = 0 |
Definition at line 65 of file StandardTokenizer.cs.
const int Lucene.Net.Analysis.Standard.StandardTokenizer.APOSTROPHE = 1 |
Definition at line 66 of file StandardTokenizer.cs.
const int Lucene.Net.Analysis.Standard.StandardTokenizer.CJ = 7 |
Definition at line 72 of file StandardTokenizer.cs.
const int Lucene.Net.Analysis.Standard.StandardTokenizer.COMPANY = 3 |
Definition at line 68 of file StandardTokenizer.cs.
const int Lucene.Net.Analysis.Standard.StandardTokenizer.EMAIL = 4 |
Definition at line 69 of file StandardTokenizer.cs.
const int Lucene.Net.Analysis.Standard.StandardTokenizer.HOST = 5 |
Definition at line 70 of file StandardTokenizer.cs.
const int Lucene.Net.Analysis.Standard.StandardTokenizer.NUM = 6 |
Definition at line 71 of file StandardTokenizer.cs.
|
static |
String token types that correspond to token type int constants
Definition at line 81 of file StandardTokenizer.cs.
|
getset |
Set the max allowed token length. Any token longer than this is skipped.
Definition at line 91 of file StandardTokenizer.cs.