A grammar-based tokenizer constructed with JFlex More...

Inherits Lucene.Net.Analysis.Tokenizer.

Public Member Functions
	StandardTokenizer (Version matchVersion, System.IO.TextReader input)
	Creates a new instance of the Lucene.Net.Analysis.Standard.StandardTokenizer. Attaches the `input` to the newly created JFlex scanner.

	StandardTokenizer (Version matchVersion, AttributeSource source, System.IO.TextReader input)
	Creates a new StandardTokenizer with a given AttributeSource.

	StandardTokenizer (Version matchVersion, AttributeFactory factory, System.IO.TextReader input)
	Creates a new StandardTokenizer with a given Lucene.Net.Util.AttributeSource.AttributeFactory

override bool	IncrementToken ()
	Consumers (i.e., IndexWriter) use this method to advance the stream to the next token. Implementing classes must implement this method and update the appropriate Util.Attributes with the attributes of the next token.

override void	End ()
	This method is called by the consumer after the last token has been consumed, after IncrementToken returned `false` (using the new `TokenStream` API). Streams implementing the old API should upgrade to use this feature. This method can be used to perform any end-of-stream operations, such as setting the final offset of a stream. The final offset of a stream might differ from the offset of the last token eg in case one or more whitespaces followed after the last token, but a WhitespaceTokenizer was used.

override void	Reset (System.IO.TextReader reader)
	Expert: Reset the tokenizer to a new reader. Typically, an analyzer (in its reusableTokenStream method) will use this to re-use a previously created tokenizer.

void	SetReplaceInvalidAcronym (bool replaceInvalidAcronym)
	Remove in 3.X and make true the only valid value See https://issues.apache.org/jira/browse/LUCENE-1068

Public Attributes
const int	ALPHANUM = 0

const int	APOSTROPHE = 1

const int	ACRONYM = 2

const int	COMPANY = 3

const int	EMAIL = 4

const int	HOST = 5

const int	NUM = 6

const int	CJ = 7

const int	ACRONYM_DEP = 8

Static Public Attributes
static readonly System.String[]	TOKEN_TYPES = new System.String[]{"<ALPHANUM>", "<APOSTROPHE>", "<ACRONYM>", "<COMPANY>", "<EMAIL>", "<HOST>", "<NUM>", "<CJ>", "<ACRONYM_DEP>"}
	String token types that correspond to token type int constants

Properties
int	MaxTokenLength `[get, set]`
	Set the max allowed token length. Any token longer than this is skipped.

Additional Inherited Members
Protected Member Functions inherited from Lucene.Net.Analysis.Tokenizer
override void	Dispose (bool disposing)

Detailed Description

A grammar-based tokenizer constructed with JFlex

This should be a good tokenizer for most European-language documents:

Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token.
Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
Recognizes email addresses and internet hostnames as one token.

Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer.

You must specify the required Version compatibility when creating StandardAnalyzer:

As of 2.4, Tokens incorrectly identified as acronyms are corrected (see LUCENE-1608

Definition at line 56 of file StandardTokenizer.cs.

Constructor & Destructor Documentation

Lucene.Net.Analysis.Standard.StandardTokenizer.StandardTokenizer	(	Version	matchVersion,
		System.IO.TextReader	input
	)

Creates a new instance of the Lucene.Net.Analysis.Standard.StandardTokenizer. Attaches the input to the newly created JFlex scanner.

Parameters

matchVersion
input	The input reader

See http://issues.apache.org/jira/browse/LUCENE-1068

Definition at line 106 of file StandardTokenizer.cs.

Lucene.Net.Analysis.Standard.StandardTokenizer.StandardTokenizer	(	Version	matchVersion,
		AttributeSource	source,
		System.IO.TextReader	input
	)

Creates a new StandardTokenizer with a given AttributeSource.

Definition at line 114 of file StandardTokenizer.cs.

Lucene.Net.Analysis.Standard.StandardTokenizer.StandardTokenizer	(	Version	matchVersion,
		AttributeFactory	factory,
		System.IO.TextReader	input
	)

Creates a new StandardTokenizer with a given Lucene.Net.Util.AttributeSource.AttributeFactory

Definition at line 124 of file StandardTokenizer.cs.

Member Function Documentation

override void Lucene.Net.Analysis.Standard.StandardTokenizer.End ( )

virtual

This method is called by the consumer after the last token has been consumed, after IncrementToken returned false (using the new TokenStream API). Streams implementing the old API should upgrade to use this feature. This method can be used to perform any end-of-stream operations, such as setting the final offset of a stream. The final offset of a stream might differ from the offset of the last token eg in case one or more whitespaces followed after the last token, but a WhitespaceTokenizer was used.

<throws> IOException </throws>

Reimplemented from Lucene.Net.Analysis.TokenStream.

Definition at line 207 of file StandardTokenizer.cs.

override bool Lucene.Net.Analysis.Standard.StandardTokenizer.IncrementToken ( )

virtual

Consumers (i.e., IndexWriter) use this method to advance the stream to the next token. Implementing classes must implement this method and update the appropriate Util.Attributes with the attributes of the next token.

The producer must make no assumptions about the attributes after the method has been returned: the caller may arbitrarily change it. If the producer needs to preserve the state for subsequent calls, it can use AttributeSource.CaptureState to create a copy of the current attribute state.

This method is called for every token of a document, so an efficient implementation is crucial for good performance. To avoid calls to AttributeSource.AddAttribute{T}() and AttributeSource.GetAttribute{T}(), references to all Util.Attributes that this stream uses should be retrieved during instantiation.

To ensure that filters and consumers know which attributes are available, the attributes must be added during instantiation. Filters and consumers are not required to check for availability of attributes in IncrementToken().

Returns: false for end of stream; true otherwise

Implements Lucene.Net.Analysis.TokenStream.

Definition at line 159 of file StandardTokenizer.cs.

override void Lucene.Net.Analysis.Standard.StandardTokenizer.Reset ( System.IO.TextReader input )

virtual

Expert: Reset the tokenizer to a new reader. Typically, an analyzer (in its reusableTokenStream method) will use this to re-use a previously created tokenizer.

Reimplemented from Lucene.Net.Analysis.Tokenizer.

Definition at line 214 of file StandardTokenizer.cs.

void Lucene.Net.Analysis.Standard.StandardTokenizer.SetReplaceInvalidAcronym ( bool replaceInvalidAcronym )

Remove in 3.X and make true the only valid value See https://issues.apache.org/jira/browse/LUCENE-1068

Parameters

replaceInvalidAcronym Set to true to replace mischaracterized acronyms as HOST.

Definition at line 227 of file StandardTokenizer.cs.

Member Data Documentation

const int Lucene.Net.Analysis.Standard.StandardTokenizer.ACRONYM = 2

Definition at line 67 of file StandardTokenizer.cs.

const int Lucene.Net.Analysis.Standard.StandardTokenizer.ACRONYM_DEP = 8

<deprecated> this solves a bug where HOSTs that end with '.' are identified as ACRONYMs. </deprecated>

Definition at line 78 of file StandardTokenizer.cs.

const int Lucene.Net.Analysis.Standard.StandardTokenizer.ALPHANUM = 0

Definition at line 65 of file StandardTokenizer.cs.

const int Lucene.Net.Analysis.Standard.StandardTokenizer.APOSTROPHE = 1

Definition at line 66 of file StandardTokenizer.cs.

const int Lucene.Net.Analysis.Standard.StandardTokenizer.CJ = 7

Definition at line 72 of file StandardTokenizer.cs.

const int Lucene.Net.Analysis.Standard.StandardTokenizer.COMPANY = 3

Definition at line 68 of file StandardTokenizer.cs.

const int Lucene.Net.Analysis.Standard.StandardTokenizer.EMAIL = 4

Definition at line 69 of file StandardTokenizer.cs.

const int Lucene.Net.Analysis.Standard.StandardTokenizer.HOST = 5

Definition at line 70 of file StandardTokenizer.cs.

const int Lucene.Net.Analysis.Standard.StandardTokenizer.NUM = 6

Definition at line 71 of file StandardTokenizer.cs.

readonly System.String [] Lucene.Net.Analysis.Standard.StandardTokenizer.TOKEN_TYPES = new System.String[]{"<ALPHANUM>", "<APOSTROPHE>", "<ACRONYM>", "<COMPANY>", "<EMAIL>", "<HOST>", "<NUM>", "<CJ>", "<ACRONYM_DEP>"}

static

String token types that correspond to token type int constants

Definition at line 81 of file StandardTokenizer.cs.

Property Documentation

int Lucene.Net.Analysis.Standard.StandardTokenizer.MaxTokenLength

getset

Set the max allowed token length. Any token longer than this is skipped.

Definition at line 91 of file StandardTokenizer.cs.

The documentation for this class was generated from the following file:

core/Analysis/Standard/StandardTokenizer.cs

Public Member Functions

Public Attributes

Static Public Attributes

Properties

Additional Inherited Members

Detailed Description

Constructor & Destructor Documentation

Member Function Documentation

Member Data Documentation

Property Documentation