A grammar-based tokenizer constructed with JFlex

This should be a good tokenizer for most European-language documents:

Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token.
Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
Recognizes email addresses and internet hostnames as one token.

Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer.

You must specify the required {@link Version} compatibility when creating StandardAnalyzer:

As of 2.4, Tokens incorrectly identified as acronyms are corrected (see LUCENE-1608

Namespace: Lucene.Net.Analysis.Standard
Assembly: Lucene.Net (in Lucene.Net.dll) Version: 2.9.4.1

Syntax

C#
public class StandardTokenizer : Tokenizer

Visual Basic
Public Class StandardTokenizer _ Inherits Tokenizer

Visual C++
public ref class StandardTokenizer : public Tokenizer

Inheritance Hierarchy

System..::..Object
  Lucene.Net.Util..::..AttributeSource
    Lucene.Net.Analysis..::..TokenStream
      Lucene.Net.Analysis..::..Tokenizer
        Lucene.Net.Analysis.Standard..::..StandardTokenizer

Syntax

Inheritance Hierarchy

See Also