NutchDocumentTokenizer (Nutch 1.2 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.nutch.analysis
Class NutchDocumentTokenizer

java.lang.Object
  org.apache.lucene.util.AttributeSource
      org.apache.lucene.analysis.TokenStream
          org.apache.lucene.analysis.Tokenizer
              org.apache.nutch.analysis.NutchDocumentTokenizer

All Implemented Interfaces:: Closeable, NutchAnalysisConstants

public final class NutchDocumentTokenizer
extends Tokenizer
implements NutchAnalysisConstants
extends Tokenizer
implements NutchAnalysisConstants

The tokenizer used for Nutch document text. Implemented in terms of our JavaCC-generated lexical analyzer, NutchAnalysisTokenManager, shared with the query parser.

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
`AttributeSource.AttributeFactory, AttributeSource.State`

Field Summary

Fields inherited from class org.apache.lucene.analysis.Tokenizer
`input`

Fields inherited from interface org.apache.nutch.analysis.NutchAnalysisConstants
`ACRONYM, APOSTROPHE, ATSIGN, C_PLUS_PLUS, C_SHARP, CJK, COLON, DEFAULT, DIGIT, DOT, EOF, IRREGULAR_WORD, LETTER, MINUS, PLUS, QUOTE, SIGRAM, SLASH, tokenImage, WHITE, WORD, WORD_PUNCT`

Constructor Summary
`NutchDocumentTokenizer(Reader reader)` Construct a tokenizer for the text in a Reader.

Method Summary
`boolean`	`incrementToken()` Lucene 3.0 API.
`static void`	`main(String[] args)` For debugging.

Methods inherited from class org.apache.lucene.analysis.Tokenizer
`close, correctOffset, reset`

Methods inherited from class org.apache.lucene.analysis.TokenStream
`end, reset`

Methods inherited from class org.apache.lucene.util.AttributeSource
`addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, restoreState, toString`

Methods inherited from class java.lang.Object
`clone, finalize, getClass, notify, notifyAll, wait, wait, wait`

Constructor Detail

NutchDocumentTokenizer

public NutchDocumentTokenizer(Reader reader)

Construct a tokenizer for the text in a Reader.

Method Detail

incrementToken

public boolean incrementToken()
                       throws IOException

Lucene 3.0 API.

Specified by:: incrementToken in class TokenStream

Throws:: IOException

main

public static void main(String[] args)
                 throws Exception

For debugging.

Throws:: Exception