org.apache.nutch.analysis
Class NutchDocumentTokenizer

java.lang.Object
  extended by org.apache.lucene.util.AttributeSource
      extended by org.apache.lucene.analysis.TokenStream
          extended by org.apache.lucene.analysis.Tokenizer
              extended by org.apache.nutch.analysis.NutchDocumentTokenizer
All Implemented Interfaces:
Closeable, NutchAnalysisConstants

public final class NutchDocumentTokenizer
extends Tokenizer
implements NutchAnalysisConstants

The tokenizer used for Nutch document text. Implemented in terms of our JavaCC-generated lexical analyzer, NutchAnalysisTokenManager, shared with the query parser.


Nested Class Summary
 
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.AttributeFactory, AttributeSource.State
 
Field Summary
 
Fields inherited from class org.apache.lucene.analysis.Tokenizer
input
 
Fields inherited from interface org.apache.nutch.analysis.NutchAnalysisConstants
ACRONYM, APOSTROPHE, ATSIGN, C_PLUS_PLUS, C_SHARP, CJK, COLON, DEFAULT, DIGIT, DOT, EOF, IRREGULAR_WORD, LETTER, MINUS, PLUS, QUOTE, SIGRAM, SLASH, tokenImage, WHITE, WORD, WORD_PUNCT
 
Constructor Summary
NutchDocumentTokenizer(Reader reader)
          Construct a tokenizer for the text in a Reader.
 
Method Summary
 boolean incrementToken()
          Lucene 3.0 API.
static void main(String[] args)
          For debugging.
 
Methods inherited from class org.apache.lucene.analysis.Tokenizer
close, correctOffset, reset
 
Methods inherited from class org.apache.lucene.analysis.TokenStream
end, reset
 
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, restoreState, toString
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Constructor Detail

NutchDocumentTokenizer

public NutchDocumentTokenizer(Reader reader)
Construct a tokenizer for the text in a Reader.

Method Detail

incrementToken

public boolean incrementToken()
                       throws IOException
Lucene 3.0 API.

Specified by:
incrementToken in class TokenStream
Throws:
IOException

main

public static void main(String[] args)
                 throws Exception
For debugging.

Throws:
Exception


Copyright © 2006 The Apache Software Foundation