org.apache.nutch.analysis
Class NutchDocumentTokenizer
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.nutch.analysis.NutchDocumentTokenizer
- All Implemented Interfaces:
- Closeable, NutchAnalysisConstants
public final class NutchDocumentTokenizer
- extends Tokenizer
- implements NutchAnalysisConstants
The tokenizer used for Nutch document text. Implemented in terms of our
JavaCC-generated lexical analyzer, NutchAnalysisTokenManager
, shared
with the query parser.
Fields inherited from class org.apache.lucene.analysis.Tokenizer |
input |
Fields inherited from interface org.apache.nutch.analysis.NutchAnalysisConstants |
ACRONYM, APOSTROPHE, ATSIGN, C_PLUS_PLUS, C_SHARP, CJK, COLON, DEFAULT, DIGIT, DOT, EOF, IRREGULAR_WORD, LETTER, MINUS, PLUS, QUOTE, SIGRAM, SLASH, tokenImage, WHITE, WORD, WORD_PUNCT |
Methods inherited from class org.apache.lucene.util.AttributeSource |
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, restoreState, toString |
NutchDocumentTokenizer
public NutchDocumentTokenizer(Reader reader)
- Construct a tokenizer for the text in a Reader.
incrementToken
public boolean incrementToken()
throws IOException
- Lucene 3.0 API.
- Specified by:
incrementToken
in class TokenStream
- Throws:
IOException
main
public static void main(String[] args)
throws Exception
- For debugging.
- Throws:
Exception
Copyright © 2006 The Apache Software Foundation