filter
public NutchDocument filter(NutchDocument doc,
Parse parse,
Text urlText,
CrawlDatum datum,
Inlinks inlinks)
throws IndexingException
Adds fields or otherwise modifies the document that will be indexed for a
parse. Unwanted documents can be removed from indexing by returning a null
value.
- Specified by:
filter
in interface IndexingFilter
- Parameters:
doc
- document instance for collecting fieldsparse
- parse data instanceurlText
- page urldatum
- crawl datum for the page (fetch datum from segment containing
fetch status and fetch time)inlinks
- page inlinks
- Returns:
- modified (or a new) document instance, or null (meaning the
document should be discarded)
- Throws:
IndexingException