org.apache.nutch.parse
Interface Parser

All Superinterfaces:
Configurable, Pluggable
All Known Implementing Classes:
ExtParser, FeedParser, HtmlParser, JSParseFilter, SWFParser, TikaParser, ZipParser

public interface Parser
extends Pluggable, Configurable

A parser for content generated by a Protocol implementation. This interface is implemented by extensions. Nutch's core contains no page parsing code.


Field Summary
static String X_POINT_ID
          The name of the extension point.
 
Method Summary
 ParseResult getParse(Content c)
           This method parses the given content and returns a map of <key, parse> pairs.
 
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
 

Field Detail

X_POINT_ID

static final String X_POINT_ID
The name of the extension point.

Method Detail

getParse

ParseResult getParse(Content c)

This method parses the given content and returns a map of <key, parse> pairs. Parse instances will be persisted under the given key.

Note: Meta-redirects should be followed only when they are coming from the original URL. That is:
Assume fetcher is in parsing mode and is currently processing foo.bar.com/redirect.html. If this url contains a meta redirect to another url, fetcher should only follow the redirect if the map contains an entry of the form <"foo.bar.com/redirect.html", Parse with a ParseStatus indicating the redirect>.

Parameters:
c - Content to be parsed
Returns:
a map containing <key, parse> pairs
Since:
NUTCH-443


Copyright © 2012 The Apache Software Foundation