org.apache.nutch.protocol.http.api
Class HttpRobotRulesParser

java.lang.Object
  extended by org.apache.nutch.protocol.RobotRulesParser
      extended by org.apache.nutch.protocol.http.api.HttpRobotRulesParser
All Implemented Interfaces:
org.apache.hadoop.conf.Configurable

public class HttpRobotRulesParser
extends RobotRulesParser

This class is used for parsing robots for urls belonging to HTTP protocol. It extends the generic RobotRulesParser class and contains Http protocol specific implementation for obtaining the robots file.


Field Summary
protected  boolean allowForbidden
           
static org.slf4j.Logger LOG
           
 
Fields inherited from class org.apache.nutch.protocol.RobotRulesParser
agentNames, CACHE, EMPTY_RULES, FORBID_ALL_RULES
 
Constructor Summary
HttpRobotRulesParser(org.apache.hadoop.conf.Configuration conf)
           
 
Method Summary
 crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol http, URL url)
          The hosts for which the caching of robots rules is yet to be done, it sends a Http request to the host corresponding to the URL passed, gets robots file, parses the rules and caches the rules object to avoid re-work in future.
 
Methods inherited from class org.apache.nutch.protocol.RobotRulesParser
getConf, getRobotRulesSet, main, parseRules, setConf
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

LOG

public static final org.slf4j.Logger LOG

allowForbidden

protected boolean allowForbidden
Constructor Detail

HttpRobotRulesParser

public HttpRobotRulesParser(org.apache.hadoop.conf.Configuration conf)
Method Detail

getRobotRulesSet

public crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol http,
                                                             URL url)
The hosts for which the caching of robots rules is yet to be done, it sends a Http request to the host corresponding to the URL passed, gets robots file, parses the rules and caches the rules object to avoid re-work in future.

Specified by:
getRobotRulesSet in class RobotRulesParser
Parameters:
http - The Protocol object
url - URL
Returns:
robotRules A BaseRobotRules object for the rules


Copyright © 2013 The Apache Software Foundation