org.apache.nutch.protocol
Class RobotRulesParser

java.lang.Object
  extended by org.apache.nutch.protocol.RobotRulesParser
All Implemented Interfaces:
org.apache.hadoop.conf.Configurable
Direct Known Subclasses:
FtpRobotRulesParser, HttpRobotRulesParser

public abstract class RobotRulesParser
extends Object
implements org.apache.hadoop.conf.Configurable

This class uses crawler-commons for handling the parsing of robots.txt files. It emits SimpleRobotRules objects, which describe the download permissions as described in SimpleRobotRulesParser.


Field Summary
protected  String agentNames
           
protected static Hashtable<String,crawlercommons.robots.BaseRobotRules> CACHE
           
static crawlercommons.robots.BaseRobotRules EMPTY_RULES
          A BaseRobotRules object appropriate for use when the robots.txt file is empty or missing; all requests are allowed.
static crawlercommons.robots.BaseRobotRules FORBID_ALL_RULES
          A BaseRobotRules object appropriate for use when the robots.txt file is not fetched due to a 403/Forbidden response; all requests are disallowed.
static org.slf4j.Logger LOG
           
 
Constructor Summary
RobotRulesParser()
           
RobotRulesParser(org.apache.hadoop.conf.Configuration conf)
           
 
Method Summary
 org.apache.hadoop.conf.Configuration getConf()
          Get the Configuration object
 crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol protocol, String url)
           
abstract  crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol protocol, URL url)
           
static void main(String[] argv)
          command-line main for testing
 crawlercommons.robots.BaseRobotRules parseRules(String url, byte[] content, String contentType, String robotName)
          Parses the robots content using the SimpleRobotRulesParser from crawler commons
 void setConf(org.apache.hadoop.conf.Configuration conf)
          Set the Configuration object
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

LOG

public static final org.slf4j.Logger LOG

CACHE

protected static final Hashtable<String,crawlercommons.robots.BaseRobotRules> CACHE

EMPTY_RULES

public static final crawlercommons.robots.BaseRobotRules EMPTY_RULES
A BaseRobotRules object appropriate for use when the robots.txt file is empty or missing; all requests are allowed.


FORBID_ALL_RULES

public static crawlercommons.robots.BaseRobotRules FORBID_ALL_RULES
A BaseRobotRules object appropriate for use when the robots.txt file is not fetched due to a 403/Forbidden response; all requests are disallowed.


agentNames

protected String agentNames
Constructor Detail

RobotRulesParser

public RobotRulesParser()

RobotRulesParser

public RobotRulesParser(org.apache.hadoop.conf.Configuration conf)
Method Detail

setConf

public void setConf(org.apache.hadoop.conf.Configuration conf)
Set the Configuration object

Specified by:
setConf in interface org.apache.hadoop.conf.Configurable

getConf

public org.apache.hadoop.conf.Configuration getConf()
Get the Configuration object

Specified by:
getConf in interface org.apache.hadoop.conf.Configurable

parseRules

public crawlercommons.robots.BaseRobotRules parseRules(String url,
                                                       byte[] content,
                                                       String contentType,
                                                       String robotName)
Parses the robots content using the SimpleRobotRulesParser from crawler commons

Parameters:
url - A string containing url
content - Contents of the robots file in a byte array
contentType - The
robotName - A string containing value of
Returns:
BaseRobotRules object

getRobotRulesSet

public crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol protocol,
                                                             String url)

getRobotRulesSet

public abstract crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol protocol,
                                                                      URL url)

main

public static void main(String[] argv)
command-line main for testing



Copyright © 2013 The Apache Software Foundation