public abstract class RobotRulesParser extends Object implements org.apache.hadoop.conf.Configurable
robots.txt
files.
It emits SimpleRobotRules objects, which describe the download permissions
as described in SimpleRobotRulesParser.Modifier and Type | Field and Description |
---|---|
protected String |
agentNames |
protected static Hashtable<String,crawlercommons.robots.BaseRobotRules> |
CACHE |
static crawlercommons.robots.BaseRobotRules |
EMPTY_RULES
A
BaseRobotRules object appropriate for use
when the robots.txt file is empty or missing;
all requests are allowed. |
static crawlercommons.robots.BaseRobotRules |
FORBID_ALL_RULES
A
BaseRobotRules object appropriate for use when the
robots.txt file is not fetched due to a 403/Forbidden
response; all requests are disallowed. |
static org.slf4j.Logger |
LOG |
Constructor and Description |
---|
RobotRulesParser() |
RobotRulesParser(org.apache.hadoop.conf.Configuration conf) |
Modifier and Type | Method and Description |
---|---|
org.apache.hadoop.conf.Configuration |
getConf()
Get the
Configuration object |
crawlercommons.robots.BaseRobotRules |
getRobotRulesSet(Protocol protocol,
org.apache.hadoop.io.Text url) |
abstract crawlercommons.robots.BaseRobotRules |
getRobotRulesSet(Protocol protocol,
URL url) |
static void |
main(String[] argv)
command-line main for testing
|
crawlercommons.robots.BaseRobotRules |
parseRules(String url,
byte[] content,
String contentType,
String robotName)
Parses the robots content using the
SimpleRobotRulesParser from crawler commons |
void |
setConf(org.apache.hadoop.conf.Configuration conf)
Set the
Configuration object |
public static final org.slf4j.Logger LOG
public static final crawlercommons.robots.BaseRobotRules EMPTY_RULES
BaseRobotRules
object appropriate for use
when the robots.txt
file is empty or missing;
all requests are allowed.public static crawlercommons.robots.BaseRobotRules FORBID_ALL_RULES
BaseRobotRules
object appropriate for use when the
robots.txt
file is not fetched due to a 403/Forbidden
response; all requests are disallowed.protected String agentNames
public RobotRulesParser()
public RobotRulesParser(org.apache.hadoop.conf.Configuration conf)
public void setConf(org.apache.hadoop.conf.Configuration conf)
Configuration
objectsetConf
in interface org.apache.hadoop.conf.Configurable
public org.apache.hadoop.conf.Configuration getConf()
Configuration
objectgetConf
in interface org.apache.hadoop.conf.Configurable
public crawlercommons.robots.BaseRobotRules parseRules(String url, byte[] content, String contentType, String robotName)
SimpleRobotRulesParser
from crawler commonsurl
- A string containing urlcontent
- Contents of the robots file in a byte arraycontentType
- TherobotName
- A string containing value ofpublic crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol protocol, org.apache.hadoop.io.Text url)
public abstract crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol protocol, URL url)
public static void main(String[] argv)
Copyright © 2014 The Apache Software Foundation