public abstract class RobotRulesParser extends Object implements Configurable
robots.txt
files. It emits SimpleRobotRules objects, which describe
the download permissions as described in SimpleRobotRulesParser.Modifier and Type | Field and Description |
---|---|
protected String |
agentNames |
protected static Hashtable<String,crawlercommons.robots.BaseRobotRules> |
CACHE |
static crawlercommons.robots.BaseRobotRules |
EMPTY_RULES
A
BaseRobotRules object appropriate for use when the
robots.txt file is empty or missing; all requests are allowed. |
static crawlercommons.robots.BaseRobotRules |
FORBID_ALL_RULES
A
BaseRobotRules object appropriate for use when the
robots.txt file is not fetched due to a 403/Forbidden
response; all requests are disallowed. |
static org.slf4j.Logger |
LOG |
Constructor and Description |
---|
RobotRulesParser() |
RobotRulesParser(Configuration conf) |
Modifier and Type | Method and Description |
---|---|
Configuration |
getConf()
Get the
Configuration object |
crawlercommons.robots.BaseRobotRules |
getRobotRulesSet(Protocol protocol,
Text url) |
abstract crawlercommons.robots.BaseRobotRules |
getRobotRulesSet(Protocol protocol,
URL url) |
static void |
main(String[] argv)
command-line main for testing
|
crawlercommons.robots.BaseRobotRules |
parseRules(String url,
byte[] content,
String contentType,
String robotName)
Parses the robots content using the
SimpleRobotRulesParser from
crawler commons |
void |
setConf(Configuration conf)
Set the
Configuration object |
public static final org.slf4j.Logger LOG
public static final crawlercommons.robots.BaseRobotRules EMPTY_RULES
BaseRobotRules
object appropriate for use when the
robots.txt
file is empty or missing; all requests are allowed.public static crawlercommons.robots.BaseRobotRules FORBID_ALL_RULES
BaseRobotRules
object appropriate for use when the
robots.txt
file is not fetched due to a 403/Forbidden
response; all requests are disallowed.protected String agentNames
public RobotRulesParser()
public RobotRulesParser(Configuration conf)
public void setConf(Configuration conf)
Configuration
objectsetConf
in interface Configurable
public Configuration getConf()
Configuration
objectgetConf
in interface Configurable
public crawlercommons.robots.BaseRobotRules parseRules(String url, byte[] content, String contentType, String robotName)
SimpleRobotRulesParser
from
crawler commonsurl
- A string containing urlcontent
- Contents of the robots file in a byte arraycontentType
- The content type of the robots filerobotName
- A string containing all the robots agent names used by parser for
matchingpublic crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol protocol, Text url)
public abstract crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol protocol, URL url)
public static void main(String[] argv)
Copyright © 2015 The Apache Software Foundation