public class HttpRobotRulesParser extends RobotRulesParser
RobotRulesParser
class and contains Http protocol
specific implementation for obtaining the robots file.Modifier and Type | Field and Description |
---|---|
protected boolean |
allowForbidden |
static org.slf4j.Logger |
LOG |
agentNames, CACHE, EMPTY_RULES, FORBID_ALL_RULES
Constructor and Description |
---|
HttpRobotRulesParser(Configuration conf) |
Modifier and Type | Method and Description |
---|---|
protected static String |
getCacheKey(URL url)
Compose unique key to store and access robot rules in cache for given URL
|
crawlercommons.robots.BaseRobotRules |
getRobotRulesSet(Protocol http,
URL url)
Get the rules from robots.txt which applies for the given
url . |
void |
setConf(Configuration conf)
Set the
Configuration object |
getConf, getRobotRulesSet, main, parseRules
public static final org.slf4j.Logger LOG
protected boolean allowForbidden
public HttpRobotRulesParser(Configuration conf)
public void setConf(Configuration conf)
RobotRulesParser
Configuration
objectsetConf
in interface Configurable
setConf
in class RobotRulesParser
protected static String getCacheKey(URL url)
public crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol http, URL url)
url
.
Robot rules are cached for a unique combination of host, protocol, and
port. If no rules are found in the cache, a HTTP request is send to fetch
{{protocol://host:port/robots.txt}}. The robots.txt is then parsed and the
rules are cached to avoid re-fetching and re-parsing it again.getRobotRulesSet
in class RobotRulesParser
http
- The Protocol
objecturl
- URL robots.txt applies toBaseRobotRules
holding the rules from robots.txtCopyright © 2015 The Apache Software Foundation