|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.apache.nutch.protocol.RobotRulesParser
public abstract class RobotRulesParser
This class uses crawler-commons for handling the parsing of robots.txt
files.
It emits SimpleRobotRules objects, which describe the download permissions
as described in SimpleRobotRulesParser.
Field Summary | |
---|---|
protected String |
agentNames
|
protected static Hashtable<String,crawlercommons.robots.BaseRobotRules> |
CACHE
|
static crawlercommons.robots.BaseRobotRules |
EMPTY_RULES
A BaseRobotRules object appropriate for use
when the robots.txt file is empty or missing;
all requests are allowed. |
static crawlercommons.robots.BaseRobotRules |
FORBID_ALL_RULES
A BaseRobotRules object appropriate for use when the
robots.txt file is not fetched due to a 403/Forbidden
response; all requests are disallowed. |
static org.slf4j.Logger |
LOG
|
Constructor Summary | |
---|---|
RobotRulesParser()
|
|
RobotRulesParser(org.apache.hadoop.conf.Configuration conf)
|
Method Summary | |
---|---|
org.apache.hadoop.conf.Configuration |
getConf()
Get the Configuration object |
crawlercommons.robots.BaseRobotRules |
getRobotRulesSet(Protocol protocol,
String url)
|
abstract crawlercommons.robots.BaseRobotRules |
getRobotRulesSet(Protocol protocol,
URL url)
|
static void |
main(String[] argv)
command-line main for testing |
crawlercommons.robots.BaseRobotRules |
parseRules(String url,
byte[] content,
String contentType,
String robotName)
Parses the robots content using the SimpleRobotRulesParser from crawler commons |
void |
setConf(org.apache.hadoop.conf.Configuration conf)
Set the Configuration object |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final org.slf4j.Logger LOG
protected static final Hashtable<String,crawlercommons.robots.BaseRobotRules> CACHE
public static final crawlercommons.robots.BaseRobotRules EMPTY_RULES
BaseRobotRules
object appropriate for use
when the robots.txt
file is empty or missing;
all requests are allowed.
public static crawlercommons.robots.BaseRobotRules FORBID_ALL_RULES
BaseRobotRules
object appropriate for use when the
robots.txt
file is not fetched due to a 403/Forbidden
response; all requests are disallowed.
protected String agentNames
Constructor Detail |
---|
public RobotRulesParser()
public RobotRulesParser(org.apache.hadoop.conf.Configuration conf)
Method Detail |
---|
public void setConf(org.apache.hadoop.conf.Configuration conf)
Configuration
object
setConf
in interface org.apache.hadoop.conf.Configurable
public org.apache.hadoop.conf.Configuration getConf()
Configuration
object
getConf
in interface org.apache.hadoop.conf.Configurable
public crawlercommons.robots.BaseRobotRules parseRules(String url, byte[] content, String contentType, String robotName)
SimpleRobotRulesParser
from crawler commons
url
- A string containing urlcontent
- Contents of the robots file in a byte arraycontentType
- TherobotName
- A string containing value of
public crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol protocol, String url)
public abstract crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol protocol, URL url)
public static void main(String[] argv)
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |