public class File extends Object implements Protocol
FileResponse
object and gets the content of the url from it.
Configurable parameters are file.content.limit
and file.crawl.parent
in nutch-default.xml defined under "file properties" section.Modifier and Type | Field and Description |
---|---|
static org.slf4j.Logger |
LOG |
CHECK_BLOCKING, CHECK_ROBOTS, X_POINT_ID
Constructor and Description |
---|
File() |
Modifier and Type | Method and Description |
---|---|
org.apache.hadoop.conf.Configuration |
getConf()
Get the
Configuration object |
ProtocolOutput |
getProtocolOutput(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Creates a
FileResponse object corresponding to the url and
return a ProtocolOutput object as per the content received |
crawlercommons.robots.BaseRobotRules |
getRobotRules(org.apache.hadoop.io.Text url,
CrawlDatum datum)
No robots parsing is done for file protocol.
|
static void |
main(String[] args)
Quick way for running this class.
|
void |
setConf(org.apache.hadoop.conf.Configuration conf)
Set the
Configuration object |
void |
setMaxContentLength(int maxContentLength)
Set the length after at which content is truncated.
|
public void setConf(org.apache.hadoop.conf.Configuration conf)
Configuration
objectsetConf
in interface org.apache.hadoop.conf.Configurable
public org.apache.hadoop.conf.Configuration getConf()
Configuration
objectgetConf
in interface org.apache.hadoop.conf.Configurable
public void setMaxContentLength(int maxContentLength)
public ProtocolOutput getProtocolOutput(org.apache.hadoop.io.Text url, CrawlDatum datum)
FileResponse
object corresponding to the url and
return a ProtocolOutput
object as per the content receivedgetProtocolOutput
in interface Protocol
url
- Text containing the urldatum
- The CrawlDatum object corresponding to the urlProtocolOutput
object for the content of the file indicated by urlpublic static void main(String[] args) throws Exception
Exception
public crawlercommons.robots.BaseRobotRules getRobotRules(org.apache.hadoop.io.Text url, CrawlDatum datum)
getRobotRules
in interface Protocol
url
- url to checkdatum
- page datumCopyright © 2014 The Apache Software Foundation