|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.apache.nutch.protocol.file.File
public class File
This class is a protocol plugin used for file: scheme.
It creates FileResponse
object and gets the content of the url from it.
Configurable parameters are file.content.limit
and file.crawl.parent
in nutch-default.xml defined under "file properties" section.
Field Summary | |
---|---|
static org.slf4j.Logger |
LOG
|
Fields inherited from interface org.apache.nutch.protocol.Protocol |
---|
CHECK_BLOCKING, CHECK_ROBOTS, X_POINT_ID |
Constructor Summary | |
---|---|
File()
|
Method Summary | |
---|---|
org.apache.hadoop.conf.Configuration |
getConf()
Get the Configuration object |
Collection<WebPage.Field> |
getFields()
|
ProtocolOutput |
getProtocolOutput(String url,
WebPage page)
Creates a FileResponse object corresponding to the url and
return a ProtocolOutput object as per the content received |
crawlercommons.robots.BaseRobotRules |
getRobotRules(String url,
WebPage page)
No robots parsing is done for file protocol. |
static void |
main(String[] args)
Quick way for running this class. |
void |
setConf(org.apache.hadoop.conf.Configuration conf)
Set the Configuration object |
void |
setMaxContentLength(int maxContentLength)
Set the point at which content is truncated. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final org.slf4j.Logger LOG
Constructor Detail |
---|
public File()
Method Detail |
---|
public void setConf(org.apache.hadoop.conf.Configuration conf)
Configuration
object
setConf
in interface org.apache.hadoop.conf.Configurable
public org.apache.hadoop.conf.Configuration getConf()
Configuration
object
getConf
in interface org.apache.hadoop.conf.Configurable
public void setMaxContentLength(int maxContentLength)
public ProtocolOutput getProtocolOutput(String url, WebPage page)
FileResponse
object corresponding to the url and
return a ProtocolOutput
object as per the content received
getProtocolOutput
in interface Protocol
url
- Text containing the urldatum
- The CrawlDatum object corresponding to the url
ProtocolOutput
object for the content of the file indicated by urlpublic Collection<WebPage.Field> getFields()
getFields
in interface FieldPluggable
public static void main(String[] args) throws Exception
Exception
public crawlercommons.robots.BaseRobotRules getRobotRules(String url, WebPage page)
getRobotRules
in interface Protocol
url
- url to check
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |