org.apache.nutch.protocol.file
Class File

java.lang.Object
  extended by org.apache.nutch.protocol.file.File
All Implemented Interfaces:
org.apache.hadoop.conf.Configurable, FieldPluggable, Pluggable, Protocol

public class File
extends Object
implements Protocol

This class is a protocol plugin used for file: scheme. It creates FileResponse object and gets the content of the url from it. Configurable parameters are file.content.limit and file.crawl.parent in nutch-default.xml defined under "file properties" section.


Field Summary
static org.slf4j.Logger LOG
           
 
Fields inherited from interface org.apache.nutch.protocol.Protocol
CHECK_BLOCKING, CHECK_ROBOTS, X_POINT_ID
 
Constructor Summary
File()
           
 
Method Summary
 org.apache.hadoop.conf.Configuration getConf()
          Get the Configuration object
 Collection<WebPage.Field> getFields()
           
 ProtocolOutput getProtocolOutput(String url, WebPage page)
          Creates a FileResponse object corresponding to the url and return a ProtocolOutput object as per the content received
 crawlercommons.robots.BaseRobotRules getRobotRules(String url, WebPage page)
          No robots parsing is done for file protocol.
static void main(String[] args)
          Quick way for running this class.
 void setConf(org.apache.hadoop.conf.Configuration conf)
          Set the Configuration object
 void setMaxContentLength(int maxContentLength)
          Set the point at which content is truncated.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

LOG

public static final org.slf4j.Logger LOG
Constructor Detail

File

public File()
Method Detail

setConf

public void setConf(org.apache.hadoop.conf.Configuration conf)
Set the Configuration object

Specified by:
setConf in interface org.apache.hadoop.conf.Configurable

getConf

public org.apache.hadoop.conf.Configuration getConf()
Get the Configuration object

Specified by:
getConf in interface org.apache.hadoop.conf.Configurable

setMaxContentLength

public void setMaxContentLength(int maxContentLength)
Set the point at which content is truncated.


getProtocolOutput

public ProtocolOutput getProtocolOutput(String url,
                                        WebPage page)
Creates a FileResponse object corresponding to the url and return a ProtocolOutput object as per the content received

Specified by:
getProtocolOutput in interface Protocol
Parameters:
url - Text containing the url
datum - The CrawlDatum object corresponding to the url
Returns:
ProtocolOutput object for the content of the file indicated by url

getFields

public Collection<WebPage.Field> getFields()
Specified by:
getFields in interface FieldPluggable

main

public static void main(String[] args)
                 throws Exception
Quick way for running this class. Useful for debugging.

Throws:
Exception

getRobotRules

public crawlercommons.robots.BaseRobotRules getRobotRules(String url,
                                                          WebPage page)
No robots parsing is done for file protocol. So this returns a set of empty rules which will allow every url.

Specified by:
getRobotRules in interface Protocol
Parameters:
url - url to check
Returns:
robot rules (specific for this url or default), never null


Copyright © 2013 The Apache Software Foundation