org.apache.nutch.protocol.file
Class File

java.lang.Object
  extended by org.apache.nutch.protocol.file.File
All Implemented Interfaces:
Configurable, Pluggable, Protocol

public class File
extends Object
implements Protocol

File.java deals with file: scheme. Configurable parameters are defined under "FILE properties" section in ./conf/nutch-default.xml or similar.

Author:
John Xing

Field Summary
static org.apache.commons.logging.Log LOG
           
 
Fields inherited from interface org.apache.nutch.protocol.Protocol
CHECK_BLOCKING, CHECK_ROBOTS, X_POINT_ID
 
Constructor Summary
File()
           
 
Method Summary
 Configuration getConf()
           
 ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum)
          Returns the Content for a fetchlist entry.
 RobotRules getRobotRules(Text url, CrawlDatum datum)
          Retrieve robot rules applicable for this url.
static void main(String[] args)
          For debugging.
 void setConf(Configuration conf)
           
 void setMaxContentLength(int length)
          Set the point at which content is truncated.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

LOG

public static final org.apache.commons.logging.Log LOG
Constructor Detail

File

public File()
Method Detail

setMaxContentLength

public void setMaxContentLength(int length)
Set the point at which content is truncated.


getProtocolOutput

public ProtocolOutput getProtocolOutput(Text url,
                                        CrawlDatum datum)
Description copied from interface: Protocol
Returns the Content for a fetchlist entry.

Specified by:
getProtocolOutput in interface Protocol

main

public static void main(String[] args)
                 throws Exception
For debugging.

Throws:
Exception

setConf

public void setConf(Configuration conf)
Specified by:
setConf in interface Configurable

getConf

public Configuration getConf()
Specified by:
getConf in interface Configurable

getRobotRules

public RobotRules getRobotRules(Text url,
                                CrawlDatum datum)
Description copied from interface: Protocol
Retrieve robot rules applicable for this url.

Specified by:
getRobotRules in interface Protocol
Parameters:
url - url to check
datum - page datum
Returns:
robot rules (specific for this url or default), never null


Copyright © 2011 The Apache Software Foundation