org.apache.nutch.protocol.ftp
Class Ftp

java.lang.Object
  extended by org.apache.nutch.protocol.ftp.Ftp
All Implemented Interfaces:
org.apache.hadoop.conf.Configurable, FieldPluggable, Pluggable, Protocol

public class Ftp
extends Object
implements Protocol

This class is a protocol plugin used for ftp: scheme. It creates FtpResponse object and gets the content of the url from it. Configurable parameters are ftp.username, ftp.password, ftp.content.limit, ftp.timeout, ftp.server.timeout, ftp.password, ftp.keep.connection and ftp.follow.talk. For details see "FTP properties" section in nutch-default.xml.


Field Summary
static org.slf4j.Logger LOG
           
 
Fields inherited from interface org.apache.nutch.protocol.Protocol
CHECK_BLOCKING, CHECK_ROBOTS, X_POINT_ID
 
Constructor Summary
Ftp()
           
 
Method Summary
protected  void finalize()
           
 org.apache.hadoop.conf.Configuration getConf()
          Get the Configuration object
 Collection<WebPage.Field> getFields()
           
 ProtocolOutput getProtocolOutput(String url, WebPage page)
          Creates a FtpResponse object corresponding to the url and returns a ProtocolOutput object as per the content received
 crawlercommons.robots.BaseRobotRules getRobotRules(String url, WebPage page)
          Get the robots rules for a given url
static void main(String[] args)
          For debugging.
 void setConf(org.apache.hadoop.conf.Configuration conf)
          Set the Configuration object
 void setFollowTalk(boolean followTalk)
          Set followTalk
 void setKeepConnection(boolean keepConnection)
          Set keepConnection
 void setMaxContentLength(int length)
          Set the point at which content is truncated.
 void setTimeout(int to)
          Set the timeout.
 
Methods inherited from class java.lang.Object
clone, equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

LOG

public static final org.slf4j.Logger LOG
Constructor Detail

Ftp

public Ftp()
Method Detail

setTimeout

public void setTimeout(int to)
Set the timeout.


setMaxContentLength

public void setMaxContentLength(int length)
Set the point at which content is truncated.


setFollowTalk

public void setFollowTalk(boolean followTalk)
Set followTalk


setKeepConnection

public void setKeepConnection(boolean keepConnection)
Set keepConnection


getProtocolOutput

public ProtocolOutput getProtocolOutput(String url,
                                        WebPage page)
Creates a FtpResponse object corresponding to the url and returns a ProtocolOutput object as per the content received

Specified by:
getProtocolOutput in interface Protocol
Parameters:
url - Text containing the ftp url
datum - The CrawlDatum object corresponding to the url
Returns:
ProtocolOutput object for the url

finalize

protected void finalize()
Overrides:
finalize in class Object

setConf

public void setConf(org.apache.hadoop.conf.Configuration conf)
Set the Configuration object

Specified by:
setConf in interface org.apache.hadoop.conf.Configurable

getConf

public org.apache.hadoop.conf.Configuration getConf()
Get the Configuration object

Specified by:
getConf in interface org.apache.hadoop.conf.Configurable

main

public static void main(String[] args)
                 throws Exception
For debugging.

Throws:
Exception

getFields

public Collection<WebPage.Field> getFields()
Specified by:
getFields in interface FieldPluggable

getRobotRules

public crawlercommons.robots.BaseRobotRules getRobotRules(String url,
                                                          WebPage page)
Get the robots rules for a given url

Specified by:
getRobotRules in interface Protocol
Parameters:
url - url to check
Returns:
robot rules (specific for this url or default), never null


Copyright © 2013 The Apache Software Foundation