org.apache.nutch.fetcher
Class Fetcher

java.lang.Object
  extended by org.apache.hadoop.conf.Configured
      extended by org.apache.nutch.fetcher.Fetcher
All Implemented Interfaces:
Configurable, JobConfigurable, MapRunnable<Text,CrawlDatum,Text,NutchWritable>, Tool

public class Fetcher
extends Configured
implements Tool, MapRunnable<Text,CrawlDatum,Text,NutchWritable>

A queue-based fetcher.

This fetcher uses a well-known model of one producer (a QueueFeeder) and many consumers (FetcherThread-s).

QueueFeeder reads input fetchlists and populates a set of FetchItemQueue-s, which hold FetchItem-s that describe the items to be fetched. There are as many queues as there are unique hosts, but at any given time the total number of fetch items in all queues is less than a fixed number (currently set to a multiple of the number of threads).

As items are consumed from the queues, the QueueFeeder continues to add new input items, so that their total count stays fixed (FetcherThread-s may also add new items to the queues e.g. as a results of redirection) - until all input items are exhausted, at which point the number of items in the queues begins to decrease. When this number reaches 0 fetcher will finish.

This fetcher implementation handles per-host blocking itself, instead of delegating this work to protocol-specific plugins. Each per-host queue handles its own "politeness" settings, such as the maximum number of concurrent requests and crawl delay between consecutive requests - and also a list of requests in progress, and the time the last request was finished. As FetcherThread-s ask for new items to be fetched, queues may return eligible items or null if for "politeness" reasons this host's queue is not yet ready.

If there are still unfetched items in the queues, but none of the items are ready, FetcherThread-s will spin-wait until either some items become available, or a timeout is reached (at which point the Fetcher will abort, assuming the task is hung).

Author:
Andrzej Bialecki

Nested Class Summary
static class Fetcher.InputFormat
           
 
Field Summary
static String CONTENT_REDIR
           
static org.apache.commons.logging.Log LOG
           
static int PERM_REFRESH_TIME
           
static String PROTOCOL_REDIR
           
 
Constructor Summary
Fetcher()
           
Fetcher(Configuration conf)
           
 
Method Summary
 void close()
           
 void configure(JobConf job)
           
 void fetch(Path segment, int threads, boolean parsing)
           
static boolean isParsing(Configuration conf)
           
static boolean isStoringContent(Configuration conf)
           
static void main(String[] args)
          Run the fetcher.
 void run(RecordReader<Text,CrawlDatum> input, OutputCollector<Text,NutchWritable> output, Reporter reporter)
           
 int run(String[] args)
           
 
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
 

Field Detail

PERM_REFRESH_TIME

public static final int PERM_REFRESH_TIME
See Also:
Constant Field Values

CONTENT_REDIR

public static final String CONTENT_REDIR
See Also:
Constant Field Values

PROTOCOL_REDIR

public static final String PROTOCOL_REDIR
See Also:
Constant Field Values

LOG

public static final org.apache.commons.logging.Log LOG
Constructor Detail

Fetcher

public Fetcher()

Fetcher

public Fetcher(Configuration conf)
Method Detail

configure

public void configure(JobConf job)
Specified by:
configure in interface JobConfigurable

close

public void close()

isParsing

public static boolean isParsing(Configuration conf)

isStoringContent

public static boolean isStoringContent(Configuration conf)

run

public void run(RecordReader<Text,CrawlDatum> input,
                OutputCollector<Text,NutchWritable> output,
                Reporter reporter)
         throws IOException
Specified by:
run in interface MapRunnable<Text,CrawlDatum,Text,NutchWritable>
Throws:
IOException

fetch

public void fetch(Path segment,
                  int threads,
                  boolean parsing)
           throws IOException
Throws:
IOException

main

public static void main(String[] args)
                 throws Exception
Run the fetcher.

Throws:
Exception

run

public int run(String[] args)
        throws Exception
Specified by:
run in interface Tool
Throws:
Exception


Copyright © 2011 The Apache Software Foundation