public class Fetcher extends org.apache.hadoop.conf.Configured implements org.apache.hadoop.util.Tool, org.apache.hadoop.mapred.MapRunnable<org.apache.hadoop.io.Text,CrawlDatum,org.apache.hadoop.io.Text,NutchWritable>
This fetcher uses a well-known model of one producer (a QueueFeeder) and many consumers (FetcherThread-s).
QueueFeeder reads input fetchlists and populates a set of FetchItemQueue-s, which hold FetchItem-s that describe the items to be fetched. There are as many queues as there are unique hosts, but at any given time the total number of fetch items in all queues is less than a fixed number (currently set to a multiple of the number of threads).
As items are consumed from the queues, the QueueFeeder continues to add new input items, so that their total count stays fixed (FetcherThread-s may also add new items to the queues e.g. as a results of redirection) - until all input items are exhausted, at which point the number of items in the queues begins to decrease. When this number reaches 0 fetcher will finish.
This fetcher implementation handles per-host blocking itself, instead of delegating this work to protocol-specific plugins. Each per-host queue handles its own "politeness" settings, such as the maximum number of concurrent requests and crawl delay between consecutive requests - and also a list of requests in progress, and the time the last request was finished. As FetcherThread-s ask for new items to be fetched, queues may return eligible items or null if for "politeness" reasons this host's queue is not yet ready.
If there are still unfetched items in the queues, but none of the items are ready, FetcherThread-s will spin-wait until either some items become available, or a timeout is reached (at which point the Fetcher will abort, assuming the task is hung).
Modifier and Type | Class and Description |
---|---|
static class |
Fetcher.InputFormat |
Modifier and Type | Field and Description |
---|---|
static String |
CONTENT_REDIR |
static org.slf4j.Logger |
LOG |
static int |
PERM_REFRESH_TIME |
static String |
PROTOCOL_REDIR |
Constructor and Description |
---|
Fetcher() |
Fetcher(org.apache.hadoop.conf.Configuration conf) |
Modifier and Type | Method and Description |
---|---|
void |
close() |
void |
configure(org.apache.hadoop.mapred.JobConf job) |
void |
fetch(org.apache.hadoop.fs.Path segment,
int threads) |
static boolean |
isParsing(org.apache.hadoop.conf.Configuration conf) |
static boolean |
isStoringContent(org.apache.hadoop.conf.Configuration conf) |
static void |
main(String[] args)
Run the fetcher.
|
void |
run(org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.Text,CrawlDatum> input,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,NutchWritable> output,
org.apache.hadoop.mapred.Reporter reporter) |
int |
run(String[] args) |
public static final int PERM_REFRESH_TIME
public static final String CONTENT_REDIR
public static final String PROTOCOL_REDIR
public static final org.slf4j.Logger LOG
public Fetcher()
public Fetcher(org.apache.hadoop.conf.Configuration conf)
public void configure(org.apache.hadoop.mapred.JobConf job)
configure
in interface org.apache.hadoop.mapred.JobConfigurable
public void close()
public static boolean isParsing(org.apache.hadoop.conf.Configuration conf)
public static boolean isStoringContent(org.apache.hadoop.conf.Configuration conf)
public void run(org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.Text,CrawlDatum> input, org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,NutchWritable> output, org.apache.hadoop.mapred.Reporter reporter) throws IOException
run
in interface org.apache.hadoop.mapred.MapRunnable<org.apache.hadoop.io.Text,CrawlDatum,org.apache.hadoop.io.Text,NutchWritable>
IOException
public void fetch(org.apache.hadoop.fs.Path segment, int threads) throws IOException
IOException
Copyright © 2014 The Apache Software Foundation