org.apache.nutch.crawl
Class AdaptiveFetchSchedule

java.lang.Object
  extended by org.apache.hadoop.conf.Configured
      extended by org.apache.nutch.crawl.AbstractFetchSchedule
          extended by org.apache.nutch.crawl.AdaptiveFetchSchedule
All Implemented Interfaces:
Configurable, FetchSchedule
Direct Known Subclasses:
MimeAdaptiveFetchSchedule

public class AdaptiveFetchSchedule
extends AbstractFetchSchedule

This class implements an adaptive re-fetch algorithm. This works as follows:

NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may destabilize the algorithm, so that the fetch interval either increases or decreases infinitely, with little relevance to the page changes. Please use main(String[]) method to test the values before applying them in a production system.

Author:
Andrzej Bialecki

Field Summary
protected  float DEC_RATE
           
protected  float INC_RATE
           
static org.slf4j.Logger LOG
           
 
Fields inherited from class org.apache.nutch.crawl.AbstractFetchSchedule
defaultInterval, maxInterval
 
Fields inherited from interface org.apache.nutch.crawl.FetchSchedule
SECONDS_PER_DAY, STATUS_MODIFIED, STATUS_NOTMODIFIED, STATUS_UNKNOWN
 
Constructor Summary
AdaptiveFetchSchedule()
           
 
Method Summary
static void main(String[] args)
           
 void setConf(Configuration conf)
           
 CrawlDatum setFetchSchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state)
          Sets the fetchInterval and fetchTime on a successfully fetched page.
 
Methods inherited from class org.apache.nutch.crawl.AbstractFetchSchedule
calculateLastFetchTime, forceRefetch, initializeSchedule, setPageGoneSchedule, setPageRetrySchedule, shouldFetch
 
Methods inherited from class org.apache.hadoop.conf.Configured
getConf
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf
 

Field Detail

LOG

public static final org.slf4j.Logger LOG

INC_RATE

protected float INC_RATE

DEC_RATE

protected float DEC_RATE
Constructor Detail

AdaptiveFetchSchedule

public AdaptiveFetchSchedule()
Method Detail

setConf

public void setConf(Configuration conf)
Specified by:
setConf in interface Configurable
Overrides:
setConf in class AbstractFetchSchedule

setFetchSchedule

public CrawlDatum setFetchSchedule(Text url,
                                   CrawlDatum datum,
                                   long prevFetchTime,
                                   long prevModifiedTime,
                                   long fetchTime,
                                   long modifiedTime,
                                   int state)
Description copied from class: AbstractFetchSchedule
Sets the fetchInterval and fetchTime on a successfully fetched page. NOTE: this implementation resets the retry counter - extending classes should call super.setFetchSchedule() to preserve this behavior.

Specified by:
setFetchSchedule in interface FetchSchedule
Overrides:
setFetchSchedule in class AbstractFetchSchedule
Parameters:
url - url of the page
datum - page description to be adjusted. NOTE: this instance, passed by reference, may be modified inside the method.
prevFetchTime - previous value of fetch time, or 0 if not available.
prevModifiedTime - previous value of modifiedTime, or 0 if not available.
fetchTime - the latest time, when the page was recently re-fetched. Most FetchSchedule implementations should update the value in @see CrawlDatum to something greater than this value.
modifiedTime - last time the content was modified. This information comes from the protocol implementations, or is set to < 0 if not available. Most FetchSchedule implementations should update the value in @see CrawlDatum to this value.
state - if FetchSchedule.STATUS_MODIFIED, then the content is considered to be "changed" before the fetchTime, if FetchSchedule.STATUS_NOTMODIFIED then the content is known to be unchanged. This information may be obtained by comparing page signatures before and after fetching. If this is set to FetchSchedule.STATUS_UNKNOWN, then it is unknown whether the page was changed; implementations are free to follow a sensible default behavior.
Returns:
adjusted page information, including all original information. NOTE: this may be a different instance than @see CrawlDatum, but implementations should make sure that it contains at least all information from @see CrawlDatum}.

main

public static void main(String[] args)
                 throws Exception
Throws:
Exception


Copyright © 2012 The Apache Software Foundation