public class MimeAdaptiveFetchSchedule extends AdaptiveFetchSchedule
Modifier and Type | Field and Description |
---|---|
static org.slf4j.Logger |
LOG |
static String |
SCHEDULE_DEC_RATE |
static String |
SCHEDULE_INC_RATE |
static String |
SCHEDULE_MIME_FILE |
DEC_RATE, INC_RATE
defaultInterval, maxInterval
SECONDS_PER_DAY, STATUS_MODIFIED, STATUS_NOTMODIFIED, STATUS_UNKNOWN
Constructor and Description |
---|
MimeAdaptiveFetchSchedule() |
Modifier and Type | Method and Description |
---|---|
static void |
main(String[] args) |
void |
setConf(org.apache.hadoop.conf.Configuration conf) |
CrawlDatum |
setFetchSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
Sets the
fetchInterval and fetchTime on a
successfully fetched page. |
calculateLastFetchTime, forceRefetch, initializeSchedule, setPageGoneSchedule, setPageRetrySchedule, shouldFetch
public static final org.slf4j.Logger LOG
public static final String SCHEDULE_INC_RATE
public static final String SCHEDULE_DEC_RATE
public static final String SCHEDULE_MIME_FILE
public void setConf(org.apache.hadoop.conf.Configuration conf)
setConf
in interface org.apache.hadoop.conf.Configurable
setConf
in class AdaptiveFetchSchedule
public CrawlDatum setFetchSchedule(org.apache.hadoop.io.Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state)
AbstractFetchSchedule
fetchInterval
and fetchTime
on a
successfully fetched page. NOTE: this implementation resets the
retry counter - extending classes should call super.setFetchSchedule() to
preserve this behavior.setFetchSchedule
in interface FetchSchedule
setFetchSchedule
in class AdaptiveFetchSchedule
url
- url of the pagedatum
- page description to be adjusted. NOTE: this instance, passed by reference,
may be modified inside the method.prevFetchTime
- previous value of fetch time, or 0 if not available.prevModifiedTime
- previous value of modifiedTime, or 0 if not available.fetchTime
- the latest time, when the page was recently re-fetched. Most FetchSchedule
implementations should update the value in @see CrawlDatum to something greater than this value.modifiedTime
- last time the content was modified. This information comes from
the protocol implementations, or is set to < 0 if not available. Most FetchSchedule
implementations should update the value in @see CrawlDatum to this value.state
- if FetchSchedule.STATUS_MODIFIED
, then the content is considered to be "changed" before the
fetchTime
, if FetchSchedule.STATUS_NOTMODIFIED
then the content is known to be unchanged.
This information may be obtained by comparing page signatures before and after fetching. If this
is set to FetchSchedule.STATUS_UNKNOWN
, then it is unknown whether the page was changed; implementations
are free to follow a sensible default behavior.Copyright © 2014 The Apache Software Foundation