public class DefaultFetchSchedule extends AbstractFetchSchedule
fetchInterval
remains
unchanged, and the updated page fetchTime will always be set to
fetchTime + fetchInterval * 1000
.defaultInterval, maxInterval
SECONDS_PER_DAY, STATUS_MODIFIED, STATUS_NOTMODIFIED, STATUS_UNKNOWN
Constructor and Description |
---|
DefaultFetchSchedule() |
Modifier and Type | Method and Description |
---|---|
CrawlDatum |
setFetchSchedule(Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
Sets the
fetchInterval and fetchTime on a
successfully fetched page. |
calculateLastFetchTime, forceRefetch, initializeSchedule, setConf, setPageGoneSchedule, setPageRetrySchedule, shouldFetch
getConf
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
getConf
public CrawlDatum setFetchSchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state)
AbstractFetchSchedule
fetchInterval
and fetchTime
on a
successfully fetched page. NOTE: this implementation resets the retry
counter - extending classes should call super.setFetchSchedule() to
preserve this behavior.setFetchSchedule
in interface FetchSchedule
setFetchSchedule
in class AbstractFetchSchedule
url
- url of the pagedatum
- page description to be adjusted. NOTE: this instance, passed by
reference, may be modified inside the method.prevFetchTime
- previous value of fetch time, or 0 if not available.prevModifiedTime
- previous value of modifiedTime, or 0 if not available.fetchTime
- the latest time, when the page was recently re-fetched. Most
FetchSchedule implementations should update the value in @see
CrawlDatum to something greater than this value.modifiedTime
- last time the content was modified. This information comes from
the protocol implementations, or is set to < 0 if not available.
Most FetchSchedule implementations should update the value in @see
CrawlDatum to this value.state
- if FetchSchedule.STATUS_MODIFIED
, then the content is considered to be
"changed" before the fetchTime
, if
FetchSchedule.STATUS_NOTMODIFIED
then the content is known to be
unchanged. This information may be obtained by comparing page
signatures before and after fetching. If this is set to
FetchSchedule.STATUS_UNKNOWN
, then it is unknown whether the page was
changed; implementations are free to follow a sensible default
behavior.Copyright © 2015 The Apache Software Foundation