MimeAdaptiveFetchSchedule (apache-nutch 1.6 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.nutch.crawl
Class MimeAdaptiveFetchSchedule

java.lang.Object
  org.apache.hadoop.conf.Configured
      org.apache.nutch.crawl.AbstractFetchSchedule
          org.apache.nutch.crawl.AdaptiveFetchSchedule
              org.apache.nutch.crawl.MimeAdaptiveFetchSchedule

All Implemented Interfaces:: Configurable, FetchSchedule

public class MimeAdaptiveFetchSchedule
extends AdaptiveFetchSchedule
extends AdaptiveFetchSchedule

Extension of @see AdaptiveFetchSchedule that allows for more flexible configuration of DEC and INC factors for various MIME-types. This class can be typically used in cases where a recrawl consists of many different MIME-types. It's not very common for MIME-types other than text/html to change frequently. Using this class you can configure different factors per MIME-type so to prefer frequently changing MIME-types over others. For it to work this class relies on the Content-Type MetaData key being present in the CrawlDB. This can either be done when injecting new URL's or by adding "Content-Type" to the db.parsemeta.to.crawldb configuration setting to force MIME-types of newly discovered URL's to be added to the CrawlDB.

Author:: markus

Field Summary
`static org.slf4j.Logger`	`LOG`
`static String`	`SCHEDULE_DEC_RATE`
`static String`	`SCHEDULE_INC_RATE`
`static String`	`SCHEDULE_MIME_FILE`

Fields inherited from class org.apache.nutch.crawl.AdaptiveFetchSchedule
`DEC_RATE, INC_RATE`

Fields inherited from class org.apache.nutch.crawl.AbstractFetchSchedule
`defaultInterval, maxInterval`

Fields inherited from interface org.apache.nutch.crawl.FetchSchedule
`SECONDS_PER_DAY, STATUS_MODIFIED, STATUS_NOTMODIFIED, STATUS_UNKNOWN`

Constructor Summary
`MimeAdaptiveFetchSchedule()`

Method Summary
`static void`	`main(String[] args)`
`void`	`setConf(Configuration conf)`
`CrawlDatum`	`setFetchSchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state)` Sets the `fetchInterval` and `fetchTime` on a successfully fetched page.

Methods inherited from class org.apache.nutch.crawl.AbstractFetchSchedule
`calculateLastFetchTime, forceRefetch, initializeSchedule, setPageGoneSchedule, setPageRetrySchedule, shouldFetch`

Methods inherited from class org.apache.hadoop.conf.Configured
`getConf`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Methods inherited from interface org.apache.hadoop.conf.Configurable
`getConf`

Field Detail

LOG

public static final org.slf4j.Logger LOG

SCHEDULE_INC_RATE

public static final String SCHEDULE_INC_RATE

See Also:: Constant Field Values

SCHEDULE_DEC_RATE

public static final String SCHEDULE_DEC_RATE

See Also:: Constant Field Values

SCHEDULE_MIME_FILE

public static final String SCHEDULE_MIME_FILE

See Also:: Constant Field Values

Constructor Detail