org.apache.nutch.crawl
Class Generator
java.lang.Object
org.apache.hadoop.conf.Configured
org.apache.nutch.crawl.Generator
- All Implemented Interfaces:
- Configurable, Tool
public class Generator
- extends Configured
- implements Tool
Generates a subset of a crawl db to fetch. This version allows to generate
fetchlists for several segments in one go. Unlike in the initial version
(OldGenerator), the IP resolution is done ONLY on the entries which have been
selected for fetching. The URLs are partitioned by IP, domain or host within a
segment. We can chose separately how to count the URLS i.e. by domain or host
to limit the entries.
Method Summary |
Path[] |
generate(Path dbDir,
Path segments,
int numLists,
long topN,
long curTime)
|
Path[] |
generate(Path dbDir,
Path segments,
int numLists,
long topN,
long curTime,
boolean filter,
boolean force)
old signature used for compatibility - does not specify whether or not to
normalise and set the number of segments to 1 |
Path[] |
generate(Path dbDir,
Path segments,
int numLists,
long topN,
long curTime,
boolean filter,
boolean norm,
boolean force,
int maxNumSegments)
Generate fetchlists in one or more segments. |
static String |
generateSegmentName()
|
static void |
main(String[] args)
Generate a fetchlist from the crawldb. |
int |
run(String[] args)
|
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
LOG
public static final org.apache.commons.logging.Log LOG
GENERATE_UPDATE_CRAWLDB
public static final String GENERATE_UPDATE_CRAWLDB
- See Also:
- Constant Field Values
GENERATOR_MIN_SCORE
public static final String GENERATOR_MIN_SCORE
- See Also:
- Constant Field Values
GENERATOR_FILTER
public static final String GENERATOR_FILTER
- See Also:
- Constant Field Values
GENERATOR_NORMALISE
public static final String GENERATOR_NORMALISE
- See Also:
- Constant Field Values
GENERATOR_MAX_COUNT
public static final String GENERATOR_MAX_COUNT
- See Also:
- Constant Field Values
GENERATOR_COUNT_MODE
public static final String GENERATOR_COUNT_MODE
- See Also:
- Constant Field Values
GENERATOR_COUNT_VALUE_DOMAIN
public static final String GENERATOR_COUNT_VALUE_DOMAIN
- See Also:
- Constant Field Values
GENERATOR_COUNT_VALUE_HOST
public static final String GENERATOR_COUNT_VALUE_HOST
- See Also:
- Constant Field Values
GENERATOR_TOP_N
public static final String GENERATOR_TOP_N
- See Also:
- Constant Field Values
GENERATOR_CUR_TIME
public static final String GENERATOR_CUR_TIME
- See Also:
- Constant Field Values
GENERATOR_DELAY
public static final String GENERATOR_DELAY
- See Also:
- Constant Field Values
GENERATOR_MAX_NUM_SEGMENTS
public static final String GENERATOR_MAX_NUM_SEGMENTS
- See Also:
- Constant Field Values
GENERATE_MAX_PER_HOST_BY_IP
public static final String GENERATE_MAX_PER_HOST_BY_IP
- See Also:
- Constant Field Values
GENERATE_MAX_PER_HOST
public static final String GENERATE_MAX_PER_HOST
- See Also:
- Constant Field Values
Generator
public Generator()
Generator
public Generator(Configuration conf)
generate
public Path[] generate(Path dbDir,
Path segments,
int numLists,
long topN,
long curTime)
throws IOException
- Throws:
IOException
generate
public Path[] generate(Path dbDir,
Path segments,
int numLists,
long topN,
long curTime,
boolean filter,
boolean force)
throws IOException
- old signature used for compatibility - does not specify whether or not to
normalise and set the number of segments to 1
- Throws:
IOException
generate
public Path[] generate(Path dbDir,
Path segments,
int numLists,
long topN,
long curTime,
boolean filter,
boolean norm,
boolean force,
int maxNumSegments)
throws IOException
- Generate fetchlists in one or more segments. Whether to filter URLs or not
is read from the crawl.generate.filter property in the configuration files.
If the property is not found, the URLs are filtered. Same for the
normalisation.
- Parameters:
dbDir
- Crawl database directorysegments
- Segments directorynumLists
- Number of reduce taskstopN
- Number of top URLs to be selectedcurTime
- Current time in milliseconds
- Returns:
- Path to generated segment or null if no entries were selected
- Throws:
IOException
- When an I/O error occurs
generateSegmentName
public static String generateSegmentName()
main
public static void main(String[] args)
throws Exception
- Generate a fetchlist from the crawldb.
- Throws:
Exception
run
public int run(String[] args)
throws Exception
- Specified by:
run
in interface Tool
- Throws:
Exception
Copyright © 2006 The Apache Software Foundation