org.apache.nutch.crawl
Class Generator

java.lang.Object
  extended by org.apache.hadoop.conf.Configured
      extended by org.apache.nutch.crawl.Generator
All Implemented Interfaces:
Configurable, Tool

public class Generator
extends Configured
implements Tool

Generates a subset of a crawl db to fetch. This version allows to generate fetchlists for several segments in one go. Unlike in the initial version (OldGenerator), the IP resolution is done ONLY on the entries which have been selected for fetching. The URLs are partitioned by IP, domain or host within a segment. We can chose separately how to count the URLS i.e. by domain or host to limit the entries.


Nested Class Summary
static class Generator.CrawlDbUpdater
          Update the CrawlDB so that the next generate won't include the same URLs.
static class Generator.DecreasingFloatComparator
           
static class Generator.GeneratorOutputFormat
           
static class Generator.HashComparator
          Sort fetch lists by hash of URL.
static class Generator.PartitionReducer
           
static class Generator.Selector
          Selects entries due for fetch.
static class Generator.SelectorEntry
           
static class Generator.SelectorInverseMapper
           
 
Field Summary
static String GENERATE_MAX_PER_HOST
           
static String GENERATE_MAX_PER_HOST_BY_IP
           
static String GENERATE_UPDATE_CRAWLDB
           
static String GENERATOR_COUNT_MODE
           
static String GENERATOR_COUNT_VALUE_DOMAIN
           
static String GENERATOR_COUNT_VALUE_HOST
           
static String GENERATOR_CUR_TIME
           
static String GENERATOR_DELAY
           
static String GENERATOR_FILTER
           
static String GENERATOR_MAX_COUNT
           
static String GENERATOR_MAX_NUM_SEGMENTS
           
static String GENERATOR_MIN_INTERVAL
           
static String GENERATOR_MIN_SCORE
           
static String GENERATOR_NORMALISE
           
static String GENERATOR_RESTRICT_STATUS
           
static String GENERATOR_TOP_N
           
static org.slf4j.Logger LOG
           
 
Constructor Summary
Generator()
           
Generator(Configuration conf)
           
 
Method Summary
 Path[] generate(Path dbDir, Path segments, int numLists, long topN, long curTime)
           
 Path[] generate(Path dbDir, Path segments, int numLists, long topN, long curTime, boolean filter, boolean force)
          old signature used for compatibility - does not specify whether or not to normalise and set the number of segments to 1
 Path[] generate(Path dbDir, Path segments, int numLists, long topN, long curTime, boolean filter, boolean norm, boolean force, int maxNumSegments)
          Generate fetchlists in one or more segments.
static String generateSegmentName()
           
static void main(String[] args)
          Generate a fetchlist from the crawldb.
 int run(String[] args)
           
 
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
 

Field Detail

LOG

public static final org.slf4j.Logger LOG

GENERATE_UPDATE_CRAWLDB

public static final String GENERATE_UPDATE_CRAWLDB
See Also:
Constant Field Values

GENERATOR_MIN_SCORE

public static final String GENERATOR_MIN_SCORE
See Also:
Constant Field Values

GENERATOR_MIN_INTERVAL

public static final String GENERATOR_MIN_INTERVAL
See Also:
Constant Field Values

GENERATOR_RESTRICT_STATUS

public static final String GENERATOR_RESTRICT_STATUS
See Also:
Constant Field Values

GENERATOR_FILTER

public static final String GENERATOR_FILTER
See Also:
Constant Field Values

GENERATOR_NORMALISE

public static final String GENERATOR_NORMALISE
See Also:
Constant Field Values

GENERATOR_MAX_COUNT

public static final String GENERATOR_MAX_COUNT
See Also:
Constant Field Values

GENERATOR_COUNT_MODE

public static final String GENERATOR_COUNT_MODE
See Also:
Constant Field Values

GENERATOR_COUNT_VALUE_DOMAIN

public static final String GENERATOR_COUNT_VALUE_DOMAIN
See Also:
Constant Field Values

GENERATOR_COUNT_VALUE_HOST

public static final String GENERATOR_COUNT_VALUE_HOST
See Also:
Constant Field Values

GENERATOR_TOP_N

public static final String GENERATOR_TOP_N
See Also:
Constant Field Values

GENERATOR_CUR_TIME

public static final String GENERATOR_CUR_TIME
See Also:
Constant Field Values

GENERATOR_DELAY

public static final String GENERATOR_DELAY
See Also:
Constant Field Values

GENERATOR_MAX_NUM_SEGMENTS

public static final String GENERATOR_MAX_NUM_SEGMENTS
See Also:
Constant Field Values

GENERATE_MAX_PER_HOST_BY_IP

public static final String GENERATE_MAX_PER_HOST_BY_IP
See Also:
Constant Field Values

GENERATE_MAX_PER_HOST

public static final String GENERATE_MAX_PER_HOST
See Also:
Constant Field Values
Constructor Detail

Generator

public Generator()

Generator

public Generator(Configuration conf)
Method Detail

generate

public Path[] generate(Path dbDir,
                       Path segments,
                       int numLists,
                       long topN,
                       long curTime)
                throws IOException
Throws:
IOException

generate

public Path[] generate(Path dbDir,
                       Path segments,
                       int numLists,
                       long topN,
                       long curTime,
                       boolean filter,
                       boolean force)
                throws IOException
old signature used for compatibility - does not specify whether or not to normalise and set the number of segments to 1

Throws:
IOException

generate

public Path[] generate(Path dbDir,
                       Path segments,
                       int numLists,
                       long topN,
                       long curTime,
                       boolean filter,
                       boolean norm,
                       boolean force,
                       int maxNumSegments)
                throws IOException
Generate fetchlists in one or more segments. Whether to filter URLs or not is read from the crawl.generate.filter property in the configuration files. If the property is not found, the URLs are filtered. Same for the normalisation.

Parameters:
dbDir - Crawl database directory
segments - Segments directory
numLists - Number of reduce tasks
topN - Number of top URLs to be selected
curTime - Current time in milliseconds
Returns:
Path to generated segment or null if no entries were selected
Throws:
IOException - When an I/O error occurs

generateSegmentName

public static String generateSegmentName()

main

public static void main(String[] args)
                 throws Exception
Generate a fetchlist from the crawldb.

Throws:
Exception

run

public int run(String[] args)
        throws Exception
Specified by:
run in interface Tool
Throws:
Exception


Copyright © 2012 The Apache Software Foundation