org.apache.nutch.crawl
Class Injector
java.lang.Object
org.apache.hadoop.conf.Configured
org.apache.nutch.crawl.Injector
- All Implemented Interfaces:
- Configurable, Tool
public class Injector
- extends Configured
- implements Tool
This class takes a flat file of URLs and adds them to the of pages to be
crawled. Useful for bootstrapping the system.
The URL files contain one URL per line, optionally followed by custom metadata
separated by tabs with the metadata key separated from the corresponding value by '='.
Note that some metadata keys are reserved :
- nutch.score : allows to set a custom score for a specific URL
- nutch.fetchInterval : allows to set a custom fetch interval for a specific URL
e.g. http://www.nutch.org/ \t nutch.score=10 \t nutch.fetchInterval=2592000 \t userType=open_source
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
LOG
public static final org.slf4j.Logger LOG
nutchScoreMDName
public static String nutchScoreMDName
- metadata key reserved for setting a custom score for a specific URL
nutchFetchIntervalMDName
public static String nutchFetchIntervalMDName
- metadata key reserved for setting a custom fetchInterval for a specific URL
Injector
public Injector()
Injector
public Injector(Configuration conf)
inject
public void inject(Path crawlDb,
Path urlDir)
throws IOException
- Throws:
IOException
main
public static void main(String[] args)
throws Exception
- Throws:
Exception
run
public int run(String[] args)
throws Exception
- Specified by:
run
in interface Tool
- Throws:
Exception
Copyright © 2011 The Apache Software Foundation