org.apache.nutch.crawl
Class InjectorJob
java.lang.Object
org.apache.hadoop.conf.Configured
org.apache.nutch.util.NutchTool
org.apache.nutch.crawl.InjectorJob
- All Implemented Interfaces:
- Configurable, Tool
public class InjectorJob
- extends NutchTool
- implements Tool
This class takes a flat file of URLs and adds them to the of pages to be
crawled. Useful for bootstrapping the system.
The URL files contain one URL per line, optionally followed by custom metadata
separated by tabs with the metadata key separated from the corresponding value by '='.
Note that some metadata keys are reserved :
- nutch.score : allows to set a custom score for a specific URL
- nutch.fetchInterval : allows to set a custom fetch interval for a specific URL
e.g. http://www.nutch.org/ \t nutch.score=10 \t nutch.fetchInterval=2592000 \t userType=open_source
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
LOG
public static final org.slf4j.Logger LOG
nutchScoreMDName
public static String nutchScoreMDName
- metadata key reserved for setting a custom score for a specific URL
nutchFetchIntervalMDName
public static String nutchFetchIntervalMDName
- metadata key reserved for setting a custom fetchInterval for a specific URL
InjectorJob
public InjectorJob()
InjectorJob
public InjectorJob(Configuration conf)
run
public Map<String,Object> run(Map<String,Object> args)
throws Exception
- Description copied from class:
NutchTool
- Runs the tool, using a map of arguments.
May return results, or null.
- Specified by:
run
in class NutchTool
- Throws:
Exception
inject
public void inject(Path urlDir)
throws Exception
- Throws:
Exception
run
public int run(String[] args)
throws Exception
- Specified by:
run
in interface Tool
- Throws:
Exception
main
public static void main(String[] args)
throws Exception
- Throws:
Exception
Copyright © 2012 The Apache Software Foundation