org.apache.nutch.crawl
Class InjectorJob

java.lang.Object
  extended by org.apache.hadoop.conf.Configured
      extended by org.apache.nutch.util.NutchTool
          extended by org.apache.nutch.crawl.InjectorJob
All Implemented Interfaces:
org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool

public class InjectorJob
extends NutchTool
implements org.apache.hadoop.util.Tool

This class takes a flat file of URLs and adds them to the of pages to be crawled. Useful for bootstrapping the system. The URL files contain one URL per line, optionally followed by custom metadata separated by tabs with the metadata key separated from the corresponding value by '='.
Note that some metadata keys are reserved :
- nutch.score : allows to set a custom score for a specific URL
- nutch.fetchInterval : allows to set a custom fetch interval for a specific URL
e.g. http://www.nutch.org/ \t nutch.score=10 \t nutch.fetchInterval=2592000 \t userType=open_source


Nested Class Summary
static class InjectorJob.UrlMapper
           
 
Field Summary
static org.slf4j.Logger LOG
           
static String nutchFetchIntervalMDName
          metadata key reserved for setting a custom fetchInterval for a specific URL
static String nutchScoreMDName
          metadata key reserved for setting a custom score for a specific URL
 
Fields inherited from class org.apache.nutch.util.NutchTool
currentJob, currentJobNum, numJobs, results, status
 
Constructor Summary
InjectorJob()
           
InjectorJob(org.apache.hadoop.conf.Configuration conf)
           
 
Method Summary
 void inject(org.apache.hadoop.fs.Path urlDir)
           
static void main(String[] args)
           
 Map<String,Object> run(Map<String,Object> args)
          Runs the tool, using a map of arguments.
 int run(String[] args)
           
 
Methods inherited from class org.apache.nutch.util.NutchTool
getProgress, getStatus, killJob, stopJob
 
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
 

Field Detail

LOG

public static final org.slf4j.Logger LOG

nutchScoreMDName

public static String nutchScoreMDName
metadata key reserved for setting a custom score for a specific URL


nutchFetchIntervalMDName

public static String nutchFetchIntervalMDName
metadata key reserved for setting a custom fetchInterval for a specific URL

Constructor Detail

InjectorJob

public InjectorJob()

InjectorJob

public InjectorJob(org.apache.hadoop.conf.Configuration conf)
Method Detail

run

public Map<String,Object> run(Map<String,Object> args)
                       throws Exception
Description copied from class: NutchTool
Runs the tool, using a map of arguments. May return results, or null.

Specified by:
run in class NutchTool
Throws:
Exception

inject

public void inject(org.apache.hadoop.fs.Path urlDir)
            throws Exception
Throws:
Exception

run

public int run(String[] args)
        throws Exception
Specified by:
run in interface org.apache.hadoop.util.Tool
Throws:
Exception

main

public static void main(String[] args)
                 throws Exception
Throws:
Exception


Copyright © 2013 The Apache Software Foundation