org.apache.nutch.parse
Class ParserChecker

java.lang.Object
  extended by org.apache.nutch.parse.ParserChecker
All Implemented Interfaces:
org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool

public class ParserChecker
extends Object
implements org.apache.hadoop.util.Tool

Parser checker, useful for testing parser. It also accurately reports possible fetching and parsing failures and presents protocol status signals to aid debugging. The tool enables us to retrieve the following data from any url:

  1. contentType: The URL Content type.
  2. signature: Digest is used to identify pages (like unique ID) and is used to remove duplicates during the dedup procedure. It is calculated using MD5Signature or TextProfileSignature.
  3. Version: From org.apache.nutch.parse.ParseData.
  4. Status: From org.apache.nutch.parse.ParseData.
  5. Title: of the URL
  6. Outlinks: associated with the URL
  7. Content Metadata: such as X-AspNet-Version, Date, Content-length, servedBy, Content-Type, Cache-Control, etc.
  8. Parse Metadata: such as CharEncodingForConversion, OriginalCharEncoding, language, etc.
  9. ParseText: The page parse text which varies in length depdnecing on content.length configuration.

Author:
John Xing

Field Summary
static org.slf4j.Logger LOG
           
 
Constructor Summary
ParserChecker()
           
 
Method Summary
 org.apache.hadoop.conf.Configuration getConf()
           
static void main(String[] args)
           
 int run(String[] args)
           
 void setConf(org.apache.hadoop.conf.Configuration c)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

LOG

public static final org.slf4j.Logger LOG
Constructor Detail

ParserChecker

public ParserChecker()
Method Detail

run

public int run(String[] args)
        throws Exception
Specified by:
run in interface org.apache.hadoop.util.Tool
Throws:
Exception

getConf

public org.apache.hadoop.conf.Configuration getConf()
Specified by:
getConf in interface org.apache.hadoop.conf.Configurable

setConf

public void setConf(org.apache.hadoop.conf.Configuration c)
Specified by:
setConf in interface org.apache.hadoop.conf.Configurable

main

public static void main(String[] args)
                 throws Exception
Throws:
Exception


Copyright © 2013 The Apache Software Foundation