org.apache.nutch.parse
Class ParserChecker
java.lang.Object
org.apache.nutch.parse.ParserChecker
- All Implemented Interfaces:
- org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool
public class ParserChecker
- extends Object
- implements org.apache.hadoop.util.Tool
Parser checker, useful for testing parser.
It also accurately reports possible fetching and
parsing failures and presents protocol status signals to aid
debugging. The tool enables us to retrieve the following data from
any url:
- contentType: The URL
Content
type.
- signature: Digest is used to identify pages (like unique ID) and is used to remove
duplicates during the dedup procedure.
It is calculated using
MD5Signature
or
TextProfileSignature
.
- Version: From
org.apache.nutch.parse.ParseData
.
- Status: From
org.apache.nutch.parse.ParseData
.
- Title: of the URL
- Outlinks: associated with the URL
- Content Metadata: such as X-AspNet-Version, Date,
Content-length, servedBy, Content-Type, Cache-Control>, etc.
- Parse Metadata: such as CharEncodingForConversion,
OriginalCharEncoding, language, etc.
- ParseText: The page parse text which varies in length depdnecing on
content.length
configuration.
- Author:
- John Xing
Field Summary |
static org.slf4j.Logger |
LOG
|
Method Summary |
org.apache.hadoop.conf.Configuration |
getConf()
|
static void |
main(String[] args)
|
int |
run(String[] args)
|
void |
setConf(org.apache.hadoop.conf.Configuration c)
|
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
LOG
public static final org.slf4j.Logger LOG
ParserChecker
public ParserChecker()
run
public int run(String[] args)
throws Exception
- Specified by:
run
in interface org.apache.hadoop.util.Tool
- Throws:
Exception
getConf
public org.apache.hadoop.conf.Configuration getConf()
- Specified by:
getConf
in interface org.apache.hadoop.conf.Configurable
setConf
public void setConf(org.apache.hadoop.conf.Configuration c)
- Specified by:
setConf
in interface org.apache.hadoop.conf.Configurable
main
public static void main(String[] args)
throws Exception
- Throws:
Exception
Copyright © 2013 The Apache Software Foundation