org.apache.nutch.parse.js
Class JSParseFilter

java.lang.Object
  extended by org.apache.nutch.parse.js.JSParseFilter
All Implemented Interfaces:
org.apache.hadoop.conf.Configurable, ParseFilter, Parser, FieldPluggable, Pluggable

public class JSParseFilter
extends Object
implements ParseFilter, Parser

This class is a heuristic link extractor for JavaScript files and code snippets. The general idea of a two-pass regex matching comes from Heritrix. Parts of the code come from OutlinkExtractor.java by Stephan Strittmatter.

Author:
Andrzej Bialecki <ab@getopt.org>

Field Summary
static org.slf4j.Logger LOG
           
 
Fields inherited from interface org.apache.nutch.parse.ParseFilter
X_POINT_ID
 
Fields inherited from interface org.apache.nutch.parse.Parser
X_POINT_ID
 
Constructor Summary
JSParseFilter()
           
 
Method Summary
 Parse filter(String url, WebPage page, Parse parse, HTMLMetaTags metaTags, DocumentFragment doc)
          Scan the JavaScript looking for possible Outlink's
 org.apache.hadoop.conf.Configuration getConf()
          Get the Configuration object
 Collection<WebPage.Field> getFields()
          Gets all the fields for a given WebPage Many datastores need to setup the mapreduce job by specifying the fields needed.
 Parse getParse(String url, WebPage page)
          Set the Configuration object
static void main(String[] args)
          Main method which can be run from command line with the plugin option.
 void setConf(org.apache.hadoop.conf.Configuration conf)
          Set the Configuration object
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

LOG

public static final org.slf4j.Logger LOG
Constructor Detail

JSParseFilter

public JSParseFilter()
Method Detail

filter

public Parse filter(String url,
                    WebPage page,
                    Parse parse,
                    HTMLMetaTags metaTags,
                    DocumentFragment doc)
Scan the JavaScript looking for possible Outlink's

Specified by:
filter in interface ParseFilter
Parameters:
url - URL of the WebPage to be parsed
page - WebPage object relative to the URL
parse - Parse object holding parse status
metatags - within the NutchDocument
doc - The NutchDocument object
Returns:
parse the actual Parse object

getParse

public Parse getParse(String url,
                      WebPage page)
Set the Configuration object

Specified by:
getParse in interface Parser
Parameters:
url - URL of the WebPage which is parsed
page - WebPage object relative to the URL
Returns:
parse the actual Parse object

main

public static void main(String[] args)
                 throws Exception
Main method which can be run from command line with the plugin option. The method takes two arguments e.g. o.a.n.parse.js.JSParseFilter file.js baseURL

Parameters:
args -
Throws:
Exception

setConf

public void setConf(org.apache.hadoop.conf.Configuration conf)
Set the Configuration object

Specified by:
setConf in interface org.apache.hadoop.conf.Configurable

getConf

public org.apache.hadoop.conf.Configuration getConf()
Get the Configuration object

Specified by:
getConf in interface org.apache.hadoop.conf.Configurable

getFields

public Collection<WebPage.Field> getFields()
Gets all the fields for a given WebPage Many datastores need to setup the mapreduce job by specifying the fields needed. All extensions that work on WebPage are able to specify what fields they need.

Specified by:
getFields in interface FieldPluggable


Copyright © 2013 The Apache Software Foundation