|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.apache.nutch.parse.js.JSParseFilter
public class JSParseFilter
This class is a heuristic link extractor for JavaScript files and code snippets. The general idea of a two-pass regex matching comes from Heritrix. Parts of the code come from OutlinkExtractor.java by Stephan Strittmatter.
Field Summary | |
---|---|
static org.slf4j.Logger |
LOG
|
Fields inherited from interface org.apache.nutch.parse.ParseFilter |
---|
X_POINT_ID |
Fields inherited from interface org.apache.nutch.parse.Parser |
---|
X_POINT_ID |
Constructor Summary | |
---|---|
JSParseFilter()
|
Method Summary | |
---|---|
Parse |
filter(String url,
WebPage page,
Parse parse,
HTMLMetaTags metaTags,
DocumentFragment doc)
Scan the JavaScript looking for possible Outlink 's |
org.apache.hadoop.conf.Configuration |
getConf()
Get the Configuration object |
Collection<WebPage.Field> |
getFields()
Gets all the fields for a given WebPage
Many datastores need to setup the mapreduce job by specifying the fields
needed. |
Parse |
getParse(String url,
WebPage page)
Set the Configuration object |
static void |
main(String[] args)
Main method which can be run from command line with the plugin option. |
void |
setConf(org.apache.hadoop.conf.Configuration conf)
Set the Configuration object |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final org.slf4j.Logger LOG
Constructor Detail |
---|
public JSParseFilter()
Method Detail |
---|
public Parse filter(String url, WebPage page, Parse parse, HTMLMetaTags metaTags, DocumentFragment doc)
Outlink
's
filter
in interface ParseFilter
url
- URL of the WebPage
to be parsedpage
- WebPage
object relative to the URLparse
- Parse
object holding parse statusmetatags
- within the NutchDocument
doc
- The NutchDocument
object
Parse
objectpublic Parse getParse(String url, WebPage page)
Configuration
object
getParse
in interface Parser
url
- URL of the WebPage
which is parsedpage
- WebPage
object relative to the URL
Parse
objectpublic static void main(String[] args) throws Exception
args
-
Exception
public void setConf(org.apache.hadoop.conf.Configuration conf)
Configuration
object
setConf
in interface org.apache.hadoop.conf.Configurable
public org.apache.hadoop.conf.Configuration getConf()
Configuration
object
getConf
in interface org.apache.hadoop.conf.Configurable
public Collection<WebPage.Field> getFields()
WebPage
Many datastores need to setup the mapreduce job by specifying the fields
needed. All extensions that work on WebPage are able to specify what fields
they need.
getFields
in interface FieldPluggable
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |