public class DOMContentUtils extends Object
Constructor and Description |
---|
DOMContentUtils(org.apache.hadoop.conf.Configuration conf) |
Modifier and Type | Method and Description |
---|---|
void |
getOutlinks(URL base,
ArrayList<Outlink> outlinks,
Node node)
|
void |
getText(StringBuffer sb,
Node node)
This is a convinience method, equivalent to
getText(sb, node, false) . |
boolean |
getTitle(StringBuffer sb,
Node node)
This method takes a
StringBuffer and a DOM Node ,
and will append the content text found beneath the first
title node to the StringBuffer . |
void |
setConf(org.apache.hadoop.conf.Configuration conf) |
public DOMContentUtils(org.apache.hadoop.conf.Configuration conf)
public void setConf(org.apache.hadoop.conf.Configuration conf)
public void getText(StringBuffer sb, Node node)
getText(sb, node, false)
.public boolean getTitle(StringBuffer sb, Node node)
StringBuffer
and a DOM Node
,
and will append the content text found beneath the first
title
node to the StringBuffer
.public void getOutlinks(URL base, ArrayList<Outlink> outlinks, Node node)
node
, and creates appropriate Outlink
records for each (relative to the supplied base
URL), and adds them to the outlinks
ArrayList
.
Links without inner structure (tags, text, etc) are discarded, as are links which contain only single nested links and empty text nodes (this is a common DOM-fixup artifact, at least with nekohtml).
Copyright © 2014 The Apache Software Foundation