org.apache.nutch.parse.tika
Class DOMContentUtils

java.lang.Object
  extended by org.apache.nutch.parse.tika.DOMContentUtils

public class DOMContentUtils
extends Object

A collection of methods for extracting content from DOM trees. This class holds a few utility methods for pulling content out of DOM nodes, such as getOutlinks, getText, etc.


Constructor Summary
DOMContentUtils(org.apache.hadoop.conf.Configuration conf)
           
 
Method Summary
 void getOutlinks(URL base, ArrayList<Outlink> outlinks, Node node)
          This method finds all anchors below the supplied DOM node, and creates appropriate Outlink records for each (relative to the supplied base URL), and adds them to the outlinks ArrayList.
 void getText(StringBuffer sb, Node node)
          This is a convinience method, equivalent to getText(sb, node, false).
 boolean getTitle(StringBuffer sb, Node node)
          This method takes a StringBuffer and a DOM Node, and will append the content text found beneath the first title node to the StringBuffer.
 void setConf(org.apache.hadoop.conf.Configuration conf)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DOMContentUtils

public DOMContentUtils(org.apache.hadoop.conf.Configuration conf)
Method Detail

setConf

public void setConf(org.apache.hadoop.conf.Configuration conf)

getText

public void getText(StringBuffer sb,
                    Node node)
This is a convinience method, equivalent to getText(sb, node, false).


getTitle

public boolean getTitle(StringBuffer sb,
                        Node node)
This method takes a StringBuffer and a DOM Node, and will append the content text found beneath the first title node to the StringBuffer.

Returns:
true if a title node was found, false otherwise

getOutlinks

public void getOutlinks(URL base,
                        ArrayList<Outlink> outlinks,
                        Node node)
This method finds all anchors below the supplied DOM node, and creates appropriate Outlink records for each (relative to the supplied base URL), and adds them to the outlinks ArrayList.

Links without inner structure (tags, text, etc) are discarded, as are links which contain only single nested links and empty text nodes (this is a common DOM-fixup artifact, at least with nekohtml).



Copyright © 2013 The Apache Software Foundation