org.apache.any23.extractor.html
Class DomUtils

java.lang.Object
  extended by org.apache.any23.extractor.html.DomUtils

public class DomUtils
extends Object

This class provides utility methods for DOM manipulation. It is separated from HTMLDocument so that its methods can be run on single DOM nodes without having to wrap them into an HTMLDocument. We use a mix of XPath and DOM manipulation.

This is likely to be a performance bottleneck but at least everything is localized here.


Method Summary
static String find(Node node, String xpath)
          Gets the string value of an XPath expression.
static List<Node> findAll(Node node, String xpath)
          Returns a NodeList composed of all the nodes that match an XPath expression, which must be valid.
static List<Node> findAllByAttributeName(Node root, String attrName)
          Finds all nodes that have a declared attribute.
static List<Node> findAllByClassName(Node root, String className)
          Finds all nodes that have a declared class.
static List<Node> findAllByTag(Node root, String tagName)
           
static List<Node> findAllByTagAndClassName(Node root, String tagName, String className)
           
static Node findNodeById(Node root, String id)
          Mimics the JS DOM API, or prototype's $()
static int getIndexInParent(Node n)
          Given a node this method returns the index corresponding to such node within the list of the children of its parent node.
static int[] getNodeLocation(Node n)
          Returns the row/col location of the given node.
static String getXPathForNode(Node node)
          Does a reverse walking of the DOM tree to generate a unique XPath expression leading to this node.
static String[] getXPathListForNode(Node n)
          Returns a list of tag names representing the path from the document root to the given node n.
static boolean hasAttribute(Node node, String attributeName)
          Checks the presence of an attribute in the given node.
static boolean hasAttribute(Node node, String attributeName, String className)
          Checks the presence of an attribute value in attributes that contain whitespace-separated lists of values.
static boolean hasClassName(Node node, String className)
          Tells if an element has a class name not checking the parents in the hierarchy mimicking the CSS .foo match.
static boolean isAncestorOf(Node candidateAncestor, Node candidateSibling)
          Checks whether a node is ancestor or same of another node.
static boolean isAncestorOf(Node candidateAncestor, Node candidateSibling, boolean strict)
          Checks whether a node is ancestor or same of another node.
static boolean isElementNode(Node target)
          Verifies if the given target node is an element.
static String readAttribute(Node node, String attribute)
          Reads the value of an attribute, returning the empty string if not present.
static String readAttribute(Node node, String attribute, String defaultValue)
          Reads the value of the specified attribute, returning the defaultValue string if not present.
static String readAttributeWithPrefix(Node node, String attributePrefix, String defaultValue)
          Reads the value of the first attribute which name matches with the specified attributePrefix.
static String serializeToXML(Node node, boolean indent)
          Given a DOM Node produces the XML serialization omitting the XML declaration.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Method Detail

getIndexInParent

public static int getIndexInParent(Node n)
Given a node this method returns the index corresponding to such node within the list of the children of its parent node.

Parameters:
n - the node of which returning the index.
Returns:
a non negative number.

getXPathForNode

public static String getXPathForNode(Node node)
Does a reverse walking of the DOM tree to generate a unique XPath expression leading to this node. The XPath generated is the canonical one based on sibling index: /html[1]/body[1]/div[2]/span[3] etc..

Parameters:
node - the input node.
Returns:
the XPath location of node as String.

getXPathListForNode

public static String[] getXPathListForNode(Node n)
Returns a list of tag names representing the path from the document root to the given node n.

Parameters:
n - the node for which retrieve the path.
Returns:
a sequence of HTML tag names.

getNodeLocation

public static int[] getNodeLocation(Node n)
Returns the row/col location of the given node.

Parameters:
n - input node.
Returns:
an array of two elements of type [<begin-row>, <begin-col>, <end-row> <end-col>] or null if not possible to extract such data.

isAncestorOf

public static boolean isAncestorOf(Node candidateAncestor,
                                   Node candidateSibling,
                                   boolean strict)
Checks whether a node is ancestor or same of another node.

Parameters:
candidateAncestor - the candidate ancestor node.
candidateSibling - the candidate sibling node.
strict - if true is not allowed that the ancestor and sibling can be the same node.
Returns:
true if candidateSibling is ancestor of candidateSibling, false otherwise.

isAncestorOf

public static boolean isAncestorOf(Node candidateAncestor,
                                   Node candidateSibling)
Checks whether a node is ancestor or same of another node. As isAncestorOf(org.w3c.dom.Node, org.w3c.dom.Node, boolean) with strict=false.

Parameters:
candidateAncestor - the candidate ancestor node.
candidateSibling - the candidate sibling node.
Returns:
true if candidateSibling is ancestor of candidateSibling, false otherwise.

findAllByClassName

public static List<Node> findAllByClassName(Node root,
                                            String className)
Finds all nodes that have a declared class. Note that the className is transformed to lower case before being matched against the DOM.

Parameters:
root - the root node from which start searching.
className - the name of the filtered class.
Returns:
list of matching nodes or an empty list.

findAllByAttributeName

public static List<Node> findAllByAttributeName(Node root,
                                                String attrName)
Finds all nodes that have a declared attribute. Note that the className is transformed to lower case before being matched against the DOM.

Parameters:
root - the root node from which start searching.
attrName - the name of the filtered attribue.
Returns:
list of matching nodes or an empty list.

findAllByTag

public static List<Node> findAllByTag(Node root,
                                      String tagName)

findAllByTagAndClassName

public static List<Node> findAllByTagAndClassName(Node root,
                                                  String tagName,
                                                  String className)

findNodeById

public static Node findNodeById(Node root,
                                String id)
Mimics the JS DOM API, or prototype's $()


findAll

public static List<Node> findAll(Node node,
                                 String xpath)
Returns a NodeList composed of all the nodes that match an XPath expression, which must be valid.


find

public static String find(Node node,
                          String xpath)
Gets the string value of an XPath expression.


hasClassName

public static boolean hasClassName(Node node,
                                   String className)
Tells if an element has a class name not checking the parents in the hierarchy mimicking the CSS .foo match.


hasAttribute

public static boolean hasAttribute(Node node,
                                   String attributeName,
                                   String className)
Checks the presence of an attribute value in attributes that contain whitespace-separated lists of values. The semantic is the CSS classes' ones: "foo" matches "bar foo", "foo" but not "foob"


hasAttribute

public static boolean hasAttribute(Node node,
                                   String attributeName)
Checks the presence of an attribute in the given node.

Parameters:
node - the node container.
attributeName - the name of the attribute.

isElementNode

public static boolean isElementNode(Node target)
Verifies if the given target node is an element.

Parameters:
target -
Returns:
true if the element the node is an element, false otherwise.

readAttribute

public static String readAttribute(Node node,
                                   String attribute,
                                   String defaultValue)
Reads the value of the specified attribute, returning the defaultValue string if not present.

Parameters:
node - node to read the attribute.
attribute - attribute name.
defaultValue - the default value to return if attribute is not found.
Returns:
the attribute value or defaultValue if not found.

readAttributeWithPrefix

public static String readAttributeWithPrefix(Node node,
                                             String attributePrefix,
                                             String defaultValue)
Reads the value of the first attribute which name matches with the specified attributePrefix. Returns the defaultValue if not found.

Parameters:
node - node to look for attributes.
attributePrefix - attribute prefix.
defaultValue - default returned value.
Returns:
the value found or default.

readAttribute

public static String readAttribute(Node node,
                                   String attribute)
Reads the value of an attribute, returning the empty string if not present.

Parameters:
node - node to read the attribute.
attribute - attribute name.
Returns:
the attribute value or "" if not found.

serializeToXML

public static String serializeToXML(Node node,
                                    boolean indent)
                             throws TransformerException,
                                    IOException
Given a DOM Node produces the XML serialization omitting the XML declaration.

Parameters:
node - node to be serialized.
indent - if true the output is indented.
Returns:
the XML serialization.
Throws:
TransformerException - if an error occurs during the serializator initialization and activation.
IOException


Copyright © 2010-2012 The Apache Software Foundation. All Rights Reserved.