org.apache.any23.extractor.html
Class HTMLDocument

java.lang.Object
  extended by org.apache.any23.extractor.html.HTMLDocument

public class HTMLDocument
extends Object

A wrapper around the DOM representation of an HTML document. Provides convenience access to various parts of the document.

Author:
Gabriele Renzi, Michele Mostarda

Nested Class Summary
static class HTMLDocument.TextField
          This class represents a text extracted from the HTML DOM related to the node from which such test has been retrieved.
 
Constructor Summary
HTMLDocument(Node document)
          Constructor accepting the root node.
 
Method Summary
static String extractRelTag(NamedNodeMap attributes)
          Extracts the href specific rel-tag string.
static String extractRelTag(String hrefAttributeContent)
          Extracts the href specific rel-tag string.
 HTMLDocument.TextField[] extractRelTagNodes()
          Extracts all the rel tag nodes.
 String find(String xpath)
           
 List<Node> findAll(String xpath)
           
 List<Node> findAllByClassName(String clazz)
          Finds all the nodes by class name.
 Node findMicroformattedObjectNode(String objectTag, String name)
           
 String findMicroformattedValue(String objectTag, String object, String fieldTag, String field, String key)
           
 Node findNodeById(String id)
           
 String getDefaultLanguage()
          Returns the document default language.
 Node getDocument()
           
 String[] getPathToLocalRoot()
          Returns the sequence of ancestors from the document root to the local root (document).
 HTMLDocument.TextField[] getPluralTextField(String className)
          Returns a plural text field.
 HTMLDocument.TextField[] getPluralUrlField(String className)
          Returns the list of URLs associated to the fields marked with class className.
 HTMLDocument.TextField getSingularTextField(String className)
          Returns a singular text field.
 HTMLDocument.TextField getSingularUrlField(String className)
          Returns the URL associated to the field marked with class className.
 String getText()
          Returns the text contained inside a node if leaf, null otherwise.
 String readAttribute(String attribute)
          Read an attribute avoiding NullPointerExceptions, if the attr is missing it just returns an empty string.
static String readNodeContent(Node node, boolean prettify)
          Reads the text content of the given node and returns it.
static HTMLDocument.TextField readTextField(Node node)
          Reads a text field from the given node adding the content to the given res list.
static void readUrlField(List<HTMLDocument.TextField> res, Node node)
          Reads an URL field from the given node adding the content to the given res list.
 org.openrdf.model.URI resolveURI(String uri)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

HTMLDocument

public HTMLDocument(Node document)
Constructor accepting the root node.

Parameters:
document -
Method Detail

readTextField

public static HTMLDocument.TextField readTextField(Node node)
Reads a text field from the given node adding the content to the given res list.

Parameters:
node - the node from which read the content.
Returns:
a valid TextField

readUrlField

public static void readUrlField(List<HTMLDocument.TextField> res,
                                Node node)
Reads an URL field from the given node adding the content to the given res list.

Parameters:
res -
node -

extractRelTag

public static String extractRelTag(String hrefAttributeContent)
Extracts the href specific rel-tag string. See the rel-tag specification.

Parameters:
hrefAttributeContent - the content of the href attribute.
Returns:
the rel-tag specification.

extractRelTag

public static String extractRelTag(NamedNodeMap attributes)
Extracts the href specific rel-tag string. See the rel-tag specification.

Parameters:
attributes - the list of attributes of a node.
Returns:
the rel-tag specification.

readNodeContent

public static String readNodeContent(Node node,
                                     boolean prettify)
Reads the text content of the given node and returns it. If the prettify flag is true the text is cleaned up.

Parameters:
node - node to read content.
prettify - if true blank chars will be removed.
Returns:
the read text.

resolveURI

public org.openrdf.model.URI resolveURI(String uri)
                                 throws ExtractionException
Returns:
An absolute URI, or null if the URI is not fixable
Throws:
ExtractionException - If the base URI is invalid

find

public String find(String xpath)

findNodeById

public Node findNodeById(String id)

findAll

public List<Node> findAll(String xpath)

findMicroformattedValue

public String findMicroformattedValue(String objectTag,
                                      String object,
                                      String fieldTag,
                                      String field,
                                      String key)

getDocument

public Node getDocument()

getSingularTextField

public HTMLDocument.TextField getSingularTextField(String className)
Returns a singular text field.

Parameters:
className - name of class containing text.
Returns:
if multiple values are found just the first is returned, if we want to check that there are no n-ary values use plural finder

getPluralTextField

public HTMLDocument.TextField[] getPluralTextField(String className)
Returns a plural text field.

Parameters:
className - name of class node containing text.
Returns:
list of fields.

getSingularUrlField

public HTMLDocument.TextField getSingularUrlField(String className)
Returns the URL associated to the field marked with class className.

Parameters:
className - name of node class containing the URL field.
Returns:
if multiple values are found just the first is returned, if we want to check that there are no n-ary values use plural finder

getPluralUrlField

public HTMLDocument.TextField[] getPluralUrlField(String className)
Returns the list of URLs associated to the fields marked with class className.

Parameters:
className - name of node class containing the URL field.
Returns:
the list of HTMLDocument.TextField found.

findMicroformattedObjectNode

public Node findMicroformattedObjectNode(String objectTag,
                                         String name)

readAttribute

public String readAttribute(String attribute)
Read an attribute avoiding NullPointerExceptions, if the attr is missing it just returns an empty string.

Parameters:
attribute - the attribute name.
Returns:
the string representing the attribute.

findAllByClassName

public List<Node> findAllByClassName(String clazz)
Finds all the nodes by class name.

Parameters:
clazz - the class name.
Returns:
list of matching nodes.

getText

public String getText()
Returns the text contained inside a node if leaf, null otherwise.

Returns:
the text of a leaf node.

getDefaultLanguage

public String getDefaultLanguage()
Returns the document default language.

Returns:
default language if any, null otherwise.

getPathToLocalRoot

public String[] getPathToLocalRoot()
Returns the sequence of ancestors from the document root to the local root (document).

Returns:
a sequence of node names.

extractRelTagNodes

public HTMLDocument.TextField[] extractRelTagNodes()
Extracts all the rel tag nodes.

Returns:
list of rel tag nodes.


Copyright © 2010-2012 The Apache Software Foundation. All Rights Reserved.