HTMLDocument (Apache Any23 :: Core 0.7.0-incubating-SNAPSHOT API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.any23.extractor.html
Class HTMLDocument

java.lang.Object
  org.apache.any23.extractor.html.HTMLDocument

public class HTMLDocument
extends Object
extends Object

A wrapper around the DOM representation of an HTML document. Provides convenience access to various parts of the document.

Author:: Gabriele Renzi, Michele Mostarda

Nested Class Summary
`static class`	`HTMLDocument.TextField` This class represents a text extracted from the HTML DOM related to the node from which such test has been retrieved.

Constructor Summary
`HTMLDocument(Node document)` Constructor accepting the root node.

Method Summary
`static String`	`extractRelTag(NamedNodeMap attributes)` Extracts the href specific rel-tag string.
`static String`	`extractRelTag(String hrefAttributeContent)` Extracts the href specific rel-tag string.
`HTMLDocument.TextField[]`	`extractRelTagNodes()` Extracts all the `rel` tag nodes.
`String`	`find(String xpath)`
`List<Node>`	`findAll(String xpath)`
`List<Node>`	`findAllByClassName(String clazz)` Finds all the nodes by class name.
`Node`	`findMicroformattedObjectNode(String objectTag, String name)`
`String`	`findMicroformattedValue(String objectTag, String object, String fieldTag, String field, String key)`
`Node`	`findNodeById(String id)`
`String`	`getDefaultLanguage()` Returns the document default language.
`Node`	`getDocument()`
`String[]`	`getPathToLocalRoot()` Returns the sequence of ancestors from the document root to the local root (document).
`HTMLDocument.TextField[]`	`getPluralTextField(String className)` Returns a plural text field.
`HTMLDocument.TextField[]`	`getPluralUrlField(String className)` Returns the list of URLs associated to the fields marked with class className.
`HTMLDocument.TextField`	`getSingularTextField(String className)` Returns a singular text field.
`HTMLDocument.TextField`	`getSingularUrlField(String className)` Returns the URL associated to the field marked with class className.
`String`	`getText()` Returns the text contained inside a node if leaf, `null` otherwise.
`String`	`readAttribute(String attribute)` Read an attribute avoiding NullPointerExceptions, if the attr is missing it just returns an empty string.
`static String`	`readNodeContent(Node node, boolean prettify)` Reads the text content of the given node and returns it.
`static HTMLDocument.TextField`	`readTextField(Node node)` Reads a text field from the given node adding the content to the given res list.
`static void`	`readUrlField(List<HTMLDocument.TextField> res, Node node)` Reads an URL field from the given node adding the content to the given res list.
`org.openrdf.model.URI`	`resolveURI(String uri)`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

HTMLDocument

public HTMLDocument(Node document)

Constructor accepting the root node.

Parameters:: document -

Method Detail

readTextField

public static HTMLDocument.TextField readTextField(Node node)

Reads a text field from the given node adding the content to the given res list.

Parameters:: node - the node from which read the content.
Returns:: a valid TextField

readUrlField

public static void readUrlField(List<HTMLDocument.TextField> res,
                                Node node)

Reads an URL field from the given node adding the content to the given res list.

Parameters:: res -; node -

extractRelTag

public static String extractRelTag(String hrefAttributeContent)

Extracts the href specific rel-tag string. See the rel-tag specification.

Parameters:: hrefAttributeContent - the content of the href attribute.
Returns:: the rel-tag specification.

extractRelTag

public static String extractRelTag(NamedNodeMap attributes)

Extracts the href specific rel-tag string. See the rel-tag specification.

Parameters:: attributes - the list of attributes of a node.
Returns:: the rel-tag specification.

readNodeContent

public static String readNodeContent(Node node,
                                     boolean prettify)

Reads the text content of the given node and returns it. If the prettify flag is true the text is cleaned up.

Parameters:: node - node to read content.; prettify - if true blank chars will be removed.
Returns:: the read text.

resolveURI

public org.openrdf.model.URI resolveURI(String uri)
                                 throws ExtractionException

Returns:: An absolute URI, or null if the URI is not fixable
Throws:: ExtractionException - If the base URI is invalid

find

public String find(String xpath)

findNodeById

public Node findNodeById(String id)

findAll

public List<Node> findAll(String xpath)

findMicroformattedValue

public String findMicroformattedValue(String objectTag,
                                      String object,
                                      String fieldTag,
                                      String field,
                                      String key)

getDocument

public Node getDocument()

getSingularTextField

public HTMLDocument.TextField getSingularTextField(String className)

Returns a singular text field.

Parameters:: className - name of class containing text.
Returns:: if multiple values are found just the first is returned, if we want to check that there are no n-ary values use plural finder

getPluralTextField

public HTMLDocument.TextField[] getPluralTextField(String className)

Returns a plural text field.

Parameters:: className - name of class node containing text.
Returns:: list of fields.

getSingularUrlField

public HTMLDocument.TextField getSingularUrlField(String className)

Returns the URL associated to the field marked with class className.

Parameters:: className - name of node class containing the URL field.
Returns:: if multiple values are found just the first is returned, if we want to check that there are no n-ary values use plural finder

getPluralUrlField

public HTMLDocument.TextField[] getPluralUrlField(String className)

Returns the list of URLs associated to the fields marked with class className.

Parameters:: className - name of node class containing the URL field.
Returns:: the list of HTMLDocument.TextField found.

findMicroformattedObjectNode

public Node findMicroformattedObjectNode(String objectTag,
                                         String name)

readAttribute

public String readAttribute(String attribute)

Read an attribute avoiding NullPointerExceptions, if the attr is missing it just returns an empty string.

Parameters:: attribute - the attribute name.
Returns:: the string representing the attribute.

findAllByClassName

public List<Node> findAllByClassName(String clazz)

Finds all the nodes by class name.

Parameters:: clazz - the class name.
Returns:: list of matching nodes.

getText

public String getText()

Returns the text contained inside a node if leaf, null otherwise.

Returns:: the text of a leaf node.

getDefaultLanguage

public String getDefaultLanguage()

Returns the document default language.

Returns:: default language if any, null otherwise.

getPathToLocalRoot

public String[] getPathToLocalRoot()

Returns the sequence of ancestors from the document root to the local root (document).

Returns:: a sequence of node names.

extractRelTagNodes

public HTMLDocument.TextField[] extractRelTagNodes()

Extracts all the rel tag nodes.

Returns:: list of rel tag nodes.

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.any23.extractor.html Class HTMLDocument

HTMLDocument

readTextField

readUrlField

extractRelTag

extractRelTag

readNodeContent

resolveURI

find

findNodeById

findAll

findMicroformattedValue

getDocument

getSingularTextField

getPluralTextField

getSingularUrlField

getPluralUrlField

findMicroformattedObjectNode

readAttribute

findAllByClassName

getText

getDefaultLanguage

getPathToLocalRoot

extractRelTagNodes

org.apache.any23.extractor.html
Class HTMLDocument