|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.apache.any23.extractor.html.HTMLDocument
public class HTMLDocument
A wrapper around the DOM representation of an HTML document. Provides convenience access to various parts of the document.
Nested Class Summary | |
---|---|
static class |
HTMLDocument.TextField
This class represents a text extracted from the HTML DOM related to the node from which such test has been retrieved. |
Constructor Summary | |
---|---|
HTMLDocument(Node document)
Constructor accepting the root node. |
Method Summary | |
---|---|
static String |
extractRelTag(NamedNodeMap attributes)
Extracts the href specific rel-tag string. |
static String |
extractRelTag(String hrefAttributeContent)
Extracts the href specific rel-tag string. |
HTMLDocument.TextField[] |
extractRelTagNodes()
Extracts all the rel tag nodes. |
String |
find(String xpath)
|
List<Node> |
findAll(String xpath)
|
List<Node> |
findAllByClassName(String clazz)
Finds all the nodes by class name. |
Node |
findMicroformattedObjectNode(String objectTag,
String name)
|
String |
findMicroformattedValue(String objectTag,
String object,
String fieldTag,
String field,
String key)
|
Node |
findNodeById(String id)
|
String |
getDefaultLanguage()
Returns the document default language. |
Node |
getDocument()
|
String[] |
getPathToLocalRoot()
Returns the sequence of ancestors from the document root to the local root (document). |
HTMLDocument.TextField[] |
getPluralTextField(String className)
Returns a plural text field. |
HTMLDocument.TextField[] |
getPluralUrlField(String className)
Returns the list of URLs associated to the fields marked with class className. |
HTMLDocument.TextField |
getSingularTextField(String className)
Returns a singular text field. |
HTMLDocument.TextField |
getSingularUrlField(String className)
Returns the URL associated to the field marked with class className. |
String |
getText()
Returns the text contained inside a node if leaf, null otherwise. |
String |
readAttribute(String attribute)
Read an attribute avoiding NullPointerExceptions, if the attr is missing it just returns an empty string. |
static String |
readNodeContent(Node node,
boolean prettify)
Reads the text content of the given node and returns it. |
static HTMLDocument.TextField |
readTextField(Node node)
Reads a text field from the given node adding the content to the given res list. |
static void |
readUrlField(List<HTMLDocument.TextField> res,
Node node)
Reads an URL field from the given node adding the content to the given res list. |
org.openrdf.model.URI |
resolveURI(String uri)
|
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public HTMLDocument(Node document)
document
- Method Detail |
---|
public static HTMLDocument.TextField readTextField(Node node)
node
- the node from which read the content.
public static void readUrlField(List<HTMLDocument.TextField> res, Node node)
res
- node
- public static String extractRelTag(String hrefAttributeContent)
hrefAttributeContent
- the content of the href attribute.
public static String extractRelTag(NamedNodeMap attributes)
attributes
- the list of attributes of a node.
public static String readNodeContent(Node node, boolean prettify)
prettify
flag is true
the text is cleaned up.
node
- node to read content.prettify
- if true
blank chars will be removed.
public org.openrdf.model.URI resolveURI(String uri) throws ExtractionException
ExtractionException
- If the base URI is invalidpublic String find(String xpath)
public Node findNodeById(String id)
public List<Node> findAll(String xpath)
public String findMicroformattedValue(String objectTag, String object, String fieldTag, String field, String key)
public Node getDocument()
public HTMLDocument.TextField getSingularTextField(String className)
className
- name of class containing text.
public HTMLDocument.TextField[] getPluralTextField(String className)
className
- name of class node containing text.
public HTMLDocument.TextField getSingularUrlField(String className)
className
- name of node class containing the URL field.
public HTMLDocument.TextField[] getPluralUrlField(String className)
className
- name of node class containing the URL field.
HTMLDocument.TextField
found.public Node findMicroformattedObjectNode(String objectTag, String name)
public String readAttribute(String attribute)
attribute
- the attribute name.
public List<Node> findAllByClassName(String clazz)
clazz
- the class name.
public String getText()
null
otherwise.
public String getDefaultLanguage()
null
otherwise.public String[] getPathToLocalRoot()
public HTMLDocument.TextField[] extractRelTagNodes()
rel
tag nodes.
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |