org.apache.any23.extractor.html
Class TagSoupParser
java.lang.Object
org.apache.any23.extractor.html.TagSoupParser
public class TagSoupParser
- extends Object
Parses an InputStream
into an HTML DOM tree using a TagSoup parser.
Note: The resulting DOM tree will not be namespace
aware, and all element names will be upper case, while attributes
will be lower case. This is because the
NekoHTML based TagSoup parser
by default uses the Xerces HTML DOM
implementation, which doesn't support namespaces and forces uppercase element names. This works
with the RDFa XSLT Converter and with XPath, so we left it this way.
- Author:
- Richard Cyganiak (richard at cyganiak dot de), Michele Mostarda (mostarda@fbk.eu), Davide Palmisano (palmisano@fbk.eu)
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
ELEMENT_LOCATION
public static final String ELEMENT_LOCATION
- See Also:
- Constant Field Values
TagSoupParser
public TagSoupParser(InputStream input,
String documentURI)
TagSoupParser
public TagSoupParser(InputStream input,
String documentURI,
String encoding)
getDOM
public Document getDOM()
throws IOException
- Returns the DOM of the given document URI.
- Returns:
- the HTML DOM.
- Throws:
IOException
getValidatedDOM
public DocumentReport getValidatedDOM(boolean applyFix)
throws IOException,
ValidatorException
- Returns the validated DOM and applies fixes on it if applyFix
is set to
true
.
- Parameters:
applyFix
-
- Returns:
- a report containing the HTML DOM that has been validated and fixed if applyFix
if
true
. The reports contains also information about the activated rules and the
the detected issues.
- Throws:
IOException
ValidatorException
Copyright © 2010-2012 The Apache Software Foundation. All Rights Reserved.