org.apache.any23.extractor.html
Class TagSoupParser

java.lang.Object
  extended by org.apache.any23.extractor.html.TagSoupParser

public class TagSoupParser
extends Object

Parses an InputStream into an HTML DOM tree using a TagSoup parser.

Note: The resulting DOM tree will not be namespace aware, and all element names will be upper case, while attributes will be lower case. This is because the NekoHTML based TagSoup parser by default uses the Xerces HTML DOM implementation, which doesn't support namespaces and forces uppercase element names. This works with the RDFa XSLT Converter and with XPath, so we left it this way.

Author:
Richard Cyganiak (richard at cyganiak dot de), Michele Mostarda (mostarda@fbk.eu), Davide Palmisano (palmisano@fbk.eu)

Nested Class Summary
static class TagSoupParser.ElementLocation
          Describes a DOM Element location.
 
Field Summary
static String ELEMENT_LOCATION
           
 
Constructor Summary
TagSoupParser(InputStream input, String documentURI)
           
TagSoupParser(InputStream input, String documentURI, String encoding)
           
 
Method Summary
 Document getDOM()
          Returns the DOM of the given document URI.
 DocumentReport getValidatedDOM(boolean applyFix)
          Returns the validated DOM and applies fixes on it if applyFix is set to true.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

ELEMENT_LOCATION

public static final String ELEMENT_LOCATION
See Also:
Constant Field Values
Constructor Detail

TagSoupParser

public TagSoupParser(InputStream input,
                     String documentURI)

TagSoupParser

public TagSoupParser(InputStream input,
                     String documentURI,
                     String encoding)
Method Detail

getDOM

public Document getDOM()
                throws IOException
Returns the DOM of the given document URI.

Returns:
the HTML DOM.
Throws:
IOException

getValidatedDOM

public DocumentReport getValidatedDOM(boolean applyFix)
                               throws IOException,
                                      ValidatorException
Returns the validated DOM and applies fixes on it if applyFix is set to true.

Parameters:
applyFix -
Returns:
a report containing the HTML DOM that has been validated and fixed if applyFix if true. The reports contains also information about the activated rules and the the detected issues.
Throws:
IOException
ValidatorException


Copyright © 2010-2012 The Apache Software Foundation. All Rights Reserved.