TagSoupParser (Apache Any23 :: Core 0.7.0-incubating-SNAPSHOT API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.any23.extractor.html
Class TagSoupParser

java.lang.Object
  org.apache.any23.extractor.html.TagSoupParser

public class TagSoupParser
extends Object
extends Object

Parses an InputStream into an HTML DOM tree using a TagSoup parser.

Note: The resulting DOM tree will not be namespace aware, and all element names will be upper case, while attributes will be lower case. This is because the NekoHTML based TagSoup parser by default uses the Xerces HTML DOM implementation, which doesn't support namespaces and forces uppercase element names. This works with the RDFa XSLT Converter and with XPath, so we left it this way.

Author:: Richard Cyganiak (richard at cyganiak dot de), Michele Mostarda (mostarda@fbk.eu), Davide Palmisano (palmisano@fbk.eu)

Nested Class Summary
`static class`	`TagSoupParser.ElementLocation` Describes a DOM Element location.

Field Summary
`static String`	`ELEMENT_LOCATION`

Constructor Summary
`TagSoupParser(InputStream input, String documentURI)`
`TagSoupParser(InputStream input, String documentURI, String encoding)`

Method Summary
`Document`	`getDOM()` Returns the DOM of the given document URI.
`DocumentReport`	`getValidatedDOM(boolean applyFix)` Returns the validated DOM and applies fixes on it if applyFix is set to `true`.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

ELEMENT_LOCATION

public static final String ELEMENT_LOCATION

See Also:: Constant Field Values

Constructor Detail

TagSoupParser

public TagSoupParser(InputStream input,
                     String documentURI)

TagSoupParser

public TagSoupParser(InputStream input,
                     String documentURI,
                     String encoding)

Method Detail

getDOM

public Document getDOM()
                throws IOException

Returns the DOM of the given document URI.

Returns:: the HTML DOM.
Throws:: IOException

getValidatedDOM

public DocumentReport getValidatedDOM(boolean applyFix)
                               throws IOException,
                                      ValidatorException

Returns the validated DOM and applies fixes on it if applyFix is set to true.

Parameters:: applyFix -
Returns:: a report containing the HTML DOM that has been validated and fixed if applyFix if true. The reports contains also information about the activated rules and the the detected issues.
Throws:: IOException; ValidatorException

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.any23.extractor.html Class TagSoupParser

ELEMENT_LOCATION

TagSoupParser

TagSoupParser

getDOM

getValidatedDOM

org.apache.any23.extractor.html
Class TagSoupParser