The elementtree.HTMLTreeBuilder Module

Tools to build element trees from HTML files.

Module Contents

HTMLTreeBuilder(builder=None, encoding=None) (class) [#]

ElementTree builder for HTML source code.

builder=
Optional builder object. If omitted, the parser uses the standard elementtree builder.
encoding=
Optional character encoding, if known. If omitted, the parser looks for META tags inside the document. If no tags are found, the parser defaults to ISO-8859-1. Note that if your document uses a non-ASCII compatible encoding, you must decode the document before parsing.

For more information about this class, see The HTMLTreeBuilder Class.

parse(source, encoding=None) [#]

Parse an HTML document or document fragment.

source
A filename or file object containing HTML data.
encoding
Optional character encoding, if known. If omitted, the parser looks for META tags inside the document. If no tags are found, the parser defaults to ISO-8859-1.
Returns:
An ElementTree instance

TreeBuilder (variable) [#]

An alias for the HTMLTreeBuilder class.

The HTMLTreeBuilder Class

HTMLTreeBuilder(builder=None, encoding=None) (class) [#]

ElementTree builder for HTML source code. This builder converts an HTML document or fragment to an ElementTree.

The parser is relatively picky, and requires balanced tags for most elements. However, elements belonging to the following group are automatically closed: P, LI, TR, TH, and TD. In addition, the parser automatically inserts end tags immediately after the start tag, and ignores any end tags for the following group: IMG, HR, META, and LINK.

builder=
Optional builder object. If omitted, the parser uses the standard elementtree builder.
encoding=
Optional character encoding, if known. If omitted, the parser looks for META tags inside the document. If no tags are found, the parser defaults to ISO-8859-1. Note that if your document uses a non-ASCII compatible encoding, you must decode the document before parsing.

close() [#]

Flushes parser buffers, and return the root element.

Returns:
An Element instance.

handle_charref(char) [#]

(Internal) Handles character references.

handle_data(data) [#]

(Internal) Handles character data.

handle_endtag(tag) [#]

(Internal) Handles end tags.

handle_entityref(name) [#]

(Internal) Handles entity references.

handle_starttag(tag, attrs) [#]

(Internal) Handles start tags.

unknown_entityref(name) [#]

(Hook) Handles unknown entity references. The default action is to ignore unknown entities.