Architecture Overview

Parsing
Configuration
Validation
Notes

Parsing

The parser is designed as a pipeline. Information flows from the scanner to the validator to the parser. In the pipeline architecture, one component acts as a source of events while the other components in the pipeline are filters. The other components both listen to and propagate the information. The following diagram illustrates the layout of the pipeline in the parser.

XML
Document Scanner Validator Parser Application
API

The scanner implements XMLDocumentSource and parses an XML document to perform callbacks to the next stage in the pipeline. The validator implements both XMLDocumentHandler to receive the callbacks from the scanner and XMLDocumentSource to act as the source for the next stage in the pipeline. ^[1] Finally, the parser implements XMLDocumentHandler to receive the callbacks from the validator and then expose that information to the application API.

By abstracting the components into sources and handlers, the stages in the pipeline can be connected together like puzzle pieces. Stages can be removed, added, or replaced with other implementations. For example, in a parser configuration with no validation, the scanner component can be connected directly as the source for the parser, thereby bypassing the validator and shortening the code path.

This architecture can also be used to implement upcoming standards like XInclude. A new XInclude filter can be written that includes document fragments into the pipeline stream before or after the validator. ^[2]

Parsing of DTDs can also be viewed as a pipeline. Since the DTD is referenced in the document instance by XML syntax (the DOCTYPE declaration), the DTD pipeline is triggered by the document scanner. This contrasts with XML Schema because there is no XML syntax that associates a Schema grammar with a document; a special attribute in the document instance is used as a hint to the location of the grammar. The following diagram illustrates the layout of the DTD pipeline.

DTD
Document DTD
Scanner Validator Parser Application
API

DTD
Grammar

Note that the DTD scanner communicates directly with the validator. The validator receives the callbacks from the DTD scanner in order to create and populate the DTD grammar object. In this way, the validator acts as a "tee", propogating the DTD events to both the next stage in the pipeline and the DTD grammar object. This allows the validation stage in the pipeline to be completely removed from the parser configuration, if needed.

Configuration

The parser object used by the application is comprised of a series of connecting components. The parser puts the components together, transmits settings to each component, and controls their actions. The configuration is managed by the parser object and the values are communicated to the components. The following diagram shows a general parser configuration and its components. (No ordering or direct connection between components should be implied.)

Parser

Symbol
Table Error
Reporter Entity
Manager Document
Scanner DTD
Scanner Validator Grammar
Pool Datatype
Validator
Factory

In the parser configuration, the parser object acts both as the configuration manager and as a component in the pipeline. The parser object does not have to be the final stage in the pipeline, though. Most users of the parser will not need to create new parsers by assembling the components but may want to change some part of the behavior of the parser (e.g. capitalize element names that get created in the DOM tree). By having the parser be the last stage, the user can extend the parser class and override the necessary methods.

Features & Properties

Features and properties are provided via the extensible mechanism found in SAX2. Features are boolean settings on the parser while properties are object settings. There are a number of SAX2 features as well as additional parser features defined by the Xerces parser. The following list includes only those features which are of general use to most components and should not not be considered a complete list.

http://xml.org/sax/features/namespaces: Sets whether the parser should process namespaces.
http://xml.org/sax/features/validation: Sets whether the parser should validate instance documents.
http://apache.org/xml/features/validation/dynamic: This setting tells the parser to validate instance documents if the document contains a reference to a grammar. If no reference to a grammar is included, then validation is turned off. Setting the http://xml.org/sax/features/validation feature to false overrides this setting.

The SAX2 property mechanism is used to communicate information about other components in the parser and its internals. The following list includes properties defined for general use by most parser instances but does not mean that every parser configuration will have all of these components. All of the following properties are read-only.

http://apache.org/xml/properties/internal/symbol-table: The symbol table used by this parser instance. The object returned by this property is of type SymbolTable.
http://apache.org/xml/properties/internal/error-reporter: The error reporter used by this parser instance. The object returned by this property is of type XMLErrorReporter.
http://apache.org/xml/properties/internal/entity-manager: The entity manager used by this parser instance. The object returned by this property is of type XMLEntityManager.
http://apache.org/xml/properties/internal/grammar-pool: The grammar pool used by this parser instance, if appropriate. The object returned by this property is of type GrammarPool.
http://apache.org/xml/properties/internal/datatype-validator-factory: The datatype validator factory used by this parser instance, if appropriate. The object returned by this property is of type DatatypeValidatorFactory.

Settings Management

The parser implements the XMLComponentManager interface and each component implements the XMLComponent interface. For this configuration system to work, the parser must adhere to the following guidelines:

Before each parse, the parser must call the reset method on each configurable component. This call allows each component to query the state of only those features and properties that are important to the operation of the component.
Any time that the application sets a feature or property on the parser during a parse, the parser must pass those settings to each configurable component. This is important because configuration settings can change while parsing an XML document and those settings may directly affect the operation of components. But this does not need to be done before or after a parse because each component will query settings during the call to reset.

Parser Initialization

To be written.

Notes

[1] For the purposes of discussion, any component that implements both the document handler and document source interfaces will be known as a "filter".

[2] XInclude is defined to work on the XML Infoset which means that it should occur after validation but there may be situations where the application wants to validate the included material. In this case, the filter would be inserted before the validator so that the document inclusion would be expanded without the validator ever knowing that there was an inclusion. However, since the pipeline does not buffer the document, there is a problem with supporting features of specifications that operate on the Infoset. Section 3.1 of the XInclude specification highlights this problem in the example. Any use of XPointer to reference parts of the document containing the XInclude could not be supported without buffering the document in some way. Perhaps this means that the best way to support XInclude, and other specifications that rely on the XML Infoset, is to operate on the document model (such as DOM) and not to try to do it in the parser pipeline.

Last modified: $Date$