The parser is designed as a pipeline. Information flows from the scanner to the validator to the parser. In the pipeline architecture, one component acts as a source of events while the other components in the pipeline are filters. The other components both listen to and propagate the information. The following diagram illustrates the layout of the pipeline in the parser.
|
The scanner implements
XMLDocumentSource
and parses an XML document to perform callbacks to the
next stage in the pipeline. The validator implements both
XMLDocumentHandler
to receive the callbacks from
the scanner and XMLDocumentSource
to
act as the source for the next stage in the pipeline.
[1]
Finally,
the parser implements XMLDocumentHandler
to receive the callbacks from the validator and then expose
that information to the application API.
By abstracting the components into sources and handlers, the stages in the pipeline can be connected together like puzzle pieces. Stages can be removed, added, or replaced with other implementations. For example, in a parser configuration with no validation, the scanner component can be connected directly as the source for the parser, thereby bypassing the validator and shortening the code path.
This architecture can also be used to implement upcoming standards like XInclude. A new XInclude filter can be written that includes document fragments into the pipeline stream before or after the validator. [2]
Parsing of DTDs can also be viewed as a pipeline. Since the DTD is referenced in the document instance by XML syntax (the DOCTYPE declaration), the DTD pipeline is triggered by the document scanner. This contrasts with XML Schema because there is no XML syntax that associates a Schema grammar with a document; a special attribute in the document instance is used as a hint to the location of the grammar. The following diagram illustrates the layout of the DTD pipeline.
|
Note that the DTD scanner communicates directly with the validator. The validator receives the callbacks from the DTD scanner in order to create and populate the DTD grammar object. In this way, the validator acts as a "tee", propogating the DTD events to both the next stage in the pipeline and the DTD grammar object. This allows the validation stage in the pipeline to be completely removed from the parser configuration, if needed.
The parser object used by the application is comprised of a series of connecting components. The parser puts the components together, transmits settings to each component, and controls their actions. The configuration is managed by the parser object and the values are communicated to the components. The following diagram shows a general parser configuration and its components. (No ordering or direct connection between components should be implied.)
|
In the parser configuration, the parser object acts both as the configuration manager and as a component in the pipeline. The parser object does not have to be the final stage in the pipeline, though. Most users of the parser will not need to create new parsers by assembling the components but may want to change some part of the behavior of the parser (e.g. capitalize element names that get created in the DOM tree). By having the parser be the last stage, the user can extend the parser class and override the necessary methods.
Features and properties are provided via the extensible mechanism
found in SAX2. Features are boolean settings on the parser while
properties are object settings. There are a number of SAX2 features
as well as additional parser features defined by the Xerces parser.
The following list includes only those features which are of
general use to most components and should not not be considered a
complete list.
false
overrides this setting.
The SAX2 property mechanism is used to communicate information about
other components in the parser and its internals. The following
list includes properties defined for general use by most
parser instances but does not mean that every parser configuration
will have all of these components. All of the following properties
are read-only.
SymbolTable
.
XMLErrorReporter
.
XMLEntityManager
.
GrammarPool
.
DatatypeValidatorFactory
.
The parser implements the
XMLComponentManager
interface and each component implements the
XMLComponent
interface. For this configuration system to work, the parser
must adhere to the following guidelines:
reset
method on each configurable component.
This call allows each component to query the state of only
those features and properties that are important to the operation
of the component.
reset
.
To be written.
[1] | For the purposes of discussion, any component that implements both the document handler and document source interfaces will be known as a "filter". |
[2] | XInclude is defined to work on the XML Infoset which means that it should occur after validation but there may be situations where the application wants to validate the included material. In this case, the filter would be inserted before the validator so that the document inclusion would be expanded without the validator ever knowing that there was an inclusion. However, since the pipeline does not buffer the document, there is a problem with supporting features of specifications that operate on the Infoset. Section 3.1 of the XInclude specification highlights this problem in the example. Any use of XPointer to reference parts of the document containing the XInclude could not be supported without buffering the document in some way. Perhaps this means that the best way to support XInclude, and other specifications that rely on the XML Infoset, is to operate on the document model (such as DOM) and not to try to do it in the parser pipeline. |