Entity Management

Overview
- Xerces
- Crimson
Assumptions
EntityManager
EntityScanner
Notes
- Open Issues

Overview

An XML document is comprised of various entities which can be encoded using different character encodings. The document instance is known as the document entity whereas we'll call the DTD the dtd entity. In addition, general entities and parameter entities act as macros for inserting fragments into the parse stream when the entity is referenced in the document and DTD, respectively.

There must be a way to declare, locate, and read entities in their respective character encoding. The entity manager handles locating entities and obtaining an entity scanner capable of scanning the entity content. Depending on the character encoding, there may be custom readers for performance reasons. Regardless of the character encoding, though, the interface to scan the underlying content must be consistent and simple.

Xerces

The complexity of the original Xerces code resulted, in large part, from the readers and entity management. The entity readers were defined with a large set of methods so that read operations could be optimized for each reader and that character transcoding could be deferred. However, this meant that every reader had to implement all of the methods separately which introduced more chances for bugs in the code and made it harder to understand the system.

Crimson

Crimson took a simpler approach to reading entities. There is only one reader class that delegates the read calls to a few optimized input stream readers. And without attempting to defer character encoding, the code path is greatly simplified. But its not all roses. If you look deep enough in the code you'll find that the entity management code is somewhat complex because of the nature of XML entities.

Assumptions

Before designing the entity management, a few assumptions were made:

Characters are always transcoded
This greatly simplifies the system and allows us to avoid using a string pool. There is a performance cost but the simplicity and understandability of the code far outweighs any performance lost.
There will be a single entity manager per parser instance
Scanners need to have a way of locating entities and reading their contents. An entity manager would provide that mechanism.
There will be a single entity scanner per parser instance that XML scanners will use
This entity scanner can still delegate to custom, optimized input stream readers for performance.

Entity Manager

The entity manager is a core component in any parser configuration and there is only one entity manager per parser instance. Some of the responsibilities of the entity manager are:

Registering declared entities
Resolving external entities
Starting entities

The XMLEntityManager class implements the entity management in the parser. This class contains methods for registering general and parameter entities; resolving entities either by default or by using the SAX EntityResolver registered by the user; and starting named and unnamed entities. The various XML scanners query the entity scanner by calling getEntityScanner on the entity manager.

Entity Scanner

The entity scanner is responsible for scanning "primitive" XML structure from an entity and reporting the parse location. The XMLEntityScanner class contains methods to peek at the current character; scan names and content; etc.

There is only one entity scanner per entity manager. The entity scanner works directly with the entity manager in order to read from the underlying character readers. This makes scanning of the entities transparent to the caller. Changing readers; auto-detecting encodings from input streams; and buffering is done "under the covers" and does not affect how the caller interacts with the entity scanner.

If both the entity manager and entity scanner are singletons per parser instance, why aren't they a single object? The manager and scanner could be a single object but they are separate in order to have a cleaner separation of functionality and API. Even though they are separate, they share common data, as shown in the following diagram.

Entity
Manager Entity
Scanner

entity resolver

reader stack

entity handler

Notes

It is expected that the entity management and readers will need to be re-evaluated as the Xerces 2 concept is implemented. The operation of reading entities directly impacts the performance of the parser and while this isn't an initial requirement it is important.

Open Issues

There are currently some open issues. [Note: these should move to implementation issues.]

Entity Encoding: When an entity is started that is read from an input stream, the encoding must first be auto-detected. Then, as the appropriate scanner parses the XMLDecl or TextDecl line, a new encoding must be set on the entity scanner. An API must be created in order to set the encoding. However, the work of swapping out the character reader is done transparently from the caller.
Open Readers: Who closes open readers? The parser should close all readers that it created but should not close any readers that are passed to the parser via the parse(InputSource) method. And at what time are the readers closed in the case of an unrecoverable error?

Last modified: $Date$