Entity Management

Table of Contents


Overview

An XML document is comprised of various entities which can be encoded using different character encodings. The document instance is known as the document entity whereas we'll call the DTD the dtd entity. In addition, general entities and parameter entities act as macros for inserting fragments into the parse stream when the entity is referenced in the document and DTD, respectively.

There must be a way to declare, locate, and read entities in their respective character encoding. The entity manager handles locating entities and obtaining an entity scanner capable of scanning the entity content. Depending on the character encoding, there may be custom readers for performance reasons. Regardless of the character encoding, though, the interface to scan the underlying content must be consistent and simple.

Xerces

The complexity of the original Xerces code resulted, in large part, from the readers and entity management. The entity readers were defined with a large set of methods so that read operations could be optimized for each reader and that character transcoding could be deferred. However, this meant that every reader had to implement all of the methods separately which introduced more chances for bugs in the code and made it harder to understand the system.

Crimson

Crimson took a simpler approach to reading entities. There is only one reader class that delegates the read calls to a few optimized input stream readers. And without attempting to defer character encoding, the code path is greatly simplified. But its not all roses. If you look deep enough in the code you'll find that the entity management code is somewhat complex because of the nature of XML entities.

Assumptions

Before designing the entity management, a few assumptions were made:

Entity Manager

The entity manager is a core component in any parser configuration and there is only one entity manager per parser instance. Some of the responsibilities of the entity manager are:

The XMLEntityManager class implements the entity management in the parser. This class contains methods for registering general and parameter entities; resolving entities either by default or by using the SAX EntityResolver registered by the user; and starting named and unnamed entities. The various XML scanners query the entity scanner by calling getEntityScanner on the entity manager.

Entity Scanner

The entity scanner is responsible for scanning "primitive" XML structure from an entity and reporting the parse location. The XMLEntityScanner class contains methods to peek at the current character; scan names and content; etc.

There is only one entity scanner per entity manager. The entity scanner works directly with the entity manager in order to read from the underlying character readers. This makes scanning of the entities transparent to the caller. Changing readers; auto-detecting encodings from input streams; and buffering is done "under the covers" and does not affect how the caller interacts with the entity scanner.

If both the entity manager and entity scanner are singletons per parser instance, why aren't they a single object? The manager and scanner could be a single object but they are separate in order to have a cleaner separation of functionality and API. Even though they are separate, they share common data, as shown in the following diagram.

Entity
Manager
Entity
Scanner
  • entity resolver
  • reader stack
  • entity handler
  • Notes

    It is expected that the entity management and readers will need to be re-evaluated as the Xerces 2 concept is implemented. The operation of reading entities directly impacts the performance of the parser and while this isn't an initial requirement it is important.

    Open Issues

    There are currently some open issues. [Note: these should move to implementation issues.]

    Entity Encoding
    When an entity is started that is read from an input stream, the encoding must first be auto-detected. Then, as the appropriate scanner parses the XMLDecl or TextDecl line, a new encoding must be set on the entity scanner. An API must be created in order to set the encoding. However, the work of swapping out the character reader is done transparently from the caller.
    Open Readers
    Who closes open readers? The parser should close all readers that it created but should not close any readers that are passed to the parser via the parse(InputSource) method. And at what time are the readers closed in the case of an unrecoverable error?


    Last modified: $Date$