An XML document is comprised of various entities which can be encoded using different character encodings. The document instance is known as the document entity whereas we'll call the DTD the dtd entity. In addition, general entities and parameter entities act as macros for inserting fragments into the parse stream when the entity is referenced in the document and DTD, respectively.
There must be a way to declare, locate, and read entities in their respective character encoding. The entity manager handles locating entities and obtaining an entity scanner capable of scanning the entity content. Depending on the character encoding, there may be custom readers for performance reasons. Regardless of the character encoding, though, the interface to scan the underlying content must be consistent and simple.
The complexity of the original Xerces code resulted, in large part, from the readers and entity management. The entity readers were defined with a large set of methods so that read operations could be optimized for each reader and that character transcoding could be deferred. However, this meant that every reader had to implement all of the methods separately which introduced more chances for bugs in the code and made it harder to understand the system.
Crimson took a simpler approach to reading entities. There is only one reader class that delegates the read calls to a few optimized input stream readers. And without attempting to defer character encoding, the code path is greatly simplified. But its not all roses. If you look deep enough in the code you'll find that the entity management code is somewhat complex because of the nature of XML entities.
Before designing the entity management, a few assumptions were made:
The entity manager is a core component in any parser configuration and there is only one entity manager per parser instance. Some of the responsibilities of the entity manager are:
The XMLEntityManager
class implements the entity management in the parser. This
class contains methods for registering general and parameter
entities; resolving entities either by default or by using
the SAX EntityResolver
registered by the user;
and starting named and unnamed entities. The various XML
scanners query the entity scanner by calling
getEntityScanner
on the entity manager.
The entity scanner is responsible for scanning "primitive"
XML structure from an entity and reporting the parse location.
The XMLEntityScanner
class contains methods to peek at the current character; scan
names and content; etc.
There is only one entity scanner per entity manager. The entity scanner works directly with the entity manager in order to read from the underlying character readers. This makes scanning of the entities transparent to the caller. Changing readers; auto-detecting encodings from input streams; and buffering is done "under the covers" and does not affect how the caller interacts with the entity scanner.
If both the entity manager and entity scanner are singletons per parser instance, why aren't they a single object? The manager and scanner could be a single object but they are separate in order to have a cleaner separation of functionality and API. Even though they are separate, they share common data, as shown in the following diagram.
|
It is expected that the entity management and readers will need to be re-evaluated as the Xerces 2 concept is implemented. The operation of reading entities directly impacts the performance of the parser and while this isn't an initial requirement it is important.
There are currently some open issues. [Note: these should move to implementation issues.]
parse(InputSource)
method. And at what time are the readers closed in the case
of an unrecoverable error?