Evaluation of Xerces Code

Overview

Historically, the Xerces code was developed to be the fastest XML parser on the planet. This impacted design decisions and caused the parser to be written from the inside out. While this produced an extremely fast XML parser with DTD validation, the overall design suffered. Xerces developers have found it difficult to understand, fix bugs, and add new features. Hence the Xerces2 effort.

The Good

Standards: The Xerces parser is an extremely complete XML parser. Besides conforming to the XML and Namespace specifications, it offers support for SAX 1 and 2; DOM Level 1 and 2; and most of the latest working draft of XML Schema.
Modularity: The current Xerces code made a decent attempt at modularity by defining a set of interfaces between components of the parser such as the scanner and validator. The parser is designed as a pipeline of components. This is a good idea whose implementation got complicated by performance considerations and feature creep.
Validation: Xerces is able to validate documents with grammars specified in DTD and XML Schema syntax. All validation is performed by a universal validator that can validate the union of features found in both syntaxes. This enables the parser to handle current and future grammars in a consistent way.
Performance: The Xerces parser has always performed well. Implementation of XML Schema has caused the performance to slip but this is to be expected -- you can't do a lot more work per element without incurring a performance penalty.

The Bad

Size: The parser is too big but this is not all due to the code required to parse XML files. A lot of contributed features have been rolled into the Xerces jar file. For example: HTML and WML DOM implementations; document serializers; etc. It would be nice to find a way to package the features into separate distributable jars.
Simplicity: The code needs to be simplified. A lot of complexity of the Xerces parser can be found in the entity readers and the use of the string pool throughout the system.
Documentation: This is little to no documentation of the Xerces code. And frequently the javadoc comments are missing or incorrect. More effort must be taken in Xerces 2 in order to make sure that everything is well documented.


Author: Andy Clark
Last modified: $Date$