Evaluation of Crimson Code

Overview

The Crimson code donated to the XML Apache Project by Sun Microsystems is a relatively clean and straightforward implementation of a conforming XML parser. However, there are some serious drawbacks to its design that hamper its use in the Xerces2 effort. This page will highlight some of the problems that I see with the Crimson code. This doesn't mean, however, that there aren't good ideas in Crimson! I'll highlight some of the things that I like about Crimson as well.

The Good

Size: Crimson has a small code footprint.
Simplicity: The code is very straightforward and easy to grok. I especially like the simple approach to reading the input streams. The advanced reader code in Xerces has been a continual source of bugs and developer confusion. See my evaluation of Xerces for more detail.

The Bad

Standards: Crimson is lacking implementation of important standards. Some examples are DOM Level 2 and XML Schema.
Modularity: The design of Crimson is not modular enough to be of general use in a wide variety of applications. For example, the document and DTD scanning code is hard-coded into the parser. Also, a lot of the classes used by the parser rely on package visibility of members. (Yuck!)
Validation: The validation engine is rather simplistic and not very fast. Plus, it doesn't seem to be able to handle the advanced validation requirements of XML Schema.
Performance: The general performance of the Crimson code is good but there are some areas where it can (and should) be tuned for performance. First, validation is not as fast as it could be but there are comments in the code that suggest "compiling" the model into a DFA for faster validation. Also, the DOM implementation wastes a lot memory when traversing the document.


Author: Andy Clark
Last modified: $Date$