Jena2 Database Interface - Performance Notes

28 August, 2003

Very little tuning or performance analysis has been done. However, a simple performance comparison of Jena1 and Jena2 on both memory and persistent models was performed using the jena-perf benchmark (available from http://sourceforge.net/projects/jena/). This showed Jena2 to be up to three times faster than Jena1 (see PerfTest). However, Jena2 models may consume more disk space than Jena1 models. This is due to the different table layouts used in Jena1 and Jena2 (see PerfTestSpace). In Jena1, all resources and literals were stored in separate tables that were referenced from the statement table. This greatly reduced space consumption since resources and literals that appeared in multiple statements were only stored once. However, retrieving a statement required joining the statement table with the literals and resources tables.

In Jena2, short resources and short literals are stored directly in the statement table. Separate tables for resources and literals exist but are only used for long objects whose length exceeds some threshold. The threshold has a database-dependent default but may be redefined by users (see LongObjectLength). Consequently, Jena2 requires fewer join operations than Jena1 at the expense of greater memory consumption. Reducing the value of LongObjectLength can reduce disk space consumption. Jena2 offers another mechanism to reduce disk space consumption. Often, resource URIs have common prefixes. By default, URIs are stored fully expanded in statement tables. However, the DoCompressURI option can be used to compress URIs by storing common prefixes in a separate table and encoding the URI in a statement table with a reference to the prefix table.

For loading large databases, two techniques can be used to reduce load time. First, if the dataset to be loaded contains no duplicate statements, the option DoDuplicateCheck  may be used to disable checking for duplicate statements on insertion. Another option is to manually delete the database indexes and then recreate them after the loading has completed. Unfortunately, Jena2 does not currently support a utility to do this on behalf of users. We cannot provide any guidance on the benefit of deleting and recreating the indexes and mention it only as an option users may wish to try if load performance cannot be improved through other techniques.

Effect of Database Options on Loading

A simple study was conducted to determine the effect of some of the database options on database space and load time. A synthetic test data set consisting of 1000 Dublin Core documents was loaded under various configurations. This data set contains 22,000 RDF statements with no duplicate statements. Some property values are long. The tests were run using MySQL on WindowsXP. For this data set, the table below shows how the various option settings enable trade-offs between space and time. The results should not be considered representative because the impact of these option settings is dependent on the characteristics of the data set. Although this table only shows individual option settings, these options may be set in any combination.

Option Settings Database Space Load Time Comments
Default option settings (long object length = 250 bytes, URI compression disabled, duplicate checked enabled) 16.38 MB 29.5 sec. Default long object length of 250, dup checking enabled, no compression.
Duplicate checking disabled 16.38 MB 16.3 sec. Use only if no duplicates in load set.
Long object length = 25 bytes 9.62 MB 58.7 sec. Heavy use of literals and resources table.
Long object length = 50 bytes 13.40 MB 35.3 sec. Moderate increased time and reduced space.
Long object length = 100 bytes 15.50 MB 29.5 sec. Similar to default case.
URI compression enabled, length = 20 bytes, cache size = 100 9.01 MB 24.0 sec. Use of memory cache for URIs is beneficial.

Some preliminary experiments were done to determine the impact of indexes during database loading. It was speculated that dropping the indexes prior to loading and recreating them afterward would be faster than loading with indexes. The thinking was that database engines (some of them, at least) have good algorithms for creating indexes as a batch operation so that creating an entire index at once would be significantly less expensive than updating the indexes for each statement added.

However, our initial experiments with MySQL showed no significant benefit to dropping and recreating the indexes. The load without indexes was significantly faster than loading with indexes. However the overhead and dropping the index (there was significant overhead to dropping the index when the table was already populated) and recreating the index approximately equaled the savings. We will continue these experiments with different engines and database configurations but for now we can provide no recommendations.

Reification Performance

The Jena2 persistent reification code is new and typical application usage patterns are not yet known. However, based on preliminary testing, we can make some recommendations for improved application performance.

Recall that Jena supports two mechanisms for adding a reified statement (see the Reification HowTo). The RDF standard represents a reified statement as four individual statements and these statements may be added individually to a model. Jena will internally convert these back to the original reified statement. Alternatively, Jena supports a createReifiedStatement method (on the Model class) that adds a complete reified statement with one call. We recommend using this method for persistent models. It is significantly faster to add reified statements using createReifiedStatement rather than adding the four statements individually. If one is using the Graph interface rather than Model, the Reifier interface for graphs supports a method similar to createReifiedStatement. Adding a fully reified statement using these optimized methods takes approximately the same amount of time as adding a single non-reified statement - it is four times faster than individually adding the four reification statements.

When creating a new Model, Jena2 now provides several different reification styles. For persistent models, we recommend using the Standard reification style if you have a choice. There is very little performance difference between the styles. However, Standard is more powerful (supports queries over the reified statements), is consistent with the RDF standard (has no unexpected behaviour), uses no more space than the other styles and operates at similar speeds.

If you have a mix of reified and non-reified statements it may be convenient to simply create two models, using one for just the reified statements. This allows you to query while ignoring the reified statements and may give a small performance gain in future implementations. We are looking at optimizing the special cases of models with only reified statements, or only non-reified statements. In fact, Fastpath already supports this optimization for RDQL query processing as an option.