Improving the Performance of Jena Query over Memory Graph

Jeremy J. Carroll

This is a report on building a faster version of the in-memory graph and query function of Jena.

Sample Data Set and Query

The methodology used was to drill down on one specific query against one specific data set and make changes that were not too specific to that query, but would have general impact.

Sample Data

The data set was one of those offered in the Jena vs Sesame performance comparison thread. However, some of the tests smaller data sets that just had that subset of the data that was touched by the query (excluding the data that is navigated, but passed over, during hash table look ups)

This data consisted in 2600 instances of the following triples.

_:r rdf:type uddi:BusinessService .
_:r uddi:hasServiceKey _:k .
_:r uddi:hasName _:b .
_:b rdf:type uddi:NameBag .
_:r rdf:_2 "Weijian's Business' Service"+i+"x" .
_:r uddi:hasBusinessEntity _:d .
_:r uddi:hasDescription _:e .
_:r uddi:hasBindingTemplate _:f .
_:r uddi:hasBindingTemplate _:g .

The blank nodes are different for each one of the instances, and the literal string depends on the number i of the instance (between 0 and 2599). The prefix uddi is bound to http://www.mygrid.ecs.soton.ac.uk/uddiv2.rdfs#.

This small data set had a total of 23400 triples. The data from the test file had 358800 triples. Later in the investigation, to understand better why the performance of the two cases differed, three further data sets were considered. Each was generated in the same way as the small data set, but in addition adding a further 7*2600 copies of the following 19 triples:

_:x eg:(a+i) _:y .
   i = 1, 2, ... 19

where, for example, when i = 5 eg:(a+i) represents the URI eg:a5.

These 345800 triples were added to the model either before the other triples, or after the other triples, or with the other triples (i.e. 133 of these eg:ai triples with each instance of the significant data triples).

Sample Query

The sample query was:

SELECT ?key WHERE
    (?businessService0 rdf:type uddi:BusinessService)
    (?businessService0 uddi:hasName ?nameBag3)
    (?nameBag3 ?v_7 "Weijian's Business' Service42x")
    (?businessService0 uddi:hasServiceKey ?key)
USING rdf FOR , 
      uddi FOR         

We explicitly excluded query reordering strategies, and concentrated on exploring the current strategy of matching each pattern in turn.

Hence, excecuting this query involves finding the 2600 matches of the first triple. For each of these, the second triple has one match. For one of these, the third triple has one match. For that one, the fourth triple matches. Thus, a total of 5202 patterns are matched against the graph when executing the query.

When using the current jena strategy of using a subject index in preference to an object index, the second, third and fourth triples are matched using the subject index, and the first using the object index. Thus the first pattern accepts 2600 triples out of 2600 in the index entry. The second pattern accepts 1 triple out of 7 in the index (2600 times). The third pattern rejects both the triples in the index (2599 times) and accepts one of the triples (once). The fourth pattern accepts one triple out of 7 in the index. Thus the underlying next on the iterator over the set of triples in the index entry is being called a total of 2600*(1+7+2)+7=26007 times, the filter accepts 5202 of these.

This analysis involves the observation that when executing a query Graph.find does two logically separable things:

  1. Decide on an excecution strategy given which of the S/P/O are concrete (or bound variables) as opposed to ANYs or unbound variables. This execution strategy also takes into account whether the graph has a non-empty reifier, and whether any of the triples stored there are relevant to the query or not.
  2. Using the current values of the bound variables to construct a specific filtering iterator.

We will call the first part find-planning, and the second part find-binding. We can then draw up the following table:

function call count
query 1
find-planning 4
find-binding 5202
triple collection next 26007
unsuccessful filtering 20805
successful filtering 5202

Note these numbers represent a logical minimum, not reported figures from a profiler. For example, the Jena 2.2 code has the find-planning and find-binding stages combined into a single find method. Therefore, with Jena 2.2 the find-planning stage must be executed at least 5202 times, rather than the logical minimum of 4.

Changes made

Test Results

Various performance tests were run, from the Jena performance test suite. The Sesame comparison queries, other than those involving reading the file data, were reduced from 2600 instances to 1000 instances, since the machine I am using, (a Windows XP box, 512M, Java 1.5.02), started thrashing on the bigger tests. The numbers for the Sesame file tests should hence be regarded with suspicion.

The Jena versions used were:

The comparisons between Jena 2.2 and the optimized version indicate the possible improvement.

The comparison between the optimized version and the various other variants give an indication of the contribution of each optimization to the improvement.

Separate files provide:

The column headings refer to different variants of Jena being tested. These will be discussed below. The quadhash, and the quadhash2 columns are both the same system, and can be compared to give some idea as to the variance of the test harness (+/- 5%).

The quadhash variant is using all the optimizations, except that it supports remove on the iterator returned from Graph.find.

Remove on find results

It would be possible to have two separate Graph.finds, one of which did not support remove. This one can go a bit faster, as shown by the 85% figure for the Sesame Query.

Hashing Costs in Node to Triple Map

The Sesame query tests are parametized:

small
Only the triples that are actually looked at by any find in the query are in the model.
file
All the triples from the file are included. The triples looked at by the find are 2.6 times as many as for the other tests, which were made smaller due to by machine limitation. The (file) figures are unreliable.
early additional triples
Many irrelevant triples were stuffed into the model, before the relevant triples were added.
mid additional triples
Many irrelevant triples were stuffed into the model, while the relevant triples were added.
late additional triples
Many irrelevant triples were stuffed into the model, while the relevant triples were added.

The code execution paths all four of the non-file tests are identical, except for the hash lookup from the node-to-triple map.

Since a hash lookup is meant to be a constant, this means that theoretically all four rows of the table should have identical results.

Instead we see that adding more triples, either before or while, adding the triples that matter, makes it slower, by roughly 200-300 milliseconds. This cost is roughly the same independent of the base time shown for the small tests.

test jena2.2 quadhash quadhash2 jena cvs July 11 My own chaining hash map hashmap
Sesame Query - API listStatements (small) 1975 850 821 1855 864 888
Sesame Query - API listStatements (early additional triples) 1812 1090 1060 1765 1105 1113
Sesame Query - API listStatements (mid additional triples) 2340 1056 1031 2169 1113 1239
Sesame Query - API listStatements (late additional triples) 1870 891 881 1832 901 1013
Sesame Query - direct graph SPI query (small) 1650 78 75 1669 71 80
Sesame Query - direct graph SPI query (early additional triples) 1534 262 299 1635 320 344
Sesame Query - direct graph SPI query (mid additional triples) 2203 274 269 2086 359 517
Sesame Query - direct graph SPI query (late additional triples) 1647 83 79 1592 84 119
Sesame Query - API listStatements (file) 6859 3199 3490 6569 3295 3254
Sesame Query - direct graph SPI query (file) 6319 1436 1547 6049 1436 1549

The most even performance is shown with a quadratic probe hashmap. Implementing my own chaining hash map was not significantly different from using java.util.HashMap. There may be other ways of trying to address this. A plausible explanation for the figures is in terms of memory cache integrity and locality. It may be worth doing further work to investigate this issue. Since the slow down on this issue alone is two to three times the cost of the query when fully optimized, it means that percentage comparisons between figures for the early/mid/late or file tests are significantly less than those for the small query.

The Triple Tables

Jena 2.2 stores triples in a HashSet, kept as the values of node to triple maps. The hash set is identified using the node to triple map, and then iterated over. In the test data, these hashsets were typically quite small, with 1, 2 or 7 items, the first one has 1000 items. Iterating over these is relatively expensive.

The optimized code uses an array within its own data structure, for which the iterators over it do all the incrementing, filtering and side effects in a single operation, without indirection.

To explore the impact of different triple table strategies a variant of the optimized code that wraps an ArrayList was tested. When going through the query engine, this wrapping is expensive (a factor of three). Using a HashSet instead of an ArrayList adds a comparable amount of time. I offer no explanation as to why the ArrayList is faster for the walking a chain test.

The wrapping itself is costly, as shown by the last column, in which the underlying triple store is an ArrayList, but any Iterators are wrapping with a FilterIterator that accepts everything. Again the slowdown is noticeable in many of the tests, particularly the sesame query tests that exercise the inner loop of the find iteration.

test quadhash hashset array list wrapped arraylist
Sesame Query - API listStatements (small) 100% 149% 109% 120%
Sesame Query - SPI loops (small) 100% 157% 101% 132%
Sesame Query - direct graph SPI query (small) 100% 638% 308% 515%
Simple-RDQL (OneProperty*1000) 100% 110% 102% 106%
listStatements() (DER*1000) 100% 112% 111% 125%
listStatements() (DMOZ-1000) 100% 107% 100% 107%
loading/creating (Chain |1000|) 100% 123% 113% 113%
loading/creating (DER*1000) 100% 179% 126% 125%
loading/creating (DMOZ-1000) 100% 108% 100% 100%
loading/creating (OneProperty*1000) 100% 180% 132% 130%
walking a chain (Chain |1000|) 100% 112% 83% 93%

Wrapping in more detail

Jena iterators get wrapped a lot. If we have a filter iterator either side of a wrapper, the iterator that is wrapped by the filter will be doing more work than the iterator that wraps the filter. This is because the filter reduces the number of items to be considered.

Thus the difference between the arraylist and the wrapped arraylist shows the impact of an extra layer of wrapping inside a find.

Two additional variants of the optimized code added one or two layers of wrapping to the find result. The difference between these and the quadhash base code, shows the cost of one or two layers of wrapping at that point. This is still substantial, while much less than the cost of a layer of wrapping deeper in the iterator stack.

test quadhash array list wrapped arraylist wrapped find doubly wrapped find
Sesame Query - API listStatements (small) 100% 109% 120% 103% 105%
Sesame Query - SPI loops (small) 100% 101% 132% 114% 131%
Sesame Query - direct graph SPI query (small) 100% 308% 515% 138% 170%
Simple-RDQL (OneProperty*1000) 100% 102% 106% 109% 104%
listStatements() (DER*1000) 100% 111% 125% 107% 119%
listStatements() (DMOZ-1000) 100% 100% 107% 98% 109%

Costs to Wrapping Improvements

A significant part of the improvement in the optimized code is by removing all wrapping. In part this is achieved through having specialized iterator classes for each type of filter on an S/P/O pattern. Each field may be one of:

The code is different for each of these cases, given 64 variations, which could be expressed as 64 subclasses. This avoids wrapping but add a code maintenace cost. It is probably possible to find some trade off in which some wrapping is maintained to allow some generality in the code, but at least some of the current layers get peeled away.

Useless microoptimizations

The optimized code allowed public access to the subj, pred and obj fields of triple. Replacing this with final accessors or even non-final accessors made precious little difference.

Also making Triple.hashCode or Triple.equals final doesn't seem to have any impact either.

Predicate Interning

The optimized code interns (using a WeakHashMap rather than String.intern()) the labels of nodes used in predicate position. This allows a fastIdentity operation to use == on the labels, for tests in the p position in find.

Since there are not very many predicates, the cost of this interning is fairly small, and it gives a significant improvement,. Switching off the interning gives a 44% slow down on the sesame query, and a 21% slow down on the simple RDQL query.

Node hashCode

The optimized code adds an extra field to Node in which the hashCode is computed by the constructor and kept.

This gives a 79% improvement on the sesame query, without significant slowdown elsewhere. (Hmmm, I seem not to have the figures, but recollect that this was also good for the cache coherency issue.)

Triple Cache

The triple cache gets used during a find. This is useless, and costly. In order to show the cost the optimized code switches the triple cache off completely. A different approach would be to not use the cache during the find, but use it at other times. Given that the optimized code for the small query goes faster with the triple cache on, it does appear that the triple cache itself is not useless, but the results for the API and SPI loops show that the cost of the triple cache in the find planning is quite high. So the recommended change is to ensure that the triple cache is not use when converting an s, p, o pattern into a triple during find, but to use the triple cache at other times.

test quadhash triple cache on
Sesame Query - API listStatements (small) 100% 151%
Sesame Query - SPI loops (small) 100% 170%
Sesame Query - direct graph SPI query (small) 100% 92%
Simple-RDQL (OneProperty*1000) 100% 112%
walking a chain (Chain |1000|) 100% 132%

Threading

The standard query code uses two threads. The optimized code uses one thread, and a 'multiplying iterator' to transform the control appropriately. The multiplying iterator within a profiler appears to take about 15% of the time. Hence a different control structure, e.g. a visitor pattern, may be significantly faster. For the two tests involving the query code (i.e. the sesame test and the legacy RDQL test) the threading cost a slow down of a factor of 2 and 3.6 respectively. It is costly having threading in the Query subsystem.

Other

A further optimization that is not shown in the tests is avoiding the pattern.match() in the following code:

com.hp.hpl.jena.graph.query.PatternStage

protected void nest( Pipe sink, Domain current, int index )
   {
   if (index == compiled.length)
      sink.put( current.copy() );
   else
      {
      Pattern p = compiled[index];
      ValuatorSet guard = guards[index];
      Iterator it = find( graph, p.asTripleMatch( current ).asTriple() );
      while (stillOpen && it.hasNext())
         if (p.match( current, (Triple) it.next()) && guard.evalBool( current )) 
            nest( sink, current, index + 1 );
      NiceIterator.close( it );
      }
   }
 

The optimized code avoids all of the nesting stuff as unnecessay overhead, but the p.match is particularly expensive. It does three things:

  1. It duplicates the condition already executed in the find
  2. It binds any variables that need binding
  3. It checks the conditions in queries that have the same variable occurring two or three times in the triple pattern.

The first of these should simply not be done. The second is implemented in the optimized code by passing a pointer into the find call that allows the value to be assigned to the variable during the find. It is not significant whether this pointer is a java object with a set method, or an array and an index with a set method.

The third of these is not implemented in the optimized code. It can be achieved by, for example, effectively treating a triple pattern: ?x ?x ?x as ?x ?y ?z with a guard of ?x = ?y and ?y = ?z.

All the nesting is avoided by converting each of the conditions, most of which are trivial, and expressing them as iterators, and then expressing the combined query as an array of iterators, rather than as nested stages. This is largely equivalent, but then allows the control to be inverted with a multiplying iterator, which for each value of the ith iterator considers all values of the i+1th iterator, rewinding the iterators as necessary. This is slightly inelegant when there are guards present (which is rare). Such cases perhaps would be better served by pushing the guard down into the find.

Bottom Line

Using all the optimizations, including disabling remove on find for query, and having the triple cache on, except for find, would result in a speed up of about 25 for the sample query on the small data set. Caching issues dominate on large data sets making the maximum speed up a factor of 5 or 6.

The code is in the Jena CVS Scratch area, under jjc.