Improving the Performance of Jena Query over Memory Graph

Jeremy J. Carroll

This is a report on building a faster version of the in-memory graph and query function of Jena.

Sample Data Set and Query

The methodology used was to drill down on one specific query against one specific data set and make changes that were not too specific to that query, but would have general impact.

Sample Data

The data set was one of those offered in the Jena vs Sesame performance comparison thread. However, some of the tests smaller data sets that just had that subset of the data that was touched by the query (excluding the data that is navigated, but passed over, during hash table look ups)

This data consisted in 2600 instances of the following triples.

_:r rdf:type uddi:BusinessService .
_:r uddi:hasServiceKey _:k .
_:r uddi:hasName _:b .
_:b rdf:type uddi:NameBag .
_:r rdf:_2 "Weijian's Business' Service"+i+"x" .
_:r uddi:hasBusinessEntity _:d .
_:r uddi:hasDescription _:e .
_:r uddi:hasBindingTemplate _:f .
_:r uddi:hasBindingTemplate _:g .

The blank nodes are different for each one of the instances, and the literal string depends on the number i of the instance (between 0 and 2599). The prefix uddi is bound to http://www.mygrid.ecs.soton.ac.uk/uddiv2.rdfs#.

This small data set had a total of 23400 triples. The data from the test file had 358800 triples. Later in the investigation, to understand better why the performance of the two cases differed, three further data sets were considered. Each was generated in the same way as the small data set, but in addition adding a further 7*2600 copies of the following 19 triples:

_:x eg:(a+i) _:y .
   i = 1, 2, ... 19

where, for example, when i = 5 eg:(a+i) represents the URI eg:a5.

These 345800 triples were added to the model either before the other triples, or after the other triples, or with the other triples (i.e. 133 of these eg:ai triples with each instance of the significant data triples).

Sample Query

The sample query was:

SELECT ?key WHERE
    (?businessService0 rdf:type uddi:BusinessService)
    (?businessService0 uddi:hasName ?nameBag3)
    (?nameBag3 ?v_7 "Weijian's Business' Service42x")
    (?businessService0 uddi:hasServiceKey ?key)
USING rdf FOR , 
      uddi FOR

We explicitly excluded query reordering strategies, and concentrated on exploring the current strategy of matching each pattern in turn.

Hence, excecuting this query involves finding the 2600 matches of the first triple. For each of these, the second triple has one match. For one of these, the third triple has one match. For that one, the fourth triple matches. Thus, a total of 5202 patterns are matched against the graph when executing the query.

When using the current jena strategy of using a subject index in preference to an object index, the second, third and fourth triples are matched using the subject index, and the first using the object index. Thus the first pattern accepts 2600 triples out of 2600 in the index entry. The second pattern accepts 1 triple out of 7 in the index (2600 times). The third pattern rejects both the triples in the index (2599 times) and accepts one of the triples (once). The fourth pattern accepts one triple out of 7 in the index. Thus the underlying next on the iterator over the set of triples in the index entry is being called a total of 2600*(1+7+2)+7=26007 times, the filter accepts 5202 of these.

This analysis involves the observation that when executing a query Graph.find does two logically separable things:

Decide on an excecution strategy given which of the S/P/O are concrete (or bound variables) as opposed to ANYs or unbound variables. This execution strategy also takes into account whether the graph has a non-empty reifier, and whether any of the triples stored there are relevant to the query or not.
Using the current values of the bound variables to construct a specific filtering iterator.

We will call the first part find-planning, and the second part find-binding. We can then draw up the following table:

function	call count
query	1
find-planning	4
find-binding	5202
triple collection next	26007
unsuccessful filtering	20805
successful filtering	5202

Note these numbers represent a logical minimum, not reported figures from a profiler. For example, the Jena 2.2 code has the find-planning and find-binding stages combined into a single find method. Therefore, with Jena 2.2 the find-planning stage must be executed at least 5202 times, rather than the logical minimum of 4.

Changes made

Use of own coded array list, rather than HashSet for storing triples
Separation of find-planning from find-binding, by using a rewind() (misnamed?) on iterators.
Use of interning on predicates to speed up testing.
Avoiding unnecessary tests in filter
Avoiding pattern.match(Domain) call in query engine, by doing variable binding within find.
Avoiding triple cach-ing during find.
Use of quadratic probe hash set rather than java.util.HashSet for NodeToTriplesMap


Having specialized subclasses of array list iterator
for each possible variant of bound variables,
unbound variables, concrete, ANY, in s/p/o
positions.
direct access of subj, pred, obj fields in triple rather than using getSubject etc.
making Triple.getSubject etc final
making Triple.equals and Triple.sameAs final.
making Triple.hashCode final
dropped multithreading in Query
simplified stage pipeline in Query
not supporting remove on iterator returned from find



Test Results

Various performance tests were run, from the Jena performance test suite.
The Sesame comparison queries, other than those involving reading the file
data, were reduced from 2600 instances to 1000 instances, since the machine I am using,
(a Windows XP box, 512M, Java 1.5.02), started thrashing on the bigger tests. The
numbers for the Sesame file tests should hence be regarded with suspicion.


The Jena versions used were:


Jena 2.2
Jena CVS copy
my own optimized branch
my own optimized branch with one of a number of changes,
each of these changes (with one exception), undid one of the
optimization


The comparisons between Jena 2.2 and the optimized version indicate
the possible improvement.


The comparison between the optimized version and the various
other variants give an indication of the contribution of each
optimization to the improvement.


Separate files provide:




The test results in milliseconds.



The test results as a percentage of Jena 2.2 times.


The test results as a percentage of optimized times.


The column headings refer to different variants of Jena being tested.
These will be discussed below. The quadhash, and the quadhash2 columns
are both the same system, and can be compared to give some idea
as to the variance of the test harness (+/- 5%).


The quadhash variant is using all the optimizations, except that it supports
remove on the iterator returned from Graph.find.

Remove on find results

It would be possible to have two
separate Graph.finds, one of which did not support remove.
This one can go a bit faster, as shown by the 85% figure for the Sesame Query.

Hashing Costs in Node to Triple Map

The Sesame query tests are parametized:


small
Only the triples that are actually looked at by any find in the query
are in the model.
file
All the triples from the file are included. The triples looked at by the find
are 2.6 times as many as for the other tests, which were made smaller due to by machine
limitation. The (file) figures are unreliable.

early additional triples

Many irrelevant triples were stuffed into the model, before the relevant triples were
added.

mid additional triples

Many irrelevant triples were stuffed into the model, while the relevant triples were
added.

late additional triples

Many irrelevant triples were stuffed into the model, while the relevant triples were
added.



The code execution paths all four of the non-file tests are identical, except
for the hash lookup from the node-to-triple map.


Since a hash lookup is meant to be a constant, this means that theoretically
all four rows of the table should have identical results.


Instead we see that adding more triples, either before or while,
adding the triples that matter, makes it slower,
by roughly 200-300 milliseconds.
This cost is roughly the same independent of the base time
shown for the small tests.




test
jena2.2
quadhash
quadhash2
jena cvs July 11
My own chaining hash map
hashmap




Sesame Query - API listStatements (small)
1975
850
821
1855
864
888

Sesame Query - API listStatements (early additional triples)
1812
1090
1060
1765
1105
1113

Sesame Query - API listStatements (mid additional triples)
2340
1056
1031
2169
1113
1239

Sesame Query - API listStatements (late additional triples)
1870
891
881
1832
901
1013

Sesame Query - direct graph SPI query (small)
1650
78
75
1669
71
80

Sesame Query - direct graph SPI query (early additional triples)
1534
262
299
1635
320
344

Sesame Query - direct graph SPI query (mid additional triples)
2203
274
269
2086
359
517

Sesame Query - direct graph SPI query (late additional triples)
1647
83
79
1592
84
119

Sesame Query - API listStatements (file)
6859
3199
3490
6569
3295
3254

Sesame Query - direct graph SPI query (file)
6319
1436
1547
6049
1436
1549



The most even performance is shown with a quadratic probe hashmap.
Implementing my own chaining hash map was not significantly different
from using java.util.HashMap.
There may be other ways of trying to address this.
A plausible explanation for the figures is in terms of 
memory cache integrity and locality.
It may be worth doing further work to investigate this issue.
Since the slow down on this issue alone is two to three times the 
cost of the query when fully optimized, it means that
percentage comparisons between figures for the early/mid/late or file
tests are significantly less than those for the small query. 

The Triple Tables

Jena 2.2 stores triples in a HashSet, kept as the values
of node to triple maps.
The hash set is identified using the node to triple map, and then
iterated over. In the test data, these hashsets were typically quite small,
with 1, 2 or 7 items, the first one has 1000 items.
Iterating over these is relatively expensive.


The optimized code uses an array within its own data structure, 
for which the iterators over it do all the incrementing, filtering
and side effects in a single operation, without indirection.


To explore the impact of different triple table strategies
a variant of the optimized code that wraps an ArrayList was tested.
When going through the query engine, this wrapping is expensive (a factor of three).
Using a HashSet instead of an ArrayList adds a comparable amount of time.
I offer no explanation as to why the ArrayList is faster for the walking a chain test.


The wrapping itself is costly, as shown by the last column, in which the
underlying triple store is an ArrayList, but any Iterators are wrapping with 
a FilterIterator that accepts everything. Again the slowdown is noticeable 
in many of the tests, particularly the sesame query tests that
exercise the inner loop of the find iteration.
 
 
 


test
quadhash
hashset
array list
wrapped arraylist




Sesame Query - API listStatements (small)
100%
149%
109%
120%

Sesame Query - SPI loops (small)
100%
157%
101%
132%

Sesame Query - direct graph SPI query (small)
100%
638%
308%
515%

Simple-RDQL (OneProperty*1000)
100%
110%
102%
106%

listStatements() (DER*1000)
100%
112%
111%
125%

listStatements() (DMOZ-1000)
100%
107%
100%
107%

loading/creating (Chain |1000|)
100%
123%
113%
113%

loading/creating (DER*1000)
100%
179%
126%
125%

loading/creating (DMOZ-1000)
100%
108%
100%
100%

loading/creating (OneProperty*1000)
100%
180%
132%
130%

walking a chain (Chain |1000|)
100%
112%
83%
93%


 

Wrapping in more detail

Jena iterators get wrapped a lot.
If we have a filter iterator either side of a wrapper, the iterator that is wrapped
by the filter will be doing more work than the iterator that wraps the filter.
This is because the filter reduces the number of items to be considered.


Thus the difference between the arraylist and the wrapped arraylist shows
the impact of an extra layer of wrapping inside a find.


Two additional variants of the optimized code added one or two
layers of wrapping to the find result.
The difference between these and the quadhash base code, shows
the cost of one or two layers of wrapping at that point.
This is still substantial, while much less than the cost of a layer of wrapping
deeper in the iterator stack.




test
quadhash
array list
wrapped arraylist
wrapped find
doubly wrapped find




Sesame Query - API listStatements (small)
100%
109%
120%
103%
105%

Sesame Query - SPI loops (small)
100%
101%
132%
114%
131%

Sesame Query - direct graph SPI query (small)
100%
308%
515%
138%
170%

Simple-RDQL (OneProperty*1000)
100%
102%
106%
109%
104%

listStatements() (DER*1000)
100%
111%
125%
107%
119%

listStatements() (DMOZ-1000)
100%
100%
107%
98%
109%


Costs to Wrapping Improvements

A significant part of the improvement in the optimized code
is by removing all wrapping. In part this is achieved through
having specialized iterator classes for each type of filter
on an S/P/O pattern.
Each field may be one of:

concrete
a bound variable
an unbound variable
any


The code is different for each of these cases, given 64 variations,
which could be expressed as 64 subclasses. This avoids wrapping
but add a code maintenace cost.
It is probably possible to find some trade off in which some
wrapping is maintained to allow some generality in the code,
but at least some of the current layers get peeled away.

Useless microoptimizations

The optimized code allowed public access to the subj, pred and obj fields
of triple. Replacing this with final accessors or even non-final accessors
made precious little difference.


Also making Triple.hashCode or Triple.equals
final doesn't seem to have any impact either.

Predicate Interning

The optimized code interns (using a WeakHashMap rather than String.intern())
the labels of nodes used in predicate position. This allows a fastIdentity
operation to use == on the labels, for tests in the p position in find.


Since there are not very many predicates, the cost of this interning is fairly
small, and it gives a significant improvement,.
Switching off the interning gives a 44% slow down on the sesame query, and
a 21% slow down on the simple RDQL query.

Node hashCode

The optimized code adds an extra field to Node in which the hashCode is computed by the
constructor
and kept.


This gives a 79% improvement on the sesame query, without significant slowdown
elsewhere.
(Hmmm, I seem not to have the figures, but recollect that this was also good for the
cache coherency issue.)

Triple Cache

The triple cache gets used during a find. This is useless,
and costly.
In order to show the cost the optimized code switches the triple cache off 
completely. A different approach would be to not use the cache during the find,
but use it at other times.
Given that the optimized code for the small query goes faster with the
triple cache on, it does appear that the triple cache itself is not useless,
but the results for the API and SPI loops show that the cost of the triple
cache in the find planning is quite high.
So the recommended change is to ensure that the triple cache is not use when converting
an s, p, o pattern into a triple during find, but to use the triple cache at
other times.

 


test
quadhash
triple cache on




Sesame Query - API listStatements (small)
100%
151%

Sesame Query - SPI loops (small)
100%
170%

Sesame Query - direct graph SPI query (small)
100%
92%

Simple-RDQL (OneProperty*1000)
100%
112%

walking a chain (Chain |1000|)
100%
132%


Threading

The standard query code uses two threads.
The optimized code uses one thread, and a 'multiplying iterator'
to transform the control appropriately. The multiplying iterator
within a profiler appears to take about 15% of the time. Hence
a different control structure, e.g. a visitor pattern, may be
significantly faster.
For the two tests involving the query code (i.e. the sesame test and the
legacy RDQL test) the threading cost a slow down of a factor of 2 and 3.6
respectively.
It is costly having threading in the Query subsystem.

Other

A further optimization that is not shown in the tests is avoiding the pattern.match()
in the following code:

com.hp.hpl.jena.graph.query.PatternStage

protected void nest( Pipe sink, Domain current, int index )
   {
   if (index == compiled.length)
      sink.put( current.copy() );
   else
      {
      Pattern p = compiled[index];
      ValuatorSet guard = guards[index];
      Iterator it = find( graph, p.asTripleMatch( current ).asTriple() );
      while (stillOpen && it.hasNext())
         if (p.match( current, (Triple) it.next()) && guard.evalBool( current )) 
            nest( sink, current, index + 1 );
      NiceIterator.close( it );
      }
   }
 

The optimized code avoids all of the nesting stuff as unnecessay overhead,
but the p.match is particularly expensive.
It does three things:


It duplicates the condition already executed in the find
It binds any variables that need binding
It checks the conditions in queries that have the same variable
occurring two or three times in the triple pattern.


The first of these should simply not be done.
The second is implemented in the optimized code by passing
a pointer into the find call that allows the value to be assigned
to the variable during the find. It is not significant whether this
pointer is a java object with a set method, or an array and an index
with a set method.


The third of these is not implemented in the optimized code.
It can be achieved by, for example, effectively treating a triple pattern:
?x ?x ?x as ?x ?y ?z with a guard of
?x = ?y and ?y = ?z.

All the nesting is avoided by converting each of the conditions,
most of which are trivial, and expressing them as iterators, and then
expressing the combined query as an array of iterators, rather than as nested stages.
This is largely equivalent, but then allows the control to be inverted with
a multiplying iterator, which for each value of the ith iterator considers
all values of the i+1th iterator, rewinding the iterators as necessary.
This is slightly inelegant when there are guards present (which is rare).
Such cases perhaps would be better served by pushing the guard down 
into the find.


Bottom Line

Using all the optimizations, including disabling remove on find for query,
and having the triple cache on, except for find, would result in a speed up
of about 25 for the sample query on the small data set. Caching issues dominate
on large data sets making the maximum speed up a factor of 5 or 6.


The code is in the Jena CVS Scratch area, under jjc.

test	jena2.2		quadhash		quadhash2		jena cvs July 11		My own chaining hash map		hashmap
test	Sesame Query - API listStatements (small)	1975		850		821		1855		864		888
Sesame Query - API listStatements (early additional triples)	1812		1090		1060		1765		1105		1113
Sesame Query - API listStatements (mid additional triples)	2340		1056		1031		2169		1113		1239
Sesame Query - API listStatements (late additional triples)	1870		891		881		1832		901		1013
Sesame Query - direct graph SPI query (small)	1650		78		75		1669		71		80
Sesame Query - direct graph SPI query (early additional triples)	1534		262		299		1635		320		344
Sesame Query - direct graph SPI query (mid additional triples)	2203		274		269		2086		359		517
Sesame Query - direct graph SPI query (late additional triples)	1647		83		79		1592		84		119
Sesame Query - API listStatements (file)	6859		3199		3490		6569		3295		3254
Sesame Query - direct graph SPI query (file)	6319		1436		1547		6049		1436		1549

test	quadhash		hashset		array list		wrapped arraylist
test	Sesame Query - API listStatements (small)	100%		149%		109%		120%
Sesame Query - SPI loops (small)	100%		157%		101%		132%
Sesame Query - direct graph SPI query (small)	100%		638%		308%		515%
Simple-RDQL (OneProperty*1000)	100%		110%		102%		106%
listStatements() (DER*1000)	100%		112%		111%		125%
listStatements() (DMOZ-1000)	100%		107%		100%		107%
loading/creating (Chain \|1000\|)	100%		123%		113%		113%
loading/creating (DER*1000)	100%		179%		126%		125%
loading/creating (DMOZ-1000)	100%		108%		100%		100%
loading/creating (OneProperty*1000)	100%		180%		132%		130%
walking a chain (Chain \|1000\|)	100%		112%		83%		93%