sparql2sql – a query engine for SPARQL over Jena triple stores

Last update: June 15th 2005

Overview

sparql2sql is a query engine for SPARQL over Jena triple stores. It rewrites SPARQL queries into SQL. This approach offloads most of the query execution work on the database. This should improve performance.

This is an experimental implementation. It cannot deal with all SPARQL queries and is not fully tested. See the Limitations and known issues sections for some details.

Tested only on MySQL 4.1, will not work on older MySQL versions
Based on Andy Seaborne's ARQ, which handles parsing, FILTER evaluation, result ordering and limiting and CONSTRUCT
Source code is available through anonymous CVS and browsable online
BSD licensed

Please direct feedback and bug reports to the Jena mailing list, jena-dev@groups.yahoo.com.

Author: Richard Cyganiak (richard@cyganiak.de)

Download and CVS access
Example: Querying a persistent Jena model
Example: Working with RDF Datasets and named graphs
Limitations and known issues
Database schema
SPARQL to SQL mapping details

Download and CVS access

Currently sparql2sql is only available as Java source code from CVS.

cvs -d:pserver:anonymous@cvs.sourceforge.net:/cvsroot/jena login
cvs -z3 -d:pserver:anonymous@cvs.sourceforge.net:/cvsroot/jena co sparql2sql

When asked for a password, just press Enter.

All required jar files (the Jena 2.2 jars, the MySQL JDBC connector, and a CVS build of ARQ) are in the lib directory.

There's a runnable example, sparql2sql/Test.java, and a unit test suite in the tests-src directory. Both require a live MySQL 4.1 database. The connection is configured in etc/db_connection.properties.

Example: Querying a persistent Jena model

sparql2sql can be used to query database-persisted Jena models (ModelRDB). The example creates a ModelRDB, reads an RDF file into the model, then re-opens the model as an RDBDataSource and executes a SPARQL query on that.

// register the sparql2sql query engine
// (must be done once at startup time)
RDBQueryEngineFactory.registerSelf();

// Open a DB connection and DB model
IDBConnection conn = new DBConnection(url, user, password, engine);
ModelMaker maker = ModelFactory.createModelRDBMaker(conn);
Model persistentModel = maker.createModel("myModelName");

// ... do interesting stuff with the model ...
persistentModel.read("http://xmlns.com/foaf/0.1/index.rdf");

// Open the same model as an ARQ DataSet
DataSet ds = RDBDataSource.open(conn, "myModelName");

// Execute a SPARQL query
String sparql =
	"PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> " +
	"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> " +
	"SELECT ?class ?label " +
	"WHERE { ?class rdf:type rdfs:Class . " +
	"        ?class rdfs:label ?label }";
ResultSet results = QueryExecutionFactory.create(
		QueryFactory.create(sparql), ds).execSelect();

// Pretty-print results to System.out
new ResultSetFormatter(results).printAll(System.out);

Example: Working with RDF Datasets and named graphs

SPARQL's Dataset is a collection consisting of a default graph and any number of named graphs, which are named by URIs.

sparql2sql's implementation of this concept is the RDBDataSource.

The example sets up an RDBDataSource, reads some RDF file into the default graph and some named graphs, and executes a SPARQL query over the Dataset.

// set up datasource
RDBDataSource ds = RDBDataSource.open(
		new DBConnection(url, user, password, engine),
		"my_dataset");

// clean the model if it still contains stuff from previous run
ds.clear();

// randomly read some RDF into the default and some named graphs
ds.getDefaultModel().read("http://www.w3.org/1999/02/22-rdf-syntax-ns");
// we have to generate the named graphs first -- clunky!
ds.addNamedModel("urn:my:graph1", ModelFactory.createDefaultModel());
ds.addNamedModel("urn:my:graph2", ModelFactory.createDefaultModel());
ds.addNamedModel("urn:my:graph3", ModelFactory.createDefaultModel());
// now read some stuff
ds.getNamedModel("urn:my:graph1").read("http://www.w3.org/2000/01/rdf-schema");
ds.getNamedModel("urn:my:graph2").read("http://purl.org/dc/elements/1.1/");
ds.getNamedModel("urn:my:graph3").read("http://xmlns.com/foaf/0.1/index.rdf");

// register the SPARQL2SQL query engine -- must be done once at
// startup time
RDBQueryEngineFactory.registerSelf();

// Set log level to debug
// This causes the engine to log executed SELECT statements
Logger.getLogger(RDBDataSource.class).setLevel(Level.DEBUG);

// do a SPARQL query
String sparql =
	"PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> " +
	"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> " +
	"SELECT ?source ?uri ?superclass " +
	"WHERE { GRAPH ?source { " +
	"{ ?uri rdf:type rdfs:Class } UNION { ?uri rdf:type rdf:Property } " +
	"OPTIONAL { ?uri rdfs:subClassOf ?superclass } } }";
Query q = QueryFactory.create(sparql);
ResultSet results = QueryExecutionFactory.create(q, ds).execSelect();

// print results using an ARQ utility class
ResultSetFormatter.out(System.out, results, q);

// close the dataset
ds.close();

Limitations and known issues

This is experimental software in a very early stage of development. No extensive testing has been performed.

sparql2sql will not work in conjunction with the RDF reification vocabulary
sparql2sql cannot query inference models, unless all inferences are stored in the database
The software has been tested with MySQL 4.1. It should, in principle, work with other database engines supported by Jena, but not with older versions of MySQL because they don't support nested SELECTs
FILTER clauses inside OPTIONAL patterns do not work
Order-dependent queries do not work:
```
WHERE { ?x :a :b OPTIONAL { ?x :c1 ?y } OPTIONAL { ?x :c2 ?y } }
```
(The results depend on which ?y is bound “first”)
Some queries will not work when multiple DataSets are stored in the same database instance. This affects graph name patterns without required triple pattern:
```
WHERE { GRAPH ?g {} }
```
```
WHERE { GRAPH ?g { OPTIONAL { ?s ?p ?o } } }
```
Some complex graph patterns involving UNIONs or nested OPTIONALs with many occurences of the same variables will not work; these are not very likely to surface in real-world SPARQL queries.
Queries that do “not work” might merely run extremely slow, might return wrong results, or might cause the system to halt and catch fire.

Database schema

sparql2sql uses the Jena ModelRDB database schema.

This allows SPARQL queries over existing ModelRDB stores, but comes at a performance and complexity cost since the Jena DB schema was not designed with RDF Datasets in mind.

ModelRDB is able to store multiple models in a single statement table. This feature is used by sparql2sql to simulate RDF Datasets. The model ID is used to store graph name URIs. The URIs are encoded using ModelRDB's node encoding scheme to improve join performance.

SPARQL to SQL mapping details

Groups of triple patterns are rewritten into joins over aliases of the statement table
OPTIONALs are rewritten into LEFT JOINs over nested SELECTs
UNIONs are rewritten into SQL UNIONs
GRAPHs cause nested triples to be handled as quads, and/or are matched against the graph name table
FILTERs are not handled by the database, but evaluated in Java code
FILTERs in OPTIONALs are not supported
For FILTERs in UNIONs, flag columns are introduced; they mark wether the constraint applies to a given result row
Order-dependent patterns should be handled by OUTER JOINs, but are not supported at the moment
Some corner cases are not supported, and either recognised and deferred to ARQ's general-purpose query engine, or will give wrong results, or will cause an exception

Generated SQL statements can be logged by lowering the log level:

Logger.getLogger(RDBDataSource.class).setLevel(Level.DEBUG);