Andy Seaborne <andy.seaborne@hp.com>
April 2002
Updated February 2004
New users should start with the SPARQL query language, which is a newer and more sophisticated than RDQL.
ARQ - A SPARQL Query Processor for Jena (also supports RDQL).
RDQL is a query language for RDF in Jena models. The idea is to provide a data-oriented query model so that there is a more declarative approach to complement the fine-grained, procedural Jena API.
It is "data-oriented" in that it only queries the information held in the models; there is no inference being done. Of course, the Jena model may be 'smart' in that it provides the impression that certain triples exist by creating them on-demand. However, the RDQL system does not do anything other than take the description of what the application wants, in the form of a query, and returns that information, in the form of a set of bindings.
RDQL is an implementation of the SquishQL RDF query language, which itself is derived from rdfDB. This class of query languages regards RDF as triple data, without schema or ontology information unless explicitly included in the RDF source.
RDF provides a graph with directed edges - the nodes are resources or literals. RDQL provides a way of specifying a graph pattern that is matched against the graph to yield a set of matches. It returns a list of bindings - each binding is a set of name-value pairs for the values of the variables. All variables are bound (there is no disjunction in the query).
In this tutorial, we will start with the simple data in vc-db-1.rdf: this file contains RDF for a number of vCard descriptions of people. vCards are described in RFC2426 and the RDF translation is described in the W3C note "Representing vCard Objects in RDF/XML". Our example database just contains some name information.
Graphically, the data looks like (click to enlarge):
The file "vc-q1" contains the following query:
SELECT ?x WHERE (?x <http://www.w3.org/2001/vcard-rdf/3.0#FN> "John Smith")
Executing this query, with command line application:
java jena.rdfquery --data vc-db-1.rdf --query vc-q1
which executes the query in the file "vc-q1" on the data file "vc-db-1.rdf" and yields the following:
x ============================= <http://somewhere/JohnSmith/>
We'll look at the structure of this query, how to execute it from the command line, then show how such a query is used from within a Java programme.
Queries can retrive related pieces of information: the next query retrieves two variables for resource and formatted name (query vc-q2).
SELECT ?x, ?fname WHERE (?x <http://www.w3.org/2001/vcard-rdf/3.0#FN> ?fname)
which gives:
x | fname ================================================ <http://somewhere/JohnSmith/> | "John Smith" <http://somewhere/RebeccaSmith/> | "Becky Smith" <http://somewhere/SarahJones/> | "Sarah Jones" <http://somewhere/MattJones/> | "Matt Jones"
So what did we just do?
Looking at the first query, we have a pattern (?x, <vCard:FN>, "John Smith") for triples in the RDF source file. This pattern is matched against each triple in the file and the results collected together (in the example, there is only one such match). The command line application has a built-in formatter and it lists the variables declared in the SELECT clause.
Here, URI's are quoted using <> (see RFC2396 for a definition of URI syntax), variables are introduced by a leading '?' and a constant is a string quoted. Constants can also be unquoted numbers.
The command line application takes a query and a data source, executes the query, then formats the results. We'll look at calling queries from Java later (see below).
In the example above, we had a single pattern for triples. In RDQL, the WHERE clause is actually matching a description of the shape of the graph, as given by a graph pattern, given as a list of triple (edge) patterns. Suppose we want the given names of the Smiths (see "vc-q3").
SELECT ?givenName WHERE (?y <http://www.w3.org/2001/vcard-rdf/3.0#Family> "Smith") , (?y <http://www.w3.org/2001/vcard-rdf/3.0#Given> ?givenName)
In this query, we want to find a node in the graph, ?y, which has the vCard property Family with the value "Smith". ?y also has another property, the vCard given name, which we want to put into a variable ?givenName.
Executing this gives:
givenName ========= "John" "Rebecca"
We have found 2 matches: one for John Smith, one for Rebecca Smith.
In the query the variable ?y is the same in each triple pattern. For a successful match, the value of the variable must be the same in triple pattern. Here, the value is the bNode making up the vCard information for the N property.
One very common structure is to know the path in the graph, whether from a known point or from a variable. Such a path is made up a number of edges, linked by a graph node which has to be given a variable (see "vc-q4")
SELECT ?resource ?givenName WHERE (?resource <http://www.w3.org/2001/vcard-rdf/3.0#N> ?z) (?z <http://www.w3.org/2001/vcard-rdf/3.0#Given> ?givenName)
Here, the variable ?z is internal to the query to link the resource to the given name by the path composed of properties vCard:N and vCard:Given. We didn't ask for ?z in the SELECT clause so we get:
resource | givenName ============================================ <http://somewhere/JohnSmith/> | "John" <http://somewhere/RebeccaSmith/> | "Rebecca" <http://somewhere/SarahJones/> | "Sarah" <http://somewhere/MattJones/> | "Matthew"
URI tend to be quite long. An RDFS schema will have a URI that defines a space of identifiers, and each identifier is a further name concatenated on to this. RDQL has a syntactic convenience for this: it allows prefix strings to be defined in the USING clause. The examples above become:
SELECT ?x WHERE (?x vCard:FN "John Smith") USING vCard FOR <http://www.w3.org/2001/vcard-rdf/3.0#>
SELECT ?givenName WHERE (?y vCard:Family "Smith") (?y vCard:Given ?givenName) USING vCard FOR <http://www.w3.org/2001/vcard-rdf/3.0#>
As more properties appear in a query, this mechanism helps maintain a readable query, without long URIs obscuring the structure of the patterns.
For readability, you can also insert commas where lists of things occur: you can use commas in someplaces and not others, as suits readability.
SELECT ?resource, ?givenName WHERE (?resource, vCard:N, ?z), (?z, vCard:Given, ?givenName) USING vCard FOR <http://www.w3.org/2001/vcard-rdf/3.0#>
For convenience, the namespaces 'rdf' , 'rdfs', 'owl' and 'xsd' are built-in. It is as if every query has:
USING rdf FOR <http://www.w3.org/1999/02/22-rdf-syntax-ns#"> rdfs FOR <http://www.w3.org/2000/01/rdf-schema#> xsd FOR <http://www.w3.org/2001/XMLSchema#> owl FOR <http://www.w3.org/2002/07/owl>
They can be redefined if an application wishes to.
There times when the application wants to filter on the value a property found. In the data file vc-db-2.rdf, we have added an extra field for age. Age is not defined by the vCard schema so we have created this for the purpose of this tutorial. RDF allows such mixing of different definitions of information because URIs are unique.
So, a query to find the names of people who are older than 24 is (this query is in file vc-q5):
SELECT ?resource WHERE (?resource info:age ?age) AND ?age >= 24 USING info FOR <http://somewhere/peopleInfo#>
which results in:
resource ============================= <http://somewhere/JohnSmith/>
Just one match, resulting in the resource URI for John Smith. Turning this round to ask for those less than 24 also yields one match for Rebecca Smith. Nothing about the Jones's.
The database contains no information about the Jones: there are no info:age properties on these vCards.
Filters expressions can also include regular expressions (provided by the Jakarta ORO package):
SELECT ?person WHERE (?person vCard:FN ?fullName) AND ?fullName =~ /Smith/i USING vCard FOR <http://www.w3.org/2001/vcard-rdf/3.0#>
which searches for people with a formatted name containing "smith" (case insentive). It would be less efficient than looking for the vCard:familyName property with value "Smith". The "does not match" operator is "!~".
In the file vc-q6 is the query:
SELECT ?resource, ?familyName WHERE (?resource info:age ?age) (?resource, vCard:N ?y) , (?y, <vCard:Family>, ?familyName) AND ?age >= 24 USING info FOR <http://somewhere/peopleInfo#> vCard FOR <http://www.w3.org/2001/vcard-rdf/3.0#>
which finds the age and family name for each resource, then filters the results by the age. It results in:
resource | familyName ========================================== <http://somewhere/JohnSmith/> | "Smith"
Variables can appear in the subject, property or value slots of a pattern. We have seen variables in the subject and value slots of a triple but that is not a restriction. We could have:
SELECT ?prop WHERE (<http://somewhere/JohnSmith/> , ?prop, "John Smith") USING info FOR <http://somewhere/peopleInfo#> vCard FOR <http://www.w3.org/2001/vcard-rdf/3.0#>
which seeks to determine how the resource with URI http://somewhere/JohnSmith/ is related to the string "Smith". See vc-q7. Try it - do you get what you expected?
prop ========================================= <http://www.w3.org/2001/vcard-rdf/3.0#FN>
You may expected the prefix abbreviation "vCard:" to be used; unfortunately, the prefix mechanism is only a syntactic convenience for writing queries. By the time gets executed, that information has been lost.
The database we have been using has bNodes (also known as anonymous nodes) in it. This query (see vc-q8) intentionally finds the bNode for the structured information on the vCard:N property.
SELECT ?b WHERE (<http://somewhere/JohnSmith/> vCard:N ?b) USING vCard FOR <http://www.w3.org/2001/vcard-rdf/3.0#>
The output is:
b ============================== <anon:113fe2:ec900a9f26:-7fff>
The weird looking output is Jena's internal identifier for bNodes. You can see something similar in N-TRIPLE files written by Jena.
bNodes don't really have labels (the "b" stands for "blank"), what you see here is the current Jena implementation details, and so you can't put them in a query - you can get them out. In the next section, we discuss using RDQL inside Java, and then finding bNodes through queries followed by further operations involving the bNodes through the Jena API because RDQL returns the Jena Resource, Property or Literal as fits the query.
So far, all the RDF literals we have seen have been plain strings, encloses in double quotes, or numbers, which needn't be quoted. RDF Literals can have an optional datatype and optional (XML) language tag.
Examples:
"123"^^<http://www.w3.org/2001/XMLSchema#integer> "123"^^xsd:integer "foo"@en^^rdf:XMLLiteral
Here are a couple exercises: using the database vc-db-3.rdf, find:
So far, we have looked at executing queries from the command line and getting the results by seeing the output of the text-based results formatter. The command line application is just a wrapper around a set of Java classes which actually provide the query execution.
All the important classes are in package com.hp.hpl.jena.rdf.rdql; the package com.hp.hpl.jena.rdf.rdql.parser contains the parser for the concrete syntax. There is no need to access the parser directly because the Query class has a constructor to do that for you.
A query is created by passing a string to the Query class. If the source is not specified within the query, then the source must then be given, usually by passing a model to the query object, or specifying the URL of a model.
String querySting = "SELECT ....." ; Query query = new Query(queryString) ; // Need to set the source if the query does not. query.setSource(model); QueryExecution qe = new QueryEngine(query) ; QueryResults results = qe.exec() ; for ( Iterator iter = results ; iter.hasNext() ; ) { ResultBinding res = (ResultBinding)iter.next() ; ... process result here ... } results.close() ;
Results are returned as an iterator, with each call to QueryResults.next returning one set of variables bindings. The text formatter called by jena.rdfquery prints each ResultBinding out as a single line.
Variables are accessed by name. To just print out the result we could do: (the full code is in rdql_code1).
for ( Iterator iter = results ; iter.hasNext() ; ) { ResultBinding res = (ResultBinding)iter.next() ; Object x = res.get("x") ; Object fname = res.get("fname") ; System.out.println("x = "+x+" fname = "+fname) ; }
which gives:
x = http://somewhere/JohnSmith/ fname = John Smith x = http://somewhere/RebeccaSmith/ fname = Becky Smith x = http://somewhere/SarahJones/ fname = Sarah Jones x = http://somewhere/MattJones/ fname = Matt Jones
The objects returned are real Jena objects: (see rdql_code2):
for ( Iterator iter = results ; iter.hasNext() ; ) { ResultBinding res = (ResultBinding)iter.next() ; Resource r = (Resource)res.get("x") ; Literal l = (Literal)res.get("fname") ; System.out.println("Resource: "+r+" Literal: "+l); break ; }
With some care, it is possible to mix Jena API calls with queries.
In rdql_code3, we have:
for ( Iterator iter = results ; iter.hasNext() ; ) { ResultBinding res = (ResultBinding)iter.next() ; Resource r = (Resource)res.get("x") ; Literal l = (Literal)res.get("fname") ; System.out.println("Resource: "+r+" Literal: "+l); for ( StmtIterator sIter = r.listProperties(); sIter.hasNext() ; ) { Statement s = sIter.next() ; System.out.println(" Predicate: "+s.getPredicate()) ; } break ; }
which reads some information about the resource retrieved in the query. It is necessary to understand a little about the implementation of RDQL to understand what is, and is not safe,
The basic rule: Don't modify the model (add or remove statements) while a query is executing. Reading information is safe. The way to get round this is to record changes to be made in a separate data structure, such as recording statements to be removed in a set, then perform them after the query results iterator has been closed.
It is worth understanding a little about the implementation of RDQL provided by the QueryEngine class. To ensure that not to much memory is used in executing a query, the query engine has a pipeline of matching, filtering and returning results. Each pipeline stage is a separate thread (and queries can go faster on a multiprocessor). If the application is not reading results then the pipeline will fill and further query execution will pause until there is space.
So the query engine can be ahead of the application in processing the query, and is making calls into the Jena model from a different thread. Modifying the model while a query is active is unpredicted and might even cause a crash as internal Jena datatsructures are not protected against concurrent updates and reads.
As an exercise in calling RDQL from Java, find the subclass relationships in this RDF model (also available in RDF/XML).
Here is one possible solution.
The Jena toolkit comes with a command line program for executing RDQL queries.
java -cp ... jena.rdfquery ...
This programme will execute a query on a data source, specified in the FROM clause of the query or on the command line. It can query all forms of Jena models: XML, N-Triple, BerkeleyDB or a relational database.
This programme has a built-in formatter for the result data. It can print in text as aligned columns and in HTML, as well as raw formats more suited to further processing.
It takes number of arguments:
Usage: [--xml|--ntriple] [--data URL] [queryString | --query file] --query file Read one query from a file --xml Data source is XML (default) --ntriple Data source is n-triple --data URL Data source (can also be part of query) --time Print some time information --test [file] Run the test suite --format FMT One of text, html, tuples, dump or none --verbose Verbose - more messages --quiet Quiet - less messages
RDQL is an SQL-like syntax for this query model derived from SquishQL and rdfDB. A description of the full grammar, as the output of JJTree (part of the JavaCC package), is included in this tutorial. The up-to-date grammr is to be found in the Jena toolkit.
In SQL, a database is a closed world; the FROM clause identifies the tables in the database; the WHERE clause identifies constraints and can be extended with AND. By analogy, the web is the database and the FROM clause identifies the RDF models. Variables are introduced with a leading ‘?’ and URIs are quoted with <>; unquoted URIs can be used where there is no ambiguity.
The up-to-date grammar is distributed as part of the Jena toolkit.