Jena Tutorial

A Programmer's Introduction to RDQL

Andy Seaborne <andy_seaborne@hp.com>

April 2002

RDQL Introduction

RDQL is a query language for RDF in Jena models.  The idea is to provide a data-oriented query model so that there is a more declarative approach to complement the fine-grained, procedural Jena API.

It is "data-oriented" in that it only queries the information held in the models; there is no inference being done.  Of course, the Jena model may be 'smart' in that it provides the impression that certain triples exist by creating them on-demand.  However, the RDQL system does not do anything other than take the description of what the application wants, in the form of a query, and returns that information, in the form of a set of bindings.

RDQL is an implementation of the SquishQL RDF query language, which itself is derived from rdfDB.  This class of query languages regards RDF as triple data, without schema or ontology information unless explicitly included in the RDF source.

RDF provides a graph with directed edges - the nodes are resources or literals.  RDQL provides a way of specifying a graph pattern that is matched against the graph to yield a set of matches.  It returns a list of bindings - each binding is a set of name-value pairs for the values of the variables.  All variables are bound (there is no disjunction in the query).

Contents

  1. RDQL Introduction
  2. RDQL-by-example
  3. Writing Queries
    1. Graph Patterns
    2. Paths
    3. URI Prefixes
    4. Filters
    5. Another Example
    6. Querying for properties
    7. bNodes
    8. More on literals
  4. Using RDQL from Java
    1. Key classes
    2. Example Java Code
    3. Mixing  RDQL and Jena API calls
  5. Reference
    1. RDQL Command Line Application
    2. RDQL Syntax

RDQL-by-Example

In this tutorial, we will start with the simple data in vc-db-1.rdf: this file contains RDF for a number of vCard descriptions of people.  vCards are described in RFC2426 and the RDF translation is described in the W3C note "Representing vCard Objects in RDF/XML".  Our example database just contains some name information.

Graphically, the data looks like (click to enlarge):

Graph of the vCard database

The file "vc-q1" contains the following query:

SELECT ?x
WHERE (?x  <http://www.w3.org/2001/vcard-rdf/3.0#FN>  "John Smith")

Executing this query, with command line application:

java jena.rdfquery --data vc-db-1.rdf --query vc-q1

which executes the query in the file "vc-q1" on the data file "vc-db-1.rdf" and yields the following:

x                            
=============================
<http://somewhere/JohnSmith/>

We'll look at the structure of this query, how to execute it from the command line, then show how such a query is used from within a Java programme.

Queries can retrive related pieces of information: the next query retrieves two variables for resource and formatted name (query vc-q2).

SELECT ?x, ?fname
WHERE (?x  <http://www.w3.org/2001/vcard-rdf/3.0#FN>  ?fname)

which gives:

x                                | fname        
================================================
<http://somewhere/JohnSmith/>    | "John Smith" 
<http://somewhere/RebeccaSmith/> | "Becky Smith"
<http://somewhere/SarahJones/>   | "Sarah Jones"
<http://somewhere/MattJones/>    | "Matt Jones" 

Explanation

So what did we just do? 

Looking at the first query, we have a pattern (?x, <vCard:FN>, "John Smith") for triples in the RDF source file.  This pattern is matched against each triple in the  file and the results collected together (in the example, there is only one such match).  The command line application has a built-in formatter and it lists the variables declared in the SELECT clause.

Here, URI's are quoted using <> (see RFC2396 for a definition of URI syntax), variables are introduced by a leading '?' and a constant is a string quoted.  Constants can also be unquoted numbers.

The command line application takes a query and a data source, executes the query, then formats the results.  We'll look at calling queries from Java later (see below).

Writing Queries

Graph Patterns

In the example above, we had a single pattern for triples.  In RDQL, the WHERE clause is actually matching a description of the shape of the graph, as given by a graph pattern, given as a list of triple (edge) patterns.  Suppose we want the given names of the Smiths (see "vc-q3").

SELECT ?givenName
WHERE (?y  <http://www.w3.org/2001/vcard-rdf/3.0#Family>  "Smith") ,
      (?y  <http://www.w3.org/2001/vcard-rdf/3.0#Given>  ?givenName)

In this query, we want to find a node in the graph, ?y, which has the vCard property Family with the value "Smith".  ?y also has another property, the vCard given name, which we want to put into a variable ?givenName.

Executing this gives:

givenName
=========
"John"   
"Rebecca"

We have found 2 matches: one for John Smith, one for Rebecca Smith.

In the query the variable ?y is the same in each triple pattern.  For a successful match, the value of the variable must be the same in triple pattern.  Here, the value is the bNode making up the vCard information for the N property.

More Graph patterns: Paths

One very common structure is to know the path in the graph, whether from a known point or from a variable.  Such a path is made up a number of edges, linked by a graph node which has to be given a variable (see "vc-q4")

SELECT ?resource  ?givenName
WHERE (?resource  <http://www.w3.org/2001/vcard-rdf/3.0#N>   ?z)
      (?z  <http://www.w3.org/2001/vcard-rdf/3.0#Given>  ?givenName)

Here, the variable ?z is internal to the query to link the resource to the given name by the path composed of properties vCard:N and vCard:Given.  We didn't ask for ?z in the SELECT clause so we get:

resource                         | givenName
============================================
<http://somewhere/JohnSmith/>    | "John"   
<http://somewhere/RebeccaSmith/> | "Rebecca"
<http://somewhere/SarahJones/>   | "Sarah"  
<http://somewhere/MattJones/>    | "Matthew"

URI Prefixes : USING

URI tend to be quite long.  An RDFS schema will have a URI that defines a space of identifiers, and each identifier is a further name concatenated on to this. RDQL has a syntactic convenience for this: it allows prefix strings to be defined in the USING clause.  The examples above become:

SELECT ?x
WHERE (?x  vCard:FN  "John Smith")
USING vCard FOR <http://www.w3.org/2001/vcard-rdf/3.0#>
SELECT ?givenName
WHERE (?y  vCard:Family  "Smith")
      (?y  vCard:Given  ?givenName)
USING vCard FOR <http://www.w3.org/2001/vcard-rdf/3.0#>

As more properties appear in a query, this mechanism helps maintain a readable query, without long URIs obscuring the structure of the patterns.

For readability, you can also insert commas where lists of things occur: you can use commas in someplaces and not others, as suits readability.

SELECT ?resource,  ?givenName
WHERE (?resource, vCard:N, ?z), 
      (?z,  vCard:Given,   ?givenName)
USING vCard FOR <http://www.w3.org/2001/vcard-rdf/3.0#>

For convenience, the namespaces 'rdf' , 'rdfs', 'owl' and 'xsd' are built-in.  It is as if every query has:

USING rdf  FOR  <http://www.w3.org/1999/02/22-rdf-syntax-ns#"> 
      rdfs FOR  <http://www.w3.org/2000/01/rdf-schema#>
      xsd  FOR  <http://www.w3.org/2001/XMLSchema#>
      owl  FOR  <http://www.w3.org/2002/07/owl>

  They can be redefined if an application wishes to.

Filters

There times when the application wants to filter on the value a property found.  In the data file vc-db-2.rdf, we have added an extra field for age.  Age is not defined by the vCard schema so we have created this for the purpose of this tutorial.  RDF allows such mixing of different definitions of information because URIs are unique. 

So, a query to find the names of people who are older than 24 is (this query is in file vc-q5):

SELECT ?resource
WHERE (?resource info:age ?age)
AND ?age >= 24
USING info FOR <http://somewhere/peopleInfo#>

which results in:

resource                     
=============================
<http://somewhere/JohnSmith/>

Just one match, resulting in the resource URI for John Smith. Turning this round to ask for those less than 24 also yields one match for Rebecca Smith.  Nothing about the Jones's.

The database contains no information about the Jones: there are no info:age properties on these vCards.

Mixing it

In the file vc-q6 is the query:

SELECT ?resource, ?familyName
WHERE (?resource  info:age  ?age)
      (?resource, vCard:N  ?y) , (?y, <vCard:Family>, ?familyName)
AND ?age >= 24
USING  info  FOR <http://somewhere/peopleInfo#>
       vCard FOR <http://www.w3.org/2001/vcard-rdf/3.0#>

which finds the age and family name for each resource, then filters the results by the age.  It results in:

resource                      | familyName
==========================================
<http://somewhere/JohnSmith/> | "Smith"   

Querying for Properties

Variables can appear in the subject, property or value slots of a pattern.  We have seen variables in the subject and value slots of a triple but that is not a restriction.  We could have:

SELECT ?prop
WHERE (<http://somewhere/JohnSmith/> , ?prop, "John Smith")
USING  info  FOR <http://somewhere/peopleInfo#>
       vCard FOR <http://www.w3.org/2001/vcard-rdf/3.0#>

which seeks to determine how the resource with URI http://somewhere/JohnSmith/ is related to the string "Smith".  See vc-q7. Try it - do you get what you expected?

prop                                     
=========================================
<http://www.w3.org/2001/vcard-rdf/3.0#FN>

You may expected the prefix abbreviation "vCard:" to be used; unfortunately, the prefix mechanism is only a syntactic convenience for writing queries.  By the time gets executed, that information has been lost.

bNodes

The database we have been using has bNodes (also known as anonymous nodes) in it.  This query (see vc-q8) intentionally finds the bNode for the structured information on the vCard:N property.

SELECT ?b
WHERE (<http://somewhere/JohnSmith/>  vCard:N  ?b)
USING vCard FOR <http://www.w3.org/2001/vcard-rdf/3.0#>

The output is:

b                             
==============================
<anon:113fe2:ec900a9f26:-7fff>

The weird looking output is Jena's internal identifier for bNodes.  You can see something similar in N-TRIPLE files written by Jena.

bNodes don't really have labels (the "b" stands for "blank"), what you see here is the current Jena implementation details, and so you can't put them in a query - you can get them out.  In the next section, we discuss using RDQL inside Java, and then finding bNodes through queries followed by further operations involving the bNodes through the Jena API because RDQL returns the Jena Resource, Property or Literal as fits the query.

More on literals

So far, all the RDF literals we have seen have been plain strings, encloses in double quotes, or numbers, which needn't be quoted. RDF Literals can have an optional datatype and optional (XML) language tag.

Examples:

  "123"^^<http://www.w3.org/2001/XMLSchema#integer>
  "123"^^xsd:integer
  "foo"@en^^rdf:XMLLiteral

Exercises

Here are a couple exercises: using the database vc-db-3.rdf, find:

  1. All the top-level properties for the vcard for John Smith (sample query, results)
  2. Find the work telephone number for John Smith (sample query, results)

Using RDQL from Java

So far, we have looked at executing queries from the command line and getting the results by seeing the output of the text-based results formatter.  The command line application is just a wrapper around a set of Java classes which actually provide the query execution.

Key Classes

All the important classes are in package com.hp.hpl.jena.rdf.rdql; the package com.hp.hpl.jena.rdf.rdql.parser contains the parser for the concrete syntax.  There is no need to access the parser directly because the Query class has a constructor to do that for you.

Java Code

A query is created by passing a string to the Query class.  If the source is not specified within the query, then the source must then be given, usually by passing a model to the query object, or specifying the URL of a model.

String querySting = "SELECT ....." ;
Query query = new Query(queryString) ;

// Need to set the source if the query does not.
query.setSource(model);
QueryExecution qe = new QueryEngine(query) ;

QueryResults results = qe.exec() ;
for ( Iterator iter = results ; iter.hasNext() ; )
{
    ResultBinding res = (ResultBinding)iter.next() ;
    ... process result here ...
}
results.close() ;

Results are returned as an iterator, with each call to QueryResults.next returning one set of variables bindings.  The text formatter called by jena.rdfquery prints each ResultBinding out as a single line.

Variables are accessed by name.  To just print out the result we could do: (the full code is in rdql_code1).

for ( Iterator iter = results ; iter.hasNext() ; )
{
    ResultBinding res = (ResultBinding)iter.next() ;
    Object x = res.get("x") ;
    Object fname = res.get("fname") ;
    System.out.println("x = "+x+"   fname = "+fname) ;
}

which gives:

x = http://somewhere/JohnSmith/   fname = John Smith
x = http://somewhere/RebeccaSmith/   fname = Becky Smith
x = http://somewhere/SarahJones/   fname = Sarah Jones
x = http://somewhere/MattJones/   fname = Matt Jones

The objects returned are real Jena objects: (see rdql_code2):

for ( Iterator iter = results ; iter.hasNext() ; )
{
    ResultBinding res = (ResultBinding)iter.next() ;
    Resource r = (Resource)res.get("x") ;
    Literal l = (Literal)res.get("fname") ;
    System.out.println("Resource: "+r+"   Literal: "+l);
    break ;
}

Mixing RDQL and Jena API calls

With some care, it is possible to mix Jena API calls with queries.

In rdql_code3, we have:

for ( Iterator iter = results ; iter.hasNext() ; )
{
    ResultBinding res = (ResultBinding)iter.next() ;
    Resource r = (Resource)res.get("x") ;
    Literal l = (Literal)res.get("fname") ;
    System.out.println("Resource: "+r+"   Literal: "+l);
    for ( StmtIterator sIter = r.listProperties(); sIter.hasNext() ; )
    {
        Statement s = sIter.next() ;
        System.out.println("   Predicate: "+s.getPredicate()) ;
    }
    break ;
}

which reads some information about the resource retrieved in the query.  It is necessary to understand a little about the implementation of RDQL to understand what is, and is not safe, 

The basic rule: Don't modify the model (add or remove statements) while a query is executing.  Reading information is safe. The way to get round this is to record changes to be made in a separate data structure, such as recording statements to be removed in a set, then perform them after the query results iterator has been closed.

It is worth understanding a little about the implementation of RDQL provided by the QueryEngine class.  To ensure that not to much memory is used in executing a query, the query engine has a pipeline of matching, filtering and returning results.  Each pipeline stage is a separate thread (and queries can go faster on a multiprocessor).  If the application is not reading results then the pipeline will fill and further query execution will pause until there is space.

So the query engine can be ahead of the application in processing the query, and is making calls into the Jena model from a different thread.  Modifying the model while a query is active is unpredicted and might even cause a crash as internal Jena datatsructures are not protected against concurrent updates and reads.

Exercise

As an exercise in calling RDQL from Java, find the subclass relationships in this RDF model (also available in RDF/XML).

Here is one possible solution.

Reference

RDQL Command Line Application

The Jena toolkit comes with a command line program for executing RDQL queries.

java -cp ... jena.rdfquery ...

This programme will execute a query on a data source, specified in the FROM clause of the query or on the command line. It can query all forms of Jena models: XML, N-Triple, BerkeleyDB or a relational database.

This programme has a built-in formatter for the result data.  It can print in text as aligned columns and in HTML, as well as raw formats more suited to further processing.

It takes number of arguments:

Usage: [--xml|--ntriple] [--data URL] [queryString | --query file]
   --query file         Read one query from a file
   --xml                Data source is XML (default)
   --ntriple            Data source is n-triple
   --data URL           Data source (can also be part of query)
   --time               Print some time information
   --test [file]        Run the test suite
   --format FMT         One of text, html, tuples, dump or none
   --verbose            Verbose - more messages
   --quiet              Quiet - less messages

RDQL Syntax

RDQL is an SQL-like syntax for this query model derived from SquishQL and rdfDB.  A description of the full grammar, as the output of JJTree (part of the JavaCC package), is included in this tutorial.  The up-to-date grammr is to be found in the Jena toolkit.

In SQL, a database is a closed world; the FROM clause identifies the tables in the database; the WHERE clause identifies constraints and can be extended with AND.  By analogy, the web is the database and the FROM clause identifies the RDF models. Variables are introduced with a leading ‘?’ and URIs are quoted with <>; unquoted URIs can be used where there is no ambiguity.

SELECT Clause
Identifies the variables to be returned to the application.  If not all the variables are needed by the application, then specifying the required results can reduce the amount of memory needed for the results set as well as providing information to a query optimizer.
FROM Clause
The FROM clause specifies the model by URI.
WHERE Clause
This specifies the graph pattern as a list of triple patterns.
AND Clause
Specifies the Boolean expressions.
USING Clause
A way to shorten the length of URIs.  As SquishQL is likely to be written by people, this mechanism helps make for an easier to understand syntax.  This is not a namespace mechanism; instead it is simple an abbreviation mechanism for long URIs by defining a string prefix.

The up-to-date grammar is distributed as part of the Jena toolkit.