This version: 0.2
Date: 28 Aug 2003

RDQL Fastpath Query Processing

Fastpath Overview

The goal of RDQL Fastpath query processing is to perform as much of the RDQL query processing as possible within the database engine. The standard RDQL query processing algorithm is to process each RDQL pattern as a separate Find operation. For database-backed models, this algorithm has high overhead for queries with variables. For example, consider the RDQL query:
                    <Var1, ex:Name, "Alice"> AND <Var1, ex:City, "Wonderland">
The standard algorithm processes this query by generating one Find operation to get all Alices and then, for each Alice, generating an additional Find operation to check if that Alice lives in Wonderland. However, a database engine could process both RDQL query patterns with a single join operation. The Fastpath algorithm generates SQL queries that match multiple patterns.

The Fastpath algorithm works by partitioning the patterns and constraints for a query into one or more groups, referred to as stages, where each stage consists of patterns that can be processed together. Stages that consist of more than one pattern are processed by a single, dynamically-generated SQL query. Stages that consist of just one pattern are processed by a Find operation. This is considered more efficient than generating an SQL query (although this has not been measured). The Fastpath algorithm also determines a partial execution order for the stages in an attempt to minimize execution time.

The Fastpath code is new and not thoroughly tested. If problems with Fastpath are suspected or found, it is possible to disable Fastpath and revert to the default nested-loops algorithm by invoking setDoFastpath(false) on the ModelRDB instance (see Options). The remainder of this document describes the Fastpath algorithm in more detail, including limitations, optimizations and future work. Jena2's support for different reification styles and use of a separate table for reification somewhat complicates the algorithm. These issues are discussed below.

Fastpath Support for Standard Models

In this section, we describe the Fastpath algorithm for the most general case, a model that uses the Standard reification style (i.e., standard RDF treatment of reification, see the Reification-HowTo for a description of the styles). Models that use this style may may contain both asserted statements and reification triples. As mentioned above, a Jena2 query consists of a set of triple patterns and a set of constraints that apply to the variable bindings (e.g., Var5 != 1000). The Fastpath algorithm is summarized in the pseudo-code below.

patternsToDo = all patterns in query
// patternsToDo has the RDQL query patterns that are not yet staged.
while ( patternsToDo is not empty ) {
    staged = the lowest cost pattern in remaining 
    // staged has the query patterns to join in the next stage
    remove staged from patternsToDo
    while ( patternsToDo is not empty ) {
                joinPattern = a pattern in patternsToDo that joins with one in staged
           if ( joinPattern is null ) break
                add joinPattern to staged
           remove joinPattern from patternsToDo
         }
    // now decide how to process the stage - as a find op or an SQL query
    if staged has one pattern
        generate find operation for this stage
    else
        generate SQL query for this stage
}

The first step is to order the triple patterns by their estimated cost of evaluation. Currently, the cost estimate is a simple measure that ranks patterns with bound variables higher than patterns with unbound variables. Further, unbound variables over predicates are ranked lower than unbound variables over subjects or objects. The highest ranked pattern is then chosen as the first pattern in the next stage of patterns to process.

The next step is to all find patterns that can join with a staged pattern. The process iterates until no more patterns remain or no joins are found. In general, two patterns are considered joinable if they apply to the same database table, have a common variable and neither variable is bound to a predicate. However, in some cases joins of predicate variables are supported as described below. Joins across database tables are somewhat tricky and are deferred to a future release.

The final step is to generate code for the staged patterns. If there is a single pattern in the stage, it is considered more efficient to process the pattern as a find operation. For multiple patterns, a new SQL query is generated dynamically.

The final result of the Fastpath algorithm is a series of stages of query results where each stage is either a find operation or an SQL query. The stages are then evaluated in order, i.e., the output of one stage is used as input for variable bindings in the next stage. A final stage is used for processing the constraints by filtering the final results. Processing constraints within the database is deferred to a future release. Another limitation is that Fastpath processes patterns for different graph separately, i.e., it cannot generate SQL code for joins across graphs. This is also deferred to a future release.

Fastpath Optimizations

If it is known that a Jena model contains only asserted statements or if a user does not wish to query over reified statments, then variables over predicates can be supported. This affects the algorithm that looks for joins (mentioned above) and should provide a performance benefit for queries that include predicate variables by reducing the number of stages. The RDB Query Handler supports a method, setQueryOnlyAsserted(true), for the user to declare that Fastpath should ignore reified statements in queries. Similarly, if a model contains only reified statements or the user only wishes to query reified statements, then setQueryOnlyReified(true), may be called. However, currently, this latter call has no effect. A future release will support joins across predicate variable for reified statements.

If a model contains only fully reified statements or the user is only interested in querying fully reified statements (i.e., partially reified statements are to be ignored), then another optimization is possible that eliminates unnecessary joins for reified statements. For example, consider the following query over reified statements.
                    <Var1, rdf:subject, "Alice"> AND <Var1, rdf:object, V2>
This retrieves all reified statements about Alice. The normal Fastpath algorithm will generate a join for these two patterns. However, we note that fully reified statements are stored in a single database row. So, if the user is only interested in query results for fully reified statements, then the join can be replaced by a simple select. This optimization is enabled by calling setQueryFullReified(true).

To retrieve the current settings for any of these optimization settings, there are corresponding get methods, that return a boolean e.g., getQueryOnlyAsserted(). Note that the values of these options are not persisted in the database (see Options).

Fastpath Support for Convenient and Minimal Reification Styles

Convenient and Minimal reification styles hide reification triples from the Find operation. Consequently, using these reification styles has the same effect as specifying setQueryOnlyAsserted(true).

Limitations

Known Bugs

Concurrent updates: if the RDQL query is not run within a transaction, anomalies may occur if the underlying persistent model is updated by some other application between the the time an RDQL query is compiled and subsequently executed. Generally, this should not be a problem but there are situations where a query may return no results. In particular, if a query references a long literal or URI, the database identifier for that object is obtained at compilation time. If the literal does not exist at compilation time but is subsequently added prior to running the query, then no results are returned when, in fact, results should be returned. There may be other examples of anomalies.

Limitations/Future Enhancements

Querying across graphs. Currently, Fastpath does not optimize queries across graphs. A separate SQL statement is generated for each graph. Some support for querying across graphs will be enabled in the future.
Querying across tables. Within a model, Fastpath only optimizes patterns over the same table. Limited support for queryng across tables will be enabled in the future.
Predicate variables. Fastpath only supports predicate variables in the case that queryOnlyAsserted is true. In the future, it should also be possible to enable predicate variables for some types of RDQL queries when queryOnlyReified is true.
Query Log. There is currently no easy-to-use facility to log the SQL statements generated by Fastpath. However, to print a  trace of the generated SQL on System.out, uncomment line 60 of jena.db.impl.DBStage.java. In the future, we hope to add an API for accessing information about the generated SQL, including the statement string.
Query Constraints. Currently RDQL constraints are not processed within the database engine. In the future, constraints will be processed in the database engine.