The goal of RDQL Fastpath query processing is to perform as much of the RDQL
query processing as possible within the database engine. The standard RDQL query
processing algorithm is to process each RDQL pattern as a separate Find
operation. For database-backed models, this algorithm has high overhead for
queries with variables. For example, consider the RDQL query:
<Var1, ex:Name, "Alice"> AND <Var1, ex:City,
"Wonderland">
The standard algorithm processes this query by generating one Find
operation to get all Alices and then, for each Alice, generating an additional
Find operation to check if that Alice lives in Wonderland. However, a
database engine could process both RDQL query patterns with a single join
operation. The Fastpath algorithm generates SQL queries that match multiple
patterns.
The Fastpath algorithm works by partitioning the patterns and constraints for a query into one or more groups, referred to as stages, where each stage consists of patterns that can be processed together. Stages that consist of more than one pattern are processed by a single, dynamically-generated SQL query. Stages that consist of just one pattern are processed by a Find operation. This is considered more efficient than generating an SQL query (although this has not been measured). The Fastpath algorithm also determines a partial execution order for the stages in an attempt to minimize execution time.
The Fastpath code is new and not thoroughly tested. If problems with Fastpath are suspected or found, it is possible to disable Fastpath and revert to the default nested-loops algorithm by invoking setDoFastpath(false) on the ModelRDB instance (see Options). The remainder of this document describes the Fastpath algorithm in more detail, including limitations, optimizations and future work. Jena2's support for different reification styles and use of a separate table for reification somewhat complicates the algorithm. These issues are discussed below.
In this section, we describe the Fastpath algorithm for the most general case, a model that uses the Standard reification style (i.e., standard RDF treatment of reification, see the Reification-HowTo for a description of the styles). Models that use this style may may contain both asserted statements and reification triples. As mentioned above, a Jena2 query consists of a set of triple patterns and a set of constraints that apply to the variable bindings (e.g., Var5 != 1000). The Fastpath algorithm is summarized in the pseudo-code below.
The first step is to order the triple patterns by their estimated cost of evaluation. Currently, the cost estimate is a simple measure that ranks patterns with bound variables higher than patterns with unbound variables. Further, unbound variables over predicates are ranked lower than unbound variables over subjects or objects. The highest ranked pattern is then chosen as the first pattern in the next stage of patterns to process.
The next step is to all find patterns that can join with a staged pattern. The process iterates until no more patterns remain or no joins are found. In general, two patterns are considered joinable if they apply to the same database table, have a common variable and neither variable is bound to a predicate. However, in some cases joins of predicate variables are supported as described below. Joins across database tables are somewhat tricky and are deferred to a future release.
The final step is to generate code for the staged patterns. If there is a single pattern in the stage, it is considered more efficient to process the pattern as a find operation. For multiple patterns, a new SQL query is generated dynamically.
The final result of the Fastpath algorithm is a series of stages of query results where each stage is either a find operation or an SQL query. The stages are then evaluated in order, i.e., the output of one stage is used as input for variable bindings in the next stage. A final stage is used for processing the constraints by filtering the final results. Processing constraints within the database is deferred to a future release. Another limitation is that Fastpath processes patterns for different graph separately, i.e., it cannot generate SQL code for joins across graphs. This is also deferred to a future release.
If it is known that a Jena model contains only asserted statements or if a user does not wish to query over reified statments, then variables over predicates can be supported. This affects the algorithm that looks for joins (mentioned above) and should provide a performance benefit for queries that include predicate variables by reducing the number of stages. The RDB Query Handler supports a method, setQueryOnlyAsserted(true), for the user to declare that Fastpath should ignore reified statements in queries. Similarly, if a model contains only reified statements or the user only wishes to query reified statements, then setQueryOnlyReified(true), may be called. However, a current limitation of Fastpath is that if QueryOnlyReified is true then QueryFullReified must also be true.
If a model contains only fully reified statements or the user is only
interested in querying fully reified statements (i.e., partially reified
statements are to be ignored), then another optimization is possible that eliminates unnecessary joins for reified statements. For
example, consider the following query over reified statements.
<Var1, rdf:subject, "Alice"> AND <Var1,
rdf:object, V2>
This retrieves all reified statements about Alice. The normal Fastpath
algorithm will generate a join for these two patterns. However, we note that
fully reified statements are stored in a single database row. So, if the user is
only interested in query results for fully reified statements, then the join can be replaced by a
simple select. This optimization is enabled by calling
setQueryFullReified(true).
To retrieve the current settings for any of these optimization settings, there are corresponding get methods, that return a boolean e.g., getQueryOnlyAsserted(). Note that the values of these options are not persisted in the database (see Options).
Convenient and Minimal reification styles hide reification triples from the Find operation. Consequently, using these reification styles has the same effect as specifying setQueryOnlyAsserted(true).