This page describes the mechanisms that can be used to extend and modify query execution within ARQ. Through these mechanisms, ARQ can be used to query different graph implementations and to provide different query evaluation and optimization strategies for particular circumstances. These mechanisms are used by SDB and TDB.
ARQ can be extended in various ways to incorporate custom code into a query. Custom filter functions and property functions provide ways to add application specific code. The free text search capabilities, using Apache Lucene, are provided via a property function. Custom filter functions and property functions should be used where possible.
Jena itself can be extended by providing a new implementation of the
Graph
interface. This can be used to encapsulate specific
specialised storage and also for wrapping non-RDF sources to look like
RDF. Theer is a common implementation framework provided by
GraphBase
so only one operation, the find
method,
needs to be written for a read-only data source. Basic find works well
is many cases, and the whole Jena API will be able to use the extension.
For higher SPARQL performance, ARQ can be extended at the
basic graph matching or
algebra level.
Applications writers who extend ARQ at the query execution level should be prepared
to work with the source code for ARQ for specific details and for finding code to
reuse. Some example can be found in the src-examples
directory in the ARQ download.
The sequence of actions performed by ARQ to perform a query are parsing, algebra generation, execution building, high-level optimization, low-level optimization and finally evaluation. It is not usual to modify the parsing step nor the conversion from the parse tree to the algebra form, which is a fixed algorithm defined by the SPARQL standard. Extensions can modify the algebra form by transforming it from one algebra expression to another, including introducing new operators. See also the documentation on working with the SPARQL algebra in ARQ including building algebra expressions programmatically, rather than obtaining them from a query string.
The parsing step turns a query string into a Query
object. The class
Query
reprensets the abstract syntax tree
(AST) for the query and provides methods to create the AST, primarily for
use by the parser.
The query object also provides methods to serialize the query to a string. Because
this is the AST, the string produced is very close to the original query with the
same syntactic elements, but without comments, and formatted with a whitespace for readability.
It is not usually the best way to build a query programmatically and the AST is
not normally an extension point.
The query object can be used many times. It is not modified once created, and in particular it is not modified by query execution.
ARQ generates the SPARQL
algebra expression for the query. After this a number of transformations
can be applied (for example, identification of property functions) but the first
step is the application of the algorithm in the SPARQL specification for translating
a SPARQL query string, as held in a Query
object into a SPARQL algebra
expression. This includes the process of removing joins involving the identity pattern
(the empty graph pattern).
For example, the query:
PREFIX foaf: <http://xmlns.com/foaf/0.1/> SELECT ?name ?mbox ?nick WHERE { ?x foaf:name ?name ; foaf:mbox ?mbox . OPTIONAL { ?x foaf:nick ?nick } }
becomes
(prefix ((foaf: <http://xmlns.com/foaf/0.1/>)) (project (?name ?mbox ?nick) (leftjoin (bgp (triple ?x foaf:name ?name) (triple ?x foaf:mbox ?mbox) ) (bgp (triple ?x foaf:nick ?nick) ) )))
using the SSE syntax to write out the internal datastructure for the algebra.
The online SPARQL validator at sparql.org can be used to see the algebra expression for a SPARQL query.
There is a collection of transformations that can be applied to the algebra, such as replacing equality filters with a more efficient graph pattern and an assignment. When extending ARQ, a query processor for a custom storage layout can choose which optimizations are appropriate and can also provide its own algebra transformations.
A transform is code that converts an algebra operation into other algebra operations.
It is applied using the Transformer
class:
Op op = ... ; Transform someTransform = ... ; op = Transformer.transform(someTransform, op) ;
The Transformer
class applies the transform to each operation in the
algebra expression tree. Transform
itself is an interface, with one
method signature for each operation type, returning a replacement for the operator
instance it
is called on.
One such transformation is to turn a SPARQL algebra expression involving
named graphs and triples into one using quads. This transformation is
performed by a call to Algebra.toQuadForm
.
Transformations proceed from the bottom of the expression tree to the
top. Algebra expressions are best treated as immutable so a change made
in one part of the tree should result in a copy of the tree above it.
This is automated by the TransformCopy
class which is
the commonly used base class for writing transforms. The other helper
base class is TransformBase,
which provides the identify
operation (returns the node supplied) for each transform operation.
Operations can be printed out in SSE
syntax. The Java toString
method is overridden to provide pretty
printing and the static methods in WriterOp
provide output to various
output objects like java.io.OutputStream
.
The step of evaluating a query is the process of executing the algebra expression, as modified by any transformations applied, to yield a stream of pattern solutions. Low-level optimizations include choosing the order in which to evaluate basic graph patterns. These are the responsibility of the custom storage layer. Low-level optimization can be carried out dynamically as part of evaluation.
Internally, ARQ uses iterators extensively. Where possible, evaluation of an operation
is achieved by feeding the stream of results from the previous stage into the evaluation.
A common pattern is to take each intermediate result one at a time (use QueryIterRepeatApply
to be called for each binding) , substituting the variables of pattern with those
in the incoming binding, and evaluating to a query iterator of all results for this
incoming row. The result can be the empty iterator (one that always
returns false for hasNext
). It is also common to not have
to touch the incoming stream at all but merely to pass it to sub-operations.
The steps from algebra generation to query evaluation are carried out when a query
is executed via the QueryExecution.execSelect
or other QueryExecution
exec operation. It is possible to carry out storage-specific operations when the
query execution is created. A query engine works in conjunction with a
QueryExecution
created by the QueryExecutionFactory
to provide the evaluation of a query pattern. QueryExecutionBase
provides all the machinery for the different result types and does not
need to be modified by extensions to query execution.
ARQ provides three query engine factories; the main query engine factory, one for a reference query engine and one to remotely execute a query. SDB and TDB provide their own query engine factories which they register during sub-system initialization. Both extend the main query engine described below.
The reference query engine is a direct top-down evaluation of the expression. It's purpose is to be simple so it can be easily verified and checked then its results used to check more complicated processing in the main engine and other implementations. All arguments to each operator are fully evaluated to produce intermediate in-memory tables then a simple implementation of the operator is called to calculate the results. It does not scale and does not perform any optimizations. It is intended to be clear and simple; it is not designed to be efficient.
Query engines are chosen by referring to the registry of query engine factories.
public interface QueryEngineFactory { public boolean accept(Query query, DatasetGraph dataset, Context context) ; public Plan create(Query query, DatasetGraph dataset, Binding inputBinding, Context context) ; public boolean accept(Op op, DatasetGraph dataset, Context context) ; public Plan create(Op op, DatasetGraph dataset, Binding inputBinding, Context context) ; }
When the query execution factory is given a dataset and query, the query execution
factory tries each registered engine factory in turn calling the accept
method (for query of algebra depending on how it was presented). The registry is
kept in reverse registration order - the most recently registered query engine factory
is tried first. The first query engine factor to return true is chosen and no further
engine factories are checked.
When a query engine factory is chosen, the create
method is called
to return a Plan
object for the execution. The main operation of the
plan interface is to get the QueryIterator
for the query.
See the example in src-examples/arq.examples.engine.MyQueryEngine
.
The main query engine can execute any query. It contains a number of basic graph
pattern matching implementations including one that uses the Graph.find
operation so it can work with any implementation of the Jena Graph SPI. The main
query engine works with general purpose datasets but not quad stores directly; it
evaluates patterns on each graph in turn.
The main query engine includes optimizations for the standard Jena implementation
of in-memory graphs.
High-level optimization is performed by a sequence of transformations. This set of
optimizations is evolving. A custom implementation of a query engine can reuse some
or all of these transformations (see Algebra.optimize
which is the
set of transforms used by the main query engine).
The main query engine is a streaming engine. It evaluates expressions as the client
consumes each query solution.
After preparing the execution by creating the initial conditions (a partial solution
of one row and no bound variables or any initial bindings of variables), the main
query engine calls
QC.execute
which is the algorithm to execute a query. Any extension
that wished to reuse some of the main query engine by providing it's own OpExecutor
must call this method to evaluate a sub-operation.
QC.execute
finds the currently active OpExecutor
factory,
creates an OpExecutor
object and invokes it to evaluate one algebra
operation.
There are two points of extension for the main query engine:
OpExecutor
to execute any algebra operator specially.
The standard OpExecutor
invokes the stage generator mechanism to match
a basic graph pattern.
The correct point to hook into ARQ for just extending basic graph pattern matching
(BGPs) is to provide a custom StageGenerator
. (To hook into filtered
basic graph patterns, the extension will need to provide it's own OpExecutor
factory). The advantage of the StageGenerator
mechanism, as compared
to the more general OpExecutor
described below, is that it more self-contained
and requires less detail about the internal evaluation of the other SPARQL algebra
operators. This extension point corresponds to section 12.6 "Extending
SPARQL Basic Graph Matching".
Below is the default code to match a BGP from OpExecutor.execute(OpBGP, QueryIterator)
.
It merely calls fixed code in the StageBuilder
class.The input is a stream of results from earlier stages. The execution must return
a query iterator that is all the possible ways to match the basic graph pattern
for each of the inputs in turn. Order of results does not matter.
protected QueryIterator execute(OpBGP opBGP, QueryIterator input) { BasicPattern pattern = opBGP.getPattern() ; return StageBuilder.execute(pattern, input, execCxt) ; }
The StageBuilder
looks for the stage generator by accessing the context
for the execution:
StageGenerator stageGenerator = (StageGenerator)context.get(ARQ.stageGenerator) ;
where the context is the global context and any query execution specific additions together with various execution control elements.
A StageGenerator
is an implementation of:
public interface StageGenerator { public QueryIterator execute(BasicPattern pattern, QueryIterator input, ExecutionContext execCxt) ; }
An extension stage generator can be registered on a per-query execution basis or (more usually) in the global context.
StageBuilder.setGenerator(Context, StageGenerator)
The global context can be obtained by a call to ARQ.getContext()
StageBuilder.setGenerator(ARQ.getContext()
, myStageGenerator) ;
In order to allow an extensions to still permit other graphs to be used, stage generators are usually chained, with each new custom one passing the execution request up the chain if the request is not supported by this custom stage generator.
public class MyStageGenerator implements StageGenerator { StageGenerator above = null ; public MyStageGenerator (StageGenerator original) { above = original ; } @Override public QueryIterator execute(BasicPattern pattern, QueryIterator input, ExecutionContext execCxt) { Graph g = execCxt.getActiveGraph() ; // Test to see if this is a graph we support. if ( ! ( g instanceof MySpecialGraphClass ) ) // Not us - bounce up the StageGenerator chain return above.execute(pattern, input, execCxt) ; MySpecialGraphClass graph = (MySpecialGraphClass )g ; // Create a QueryIterator for this request ...
This is registered by setting the global context (StageBuilder
has
a convenience operation to do this):
// Get the standard one.
StageGenerator orig = (StageGenerator)ARQ.getContext().get(ARQ.stageGenerator) ;
// Create a new one
StageGenerator myStageGenerator= new MyStageGenerator(orig) ;
// Register it
StageBuilder.setGenerator(ARQ.getContext()
, myStageGenerator) ;
Example: src-examples/arq.examples.bgpmatching
.
A StageGenerator
provides matching for a basic graph pattern. If an
extension wishes to take responsibility for more of the evaluation then it needs
to work with OpExecutor
. This includes evaluation of filtered basic
graph patterns.
An example query using a filter:
PREFIX dc: <http://purl.org/dc/elements/1.1/> PREFIX books: <http://example.org/book/> SELECT * WHERE { ?book dc:title ?title . FILTER regex(?title, "Paddington") }
results in the algebra expression for the pattern:
(filter (regex ?title "Paddington") (bgp (triple ?book dc:title ?title) ))
showing that the filter is being applied to the results of a basic graph pattern matching.
Note: this is not the way to provide custom filter operations. See the documentation for application-provided filter functions.
Each step of evaluation in the main query engine is performed by a OpExecutor
and a new one is created from a factory at each step. The factory is registered
in the execution context. The implementation of a specialized OpExecutor
can inherit from the standard one and override only those algebra operators it wishes
to deal with, including inspecting the execution and choosing to passing up to the
super-class based on the details of the operation. From the query above, only
regex filters might be specially handled.
Registering an OpExecutorFactory
:
OpExecutorFactory customExecutorFactory = new MyOpExecutorFactory(...) ; QC.setFactory(ARQ.getCOntext(), customExecutorFactory) ;
QC is a point of indirection that chooses the execution process at each stage in
a query so if the custom execution wishes to evaluate an algebra operation within
another operation, it shoudl call QC.execute
. Be careful not to loop endlessly
if the operation is itself handled by the custom evaluator. This can be done by
swapping in a different OpExecutorFactory
.
// Execute an operation with a different OpExecution Factory // New context. ExecutionContext ec2 = new ExecutionContext(execCxt) ; ec2.setExecutor(plainFactory) ; QueryIterator qIter = QC.execute(op, input, ec2) ;
private static OpExecutorFactory plainFactory = new OpExecutorFactory() { @Override public OpExecutor create(ExecutionContext execCxt) { // The default OpExecutor of ARQ. return new OpExecutor(execCxt) ; } } ;
If a custom extension provides named graphs, then it may be useful to execute the
quad form of the query. This is done by writing a custom query engine and overriding
QueryEngineMain.modifyOp
:
@Override protected Op modifyOp(Op op) { // Cope with initial bindings. op = Substitute.substitute(op, initialInput) ; // Use standard optimizations. op = super.modifyOp(op) ; // Turn into quad form. op = Algebra.toQuadForm(op) ; return op ; }
The extension may need to provide its own dataset implementation so that it can detect when queries are directed to its named graph storage. SDB and TDB are examples of this.
The dataset implementation used in normal operation does not work on quads but instead
can provide a dataset with a collection of graphs each from different implementation
sub-systems. In-memory graphs can be mixed with database backed graphs as well as
custom storage systems. Query execution proceeds per-graph so that an custom OpExecutor
will need to test the graph to work with to make sure it is of the right class.
The pattern in the StageGenerator
extension point is an example of
design pattern in that situation.
A custom query engine enables an extension to choose which datasets it wishes to
handle. It also allows the extension to intercept query execution during the setup
of the execution so it can modify the algebra expression, introduce it's own algebra
extensions, choose which high-level optimizations to apply and also transform to
the expression into quad form. Execution can proceed with the normal algorithm or
a custom OpExecutor
or a custom Stage Generator or a combination of
all three extension mechanism.
Only a small, skeleton custom query engine is needed to intercept the initial setup.
See the example in src-examples/arq.examples.engine.MyQueryEngine
.
While it is possible to replace the entire process of query evaluation, this is
a substantial endeavour. QueryExecutionBase
provides the machinery
for result presentation (SELECT
, CONSTRUCT
,
DESCRIBE
, ASK
), leaving the work of
pattern evaluation to the custom query engine. QueryExecutionFactory
assumes that QueryExecutionBase
will be used.
New operators can be added to the algebra using the OpExt
class as
the super-class of the new operator. They can be inserted into the expression to
be evaluated using a custom query engine to intercept evaluation initialization.
When evaluation of a query requires the evaluation of a sub-class of OpExt
,
the eval
method is called. SDB uses this to introduce an operator that
is implemented in SQL.