Title: ARQ - Extending Query Execution

This page describes the mechanisms that can be used to extend and
modify query execution within ARQ. Through these mechanisms, ARQ
can be used to query different graph implementations and to provide
different query evaluation and opti-mization strategies for
particular circumstances. These mechanisms are used by
[TDB-](../tdb) and [SDB](../sdb/).-

ARQ can be [extended in various ways](extension.html) to
incorporate custom code into a query.
[Custom filter functions](extension.html#value-functions) and
[property functions](extension.html#property-functio-ns) p-rovi-de ways
to add application specific code. The
[free text search](/documentation/larq/) capabilities, using Apache
Lucene, are provided via a property function. Custom filter
functions and property functions should be used where possible.

Jena itself can be extended by providing a new implementation of
the `Graph` interface. This can be used to encapsulate specific
specialised storage and also for wrapping non-RDF sources to look
like RDF. There is a common implementation framewo-fk provided-fby
`GraphBase` so only one operation, the `find` method, needs to be
written for a read-only data source. Basic find works well is many
cases, and the whole Jena API will be able to use the extension. 
For higher SPARQL performance, ARQ can be extended at the
[basic graph matching](#stage-generator) or
[algebra level](#opexecutor).

Applications writers who extend ARQ at the query execution level
should be prepared to work with the source code for ARQ for
specific details and for finding code to reuse. Some example can be
found in the `src-examples` directory in the ARQ download.

-   [Overview of ARQ Query processing-](#overview)
-   [The Main Query Engine](#main-query-engine)
-   [Graph matching and a custom StageGenerator](#stage-generator)
-   [OpExecutor](#opexecutor)
-   [Quads](#quads)
-   [Mixed Graph Implementation Datasets](#mixed-datasets)
-   [Custom Query Engines](#custom-query-engines)
-   [Extend the algebra](#algebra-extensions)

## Overview of ARQ Query Processing

The sequence of actions performed by ARQ to perform a query are
parsing, algebra generation, execution building, high-level
optimization, low-level optimization and finally evaluation. It is
not usual to modify the parsing step nor the conversion from the
parse tree to the algebra form, which is a fixed algorithm defined
by the SPARQL standard. Extensions can modify the algebra form by
transforming it from one algebra expression to another, including
introducing new operators. See also the documentation on
[working with the SPARQL algebra in ARQ](algebra.html) including
building algebra expressions programmatically, rather than
obtaining them from a query string.

### Parsing

The parsing step turns a query string into a `Query` object. The
class `Query` represents the abstract syntax tree (AST) for the
query and provides methods to create the AST, primarily for use by
the parser. The query object also provides methods to serialize the
query to a string. Because this is the AST, the string produced is
very close to the original query with the same syntactic elements,
but without comments, and formatted with a whitespace for
readability. It is not usually the best way to build a query
programmatically and the AST is not normally an extension point.

The query object can be used many times. It is not modified once
created, and in particular it is not modified by query execution.

### Algebra generation

ARQ generates the
[SPARQL algebra](http://www.w3.org/TR/sparql11-query/#sparqlQuery)
expression for the query. After this a number of transformations
can be applied (for example, identification of property functions)
but the first step is the application of the algorithm in the
SPARQL specification for translating a SPARQL query string, as held
in a `Query` object into a SPARQL algebra expression. This includes
the process of removing joins involving the identity pattern (the
empty graph pattern).

For example, the query:

    PREFIX foaf: <http://xmlns.com/foaf/0.1/>
    SELECT ?name ?mbox ?nick
    WHERE  { ?x foaf:name  ?name ;
                foaf:mbox  ?mbox .
             OPTIONAL { ?x  foaf:nick  ?nick }
           }

becomes

    (prefix ((foaf: <http://xmlns.com/foaf/0.1/>))
      (project (?name ?mbox ?nick)
        (leftjoin
          (bgp
            (triple ?x foaf:name ?name)
            (triple ?x foaf:mbox ?mbox)
          )
          (bgp (triple ?x foaf:nick ?nick)
          )
        )))

using the [SSE syntax](../notes/sse.html) to write out
the internal data-structure for the algebra.

The [online SPARQL validator](http://www.sparql.org/validator.html)
at [sparql.org](http://sparql.org/) can be used to see the algebra
expression for a SPARQL query. This validator is also include in
[Fuseki](../serving_data/).

### High-Level Optimization and Transformations

There is a collection of transformations that can be applied to the
algebra, such as replacing equality filters with a more efficient
graph pattern and an assignment. When extending ARQ, a query
processor for a custom storage layout can choose which
optimizations are appropriate and can also provide its own algebra
transformations.

A transform is code that converts an algebra operation into other
algebra operations. It is applied using the `Transformer` class:

    Op op = ... ;
    Transform someTransform = ... ;
    op = Transformer.transform(someTransform, op) ;

The `Transformer` class applies the transform to each operation in
the algebra expression tree. `Transform` itself is an interface,
with one method signature for each operation type, returning a
replacement for the operator instance it is called on.

One such transformation is to turn a SPARQL algebra expression
involving named graphs and triples into one using quads. This
transformation is performed by a call to `Algebra.toQuadForm`.

Transformations proceed from the bottom of the expression tree to
the top. Algebra expressions are best treated as immutable so a
change made in one part of the tree should result in a copy of the
tree above it.  This is automated by the  `TransformCopy` class
which is the commonly used base class for writing transforms. The
other helper base class is `TransformBase,` which provides the
identify operation (returns the node supplied) for each transform
operation.

Operations can be printed out in
[SSE](../notes/sse.html) syntax. The Java `toString`
method is overridden to provide pretty printing and the static
methods in `WriterOp` provide output to various output objects like
`java.io.OutputStream`.

### Low-Level Optimization and Evaluation

The step of evaluating a query is the process of executing the
algebra expression, as modified by any transformations applied, to
yield a stream of pattern solutions. Low-level optimizations
include choosing the order in which to evaluate basic graph
patterns. These are the responsibility of the custom storage layer.
Low-level optimization can be carried out dynamically as part of
evaluation.

Internally, ARQ uses iterators extensively. Where possible,
evaluation of an operation is achieved by feeding the stream of
results from the previous stage into the evaluation. A common
pattern is to take each intermediate result one at a time (use
`QueryIterRepeatApply` to be called for each binding) ,
substituting the variables of pattern with those in the incoming
binding, and evaluating to a query iterator of all results for this
incoming row. The result can be the empty iterator (one that always
returns false for `hasNext`). It is also common to not have to
touch the incoming stream at all but merely to pass it to
sub-operations.

### Query Engines and Query Engine Factories

The steps from algebra generation to query evaluation are carried
out when a query is executed via the `QueryExecution.execSelect` or
other `QueryExecution` exec operation. It is possible to carry out
storage-specific operations when the query execution is created. A
query engine works in conjunction with a `QueryExecution` created
by the `QueryExecutionFactory` to provide the evaluation of a query
pattern. `QueryExecutionBase` provides all the machinery for the
different result types and does not need to be modified by
extensions to query execution.

ARQ provides three query engine factories; the main query engine
factory, one for a reference query engine and one to remotely
execute a query. SDB and TDB provide their own query engine
factories which they register during sub-system initialization.
Both extend the main query engine described below.

The reference query engine is a direct top-down evaluation of the
expression. It's purpose is to be simple so it can be easily
verified and checked then its results used to check more
complicated processing in the main engine and other
implementations. All arguments to each operator are fully evaluated
to produce intermediate in-memory tables then a simple
implementation of the operator is called to calculate the results.
It does not scale and does not perform any optimizations. It is
intended to be clear and simple; it is not designed to be
efficient.

Query engines are chosen by referring to the registry of query
engine factories.

    public interface QueryEngineFactory
    {
        public boolean accept(Query query, DatasetGraph dataset, Context context) ;
        public Plan create(Query query, DatasetGraph dataset, Binding inputBinding, Context context) ;

        public boolean accept(Op op, DatasetGraph dataset, Context context) ;
        public Plan create(Op op, DatasetGraph dataset, Binding inputBinding, Context context) ;
    }

When the query execution factory is given a dataset and query, the
query execution factory tries each registered engine factory in
turn calling the `accept` method (for query of algebra depending on
how it was presented). The registry is kept in reverse registration
order - the most recently registered query engine factory is tried
first. The first query engine factor to return true is chosen and
no further engine factories are checked.

When a query engine factory is chosen, the `create` method is
called to return a `Plan` object for the execution. The main
operation of the plan interface is to get the `QueryIterator` for
the query.

See the example in
`src-examples/arq.examples.engine.MyQueryEngine`.

## The Main Query Engine

The main query engine can execute any query. It contains a number
of basic graph pattern matching implementations including one that
uses the `Graph.find` operation so it can work with any
implementation of the Jena Graph SPI. The main query engine works
with general purpose datasets but not quad stores directly; it
evaluates patterns on each graph in turn. The main query engine
includes optimizations for the standard Jena implementation of
in-memory graphs.

High-level optimization is performed by a sequence of
transformations. This set of optimizations is evolving. A custom
implementation of a query engine can reuse some or all of these
transformations (see `Algebra.optimize` which is the set of
transforms used by the main query engine).

The main query engine is a streaming engine. It evaluates
expressions as the client consumes each query solution. After
preparing the execution by creating the initial conditions (a
partial solution of one row and no bound variables or any initial
bindings of variables), the main query engine calls `QC.execute`
which is the algorithm to execute a query. Any extension that
wished to reuse some of the main query engine by providing it's own
`OpExecutor` must call this method to evaluate a sub-operation.

`QC.execute` finds the currently active `OpExecutor` factory,
creates an `OpExecutor` object and invokes it to evaluate one
algebra operation.

There are two points of extension for the main query engine:

-   Stage generators, for evaluating basic graph patterns and
    reusing the rest of the engine.
-   `OpExecutor` to execute any algebra operator specially.

The standard `OpExecutor` invokes the stage generator mechanism to
match a basic graph pattern.

## Graph matching and a custom StageGenerator

The correct point to hook into ARQ for just extending basic graph
pattern matching (BGPs) is to provide a custom `StageGenerator`. 
(To hook into filtered basic graph patterns, the extension will
need to provide it's own `OpExecutor` factory). The advantage of
the `StageGenerator` mechanism, as compared to the more general
`OpExecutor` described below, is that it more self-contained and
requires less detail about the internal evaluation of the other
SPARQL algebra operators.  This extension point corresponds to
section 12.6
"[Extending SPARQL Basic Graph Matching](http://www.w3.org/TR/sparql11-query/#sparqlBGPExtend)".

Below is the default code to match a BGP from
`OpExecutor.execute(OpBGP, QueryIterator)`. It merely calls fixed
code in the `StageBuilder` class.The input is a stream of results
from earlier stages. The execution must return a query iterator
that is all the possible ways to match the basic graph pattern for
each of the inputs in turn. Order of results does not matter. 

    protected QueryIterator execute(OpBGP opBGP, QueryIterator input)
    {
        BasicPattern pattern = opBGP.getPattern() ;
        return StageBuilder.execute(pattern, input, execCxt) ;
    }

The `StageBuilder` looks for the stage generator by accessing the
context for the execution:

    StageGenerator stageGenerator = (StageGenerator)context.get(ARQ.stageGenerator) ;

where the context is the global context and any query execution
specific additions together with various execution control
elements.

A `StageGenerator` is an implementation of:

        public interface StageGenerator
        {
            public QueryIterator execute(BasicPattern pattern,
                                         QueryIterator input,
                                         ExecutionContext execCxt) ;
        }

### Setting the Stage Generator

An extension stage generator can be registered on a per-query
execution basis or (more usually) in the global context.

        StageBuilder.setGenerator(Context, StageGenerator)

The global context can be obtained by a call to `ARQ.getContext()`

        StageBuilder.setGenerator(ARQ.getContext(), myStageGenerator) ;

In order to allow an extensions to still permit other graphs to be
used, stage generators are usually chained, with each new custom
one passing the execution request up the chain if the request is
not supported by this custom stage generator.

    public class MyStageGenerator implements StageGenerator
    {
        StageGenerator above = null ;

        public MyStageGenerator (StageGenerator original)
        { above = original ; }

        @Override
        public QueryIterator execute(BasicPattern pattern, QueryIterator input, ExecutionContext execCxt)
        {
            Graph g = execCxt.getActiveGraph() ;
            // Test to see if this is a graph we support.
            if ( ! ( g instanceof MySpecialGraphClass ) )
                // Not us - bounce up the StageGenerator chain
                return above.execute(pattern, input, execCxt) ;
            MySpecialGraphClass graph = (MySpecialGraphClass )g ;
            // Create a QueryIterator for this request
         ...


This is registered by setting the global context (`StageBuilder`
has a convenience operation to do this):

      // Get the standard one.
      StageGenerator orig = (StageGenerator)ARQ.getContext().get(ARQ.stageGenerator) ;
      // Create a new one
      StageGenerator myStageGenerator= new MyStageGenerator(orig) ;
      // Register it
      StageBuilder.setGenerator(ARQ.getContext(), myStageGenerator) ;


Example: `src-examples/arq.examples.bgpmatching`.

## OpExecutor

A `StageGenerator` provides matching for a basic graph pattern. If
an extension wishes to take responsibility for more of the
evaluation then it needs to work with `OpExecutor`. This includes
evaluation of filtered basic graph patterns.

An example query using a filter:

    PREFIX  dc:   <http://purl.org/dc/elements/1.1/>
    PREFIX  books: <http://example.org/book/>

    SELECT  *
    WHERE
      { ?book  dc:title  ?title .
        FILTER regex(?title, "Paddington")
      }

 results in the algebra expression for the pattern:

        (filter (regex ?title "Paddington")
            (bgp (triple ?book dc:title ?title)
            ))

showing that the filter is being applied to the results of a basic
graph pattern matching.

Note: this is not the way to provide custom filter operations.  See
the documentation for
[application-provided filter functions](extension.html#valueFunctions).

Each step of evaluation in the main query engine is performed by a
`OpExecutor` and a new one is created from a factory at each step. 
The factory is registered in the execution context. The
implementation of a specialized `OpExecutor` can inherit from the
standard one and override only those algebra operators it wishes to
deal with, including inspecting the execution and choosing to
passing up to the super-class based on the details of the
operation.  From the query above, only regex filters might be
specially handled.

Registering an `OpExecutorFactory`:

    OpExecutorFactory customExecutorFactory = new MyOpExecutorFactory(...) ;
    QC.setFactory(ARQ.getCOntext(), customExecutorFactory) ;

QC is a point of indirection that chooses the execution process at
each stage in a query so if the custom execution wishes to evaluate
an algebra operation within another operation, it shoudl call
`QC.execute`. Be careful not to loop endlessly if the operation is
itself handled by the custom evaluator. This can be done by
swapping in a different `OpExecutorFactory`.

       // Execute an operation with a different OpExecution Factory

       // New context.
       ExecutionContext ec2 = new ExecutionContext(execCxt) ;
       ec2.setExecutor(plainFactory) ;

       QueryIterator qIter = QC.execute(op, input, ec2) ;

       private static OpExecutorFactory plainFactory =
          new OpExecutorFactory()
          {
             @Override
             public OpExecutor create(ExecutionContext execCxt)
             {
                 // The default OpExecutor of ARQ.
                 return new OpExecutor(execCxt) ;
             }
          } ;

## Quads

If a custom extension provides named graphs, then it may be useful
to execute the quad form of the query. This is done by writing a
custom query engine and overriding `QueryEngineMain.modifyOp`:

      @Override
      protected Op modifyOp(Op op)
      {
         // Cope with initial bindings.
         op = Substitute.substitute(op, initialInput) ;
         // Use standard optimizations.
         op = super.modifyOp(op) ;
         // Turn into quad form.
         op = Algebra.toQuadForm(op) ;
         return op ;
      }

The extension may need to provide its own dataset implementation so
that it can detect when queries are directed to its named graph
storage. [TDB](../tdb/) and [SDB](../sdb/) are examples of this.

## Mixed Graph Implementation Datasets

The dataset implementation used in normal operation does not work
on quads but instead can provide a dataset with a collection of
graphs each from different implementation sub-systems. In-memory
graphs can be mixed with database backed graphs as well as custom
storage systems. Query execution proceeds per-graph so that an
custom `OpExecutor` will need to test the graph to work with to
make sure it is of the right class. The pattern in the
`StageGenerator` extension point is an example of design pattern in
that situation.

## Custom Query Engines

A custom query engine enables an extension to choose which datasets
it wishes to handle. It also allows the extension to intercept
query execution during the setup of the execution so it can modify
the algebra expression, introduce it's own algebra extensions,
choose which high-level optimizations to apply and also transform
to the expression into quad form. Execution can proceed with the
normal algorithm or a custom `OpExecutor` or a custom Stage
Generator or a combination of all three extension mechanism.

Only a small, skeleton custom query engine is needed to intercept
the initial setup. See the example in
`src-examples/arq.examples.engine.MyQueryEngine`.

While it is possible to replace the entire process of query
evaluation, this is a substantial endeavour. `QueryExecutionBase`
provides the machinery for result presentation (`SELECT`,
`CONSTRUCT`, `DESCRIBE`, `ASK`), leaving the work of pattern
evaluation to the custom query engine. `QueryExecutionFactory`
assumes that `QueryExecutionBase` will be used.

## Algebra Extensions

New operators can be added to the algebra using the `OpExt` class
as the super-class of the new operator. They can be inserted into
the expression to be evaluated using a custom query engine to
intercept evaluation initialization.  When evaluation of a query
requires the evaluation of a sub-class of `OpExt`, the `eval`
method is called. SDB uses this to introduce an operator that is
implemented in SQL.