LARQ is a combination of ARQ and Lucene. It gives ARQ the ability to perform free text searches. Lucene indexes are additional information for accessing the RDF graph, not storage for the graph itself.
Some example code is available in directory src-examples/arq/examples
in the ARQ distribution.
Two helper commands are provided:
arq.larqbuilder
and arq.larq
.
These are support the updating and query of LARQ indexes.
A full description of the free text query language syntax is given in the Lucene query syntax document.
There are two basic usage patterns supported:
Patterns 1 and 2 have the indexed content in the graph. Both 1 and 2 can be modified by specifying a property so that only values of a given property are indexed. Pattern 2 is less flexible as discussed below. Pattern 3 is covered separately.
LARQ can be used in other ways as well but the classes for these patterns are supplied. In both patterns 1 and 2, strings are indexed, being plain strings, string with any language tag or any literal with datatype XSD string.
There are many ways to use Lucene, which can be set up to handle particular features or languages. The creation of the index is done outside of the ARQ query system proper and only accessed at query time. LARQ includes some platform classes and also utility classes to create indexes on string literals for the use cases above. Indexing can be performed as the graph is read in, or to built from an existing graph.
An index builder is a class to create a Lucene index from RDF data.
IndexBuilderString
: This is the most commonly used index
builder.
It indexes plain literals (with or without language tags) and XSD strings
and stores the complete literal. Optionally, a property can be supplied which
restricts indexing to strings in statements using that property.IndexBuilderSubject
: Index the subject resource by a string
literal, an store the subject resource, possibly restricted by a specified property.Lucene has many ways to create indexes and the index builder classes do not attempt to provide all possible Lucene features. Applications may need to extend or modify the standard index builders provided by LARQ.
An index can be built while reading RDF into a model:
// -- Read and index all literal strings. IndexBuilderString larqBuilder = new IndexBuilderString() ; // -- Index statements as they are added to the model. model.register(larqBuilder) ; FileManager.get().readModel(model, datafile) ; // -- Finish indexing larqBuilder.closeWriter() ; model.unregister(larqBuilder) ; // -- Create the access index IndexLARQ index = larqBuilder.getIndex() ;
or it can be created from an existing model:
// -- Create an index based on existing statements larqBuilder.indexStatements(model.listStatements()) ; // -- Finish indexing larqBuilder.closeWriter() ; // -- Create the access index IndexLARQ index = larqBuilder.getIndex() ;
Next the index is made available to ARQ. This can be done globally:
// -- Make globally available LARQ.setDefaultIndex(index) ;
or it can be set on a per-query execution basis.
QueryExecution qExec = QueryExecutionFactory.create(query, model) ; // -- Make available to this query execution only LARQ.setDefaultIndex(qExec.getContext(), index) ;
In both these cases, the default index is set, which is the one expected by
property function pf:textMatch
. Use of multiple indexes in the same
query can be achieved by introducing new properties. The application can
subclass the search class com.hp.hpl.jena.query.larq.LuceneSearch
to set different indexes with different property names.
Query execution is as usual using the property function pf:textMatch
.
"textMatch" can be thought of as an implied relationship in the data. Note the
prefix ends in ".".
String queryString = StringUtils.join("\n", new String[]{ "PREFIX pf: <http://jena.hpl.hp.com/ARQ/property#>", "SELECT * {" , " ?lit pf:textMatch '+text'", "}" }) ; Query query = QueryFactory.create(queryString) ; QueryExecution qExec = QueryExecutionFactory.create(query, model) ; ResultSetFormatter.out(System.out, qExec.execSelect(), query) ;
The subjects with a property value of the matched literals can be retrieved by looking up the literals in the model:
PREFIX pf: <http://jena.hpl.hp.com/ARQ/property#> SELECT ?doc { ?lit pf:textMatch '+text' . ?doc ?p ?lit }
This is a more flexible way of achieving the effect of using a IndexBuilderSubject
. IndexBuilderSubject
can be more compact when there are many large literals (it stores the
subject not the literal) but does not work for blank node subjects without
extremely careful co-ordination with a persistent model. Looking the literal
up in the model does not have this complication.
The application can get access to the Lucene match score by using a list
argument for the subject of pf:textMatch
. The list must have two
arguments, both unbound variables at the time of the query.
PREFIX pf: <http://jena.hpl.hp.com/ARQ/property#> SELECT ?doc ?score { (?lit ?score ) pf:textMatch '+text' . ?doc ?p ?lit }
When used with just a query string, pf:textMatch
returns all the
Lucene matches. In many applications, the application is only interested in
the first few matches (Lucene returns matches in order, highest scoring
first), or only matches above some score threshold. The query argument that
forms the object of the pf:textMatch
property can also be a list,
including a score threshold and a total limit on the number of results
matched.
?lit pf:textMatch ( '+text' 100 ) . # Limit to at most 100 hits
?lit pf:textMatch ( '+text' 0.5 ) . # Limit to Lucene scores of 0.5 and over.
?lit pf:textMatch ( '+text' 0.5 100 ) . # Limit to scores of 0.5 and limit to 100 hits
The IndexLARQ
class provides the ability to search programmatically,
not just from ARQ. The searchModelByIndex
method returns an iterator over
RDFNodes.
// -- Create the access index IndexLARQ index = larqBuilder.getIndex() ; NodeIterator nIter = index.searchModelByIndex("+text") ; for ( ; nIter.hasNext() ; ) { // if it's an index storing literals ... Literal lit = (Literal)nIter.nextNode() ; }
Sometimes, the index needs to be created based on external material and the
index gives nodes in the graph. This can be done by using
IndexBuilderNode
which is a helper class to relate external material to
some RDF node.
Here, the indexed content is not in the RDF graph at all. For example, the indexed content may come from HTML.XHTML, PDFs or XML documents and the RDF graph only holds the metadata about these content items.
The Lucene contributions page lists some content converters.
A new LARQ is available as a separate module from ARQ, this enables the two modules to have independent release cycles. Lucene dependency has been upgraded from v2.3.1 to v3.1.0 (i.e. the latest stable Lucene release). Two other improvements to LARQ are the support for index removals/deletions that can be used to keep a Lucene index in sync with an RDF Dataset/DataSource as RDF triples are added or removed to it and the duplicate avoidance using the Lucene index itself instead of in memory data structures. These two improvements required an additional field to Lucene index, therefore a reindex is necessary to use the new LARQ module.
Once LARQ is included in the classpath, larq.larqbuilder and larq.larq helper commands are available. They works the same as the arq.larqbuilder and arq.larq commands, with only one additional option for larq.larqbuilder:
--allow-duplicates
: Suppress duplicate avoidance using Lucene
index, this is recommended for bulk indexing large RDF datasets (even if it might
add a few duplicate documents to the Lucene index).The new LARQ module is distributed as a Maven artifact and it can be included in a project, as any other dependency, using:
<dependency> <groupId>org.apache.jena</groupId> <artifactId>larq</artifactId> <version>0.2.2-SNAPSHOT</version> </dependency>
It is possible to attach an exiting Lucene index built by larqbuilder to an RDF Dataset using the ja:textIndex property. For example, this is the assembler specification of a TDB Dataset with LARQ enabled:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> . @prefix tdb: <http://jena.hpl.hp.com/2008/tdb#> . [] ja:loadClass "com.hp.hpl.jena.tdb.TDB" . tdb:DatasetTDB rdfs:subClassOf ja:RDFDataset . tdb:GraphTDB rdfs:subClassOf ja:Model . <#dataset> rdf:type tdb:DatasetTDB ; tdb:location "/path/to/tdb/indexes/" ; ja:textIndex "/path/to/lucene/index/" ; .