Jena has operations useful in processing RDF in a streaming fashion. Streaming can be used for manipulating RDF at scale. Jena provides high performance readers and writers for all standard RDF formats, and it can be extended with custom formats.
The RDF Binary using Apache Thrift provides the highest input parsing performance. N-Triples/N-Quads provide the highest input parsing performance using W3C Standards.
Files ending in .gz
are assumed to be gzip-compressed. Input and output
to such files takes this into account, including looking for the other file
extension. data.nt.gz
is a parsed as a gzip-compressed N-Triples file.
The central abstraction is
StreamRDF
which is an interface for streamed RDF data. It covers triples and quads,
and also parser events for prefix settings and base URI declarations.
public interface StreamRDF { /** Start processing */ public void start() ; /** Triple emitted */ public void triple(Triple triple) ; /** Quad emitted */ public void quad(Quad quad) ; /** base declaration seen */ public void base(String base) ; /** prefix declaration seen */ public void prefix(String prefix, String iri) ; /** Finish processing */ public void finish() ; }
There are utilities to help:
StreamRDFLib
– create StreamRDF
objectsStreamOps
– helpers for sending RDF data to StreamRDF
objectsAll parsers of RDF syntaxes provided by RIOT are streaming with the exception of JSON-LD. A JSON object can have members in any order so the parser may need the whole top-level object in order to have the information needed for parsing.
The parse
functions
of RDFDataMgr
directs the output of the parser to a StreamRDF
. For example:
StreamRDF destination = ... RDFDataMgr.parse(destination, "http://example/data.ttl") ;
The above code reads the remote URL, with content negotiation, and send the
triples to the destination
.
Not all RDF formats are suitable for writing as a stream. Formats that
provide pretty printing (for example the default RDFFormat
for each of
Turtle, TriG and RDF/XML) require analysis of the whole of a model in order
to determine nestable structures of blank nodes and for using specific
syntax for RDF lists.
These languages can be used for streaming output but with an appearance that is necessarily "less pretty". See "Streamed Block Formats" for details.
The StreamRDFWriter
class has functions that write graphs and datasets
using a streaming writer and also provides for the creation of
an StreamRDF
backed by a stream-based writer
StreamRDFWriter.write(output, model.getGraph(), lang) ;
which can be done as:
StreamRDF writer = StreamRDFWriter.getWriterStream(output, lang) ; StreamOps.graphToStream(writer, model.getGraph()) ;
N-Triples and N-Quads are always written as a stream.
RDFFormat | Lang shortcut |
---|---|
RDFFormat.TURTLE_BLOCKS |
Lang.TURTLE |
RDFFormat.TURTLE_FLAT |
|
RDFFormat.TRIG_BLOCKS |
Lang.TRIG |
RDFFormat.TRIG_FLAT |
|
RDFFormat.NTRIPLES_UTF8 |
Lang.NTRIPLES |
RDFFormat.NTRIPLES_ASCII |
|
RDFFormat.NQUADS_UTF8 |
Lang.NQUADS |
RDFFormat.NQUADS_ASCII |
|
RDFFormat.TRIX |
Lang.TRIX |
RDFFormat.RDF_THRIFT |
Lang.RDFTHRIFT |