Jena I/O Mini HowTo

This is a guide to the I/O subsystem of Jena. It has some obligatory reading for all Jena users migrating from Jena1 to Jena2. The bulk of the document is aimed at users wishing to use advanced features within the I/O subsystem.

1. Rush Guide - Jena1 Migration (Must Read)
2. Character Encoding in Java and XML
3. When to Use Reader and Writer?
4. Introduction to Advanced Jena I/O
5. Advanced RDF/XML Input
- 5.1 ARP properties
- 5.2 Interrupting ARP
6. Advanced RDF/XML Output
7. Conformance
8. Faster RDF/XML I/O

1. Rush Guide - Jena1 Migration (Must Read)

Jena2 I/O subsystem uses InputStream's and OutputStream's where Jena1 used Reader's and Writer's.

The main I/O methods to use in Jena are found on the Model interface. These are:

`Model`	`read(java.io.InputStream in, java.lang.String base)` Add statements from an RDF/XML serialization
`Model`	`read(java.io.InputStream in, java.lang.String base, java.lang.String lang)` Add RDF statements represented in language `lang` to the model.
`Model`	`read(java.lang.String url)` Add the RDF statements from an XML document.
`Model`	`write(java.io.OutputStream out)` Write the model as an XML document.
`Model`	`write(java.io.OutputStream out, java.lang.String lang)` write a serialized represention of a model in a specified language.
`Model`	`write(java.io.OutputStream out, java.lang.String lang, java.lang.String base)` write a serialized represention of a model in a specified language.

The built-in languages are "RDF/XML", "RDF/XML-ABBREV", "N-TRIPLE" and "N3".

The Jena1 developer will note that all but one of these methods are new in Jena2. Jena1 mistakenly required the use of java.io.Reader and java.io.Writer. While these classes work well on a single machine, the way they address character encoding problems differs significantly from the solution offered by XML. A typical use of the Jena1 input interface such as mdl.read(new FileReader(fName)); is incorrect, and will give the wrong results if the file contains non-ASCII data. Also, in Jena1, mdl.write(new FileWriter(fName)); wrote incorrect XML for non-ASCII data (now fixed). However, these two bugs in Jena1 canceled each other out, as long as the system doing the reading had the same default character encoding as the one doing the writing, which is why it did not bite in a typical development environment. There is the substantial migration task when porting Jena1 code to Jena2 code of reviewing all the I/O calls and changing them to use InputStream's and OutputStream's.

The old methods are still in the interface, and are not deprecated. They are useful, see below; and there is every intention to continue to support them. However, the RDF/XML parser now checks to see if the Model.read(Reader …) calls are being abused, and issues ERR_ENCODING_MISMATCH and WARN_ENCODING_MISMATCH errors. A naive Jena1 to Jena2 port will almost certainly result in such errors. This reflects code that had bugs in Jena1, which would exhibit themselves by misreading certain (non-ASCII) input files. The old output routines Model.write(Writer …) are not suitable for N3, except as indicated below. With RDF/XML by default they produce correct XML by using an appropriate XML declaration giving the encoding - e.g. <?xml version='1.0' encoding='ISO-8859-15'?> on my system. However, such XML is less portable than XML in UTF-8. Using the Model.write(OutputStream …) methods allows the Jena system code to choose UTF-8 encoding, which is the best choice.

1.1 RDF/XML, RDF/XML-ABBREV

For input, both of these are the same, and fully implement the RDF Syntax Last Call Working Draft, see conformance.

For output, "RDF/XML", produces regular output reasonably efficiently, but it is not readable. In contrast, "RDF/XML-ABBREV", produces readable output without much regard to efficiency.

All the readers and writers for RDF/XML are configurable, see below, input and output.

1.2 N3

The N3 readers and writers implement Tim Berners-Lee's N3 language.

There are actually 4 writers:

N3: The stabdard writer that choose one of the other 3.
N3-PP: The full N3 pretty writer
N3-PLAIN: An N3 writer that does not nest bNode strutures but does write record-like groups of all properties for a subject
N3-TRIPLE: Writer one statement per line, like N-TRIPLES, but also does qname conversion of URIrefs.

The standard Jena writer "N3" chooses which writer to use based on the system property com.hp.hpl.jena.n3.N3JenaWriter.writer.

N3 Writer Properties

The N3 pretty printer (which is used by default) and the N3 plain writer provide a number of properties to control their output. The properties all start: http://jena.hpl.hp.com/n3/properties/. The name used can be the full name, starting with this string, or the short form of just the name below. All values are strings; they may be interpreted as integer, boolean or string as defined below.

Property Name	Description	Default	Legal Values of String
Properties to Control N3 Output
minGap	Minimum gap bewteen items on a line	1	positive integer
objectLists	Print object lists as comma separated lists	true	boolean "true" or "false"
subjectColumn	If the subject is shorter than this value, the first property may go on the same line.	indentColumn	positive integer
propertyColumn	Width of the property column	8	positive integer
indentProperty	Width to indent properties	6	positive integer
widePropertyLen	Width of the property column	20	integer, greater than propertyColumn
abbrevBaseURI	Control whether to use abbreviations `<>` or `<#>`	true	boolean "true" or "false"
usePropertySymbols	Control whether to use "a", "=" and "=>" in output	true	boolean "true" or "false"
useTripleQuotedStrings	Allow the use of """ to delimit long strings	true	boolean "true" or "false"
useDoubles	Allow the use doubles as 123.456	true	boolean "true" or "false"

Notes:

Only the N3 pretty printer print object lists as comma separated lists.

1.3 N-TRIPLE

The N-TRIPLE readers and writers implement RDF Core's N-Triples language. They are not configurable.

1.4 TURTLE

The N3 reader accepts any valid Turtle. Note that the N3 and Turtle writers produce internationalized qnames, with the character set from XML Namespaces (except for ':'), not restricted to ASCII as is the definition of N3 and Turtle.

The Turtle writer is the N3 writer configured with: usePropertySymbols=false, useTripleQuotedStrings=false, useDoubles=false.

2. Character Encoding in Java and XML

What when wrong with character encoding?

The java.io.* classes based around Reader's and Writer's are intended to help us avoid encoding problems. The encoding attribute in the XML declaration at the top of an XML document is intended to help us avoid encoding problems. Unfortunately, these are two different approaches; and Jena1 went with the Java conventions, whereas the Web scalable conventions are those used by XML.

The Java approach is that the machine on which the Java is running has some default encoding. I/O done with FileReader's and PrintWriter's etc, then is done using that encoding, unless there is a specific user instruction when the Reader or Writer is created. It is not possible to change the encoding used by a Reader or Writer while it is being used.

The XML approach is that XML documents are in UTF-8 or UTF-16 unless they say otherwise in the first line of the document (this first line is sufficiently restricted to make it possible to read it without knowing the encoding). Hence, an XML reader should start by looking at the first few bytes and work out from those whether it is UTF-8 UTF-16 or some other encoding as declared in the first line. From then on, it uses that encoding.

The Java approach is designed for ease of use on a single machine, which uses a single encoding; often being a one-byte encoding, e.g. for European languages which do not need thousands of different characters.

The XML approach is designed for the Web which uses multiple encodings, and some of them requiring thousands of characters.

In Jena1, we had not understood these issues; and went with the Java solution. This was the wrong call. We have now fixed it. We are sorry that this does cause our users work when migrating to Jena2.

3. When to Use Reader and Writer?

Infrequently.

Despite these problems it is still sometimes appropriate to use Readers and Writers with Jena I/O. A good example is using Readers and Writers into StringBuffers in memory. These do not need to be encoded and decoded so a character encoding does not need to be specified. Other examples are when an advanced user explicitly wishes to correctly control the encoding.

`Model`	`read(java.io.Reader reader, java.lang.String base)` Using this method is often a mistake.
`Model`	`read(java.io.Reader reader, java.lang.String base, java.lang.String lang)` Using this method is often a mistake.
`Model`	`write(java.io.Writer writer)` Caution! Write the model as an XML document.
`Model`	`write(java.io.Writer writer, java.lang.String lang)` Caution! Write a serialized represention of a model in a specified language.
`Model`	`write(java.io.Writer writer, java.lang.String lang, java.lang.String base)` Caution! Write a serialized represention of a model in a specified language.

Incorrect use of these read(Reader, …) methods results in warnings and errors with RDF/XML and RDF/XML-ABBREV (except in a few cases where the incorrect use cannot be automatically detected). Incorrect use of the write(Writer, …) methods results in peculiar XML declarations such as %lt?xml version="1.0" encoding="WINDOWS-1252"?>. This would reflect that the character encoding you used (probably without realizing) in your Writer is registered with IANA under the name "WINDOWS-1252". The resulting XML is of reduced portability as a result. Glenn Marcy notes:

since UTF-8 and UTF-16 are the only encodings REQUIRED to be understood by all conformant XML processors, even ISO-8859-1 would technically be on shaky ground if not for the fact that it is in such widespread use that every reasonable XML processor supports it.

With N-TRIPLE incorrect use is usually benign, since N-TRIPLE is ascii based.

Character encoding issues of N3 are not well-defined; hence use of these methods may require changes in the future. Use of the InputStream and OutputStream methods will allow your code to work with future versions of Jena which do the right thing - whatever that is. Currently the OutputStream methods use UTF-8 encoding.

4. Introduction to Advanced Jena I/O

The RDF/XML input and output is configurable.

However, to configure it, it is necessary to access an RDFReader or RDFWriter object that remains hidden in the simpler interface above.

The four vital calls in the Model interface are:

`RDFReader`	`getReader()` return an RDFReader instance for the default serialization language.
`RDFReader`	`getReader(java.lang.String lang)` return an RDFReader instance for the specified serialization language.
`RDFWriter`	`getWriter()` return an RDFWriter instance for the default serialization language.
`RDFWriter`	`getWriter(java.lang.String lang)` an RDFWriter instance for the specified serialization language.

Each of these calls returns an RDFReader or RDFWriter that can be used to read or write any Model (not just the one which created it). As well as the necessary read and write methods, these interfaces provide:

RDFErrorHandler setErrorHandler(RDFErrorHandler errHandler)
Set an error handler for the reader

java.lang.Object setProperty(java.lang.String propName, java.lang.Object propValue)
Set the value of a reader property.

Setting properties, or the error handler, on an RDFReader or an RDFWriter allows the programmer to access non-default behaviour. Moreover, since the RDFReader and RDFWriter is not bound to a specific Model, a typical idiom is to create the RDFReader or RDFWriter on system initialization, to set the appropriate properties so that it behaves exactly as required in your system, and then to do all subsequent I/O through it.

    Model m = Modelfactory.createDefaultModel();
    RDFWriter writer = m.getRDFWriter();
    m = null; // m is no longer needed.
    writer.setErrorHandler(myErrorHandler);
    writer.setProperty("showXmlDeclaration","true");
    writer.setProperty("tab","8");
    writer.setProperty("relativeURIs","same-document,relative");
    …
    Model marray[];
    …
    for (int i=0; i<marray.length; i++) {
    …
        OutputStream out = new FileOutputStream("foo" + i + ".rdf");
        writer.write(marray[i],
                           out,
          "http://example.org/");
        out.close();
    }

Note that all of the current implementations are synchronized, so that a specific RDFReader cannot be reading two different documents at the same time. In a multihtreaded application this may suggest a need for a pool of RDFReaders and/or RDFWriters, or alternatively to create, initialize, use and discard them as needed.

For the languages N3 and N-TRIPLE there are currently no properties supported for neither the RDFReader nor the RDFWriter. Hence this idiom above is not very helpful, and just using the Model.write() methods may prove easier.

For RDF/XML and RDF/XML-ABBREV there are many options in both the RDFReader and the RDFWriter. These options are detailed in the javadoc for JenaReader.setProperty(String, Object) and RDFXMLWriterI.setProperty(String, Object) , and they are also given below.

5. Advanced RDF/XML Input

For access to these advanced features, first get an RDFReader object that is an instance of an ARP parser, by using the getReader() method on any Model. It is then configured using the setProperty(String, Object) method. This changes the properties for parsing RDF/XML. Many of the properties change the RDF parser, some change the XML parser. (The Jena RDF/XML parser, ARP, is built using a fairly direct implementation of the RDF grammar over a Xerces2-J XML parser). However, changing the features and properties of the XML parser is not likely to be useful, but weas easy to implement.

This method is not tested in the standard Jena test suite.

setProperty(String, Object) can be used to set and get:

ARP properties: These allow fine grain control over the extensive error reporting capabailities of ARP. And are detailed directly below.
SAX2 features: See Xerces features. Value should be given as a String "true" or "false" or a Boolean.
SAX2 properties: See Xerces properties.
Xerces features: See Xerces features. Value should be given as a String "true" or "false" or a Boolean.
Xerces properties: See Xerces properties.

5.1 ARP properties

An ARP property is referred to either by its property name, (see below) or by an absolute URL of the form http://jena.hpl.hp.com/arp/properties/<PropertyName>. The value should be a String, an Integer or a Boolean depending on the property.

ARP property names and string values are case insensitive.

Property Name	Description	Value class	Legal Values
ARP Properties
`error-mode`	`ARPOptions.setDefaultErrorMode()` `ARPOptions.setLaxErrorMode()` `ARPOptions.setStrictErrorMode()` `ARPOptions.setStrictErrorMode(int)` This allows a coarse-grained approach to control of error handling. Setting this property is equivalent to setting many of the fine-grained error handling properties.	String	`default` `lax` `strict` `strict-ignore` `strict-warning` `strict-error` `strict-fatal`
`embedding`	`ARPOptions.setEmbedding(boolean)` This sets ARP to look for RDF embedded within an enclosing XML document.	String or Boolean	`true` or `false`
`ERR_<XXX>` `WARN_<XXX>` `IGN_<XXX>`	See `ARPErrorNumbers` for a complete list of the error conditions detected. Setting one of these properties is equivalent to the method `ARPOptions.setErrorMode(int, int)`. Thus fine-grained control over the behaviour in response to specific error conditions is possible.	String or Integer	`EM_IGNORE` `EM_WARNING` `EM_ERROR` `EM_FATAL`

As an example, if you are working in an environment with legacy RDF data that uses unqualified RDF attributes such as "about" instead of "rdf:about", then the following code is appropriate:

    Model m = Modelfactory.createDefaultModel();
    RDFReader arp = m.getRDFReader();
    m = null; // m is no longer needed.
    // initialize arp
    // Do not warn on use of unqualified RDF attributes.
    arp.setProperty("WARN_UNQUALIFIED_RDF_ATTRIBUTE","EM_IGNORE");

    …
        InputStream in = new FileInputStream(fname);
        arp.read(m,in,url);
        in.close();

As a second example, suppose you wish to work in strict mode, but allow "daml:collection", the following works:

     …
     arp.setProperty("error-mode", "strict" );
     arp.setProperty("IGN_DAML_COLLECTION","EM_IGNORE");
     …

The other way round does not work.

     …
     arp.setProperty("IGN_DAML_COLLECTION","EM_IGNORE");
     arp.setProperty("error-mode", "strict" );
     …

This is because in strict mode IGN_DAML_COLLECTION is treated as an error, and so the second call to setProperty overwrites the effect of the first.

5.2 Interrupting ARP

ARP can be interrupted using the Thread.interrupt() method. It throws a java.io.InterruptedIOException.

Here is an illustrative code sample:

  ARP a = new ARP();
  final Thread arpt = Thread.currentThread();
  Thread killt = new Thread(new Runnable() {
       public void run() { 
  	     try {
	        Thread.sleep(tim);
	     } catch (InterruptedException e) {
	     } 
	     arpt.interrupt();
       }
    });
  killt.start();
  try {
    in = new FileInputStream(fileName);
    a.load(in);
    in.close();
    fail("Thread was not interrupted.");
  } catch (InterruptedIOException e) {
  } catch (SAXParseException e) {
  }

I am not sure what happens when using a Jena Model and a read operation. Check the latest version of this document. I suspect this exception gets reported through the error handler and has to be rethrown to end the parse. If you have working code, and this section has not yet been updated, please send it to jena-devel for inclusion in this document.

6. Advanced RDF/XML Output

The first RDF/XML output question is whether to use the "RDF/XML" or RDF/XML-ABBREV writer. While some of the code is shared, this two writers are really very different, resulting in different but equivalent output. RDF/XML-ABBREV is slower, but should produce more readable XML.

For access to advanced features, first get an RDFWriter object, of the appropriate language, by using getWriter("RDF/XML") or getWriter("RDF/XML-ABBREV") on any Model. It is then configured using the setProperty(String, Object) method. This changes the properties for writing RDF/XML.

Property Name	Description	Value class	Legal Values
Properties to Control RDF/XML Output
xmlbase	The value for xml:base in the file as a string.	String	a URI string, or null (default)
longId	Whether to use long or short id's for anon resources. Short id's are easier to read and are the default, but can run out of memory on very large models.	String or Boolean	"true", "false" (default)
allowBadURIs	URIs in the graph are, by default, checked prior to serialization.	String or Boolean	"true", "false" (default)
relativeURIs	What sort of relative URIs should be used. A comma separate list of options: same-document same-document references (e.g. "" or "#foo") network network paths e.g. "//example.org/foo" omitting the URI scheme absolute absolute paths e.g. "/foo" omitting the scheme and authority relative relative path not begining in "../" parent relative path begining in "../" grandparent relative path begining in "../../" The default value is "same-document, absolute, relative, parent". To switch off relative URIs use the value "". Relative URIs of any of these types are output where possible if and only if the option has been specified.	String
showXmlDeclaration	If true, an XML Declaration is included in the output, if false no XML declaration is included. The default behaviour only gives an XML Declaration when asked to write to an OutputStreamWriter that uses some encoding other than UTF-8 or UTF-16. In this case the encoding is shown in the XML declaration. To ensure that the encoding attribute is shown in the XML declaration either use the `write(Model,Writer,String)` variant with an appropriate OutputStreamWriter or set this option to false write the declaration to an OutputStream before calling `write(Model,OutputStream,String)`.	true, "true", false, "false" or "default"	can be true, false or "default" (null)
tab	The number of spaces with which to indent XML child elements.	String or Integer	positive integer "2" is the default
attributeQuoteChar	How to write XML attributes.	String	"\"" or "'"
blockRules	A list of Resource or a String being a comma separated list of fragment IDs from http://www.w3.org/TR/rdf-syntax-grammar indicating grammar rules that will not be used. Rules that can be avoided are: section-Reification (`RDFSyntax.sectionReification`) section-List-Expand (`RDFSyntax.sectionListExpand`) parseTypeLiteralPropertyElt (`RDFSyntax.parseTypeLiteralPropertyElt`) parseTypeResourcePropertyElt (`RDFSyntax.parseTypeLiteralPropertyElt`) parseTypeCollectionPropertyElt (`RDFSyntax.parseTypeCollectionPropertyElt`) idAttr (`RDFSyntax.idAttr`) propertyAttr (`RDFSyntax.propertyAttr`) In addition "daml:collection" (`DAML_OIL.collection`) can be blocked. Blocking idAttr also blocks section-Reification. By default, rule propertyAttr is blocked. For the basic writer (RDF/XML) only parseTypeLiteralPropertyElt has any affect, since none of the other rules are implemented by that writer.	Resource[] or String
prettyTypes	Only for the RDF/XML-ABBREV writer. This a list of the types of the principal objects in the model. The writer will tend to create RDF/XML with resources of these types at the top level. Example usage showing the default value: w.setProperty("prettyTypes", new Resource[]{ DAML_OIL.Ontology, DAML_OIL.Class, DAML_OIL.Datatype, DAML_OIL.Property, DAML_OIL.ObjectProperty, DAML_OIL.DatatypeProperty, DAML_OIL.TransitiveProperty, DAML_OIL.UnambigousProperty, DAML_OIL.UniqueProperty, });	Resource[]

As an example,

     RDFWriter w = m.getWriter("RDF/XML-ABBREV");
     w.setProperty("attribtueQuoteChar","'");
     w.setProperty("showXMLDeclaration","true");
     w.setProperty("tab","1");
     w.setProperty("blockRules",
       "daml:collection,parseTypeLiteralPropertyElt,"
       +"parseTypeResourcePropertyElt,parseTypeCollectionPropertyElt");

creates a writer that does not use rdf:parseType (preferring rdf:datatype for rdf:XMLLiteral), indents only a little, and produces the XMLDeclaration. Attributes are used, and are quoted with "'".

7. Conformance

The RDF/XML I/O endeavours to conform with the RDF Syntax Last Call Working Draft.

The parser must be set to strict mode, in which case it's only non-conformant behaviour is that it treats rdf:parseType="daml:collection" as an error. (The conformant behaviour is to silently turn "daml:collection" into "Literal").

The RDF/XML writer is conformant. But does not exercise much of the grammar.

The RDF/XML-ABBREV writer exercises all of the grammar and is conformant except that it uses the "daml:collection" construct for DAML ontologies. This non-conformant behaviour can be switched off using the "blockRules" property.

To create such conformant readers and writers use the following:

   Model m;
   …
   RDFReader conformantParser = m.getReader();
   conformantParser.setProperty("error-mode","strict");
   …
   RDFWriter conformantAbbrevWriter = m.getWriter("RDF/XML-ABBREV");
   conformantAbbrevWriter.setProperty("blockRules","daml:collection");

8. Faster RDF/XML I/O

ARP in Jena2 is significantly faster than in Jena1. Future optimizations may allow error checking to be suppressed in response to setting the properties on specific errors. Currently only error messages get suppressed. Errors are checked for, whether or not the message is actually desired.

To optimise the speed of writing RDF/XML it is suggested that all URI processing is turned off. Also do not use RDF/XML-ABBREV. It is unclear whether the longId attribute is faster or slower; the short IDs have to be generated on the fly and a table maintained during writing. The longer IDs are long, and hence take longer to write. The following creates a faster writer:

   Model m;
   …
   …
   RDFWriter fasterWriter = m.getWriter("RDF/XML");
   fasterWriter.setProperty("allowBadURIs","true");
   fasterWriter.setProperty("relativeURIs","");
   fasterWriter.setProperty("tab","0");

When reading RDF/XML the check for reuse of rdf:ID has a memory overhead, which can be significant for very large files. In this case, this check can be suppressed by telling ARP to ignore this error.

   Model m;
   …
   …
   RDFReader bigFileReader = m.getReader("RDF/XML");
   bigFileReader.setProperty("WARN_REDEFINITION_OF_ID","EM_IGNORE");
   …

`RDFErrorHandler`	`setErrorHandler(RDFErrorHandler errHandler)` Set an error handler for the reader
`java.lang.Object`	`setProperty(java.lang.String propName, java.lang.Object propValue)` Set the value of a reader property.

Jena I/O Mini HowTo

Contents

N3 Writer Properties