Title: Notes on Jena internals **Note:** These notes are quite old now, but may still be of some interest in the design and architecture of Jena. ## Enhanced Nodes This note is a development of the original note on the enhanced node and graph design of Jena 2. ### Key objectives for the enhanced node design One problem with the Jena 1 design was that both the DAML layer and the RDB layer independently extended Resource with domain-specific information. That made it impossible to have a DAML-over-RDB implementation. While this could have been fixed by using the "enhanced resource" mechanism of Jena 1, that would have left a second problem. In Jena 1.0, once a resource has been determined to be a DAML Class (for instance), that remains true for the lifetime of the model. If a resource starts out not qualifying as a DAML Class (no `rdf:type daml:Class`) then adding the type assertion later doesn't make it a Class. Similarly, of a resource is a DAML Class, but then the type assertion is retracted, the resource is still apparently a class. Hence being a DAMLClass is a *view* of the resource that may change over time. Moreover, a given resource may validly have a number of different views simultaneously. Using the current `DAMLClass` implementation method means that a given resource is limited to a single such view. A key objective of the new design is to allow different views, or *facets*, to be used dynamically when accessing a node. The new design allows nodes to be polymorphic, in the sense that the same underlying node from the graph can present different encapsulations - thus different affordances to the programmer - on request. In summary, the enhanced node design in Jena 2.0 allows programmers to: - provide alternative perspectives onto a node from a graph, supporting additional functionality particular to that perspective; - dynamically convert a between perspectives on nodes; - register implementations of implementation classes that present the node as an alternative perspective. ### Terminology To assist the following discussion, the key terms are introduced first. node ~ A subject or object from a triple in the underlying graph graph ~ The underlying container of RDF triples that simplifies the previous abstraction Model enhanced node ~ An encapsulation of a node that adds additional state or functionality to the interface defined for node. For example, a bag is a resource that contains a number of other resources; an enhanced node encapsulating a bag might provide simplified programmatic access to the members of the bag. enhanced graph ~ Just as an enhanced node encapsulates a node and adds extra functionality, an enhanced graph encapsulates an underlying graph and provides additional features. For example, both Model and DAMLModel can be thought of as enhancements to the (deliberately simple) interface to graphs. polymorphic ~ An abstract super-class of enhanced graph and enhanced node that exists purely to provide shared implementation. personality ~ An abstraction that circumscribes the set of alternative views that are available in a given context. In particular, defines a mapping from types (q.v.) to implementations (q.v.). This seems to be taken to be closed for graphs. implementation ~ A factory object that is able to generate polymorphic objects that present a given enhanced node according to a given type. For example, an alt implementation can produce a sub-class of enhanced node that provides accessors for the members of the alt. #### Key points Some key features of the design are: - every enhanced graph has a single graph personality, which represents the types of all the enhanced nodes that can be created in this graph; - every enhanced node refers to that personality - different kinds of enhanced graph can have different personalities, for example, may implement interfaces in different ways, or not implement some at all. - enhanced nodes wrap information in the graph, but keep no independant state; they may be discarded and regenerated at whim. ### How an enhanced node is created #### Creation from another enhanced node If `en` is an enhanced node representing some resource we wish to be able to view as being of some (Java) class/interface `T`, the expression `en.as(T.class)` will either deliver an EnhNode of type `C`, if it is possible to do so, or throw an exception if not. To check if the conversion is allowed, without having to catch exceptions, the expression `en.canAs(T.class)` delivers `true` iff the conversion is possible. #### Creation from a base node Somehow, some seed enhanced node must be created, otherwise `as()` would have nothing to work on. Subclasses of enhanced node provide constructors (perhaps hidden behind factories) which wrap plain nodes up in enhanced graphs. Eventually these invoke the constructor `EnhNode(Node,EnhGraph)` It's up to the constructors for the enhanced node subclasses to ensure that they are called with appropriate arguments. #### internal operation of the conversion `as(Class T)` is defined on EnhNode to invoke `asInternal(T)` in `Polymorphic`. If the original enhanced node `en`is already a valid instance of `T`, it is returned as the result. Validity is checked by the method `isValue()`. If `en` is not already of type `T`, then a cache of alternative views of `en` is consulted to see if a suitable alternative exists. The cache is implemented as a *sibling ring* of enhanced nodes - each enhanced node has a link to its next sibling, and the "last" node links back to the "first". This makes it cheap to find alternative views if there are not too many of them, and avoids caches filling up with dead nodes and having to be flushed. If there is no existing suitable enhanced node, the node's personality is consulted. The personality maps the desired class type to an `Implementation` object, which is a factory with a `wrap` method which takes a (plain) node and an enhanced graph and delivers the new enhanced node after checking that its conditions apply. The new enhanced node is then linked into the sibling ring. ### How to build an enhanced node & graph What you have to do to define an enhanced node/graph implementation: 1. define an interface `I` for the new enhanced node. (You could use just the implementation class, but we've stuck with the interface, because there might be different implementations) 2. define the implementation class `C`. This is just a front for the enhanced node. All the state of `C` is reflected in the graph (except for caching; but beware that the graph can change without notice). 3. define an `Implementation` class for the factory. This class defines methods `canWrap` and `wrap`, which test a node to see if it is allowed to represent `I` and construct an implementation of `C`respectively. 4. Arrange that the personality of the graph maps the class of `I` to the factory. At the moment we do this by using (a copy of) the built-in graph personality as the personality for the enhanced graph. For an example, see the code for `ReifiedStatementImpl`. ## Reification API ### Introduction This document describes the reification API in Jena2, following discussions based on the 0.5a document. The essential decision made during that discussion is that reification triples are captured and dealt with by the Model transparently and appropriately. ### Context The first Jena implementation made some attempt to optimise the representation of reification. In particular it tried to avoid so called 'triple bloat', *ie* requiring four triples to represent the reification of a statement. The approach taken was to make a *Statement* a subclass of *Resource* so that properties could be directly attached to statement objects. There are a number of defects in the Jena 1 approach. - Not everyone in the team was bought in to the approach - The *.equals()* method for *Statement*s was arguably wrong and also violated the Java requirements on a *.equals()* - The implied triples of a reification were not present so could not be searched for - There was confusion between the optimised representation and explicit representation of reification using triples - The optimisation did not round trip through RDF/XML using the the writers and ARP. However, there are some supporters of the approach. They liked: - the avoidance of triple bloat - that the extra reifications statements are not there to be found on queries or ListStatements and do not affect the *size()* method. Since Jena was first written the RDFCore WG have clarified the meaning of a reified statement. Whilst Jena 1 took a reified statement to denote a statement, RDFCore have decided that a reified statement denotes an occurrence of a statement, otherwise called a stating. The Jena 1 *.equals()* methods for *Statement*s is thus inappropriate for comparing reified statements. The goal of reification support in the Jena 2 implementation are: - to conform to the revised RDF specifications - to maintain the expections of Jena 1; *ie* they should still be able to reify everything without worrying about triple bloat if they want to - as far as is consistent with 2, to not break existing code, or at least make it easy to transition old code to Jena 2. - to enable round tripping through RDF/XML and other RDF representation langauges - enable a complete standard compliant implementation, but not necessarily as default ### Presentation API *Statement* will no longer be a subclass of *Resource*. Thus a statement may not be used where a resource is expected. Instead, a new interface *ReifiedStatement* will be defined: public interface ReifiedStatement extends Resource { public Statement getStatement(); // could call it a day at that or could duplicate convenience // methods from Statement, eg getSubject(), getInt(). ... } The *Statement* interface will be extended with the following methods: public interface Statement ... public ReifiedStatement createReifiedStatement(); public ReifiedStatement createReifiedStatement(String URI); /* */ public boolean isReified(); public ReifiedStatement getAnyReifiedStatement(); /* */ public RSIterator listReifiedStatements(); /* */ public void removeAllReifications(); ... *RSIterator* is a new iterator which returns *ReifiedStatement*s. It is an extension of *ResourceIterator*. The *Model* interface will be extended with the following methods: public interface Model ... public ReifiedStatement createReifiedStatement(Statement stmt); public ReifiedStatement createReifiedStatement(String URI, Statement stmt); /* */ public boolean isReified(Statement st); public ReifiedStatement getAnyReifiedStatement(Statement stmt); /* */ public RSIterator listReifiedStatements(); public RSIterator listReifiedStatements(Statement stmt); /* */ public void removeReifiedStatement(reifiedStatement rs); public void removeAllReifications(Statement st); ... The methods in *Statement* are defined to be the obvious calls of methods in *Model*. The interaction of those models is expressed below. Reification operates over statements in the model which use predicates **rdf:subject**, **rdf:predicate**, **rdf:object**, and **rdf:type** with object **rdf:Statement**. *statements with those predicates are, by default, invisible*. They do not appear in calls of *listStatements*, *contains*, or uses of the *Query* mechanism. Adding them to the model will not affect *size()*. Models that do not hide reification quads will also be available. ### Retrieval The *Model::as()* mechanism will allow the retrieval of reified statements. someResource.as( ReifiedStatement.class ) If *someResource* has an associated reification quad, then this will deliver an instance *rs* of *ReifiedStatement* such that *rs.getStatement()* will be the statement *rs* reifies. Otherwise a *DoesNotReifyException* will be thrown. (Use the predicate *canAs()* to test if the conversion is possible.) It does not matter how the quad components have arrived in the model; explicitly asserted or by the *create* mechanisms described below. If quad components are removed from the model, existing *ReifiedStatement* objects will continue to function, but conversions using *as()* will fail. ### Creation *createReifiedStatement(Statement stmt)* creates a new *ReifiedStatement* object that reifies *stmt*; the appropriate quads are inserted into the model. The resulting resource is a blank node. *createReifiedStatement(String URI, Statement stmt)* creates a new *ReifiedStatement* object that reifies *stmt*; the appropriate quads are inserted into the model. The resulting resource is a *Resource* with the URI given. ### Equality Two reified statements are *.equals()* iff they reify the same statement and have *.equals()* resources. Thus it is possible for equal *Statement*s to have unequal reifications. ### IsReified *isReified(Statement st)* is true iff in the *Model* of this *Statement* there is a reification quad for this *Statement*. It does not matter if the quad was inserted piece-by-piece or all at once using a *create* method. ### Fetching *getAnyReifiedStatement(Statement st)* delivers an existing *ReifiedStatement* object that reifies *st*, if there is one; otherwise it creates a new one. If there are multiple reifications for *st*, it is not specified which one will be returned. ### Listing *listReifiedStatements()* will return an *RSIterator* which will deliver all the reified statements in the model. *listReifiedStatements( Statement st )* will return an *RSIterator* which will deliver all the reified statements in the model that reifiy *st*. ### Removal *removeReifiedStatement(ReifiedStatement rs)* will remove the reification *rs* from the model by removing the reification quad. Other reified statements with different resources will remain. *removeAllReifications(Statement st)* will remove all the reifications in this model which reify *st*. ### Input and output The writers will have access to the complete set of *Statement*s and will be able to write out the quad components. The readers need have no special machinery, but it would be efficient for them to be able to call *createReifiedStatement* when detecting an reification. ### Performance Jena1's "statements as resources" approach avoided triples bloat by not storing the reification quads. How, then, do we avoid triple bloat in Jena2? The underlying machinery is intended to capture the reification quad components and store them in a form optimised for reification. In particular, in the case where a statement is completely reified, it is expected to store only the implementation representation of the *Statement*. *createReifiedStatement* is expected to bypass the construction and detection of the quad components, so that in the "usual case" they will never come into existance. ## The Reification SPI ### Introduction This document describes the reification SPI, the mechanisms by which the Graph family supports the Model API reification interface. Graphs handle reification at two levels. First, their reifier supports requests to reify triples and to search for reifications. The reifier is responsible for managing the reification information it adds and removes - the graph is not involved. Second, a graph may optionally allow all triples added and removed through its normal operations (including the bulk update interfaces) to be monitored by its reifier. If so, all appropriate triples become the property of the reifier - they are no longer visible through the graph. A graph may also have a reifier that doesn't do any reification. This is useful for internal graphs that are not exposed as models. So there are three kinds of `Graph`: Graphs that do no reification; Graphs that only do explicit reficiation; Graphs that do implicit reification. ### Graph operations for reification The primary reification operation on graphs is to extract their `Reifier` instance. Handing reification off to a different class allows reification to be handled independantly of other Graph issues, eg query handling, bulk update. #### Graph.getReifier() -\> Reifier Returns the `Reifier` for this `Graph`. Each graph has a single reifier during its lifetime. The reifier object need not be allocated until the first call of `getReifier()`. ### add(Triple), delete(Triple) These two operations may defer their triples to the graph's reifier using `handledAdd(Triple)` and `handledDelete(Triple)`; see below for details. ### Interface Reifier Instances of `Reifier` handle reification requests from their `Graph` and from the API level code (issues by the API class `ModelReifier`. #### reifier.getHiddenTriples() -\> Graph The reifier may keep reification triples to itself, coded in some special way, rather than having them stored in the parent `Graph`. This method exposes those triples as another `Graph`. This is a dynamic graph - it changes as the underlying reifications change. However, it is read-only; triples cannot be added to or removed from it. The `SimpleReifier` implementation currently does not implement a dynamic graph. This is a bug that will need fixing. #### reifier.getParentGraph() -\> Graph Get the `Graph` that this reifier serves; the result is never `null`. (Thus the observable relationship between graphs and reifiers is 1-1.) #### class AlreadyReifiedException This class extends `RDFException`; it is the exception that may be thrown by `reifyAs`. ### reifier.reifyAs( Triple t, Node n ) -\> Node Record the `t` as reified in the parent `Graph` by the given `n` and returns `n`. If `n` already reifies a different `Triple`, throw a `AlreadyReifiedException`. Calling `reifyAs(t,n)` is like adding the triples: `n rdf:type ref:Statement` `n rdf:subject t.getSubject()` `n rdf:predicate t.getPredicate()` `n rdf:object t.getObject()` to the associated Graph; however, it is intended that it is efficient in both time and space. #### reifier.hasTriple( Triple t ) -\> boolean Returns true iff some `Node n` reifies `t` in this `Reifier`, typically by an unretracted call of `reifyAs(t,n)`. The intended (and actual) use for `hasTriple(Triple)` is in the implementation of `isReified(Statement)` in `Model`. #### reifier.getTriple( Node n ) -\> Triple Get the single `Triple` associated with `n`, if there is one. If there isn't, return `null`. A node reifies at most one triple. If `reifyAs`, with its explicit check, is bypassed, and extra reification triples are asserted into the parent graph, then `getTriple()` will simply return `null`. ### reifier.allNodes() -\> ExtendedIterator Returns an (extended) iterator over all the nodes that (still) reifiy something in this reifier. This is intended for the implementation of `listReifiedStatements` in `Model`. #### reifier.allNodes( Triple t ) -\> ClosableIterator Returns an iterator over all the nodes that (still) reify the triple \_t\_. ### reifier.remove( Node n, Triple t ) Remove the association between `n` and the triple`t`. Subsequently, `hasNode(n)` will return false and `getTriple(n)` will return `null`. This method is used to implement `removeReification(Statement)` in `Model`. #### reifier.remove( Triple t ) Remove all the associations between any node `n` and `t`; ie, for all `n` do `remove(n,t)`. This method is used to implement `removeAllReifications` in `Model`. #### handledAdd( Triple t ) -\> boolean A graph doing reification may choose to monitor the triples being added to it and have the reifier handle reification triples. In this case, the graph's `add(t)` should call `handledAdd(t)` and only proceed with its add if the result is `false`. A graph that does not use `handledAdd()` [and `handledDelete()`] can only use the explict reification supplied by its reifier. #### handledRemove( Triple t ) As for `handledAdd(t)`, but applied to `delete`. ### SimpleReifier `SimpleReifier` is an implementation of `Reifier` suitable for in-memory `Graph`s built over `GraphBase`. It operates in either of two modes: with and without triple interception. With interception enabled, reification triples fed to (or removed from) its parent graph are captured using `handledAdd()` and `handledRemove`; otherwise they are ignored and the graph must store them itself. `SimpleReifier` keeps a map from nodes to the reification information about that node. Nodes which have no reification information (most of them, in the usual case) do not appear in the map at all. Nodes with partial or excessive reification information are associated with `Fragments`. A `Fragments` for a node `n` records separately the `S`s of all `n ref:subject S` triples the `P`s of all `n ref:predicate P` triples the `O`s of all `n ref:subject O` triples the `T`s of all `n ref:type T[Statement]` triples If the `Fragments` becomes *singular*, ie each of these sets contains exactly one element, then `n` represents a reification of the triple `(S, P, O)`, and the `Fragments` object is replaced by that triple. (If another reification triple for `n` arrives, then the triple is re-exploded into `Fragments`.)