Triple oriented vs. molecule oriented versioning

Finding the right level of granularity for versioning.

Triple oriented approach

The idea is to add the versioning information to each triple. This could be implemented by adding two extra columns to the table in which the triples are stored, one containing the revision number from which the triple is valid and one containing the number of the first revision in which the triple is no longer contained (the value of this column may be null).

An implemtational consequence of the triple based approach is that anonymous nodes can keep an identity across versions, i.e. the identifier in the field of the table of triples. A triple oriented versioning system may or may not expose this identity, depending on this the possible ways to keep the number of recorded changes low are:

Store the transactions made through a model API or with a query[?] language and assume the application won't make unnecessary revocations/re-assertions
An algorithm to find a matching from anonymous resources in the old version to anonymous resources in the new version so that the number of revocations/re-assertions is minimal

The approach discussed here is the first one, this approach allows the API user to pint to an anonymous resource across different versions.

Molecule based approach

Molecules are elements of a lossless decomposition of a graph, i.e. a graph can be decomposed into molecules without loosing information and without assigning a cross-component identity to bnodes. To such a molecule the information

Example

In this example "graph over time" (GOT) the old version like this:

<http://examle.org/content> a foaf:PersonalProfileDocumet.
<http://examle.org/content> dc:modified "2001-10-13".
        [ a foaf:Person;
        foaf:firstName "Chris";
        foaf:lastName "Dollin";
        foaf:nickName "Electric Hedgehog"
        ] 
        [ a foaf:Person;
        foaf:firstName "Reto";
        foaf:lastName "Gmür"
        ]

and the new one:

<http://examle.org/content> a foaf:PersonalProfileDocumet.
<http://examle.org/content> dc:modified "2006-12-15".
        [ a foaf:Person;
        foaf:firstName "Chris";
        foaf:lastName "Dollin";
        foaf:nickName "Perikles triumphant"
        ] 
        [ a foaf:Person;
        foaf:firstName "Reto";
        foaf:lastName "Bachmann-Gmür"
        ]

With the triple oriented approach the changes that happen in the database reflect the transactions done on Model. Some java code modifying the Model could have changed the properties of existing anonymous resources in which case in the database 3 statements get invalidated for the new version while 3 other statements start to be valid. If however the program would have imported the new version from a file and update the model without additional knowledge or guessing, all properties of the two anonymous resources in the old version would have been invalidated and a total of 8 statements added.

With the molecule based approach the validity information stored in the meta-model refers to the following molecules:

<http://examle.org/content> a foaf:PersonalProfileDocumet.
<http://examle.org/content> dc:modified "2006-12-15".

<http://examle.org/content> dc:modified "2001-10-13".

[ a foaf:Person;
        foaf:firstName "Chris";
        foaf:lastName "Dollin";
        foaf:nickName "Electric Hedgehog"
]

[ a foaf:Person;
        foaf:firstName "Reto";
        foaf:lastName "Gmür"
]

[ a foaf:Person;
        foaf:firstName "Chris";
        foaf:lastName "Dollin";
        foaf:nickName "Perikles triumphant"
]

[ a foaf:Person;
        foaf:firstName "Reto";
        foaf:lastName "Bachmann-Gmür"
]

In terms of total number of triples stored the molecule approach is as high as the worst case in the triple oriented approach, the number of component assertions/revocation is however equals to best case in the triple oriented approach, 3 things get revoked and 3 asserted. The minimal changes to the database possible with the triple oriented approach relies on the external knowledge of the programmer or user (which would reflect in different changes depending of whether Reto changed his name or one Reto left and another one joined).
If the two anonymous resource would have constant inverse functional properties in the two versions the recorded changes about the anonymous resources would be smaller:

revoked:

[       foaf:mbox <chris.dollin@hp.com>;
        foaf:nickName "Electric Hedgehog"
]

[       foaf:mbox <reto@gmuer.ch>;
        foaf:lastName "Gmür"
]

asserted:

[       foaf:mbox <chris.dollin@hp.com>;
        foaf:nickName "Perikles triumphant"
]

[       foaf:mbox <reto@gmuer.ch>;
        foaf:lastName "Bachmann-Gmür"
]

Different usage scenarios

Synchronization

Both approaches allow the transfer of changes to a remote system. For the triple oriented approach two keep the data transfered low the anonymous node have constant identifier across the systems, this is not a problem if one system is a read-only copy of the other but in the situation that both model can be idenpendently changed the result of synchronization of two true models could be a false one.

Aggregation

An aggregator records the changes from different sources, this is possible with both approaches as long as with the triple based approach the b-node IDs are globally unique, if different sources assert the same information the aggregator has to store both b-node ids.

Resource oriented API

The Java code

Model model = ModelFactory.createDefaultModel();
Resource r1 = model.createResource(FOAF.Person);
Resource r2 = model.createResource(FOAF.Person);

creates a non-lean expressing the same content as the one created by

Model model = ModelFactory.createDefaultModel();
Resource r1 = model.createResource(FOAF.Person);

a dynamic merging however, would probably break the expectations of the user of the OO language. As long as the java objects life they cannot be threated as existential variables but must be threated as things with own identity. With the triple oriented approach this is straight forward as the different object maps to bNode ids, the java instances can be stored losslessly in the system.
A molecule based stored guarantees to keep the asserted content, redundant information may and ideally should be removed so that the returned graphs are lean. The space of Java instances can be seen as a scratch board which converts to RDF when it is committed, the framework could be designed so to discourage the programmer keeping references to (anonymous) resources between transaction and/or switch the objects in a "read-only" mode after committing.

Storing named graphs

A triple oriented store may well store multiple named graphs the graph in which a triple is contained could be an additional field in the database table. A triple may thus be stored several times and be considered distinct depending on the containing graph, the same bNode id never appears in two graphs.
The GVS-Concept of Source is a named graph changing over time, a molecule may be asserted by several sources. Isomorphic molecules are never stored twice which makes it easy and fast to return the union of several overlapping models.

Diffs

From the molecule-based store it is trivial to extract diffs which do not depend on b-node ids. The advantage of such a diff is that it has a context independent meaning, i.e. knowing the meaning of the named resources is sufficient to conceive the meaning. For instance a diff depending on b-node ids can only reasonable be signed with reference to the context of the resource in the compared models.

Conclusions

The triple oriented versioning approach suits nicely into a scenario where anonymous resource are threated similarly to named resources, i.e where graphs are not leanified and a API user can keep references to anonymous resources. The molecule oriented approach is to be preferred when the relevant information is the expressed content according to RDF-Semantics and where there is no way aside the expressed meaning expressed by two versions of a graph to associated anonymous resources as being a time variing resource.