Ymris: the Jena3 rules language

introduction

Ymris is a language for making computations over RDF data. The computations are expressed as rules: given that some pattern of data exists, deduce some new pattern of data. Patterns are RDF triples linked by variables, augmented by an extensible bunch of filters written as Java classes.

The core syntax of Ymris is inspired by SPARQL [REF SPARQL]. The functionality of Ymris covers that of the Jena 2 rules language [REF RULES] with augmentations (eg aggregation) to increase coverage. This functionality is sufficient to implement OWL using the OWL 2 rules [REF OWL2]. Ymris also allows the description of combinations of sets of rules. Ymris will be implemented on a variety of scales ranging from a simple in-memory version to multi-processor streaming networks.

A current implementation of Ymris can be downloaded from [REF DOWNLOAD]. The reader may find it helpful to follow along the tutorial section of this document using that implementation. [The beta-reader will find it helpful to realise that every single command-line and GUI example below is completely made up pending some design and building.]

tutorial

a very simple rule

We'll start with a single rule that marks a dataset as having been processed:

RULE
  PREFIX my: <http://cdollin.hpl.hp.com/my/>
  PREFIX myRules: <http://cdollin.hpl.hp.com/myRules/>

  { ?x a my:Header } => { ?x my:processedBy myRules:A }

The PREFIX declarations allow readably shortened names in the body of the rule. Here the premise of the rule is a single triple pattern which matches any triple with predicate rdf:type (abbreviated "a" in the style of N3/Turtle/SPARQL) and object (the expansion of) my:Header. The subject of the triple is matched by the variable ?x.

The conclusion of the rule is also a single triple, obtained by replacing the variable ?x by its matching value. The effects of applying this rule to some RDF dataset is to augment any my:Header individuals with a my:processedBy property whose value is myRules:A.

Note that this rule can only add my:processedBy properties to my:Header entities that already exist in the graph. Later on we'll see how to test whether something already exists in the graph or not.

a rule with two conclusions

As well as marking a dataset with what rules it has been processed with, we might want to mark them with when those rules were applied. Ymris allows functions to be called to provide the objects of statements, and the built-in function now returns the date when the rule is executed (more exactly, the date and time when the execution of the rules started; every call of now in that execution delivers the same date).

RULE
  PREFIX my: <http://cdollin.hpl.hp.com/my/>
  PREFIX myRules: <http://cdollin.hpl.hp.com/myRules/>

  { ?x a my:Header } 
  => { ?x my:processedBy myRules:A; my:processedOn now() }

The SPARQL-style ";" operator allows us to provide multiple properties for the same entity.

blank nodes in conclusions

A problem with the rule above is that if two different rulesets -- myRules:A and myRules:B, say -- are applied, one to the results from the other, then the resulting graph will have the elements so mixed that we cannot tell which ruleset was applied when:

  my:example a my:Header
    ; my:processedBy myRules:A, myRules:B
    ; my:processedOn "[DATE HERE]", my:processedOn "DIFFERENT DATE HERE"
    .

We need to keep the rules name and date together, which we can do by introducing a bnode:

RULE
  PREFIX my: <http://cdollin.hpl.hp.com/my/>
  PREFIX myRules: <http://cdollin.hpl.hp.com/myRules/>

  { ?x a my:Header } 
  => { ?x my:processed [my:processedBy myRules:A; my:processedOn now()] }

We permit the SPARQL syntax for bnodes to be used to introduce new blank nodes. Each binding for ?x will produce a new bnode, so if there are N header entities in the input graph, each one of those will get its own new my:processed bnode.

multiple premises

Rules are more interesting when they have multiple premises.

RULE
  PREFIX my: <http://cdollin.hpl.hp.com/my/>

  {?x my:hasParent ?y. ?y my:hasBrother ?z} => {?x my:hasUncle ?z}

The two statement patterns in the premises are separated by ".". The variable ?y links them together; it must be the same entity that is the parent of ?x has brother ?z.

multiple rules

If we can infer uncles, we can infer aunts, using a very similar rule. But we can't use just one rule to infer uncles and aunts as well. Ymris allows us to write multiple rules gathered into a ruleset.

RULESET "relatives, part 1"
PREFIX my: <http://cdollin.hpl.hp.com/my/>
  {
  RULE "infer uncles"
    {?x my:hasParent ?y. ?y my:hasBrother ?z} => {?x my:hasUncle ?z}

  RULE "infer aunts"
    {?x my:hasParent ?y. ?y my:hasSister ?z} => {?x my:hasAunt ?z}
  }

A ruleset can declare prefixes just as a rule can; those prefixes are shared by all the rules in the set. Rulesets and rules can have arbitrarily many labels, which are strings with an optional language tag, just like SPARQL string literals.

feedacross

relatives, part 1 doesn't allow for uncles-in-law and aunts-in-law; we can deal with this by having ?z to be appropriately married to the sibling ?y. We could write a parallel rule for this:

RULE "infer uncles 2"
  {?x my:hasParent ?y1. ?y1 my:marriedTo ?y2. ?y2 my:hasBrother ?z} => {?x my:hasUncle ?z}

and a similar parallel rule for "infer aunts". But there's another way, and that's to realise that we could generalise the notion of "brother" to "brother or brother-in-law", and similarly for aunts. Let's call the predicates for these generalised relations hasGeneralBrother and hasGeneralSister.

RULE "infer uncles 3"
  {?x my:hasParent ?y. ?y1 my:hasGeneralBrother ?z} => {?x my:hasUncle ?z}

RULE "infer aunts 3"
  {?x my:hasParent ?y. ?y1 my:hasGeneralSister ?z} => {?x my:hasAunt ?z}

RULE "general brothers 1"
  {?x my:hasBrother ?y} => {?x my:hasGeneralBrother ?y}

RULE "general brothers 2"
  {?x my:marriedTo ?y. ?y my:hasBrother ?z} => {?x my:hasGeneralBrother ?z}

RULE "general sisters 1"
  {?x my:hasSister ?y} => {?x my:hasGeneralSister ?y}

RULE "general sisters 2"
  {?x my:marriedTo ?y. ?y my:hasSister ?z} => {?x my:hasGeneralSister ?z}

This works because the results from one rule are available to the premises of the others, including itself.

filters

Let's take a step backwards and wonder how we might know that A hasBrother B, if that fact isn't provided. Let's define that someone's brother is a male who has the same parents as that someone:

RULE "brother"
  { ?x my:hasFather ?f; my:hasMother ?m. ?y my:hasFather ?f; my:hasMother ?m; a my:Male }
  => { ?x my:hasBrother ?y }

Something inconvenient happens when we apply this rule to a little family:

my:Bob a my:Male; my:hasFather my:A; my:hasMother my:B.

It concludes that Bob is his own brother; Bob is male and certainly has the same parents as himself. This mathematical elegance doesn't cut much mustard in the day-to-day world, so we will have to filter out the cases in brother where ?x and ?y are the same:

 RULE "brother with filter"
  { ?x my:hasFather ?f; my:hasMother ?m. ?y my:hasFather ?f; my:hasMother ?m; a my:Male
    FILTER ?x != ?y 
  }
  => { ?x my:hasBrother ?y }

The expression following FILTER is a condition which has to be true for the rule to reach its conclusion. The condition ?x != ?y means that (the values of) ?x and ?y must be different. This prevents Bob from being his own brother.

aggregation

Suppose we wanted to count the number of people with mothers in our dataset. We can certainly discover those people:

RULE "people with mothers"
  { ?x my:hasMother ?m } => ... what ... ?

If Ymris were an imperative programming language, we might think of initialising variables to zero and incrementing them every time we find a mothered person. For various technical reasons (summarised as "that can get very confusing and inefficient"), you can't do that in Ymris. What you can do is to bundle all the x's up together and count the size of the resulting collection:

RULE "people with mothers"
  {ALL ?x. { ?x my:hasMother ?m }} => { my:motheredCount rdf:value count(?x) }

The ALL ?x. says that we're looking for all the x's that are bound by the pattern ?x my:hasMother ?m. Outside the ALL, the variable x is bound to the complete collection -- the aggregate -- of matching values.

In the conclusion, the object of the rdf:value statement is the built-in function count, which returns the number of elements in its aggregate argument. So the effect of the rule is to count the number N of mothered persons in the input data and deduce the statement my:motheredCount rdf:value N.

The only useful thing you can do with an aggregate variable like x is to use it in an expression, eg as a function argument -- you can't use it as a subject, predicate, or object in a triple. Useful functions include max, min, sum, mean, [OTHERS?] and count.

There's a subtlety in the people with mothers rule which is shown up by asking for a count of mothers:

RULE "mothers of people"
  {ALL ?m. { ?x my:hasMother ?m }} => { my:motherCount rdf:value count(?m) }

Suppose we apply this to the model:

@prefix my: <http://cdollin.hpl.hp.com/my/>.

my:tom my:mother my:alice.
my:harry my:mother my:alice.

We will deduce my:motherCount rdf:value 2, because we're counting how many times someone is a mother not how many people are mothers. So mothers with multiple children are counted multiple times.

Fear not:

RULE "mothers of people"
  {ALL DISTINCT ?m. { ?x my:hasMother ?m }} => { my:motherCount rdf:value count(?m) }

Now we deduce my:motherCount rdf:value 1, because the DISTINCT modifier discards duplicate values of m.

multiple-variable aggregation, grouping, converting aggregates to lists/seqs/strings/etc, gotchas (circularities).

term negation with UNLESS

RDF is based on the open-world assumption that anyone can makes statements about anything, and so you can't draw conclusions from a local absence of some statement. But sometimes you can realistically say you have all the relevant data, or you'd like to draw a provisional conclusion.

Ymris allows you to draw conclusions from the absence of some triple pattern. This can be useful in data validation or for providing a default.

RULE "everyone should have an age"
  {?x a my:Person. UNLESS GIVEN {?x my:age ?y }} => {?x a my:AgeProblem}

The UNLESS GIVEN introduces a pattern which is looked for in the given instance data. If the pattern is not found -- ie, if there is some Person with no age -- then we conclude that x is an AgeProblem person.

RULE "faking a genre"
  {?x a my:Book. UNLESS GIVEN {?x my:genre ?y }} => {?x my:genre my:MainstreamFiction }

Here, if some Book hasn't been given a specific genre in the base data, we conclude ("assume" would be a better word) that it's mainstream fiction.

The term following UNLESS GIVEN can use all the matching machinery: multiple triples, use of functions and predicates, aggregation, even nested uses of UNLESS.

UNLESS GIVEN cannot see any triples that are deduced by the ruleset. This makes it easy to implement and understand. Sometimes we want to able to detect that even with inference some triple pattern never turns up.

RULE "if all else fails"
  {?x a my:Book. UNLESS ALREADY {?x my:genre ?y }} => {?x my:genre my:MainstreamFiction }

UNLESS ALREADY looks in both the base data and the deductions for any genre statements: it preferentially allows other rules to try and deduce a genre for the book. (A third variant form is UNLESS DERIVED, which checks only the deductions.)

I wonder if some of this should be dealt with by the layering possibilities of the ruleset combination language. An UNLESS DERIVED could just be an UNLESS GIVEN in the layer after the derivations. However, the rules compiler/deployer might be better positioned than the programmer to make these decisions. Does the layering argument also apply to aggregations?

the built-in predicates

Ymris has a range of built-in predicates, some of which it inherits from SPARQL, some of which are present for compatability with Jena 2, and some are specific to Ymris.

predicates inherited from SPARQL

is_iri(X) is true if X is a IRI node (not a literal value, not a blank node)
is_blank(X) is true if X is a blank node (not a literal value, not a IRI node).
is_literal(X) is true if X is a literal node (not a blank or IRI node).
regex(S, R, F) is true if S is a string literal which matches the regular expression R with the (optional) flags string F.
sameTerm(X, Y) is true if X and Y represent the same RDF terms. (Different from = in that = will fail if it finds literals whose types it does not understand.)
langMatches(T, R) is true if the language tag T matches the language range R according to the rules in http://www.ietf.org/rfc/rfc4647.txt.
=
!=
<
>

predicates inherited from Jena 2

See also functions inherited from Jena 2. Note that not all the Jena 2 predicates and functions are supported, for example the bound/unbound predicates are not supported because the expected implementation is as a forward engine where there are no unbound terms. Also predicates and functions with state are not supported. noValue is superseded by UNLESS.

isLiteral(?x), notLiteral(?x), isFunctor(?x), notFunctor(?x), isBNode(?x), notBNode(?x).

equal(?x,?y), notEqual(?x,?y) Test if x=y (or x != y). The equality test is semantic equality so that, for example, the xsd:int 1 and the xsd:decimal 1 would test equal.

lessThan(?x, ?y), greaterThan(?x, ?y) le(?x, ?y), ge(?x, ?y) Test if x is <, >,

= y. Only passes if both x and y are numbers or time instants (can be integer or floating point or XSDDateTime).

isDType(?l, ?t) notDType(?l, ?t) Tests if literal ?l is (or is not) an instance of the datatype defined by resource ?t.

print(?x, ...) Print (to standard out) a representation of each argument. This is useful for debugging rather than serious IO work.

listContains(?l, ?x) listNotContains(?l, ?x) Passes if ?l is a list which contains (does not contain) the element ?x, both arguments must be ground, can not be used as a generator.

listEntry(?list, ?index, ?val) Binds ?val to the ?index'th entry in the RDF list ?list. If there is no such entry the variable will be unbound and the call will fail. Only useable in rule bodies.

listLength(?l, ?len) Binds ?len to the length of the list ?l.

listEqual(?la, ?lb) listNotEqual(?la, ?lb) listEqual tests if the two arguments are both lists and contain the same elements. The equality test is semantic equality on literals (sameValueAs) but will not take into account owl:sameAs aliases. listNotEqual is the negation of this (passes if listEqual fails).

the built-in functions

Ymris has a range of built-in functions, from the same sources as its predicates. Any predicate may be used as a function delivering a boolean value, and any function that delivers a boolean value can be used as a predicate.

functions shared with SPARQL

datatype
lang
str
+
-
*
/

functions inherited from Jena 2

Some Jena 2 predicates are replaced by functions, since the predicates manipulated bindings and Ymris predicates and functions are not permitted to do so.

now(?x) Binds ?x to an xsd:dateTime value corresponding to the current time.

makeInstance(?x, ?p, ?v) makeInstance(?x, ?p, ?t, ?v) Binds ?v to be a blank node which is asserted as the value of the ?p property on resource ?x and optionally has type ?t. Multiple calls with the same arguments will return the same blank node each time - thus allowing this call to be used in backward rules.

sum(?a, ?b, ?c) addOne(?a, ?c) difference(?a, ?b, ?c) min(?a, ?b, ?c) max(?a, ?b, ?c) product(?a, ?b, ?c) quotient(?a, ?b, ?c) Sets c to be (a+b), (a+1) (a-b), min(a,b), max(a,b), (a*b), (a/b). Note that these do not run backwards, if in sum a and c are bound and b is unbound then the test will fail rather than bind b to (c-a). This could be fixed.

strConcat(?a1, .. ?an, ?t) uriConcat(?a1, .. ?an, ?t) Concatenates the lexical form of all the arguments except the last, then binds the last argument to a plain literal (strConcat) or a URI node (uriConcat) with that lexical form. In both cases if an argument node is a URI node the URI will be used as the lexical form.

regex(?t, ?p) regex(?t, ?p, ?m1, .. ?mn) Matches the lexical form of a literal (?t) against a regular expression pattern given by another literal (?p). If the match succeeds, and if there are any additional arguments then it will bind the first n capture groups to the arguments ?m1 to ?mn. The regular expression pattern syntax is that provided by java.util.regex. Note that the capture groups are numbered from 1 and the first capture group will be bound to ?m1, we ignore the implicit capature group 0 which corresponds to the entire matched string. So for example regexp('foo bar', '(.*) (.*)', ?m1, ?m2) will bind m1 to "foo" and m2 to "bar".

functions specific to Ymris

max
min
count
mean
sum

user-defined functions and predicates

An implementation of Ymris is expected to be able to allow the user to define their own functions and predicates. Because functions are referred to by URIs, different functions with the same (local) name can be used in a single ruleset.

No special syntax is required (or provided) by Ymris to use a user-declared function or predicate. Each implementation will provide some way to attach a function's URI to its implementation (eg, as a Java class with an appropriate method), and document that method.

definitions with LET

Sometimes it is useful to be able to introduce names for computed values within a rule, eg to avoid writing an expression multiple times, or to give the value an explanatory name.

RULE "demonstrate let (needs improvement)"
  { ?x P ?y. LET y2 = y * 2. ?z Q ?y2 } => { ?x PQ ?z }

The defined name appears between the LET and the = of the definition; the defining expression follows the =. The placement of the LET within the terms of the premises of the rule does not matter.

ruleset composition

Ymris allows rulesets to be composed from other rulesets using composition expressions following the explicit rules of the ruleset.

The choice of operator symbols may not be the best; what's important is the different semantic possibilities.

RULESET "composition soup"
PREFIX my: <http://cdollin.hpl.hp.com/my/>
RUN my:local, my:global

The rulesets are referred to by their URIs, which may be shortened as usual using declared prefixes. The comma-separated rulesets are merged and run as one ruleset; any inferences from my:global are visible to my:local and vice versa.

RULESET "composition by sequencing"
PREFIX my: <http://cdollin.hpl.hp.com/my/>
RUN my:local; my:global

Separating the ruleset references with the ; operator (note that this is not the same as the use of ; in terms) specifies that the rulesets are applied in sequence; the inferences from my:local are available to my:global, but the my:global inferences are not visible to my:local.

RULESET "composition by independant parallelism"
PREFIX my: <http://cdollin.hpl.hp.com/my/>
RUN my:local & my:global

Separating the ruleset inferences with the & operator specifies that the rulesets are applied independantly and their inferences are combined. Neither sub-ruleset sees the inferences produced by the other.

The operators can be combined, including the use of brackets () for grouping, allowing expressions like a & b & (c; d). The & operator has the weakest precedence and the ; the strongest, with , between them, so that:

a, b; c, d & e, f; g, h

parses as:

((a, b); (c, d)) & ((e, f); (g, h))

The rules in this ruleset can be referred to by the special identifier these. If no RUN directive is given, it is as though RUN these had been supplied.

Ymris grammar

See the full ymris grammar for the details of the syntax.

Note that there may be inconsistencies between the body of this document and the grammar. In that case, typically this document will reflect intentions that the grammar has not caught up with. Sometimes it'll be a bug in this document.

the Ymris processing model

The meaning of Ymris rules is given by the Ymris semantics, which is expressed in terms of a translation of Ymris constructs into a graph of elements to be executed by the rules engine.

The elements express basic components of rule execution and are connected by typed streams. The element type can be:

Triple - the elements of the stream are (encodings of) RDF triples.
Bind(Name*) - each element of the stream is a binding for the named variables. (All elements of the stream have the same names bound.)
Poke - the elements of the stream are empty tuples. (Each poke is a signal. Typically only one poke is expected.)

There are nine defined element types. Two of them are paramterised on another type, the triple pattern: a triple pattern P is a triple where some (possibly zero) of the components have been replaced by named variables.

The notation [StreamType_in*]Element[StreamType_out*] describes an element with input inputs typed by in and output elements typed by out.

[Bind(B)]subst(P)[Triple] - the names appearing in P must be a subset of the names bound in B. Each input binding produces an output triple with the names in P replaced by their values from B.
[Triple]match(P)[Bind(B)] - the names appearing in B must be a subset of those appearing in P. Each input triple produces an output binding with all the names in P bound to the corresponding element of the triple.
[Bind(B_in)]filter(F(X))[Bind(B_out)] - the names in B_in and B_out must be the same. (See DROP below.) F is a predicate over tuples of values, and X is a tuple of names; every name in X must be bound in B_in. Each input binding is validated by applying F to the tuple of values obtained by replacing the names in X with their values from B_in. Valid bindings are sent to the output; invalid bindings are discarded.
[Bind(B_L),Bind(B_R)]join[Bind(B_out)] - The names in B_out must be the union of the names in B_L and B_R. Each input binding L (R) generates as output every consistent union of L (R) with all the previous bindings of R (L). (A consistent union is one in which any variable bound in both the operands is bound to the same value.)
Note that this implies a join element consumes store; in general, every binding that arrives must be preserved so that it may generate new bindings for arrivals on the other input stream.
[Bind(T), Bind(U), Poke]unless[Bind(T)] - this element generates no output until a Poke signal arrives. Then all bindings T that have arrived or will arrive are delivered to the output unless a binding U arrived which is a subset of T.
Note that this implies that an unless element consumes store for every U, and for every T that arrives before the Poke.
[Bind(X)]ignore - any bindings that arrive on the input are discarded.
ignore makes explict the points where bindings can be disposed of.
[S_in]fork[S_out*] - fork copies its input stream (which may be of any type) to all of its output streams (which are all of the same type as the input).
[Bind(X), Poke]aggregate(A, Y)[Bind(X), Bind(A)] - the names Y must be a subset of the names bound in X. aggregate copies its input bindings X to its X output. When the Poke signal arrives, it outputs down it's A output a single aggregate value binding A to a collection of bindings for the subset of X extracted by Y.
Sometimes the X output is not required, in which case it can be eaten by an ignore element.

The bindings X must all be stored until the Poke signal arrives.
[Bind(X), Poke]summarise[H(Y), Bind(X), Bind(Y)] - summarise copies its X input to its X output. When the Poke signal arrives, it outputs a summary of all the bindings X to its Y stream. The summary is defined by H and should require much less state than remembering all the X's.
The set of H options includes at least count, min, max, mean, sum, product.

summarise exists to make explicit some obvious optimisations of aggregation.
[Bind(B)]distinct[Bind(B)] - each input B generates an output of that same B unless such a B has already been output: duplicates are eliminated.
Hence every distinct B that arrives must be recorded.
[Bind(B)]drop(X)[Bind(B')] - B' must be a subset of B, and the names X must all appear in B. Each binding B that arrives generates a B' which is B with all the bindings for names in X removed.
drop makes explicit the points where bindings become irrelevant.

translation of rules into elements

We sketch here the unoptimised translation of Ymris rules into the rules processing network. We suppose that there is some finite incoming stream of triples from the base dataset, and a resulting stream of inferred triples to the deductions consumer: the details of these streams are not important to the unoptimised translation. We also suppose that it is possible to independantly query the base dataset for triple patterns for one of the implementations of UNLESS.

First pass: doesn't account for UNLESS, ALL, and their signals.

translation of a ruleset

To translate a ruleset RS consisting of rules R_i, create an input stream and output stream for each R_i and a feedback stream F.

merge the output of F with the input stream to RS. fork that merged stream to each of the input streams to the R_i.

merge the output streams of the R_i together and fork that merge, one copy to the output stream of RS and the other to the input side of F.

translation of a rule

Translate the premises of the rule, taking the input from the input triples stream. Translate the conclusions of the rule, feeding the output to the output of the rule. Connect the translated premises output bindings to the translated conclusions input bindings.

translation of a rule's conclusion

Fork the incoming bindings stream to each term in the conclusion. Merge the output triples streams from those terms to the output of the rule.

A translated conclusion term for a triple pattern P outputs P(B) for each binding B from its input stream, where P(B) is the triple resulting from replacing each variable in P with its binding from B.

translation of a rule's premises

A translated rules input stream is forked to provide a copy for each translated triple-pattern term. Each such translated term matches its pattern against its term and, if it matches, outputs a binding of the variable in the term to the corresponding value in the triple.

The premise term triple output streams are joined pairwise until only one joined stream remains. To join 1 stream, we're done; to join N>1 streams, join any N/2 streams to produce stream A, join the remaining streams to produce stream B, and then join A and B to produce the term output binding stream.

Each filter predicate in the premises becomes a filter element; all those elements are cascaded on the term output stream to produce the premise output output stream. The order of then filters does not matter. (No filter can refer to a variable unless it has been bound by triple terms; filters cannot bind variables.)

TODO

Aggregations. Unless. Let.

bibliography

The Riddle-Masters Game
SPARQL
Jena 2 rules
OWL 2 rules
RIF