Ymris: the Jena3 rules language

introduction

Ymris is a language for making computations over RDF data. The computations are expressed as rules: given that some pattern of data exists, deduce some new pattern of data. Patterns are RDF triples linked by variables, augmented by an extensible bunch of filters written as Java classes.

The core syntax of Ymris is inspired by SPARQL [REF SPARQL]. The functionality of Ymris covers that of the Jena 2 rules language [REF RULES] with augmentations (eg aggregation) to increase coverage. This functionality is sufficient to implement OWL using the OWL 2 rules [REF OWL2]. Ymris also allows the description of combinations of sets of rules. Ymris can be implemented on a variety of scales ranging from a simple in-memory version to multi-processor streaming networks.

A current implementation of Ymris can be downloaded from [REF DOWNLOAD]. The reader may find it helpful to follow along the tutorial section of this document using that implementation. [The beta-reader will find it helpful to realise that every single command-line and GUI example below is completely made up pending some design and building.]

tutorial

a very simple rule

We'll start with a single rule that marks a dataset as having been processed:

RULE
  PREFIX my: <http://cdollin.hpl.hp.com/my/>
  PREFIX myRules: <http://cdollin.hpl.hp.com/myRules/>

  { ?x a my:Header } => { ?x my:processedBy myRules:A }

The PREFIX declarations allow readably shortened names in the body of the rule. Here the premise of the rule is a single triple pattern which matches any triple with predicate rdf:type (abbreviated "a" in the style of N3/Turtle/SPARQL) and object (the expansion of) my:Header. The subject of the triple is matched by the variable ?x.

The conclusion of the rule is also a single triple, obtained by replacing the variable ?x by its matching value. The effects of applying this rule to some RDF dataset is to augment any my:Header individuals with a my:processedBy property whose value is myRules:A.

Note that this rule can only add my:processedBy properties to my:Header entities that already exist in the graph. Later on we'll see how to test whether something already exists in the graph or not.

a rule with two conclusions

As well as marking a dataset with what rules it has been processed with, we might want to mark them with when those rules were applied. Ymris allows functions to be called to provide the objects of statements, and the built-in function now returns the date when the rule is executed (more exactly, the date and time when the execution of the rules started; every call of now in that execution delivers the same date).

RULE
  PREFIX my: <http://cdollin.hpl.hp.com/my/>
  PREFIX myRules: <http://cdollin.hpl.hp.com/myRules/>

  { ?x a my:Header } 
  => { ?x my:processedBy myRules:A; my:processedOn now() }

The SPARQL-style ";" operator allows us to provide multiple properties for the same entity.

blank nodes in conclusions

A problem with the rule above is that if two different rulesets -- myRules:A and myRules:B, say -- are applied, one to the results from the other, then the resulting graph will have the elements so mixed that we cannot tell which ruleset was applied when:

  my:example a my:Header
    ; my:processedBy myRules:A, myRules:B
    ; my:processedOn "[DATE HERE]", my:processedOn "DIFFERENT DATE HERE"
    .

We need to keep the rules name and date together, which we can do by introducing a bnode:

RULE
  PREFIX my: <http://cdollin.hpl.hp.com/my/>
  PREFIX myRules: <http://cdollin.hpl.hp.com/myRules/>

  { ?x a my:Header } 
  => { ?x my:processed [my:processedBy myRules:A; my:processedOn now()] }

We permit the SPARQL syntax for bnodes to be used to introduce new blank nodes. Each binding for ?x will produce a new bnode, so if there are N header entities in the input graph, each one of those will get its own new my:processed bnode.

multiple premises

Rules are more interesting when they have multiple premises.

RULE
  PREFIX my: <http://cdollin.hpl.hp.com/my/>

  {?x my:hasParent ?y. ?y my:hasBrother ?z} => {?x my:hasUncle ?z}

The two statement patterns in the premises are separated by ".". The variable ?y links them together; it must be the same entity that is the parent of ?x has brother ?z.

multiple rules

If we can infer uncles, we can infer aunts, using a very similar rule. But we can't use just one rule to infer uncles and aunts as well. Ymris allows us to write multiple rules gathered into a ruleset.

RULESET "relatives, part 1"
PREFIX my: <http://cdollin.hpl.hp.com/my/>
  {
  RULE "infer uncles"
    {?x my:hasParent ?y. ?y my:hasBrother ?z} => {?x my:hasUncle ?z}

  RULE "infer aunts"
    {?x my:hasParent ?y. ?y my:hasSister ?z} => {?x my:hasAunt ?z}
  }

A ruleset can declare prefixes just as a rule can; those prefixes are shared by all the rules in the set. Rulesets and rules can have arbitrarily many labels, which are strings with an optional language tag, just like SPARQL string literals.

feedacross

relatives, part 1 doesn't allow for uncles-in-law and aunts-in-law; we can deal with this by having ?z to be appropriately married to the sibling ?y. We could write a parallel rule for this:

RULE "infer uncles 2"
  {?x my:hasParent ?y1. ?y1 my:marriedTo ?y2. ?y2 my:hasBrother ?z} => {?x my:hasUncle ?z}

and a similar parallel rule for "infer aunts". But there's another way, and that's to realise that we could generalise the notion of "brother" to "brother or brother-in-law", and similarly for aunts. Let's call the predicates for these generalised relations hasGeneralBrother and hasGeneralSister.

RULE "infer uncles 3"
  {?x my:hasParent ?y. ?y1 my:hasGeneralBrother ?z} => {?x my:hasUncle ?z}

RULE "infer aunts 3"
  {?x my:hasParent ?y. ?y1 my:hasGeneralSister ?z} => {?x my:hasAunt ?z}

RULE "general brothers 1"
  {?x my:hasBrother ?y} => {?x my:hasGeneralBrother ?y}

RULE "general brothers 2"
  {?x my:marriedTo ?y. ?y my:hasBrother ?z} => {?x my:hasGeneralBrother ?z}

RULE "general sisters 1"
  {?x my:hasSister ?y} => {?x my:hasGeneralSister ?y}

RULE "general sisters 2"
  {?x my:marriedTo ?y. ?y my:hasSister ?z} => {?x my:hasGeneralSister ?z}

This works because the results from one rule are available to the premises of the others, including itself.

filters

Let's take a step backwards and wonder how we might know that A hasBrother B, if that fact isn't provided. Let's define that someone's brother is a male who has the same parents as that someone:

RULE "brother"
  { ?x my:hasFather ?f; my:hasMother ?m. ?y my:hasFather ?f; my:hasMother ?m; a my:Male }
  => { ?x my:hasBrother ?y }

Something inconvenient happens when we apply this rule to a little family:

my:Bob a my:Male; my:hasFather my:A; my:hasMother my:B.

It concludes that Bob is his own brother; Bob is male and certainly has the same parents as himself. This mathematical elegance doesn't cut much mustard in the day-to-day world, so we will have to filter out the cases in brother where ?x and ?y are the same:

 RULE "brother with filter"
  { ?x my:hasFather ?f; my:hasMother ?m. ?y my:hasFather ?f; my:hasMother ?m; a my:Male
    FILTER ?x != ?y 
  }
  => { ?x my:hasBrother ?y }

The expression following FILTER is a condition which has to be true for the rule to reach its conclusion. The condition ?x != ?y means that (the values of) ?x and ?y must be different. This prevents Bob from being his own brother.

aggregation

Suppose we wanted to count the number of people with mothers in our dataset. We can certainly discover those people:

RULE "people with mothers"
  { ?x my:hasMother ?m } => ... what ... ?

If Ymris were an imperative programming language, we might think of initialising variables to zero and incrementing them every time we find a mothered person. For various technical reasons (summarised as "that can get very confusing and inefficient"), you can't do that in Ymris. What you can do is to bundle all the x's up together and count the size of the resulting collection:

RULE "people with mothers"
  {ALL ?x. { ?x my:hasMother ?m }} => { my:motheredCount rdf:value count(?x) }

The ALL ?x. says that we're looking for all the x's that are bound by the pattern ?x my:hasMother ?m. Outside the ALL, the variable x is bound to the complete collection -- the aggregate -- of matching values.

In the conclusion, the object of the rdf:value statement is the built-in function count, which returns the number of elements in its aggregate argument. So the effect of the rule is to count the number N of mothered persons in the input data and deduce the statement my:motheredCount rdf:value N.

The only useful thing you can do with an aggregate variable like x is to use it in an expression, eg as a function argument -- you can't use it as a subject, predicate, or object in a triple. Useful functions include max, min, sum, mean, [OTHERS?] and count.

There's a subtlety in the people with mothers rule which is shown up by asking for a count of mothers:

RULE "mothers of people"
  {ALL ?m. { ?x my:hasMother ?m }} => { my:motherCount rdf:value count(?m) }

Suppose we apply this to the model:

@prefix my: <http://cdollin.hpl.hp.com/my/>.

my:tom my:mother my:alice.
my:harry my:mother my:alice.

We will deduce my:motherCount rdf:value 2, because we're counting how many times someone is a mother not how many people are mothers. So mothers with multiple children are counted multiple times.

Fear not:

RULE "mothers of people"
  {ALL DISTINCT ?m. { ?x my:hasMother ?m }} => { my:motherCount rdf:value count(?m) }

Now we deduce my:motherCount rdf:value 1, because the DISTINCT modifier discards duplicate values of m.

multiple-variable aggregation, grouping, converting aggregates to lists/seqs/strings/etc, gotchas (circularities).

term negation with UNLESS

RDF is based on the open-world assumption that anyone can makes statements about anything, and so you can't draw conclusions from a local absence of some statement. But sometimes you can realistically say you have all the relevant data, or you'd like to draw a provisional conclusion.

Ymris allows you to draw conclusions from the absence of some triple pattern. This can be useful in data validation or for providing a default.

RULE "everyone should have an age"
  {?x a my:Person. UNLESS GIVEN {?x my:age ?y }} => {?x a my:AgeProblem}

The UNLESS GIVEN introduces a pattern which is looked for in the given instance data. If the pattern is not found -- ie, if there is some Person with no age -- then we conclude that x is an AgeProblem person.

RULE "faking a genre"
  {?x a my:Book. UNLESS GIVEN {?x my:genre ?y }} => {?x my:genre my:MainstreamFiction }

Here, if some Book hasn't been given a specific genre in the base data, we conclude ("assume" would be a better word) that it's mainstream fiction.

The term following UNLESS GIVEN can use all the matching machinery: multiple triples, use of functions and predicates, aggregation, even nested uses of UNLESS.

UNLESS GIVEN cannot see any triples that are deduced by the ruleset. This makes it easy to implement and understand. Sometimes we want to able to detect that even with inference some triple pattern never turns up.

RULE "if all else fails"
  {?x a my:Book. UNLESS ALREADY {?x my:genre ?y }} => {?x my:genre my:MainstreamFiction }

UNLESS ALREADY looks in both the base data and the deductions for any genre statements: it preferentially allows other rules to try and deduce a genre for the book. (A third variant form is UNLESS DERIVED, which checks only the deductions.)

I wonder if some of this should be dealt with by the layering possibilities of the ruleset combination language. An UNLESS DERIVED could just be an UNLESS GIVEN in the layer after the derivations. However, the rules compiler/deployer might be better positioned than the programmer to make these decisions. Does the layering argument also apply to aggregations?

the built-in predicates

Ymris has a range of built-in predicates, some of which it inherits from SPARQL, some of which are present for compatability with Jena 2, and some are specific to Ymris.

predicates inherited from SPARQL

is_iri(X) is true if X is a IRI node (not a literal value, not a blank node)
is_blank(X) is true if X is a blank node (not a literal value, not a IRI node).
is_literal(X) is true if X is a literal node (not a blank or IRI node).
regex(S, R, F) is true if S is a string literal which matches the regular expression R with the (optional) flags string F.
sameTerm(X, Y) is true if X and Y represent the same RDF terms. (Different from = in that = will fail if it finds literals whose types it does not understand.)
langMatches(T, R) is true if the language tag T matches the language range R according to the rules in http://www.ietf.org/rfc/rfc4647.txt.
=
!=
<
>

predicates inherited from Jena 2

See also functions inherited from Jena 2. Note that not all the Jena 2 predicates and functions are supported, for example the bound/unbound predicates are not supported because the expected implementation is as a forward engine where there are no unbound terms. Also predicates and functions with state are not supported. noValue is superseded by UNLESS.

isLiteral(?x), notLiteral(?x), isFunctor(?x), notFunctor(?x), isBNode(?x), notBNode(?x).

equal(?x,?y), notEqual(?x,?y) Test if x=y (or x != y). The equality test is semantic equality so that, for example, the xsd:int 1 and the xsd:decimal 1 would test equal.

lessThan(?x, ?y), greaterThan(?x, ?y) le(?x, ?y), ge(?x, ?y) Test if x is <, >,

= y. Only passes if both x and y are numbers or time instants (can be integer or floating point or XSDDateTime).

isDType(?l, ?t) notDType(?l, ?t) Tests if literal ?l is (or is not) an instance of the datatype defined by resource ?t.

print(?x, ...) Print (to standard out) a representation of each argument. This is useful for debugging rather than serious IO work.

listContains(?l, ?x) listNotContains(?l, ?x) Passes if ?l is a list which contains (does not contain) the element ?x, both arguments must be ground, can not be used as a generator.

listEntry(?list, ?index, ?val) Binds ?val to the ?index'th entry in the RDF list ?list. If there is no such entry the variable will be unbound and the call will fail. Only useable in rule bodies.

listLength(?l, ?len) Binds ?len to the length of the list ?l.

listEqual(?la, ?lb) listNotEqual(?la, ?lb) listEqual tests if the two arguments are both lists and contain the same elements. The equality test is semantic equality on literals (sameValueAs) but will not take into account owl:sameAs aliases. listNotEqual is the negation of this (passes if listEqual fails).

the built-in functions

Ymris has a range of built-in functions, from the same sources as its predicates. Any predicate may be used as a function delivering a boolean value, and any function that delivers a boolean value can be used as a predicate.

functions shared with SPARQL

datatype
lang
str
+
-
*
/

functions inherited from Jena 2

Some Jena 2 predicates are replaced by functions, since the predicates manipulated bindings and Ymris predicates and functions are not permitted to do so.

now(?x) Binds ?x to an xsd:dateTime value corresponding to the current time.

makeInstance(?x, ?p, ?v) makeInstance(?x, ?p, ?t, ?v) Binds ?v to be a blank node which is asserted as the value of the ?p property on resource ?x and optionally has type ?t. Multiple calls with the same arguments will return the same blank node each time - thus allowing this call to be used in backward rules.

sum(?a, ?b, ?c) addOne(?a, ?c) difference(?a, ?b, ?c) min(?a, ?b, ?c) max(?a, ?b, ?c) product(?a, ?b, ?c) quotient(?a, ?b, ?c) Sets c to be (a+b), (a+1) (a-b), min(a,b), max(a,b), (a*b), (a/b). Note that these do not run backwards, if in sum a and c are bound and b is unbound then the test will fail rather than bind b to (c-a). This could be fixed.

strConcat(?a1, .. ?an, ?t) uriConcat(?a1, .. ?an, ?t) Concatenates the lexical form of all the arguments except the last, then binds the last argument to a plain literal (strConcat) or a URI node (uriConcat) with that lexical form. In both cases if an argument node is a URI node the URI will be used as the lexical form.

regex(?t, ?p) regex(?t, ?p, ?m1, .. ?mn) Matches the lexical form of a literal (?t) against a regular expression pattern given by another literal (?p). If the match succeeds, and if there are any additional arguments then it will bind the first n capture groups to the arguments ?m1 to ?mn. The regular expression pattern syntax is that provided by java.util.regex. Note that the capture groups are numbered from 1 and the first capture group will be bound to ?m1, we ignore the implicit capature group 0 which corresponds to the entire matched string. So for example regexp('foo bar', '(.*) (.*)', ?m1, ?m2) will bind m1 to "foo" and m2 to "bar".

functions specific to Ymris

max
min
count
mean
sum

user-defined functions and predicates

An implementation of Ymris is expected to be able to allow the user to define their own functions and predicates. Because functions are referred to by URIs, different functions with the same (local) name can be used in a single ruleset.

No special syntax is required (or provided) by Ymris to use a user-declared function or predicate. Each implementation will provide some way to attach a function's URI to its implementation (eg, as a Java class with an appropriate method), and document that method.

definitions with LET

Sometimes it is useful to be able to introduce names for computed values within a rule, eg to avoid writing an expression multiple times, or to give the value an explanatory name.

RULE "demonstrate let (needs improvement)"
  { ?x P ?y. LET y2 = y * 2. ?z Q ?y2 } => { ?x PQ ?z }

The defined name appears between the LET and the = of the definition; the defining expression follows the =. The placement of the LET within the terms of the premises of the rule does not matter.

ruleset composition

Ymris allows rulesets to be composed from other rulesets using composition expressions following the explicit rules of the ruleset.

The choice of operator symbols may not be the best; what's important is the different semantic possibilities.

RULESET "composition soup"
PREFIX my: <http://cdollin.hpl.hp.com/my/>
RUN my:local, my:global

The rulesets are referred to by their URIs, which may be shortened as usual using declared prefixes. The comma-separated rulesets are merged and run as one ruleset; any inferences from my:global are visible to my:local and vice versa.

RULESET "composition by sequencing"
PREFIX my: <http://cdollin.hpl.hp.com/my/>
RUN my:local; my:global

Separating the ruleset references with the ; operator (note that this is not the same as the use of ; in terms) specifies that the rulesets are applied in sequence; the inferences from my:local are available to my:global, but the my:global inferences are not visible to my:local.

RULESET "composition by independant parallelism"
PREFIX my: <http://cdollin.hpl.hp.com/my/>
RUN my:local & my:global

Separating the ruleset inferences with the & operator specifies that the rulesets are applied independantly and their inferences are combined. Neither sub-ruleset sees the inferences produced by the other.

The operators can be combined, including the use of brackets () for grouping, allowing expressions like a & b & (c; d). The & operator has the weakest precedence and the ; the strongest, with , between them, so that:

a, b; c, d & e, f; g, h

parses as:

((a, b); (c, d)) & ((e, f); (g, h))

The rules in this ruleset can be referred to by the special identifier these. If no RUN directive is given, it is as though RUN these had been supplied.

Ymris grammar

See the full ymris grammar for the details of the syntax.

Note that there may be inconsistencies between the body of this document and the grammar. In that case, typically this document will reflect intentions that the grammar has not caught up with. Sometimes it'll be a bug in this document.

bibliography

The Riddle-Masters Game
SPARQL
Jena 2 rules
OWL 2 rules
RIF