Ymris is a language for making computations over RDF data. The computations are expressed as rules: given that some pattern of data exists, deduce some new pattern of data. Patterns are RDF triples linked by variables, augmented by an extensible bunch of filters written as Java classes.
The core syntax of Ymris is inspired by SPARQL [REF SPARQL]. The functionality of Ymris covers that of the Jena 2 rules language [REF RULES] with augmentations (eg aggregation) to increase coverage. This functionality is sufficient to implement OWL using the OWL 2 rules [REF OWL2]. Ymris also allows the description of combinations of sets of rules. Ymris can be implemented on a variety of scales ranging from a simple in-memory version to multi-processor streaming networks.
A current implementation of Ymris can be downloaded from [REF DOWNLOAD]. The reader may find it helpful to follow along the tutorial section of this document using that implementation. [The beta-reader will find it helpful to realise that every single command-line and GUI example below is completely made up pending some design and building.]
We'll start with a single rule that marks a dataset as having been processed:
RULE PREFIX my: <http://cdollin.hpl.hp.com/my/> PREFIX myRules: <http://cdollin.hpl.hp.com/myRules/> { ?x a my:Header } => { ?x my:processedBy myRules:A }
The PREFIX declarations allow readably shortened names in the body
of the rule. Here the premise of the rule is a single triple
pattern which matches any triple with predicate rdf:type
(abbreviated "a
" in the style of N3/Turtle/SPARQL) and object
(the expansion of) my:Header
. The subject of the triple is
matched by the variable ?x
.
The conclusion of the rule is also a single triple, obtained
by replacing the variable ?x
by its matching value. The effects
of applying this rule to some RDF dataset is to augment any
my:Header
individuals with a my:processedBy
property
whose value is myRules:A
.
Note that this rule can only add my:processedBy
properties to
my:Header
entities that already exist in the graph.
Later on we'll see how to test whether something already exists in the
graph or not.
As well as marking a dataset with what rules it has been processed with,
we might want to mark them with when those rules were applied.
Ymris allows functions to be called to provide the objects of
statements, and the built-in function now
returns
the date when the rule is executed (more exactly, the date and time
when the execution of the rules started; every call of
now
in that execution delivers the same date).
RULE PREFIX my: <http://cdollin.hpl.hp.com/my/> PREFIX myRules: <http://cdollin.hpl.hp.com/myRules/> { ?x a my:Header } => { ?x my:processedBy myRules:A; my:processedOn now() }
The SPARQL-style ";" operator allows us to provide multiple properties for the same entity.
A problem with the rule above is that if two different rulesets
-- myRules:A
and myRules:B
, say --
are applied, one to the results from the other, then the
resulting graph will have the elements so mixed that we
cannot tell which ruleset was applied when:
my:example a my:Header ; my:processedBy myRules:A, myRules:B ; my:processedOn "[DATE HERE]", my:processedOn "DIFFERENT DATE HERE" .
We need to keep the rules name and date together, which we can do by introducing a bnode:
RULE PREFIX my: <http://cdollin.hpl.hp.com/my/> PREFIX myRules: <http://cdollin.hpl.hp.com/myRules/> { ?x a my:Header } => { ?x my:processed [my:processedBy myRules:A; my:processedOn now()] }
We permit the SPARQL syntax for bnodes to be used to introduce new blank
nodes. Each binding for ?x
will produce a new bnode, so if
there are N
header entities in the input graph, each one of
those will get its own new my:processed
bnode.
Rules are more interesting when they have multiple premises.
RULE PREFIX my: <http://cdollin.hpl.hp.com/my/> {?x my:hasParent ?y. ?y my:hasBrother ?z} => {?x my:hasUncle ?z}
The two statement patterns in the premises are separated by ".".
The variable ?y
links them together; it must be the
same entity that is the parent of ?x
has brother
?z
.
If we can infer uncles, we can infer aunts, using a very similar rule. But we can't use just one rule to infer uncles and aunts as well. Ymris allows us to write multiple rules gathered into a ruleset.
RULESET "relatives, part 1" PREFIX my: <http://cdollin.hpl.hp.com/my/> { RULE "infer uncles" {?x my:hasParent ?y. ?y my:hasBrother ?z} => {?x my:hasUncle ?z} RULE "infer aunts" {?x my:hasParent ?y. ?y my:hasSister ?z} => {?x my:hasAunt ?z} }
A ruleset can declare prefixes just as a rule can; those prefixes are shared by all the rules in the set. Rulesets and rules can have arbitrarily many labels, which are strings with an optional language tag, just like SPARQL string literals.
relatives, part 1
doesn't allow for uncles-in-law and
aunts-in-law; we can deal with this by having ?z
to
be appropriately married to the sibling ?y
.
We could write a parallel rule for this:
RULE "infer uncles 2" {?x my:hasParent ?y1. ?y1 my:marriedTo ?y2. ?y2 my:hasBrother ?z} => {?x my:hasUncle ?z}
and a similar parallel rule for "infer aunts". But there's another way,
and that's to realise that we could generalise the notion of "brother"
to "brother or brother-in-law", and similarly for aunts. Let's
call the predicates for these generalised relations hasGeneralBrother
and hasGeneralSister
.
RULE "infer uncles 3" {?x my:hasParent ?y. ?y1 my:hasGeneralBrother ?z} => {?x my:hasUncle ?z} RULE "infer aunts 3" {?x my:hasParent ?y. ?y1 my:hasGeneralSister ?z} => {?x my:hasAunt ?z} RULE "general brothers 1" {?x my:hasBrother ?y} => {?x my:hasGeneralBrother ?y} RULE "general brothers 2" {?x my:marriedTo ?y. ?y my:hasBrother ?z} => {?x my:hasGeneralBrother ?z} RULE "general sisters 1" {?x my:hasSister ?y} => {?x my:hasGeneralSister ?y} RULE "general sisters 2" {?x my:marriedTo ?y. ?y my:hasSister ?z} => {?x my:hasGeneralSister ?z}
This works because the results from one rule are available to the premises of the others, including itself.
Let's take a step backwards and wonder how we might know that
A hasBrother B
, if that fact isn't provided.
Let's define that someone's brother is a male who has the
same parents as that someone:
RULE "brother" { ?x my:hasFather ?f; my:hasMother ?m. ?y my:hasFather ?f; my:hasMother ?m; a my:Male } => { ?x my:hasBrother ?y }
Something inconvenient happens when we apply this rule to a little family:
my:Bob a my:Male; my:hasFather my:A; my:hasMother my:B.
It concludes that Bob is his own brother; Bob is male and certainly
has the same parents as himself. This mathematical elegance doesn't
cut much mustard in the day-to-day world, so we will have to
filter out the cases in brother
where ?x
and ?y
are the same:
RULE "brother with filter" { ?x my:hasFather ?f; my:hasMother ?m. ?y my:hasFather ?f; my:hasMother ?m; a my:Male FILTER ?x != ?y } => { ?x my:hasBrother ?y }
The expression following FILTER
is a condition which has to be true
for the rule to reach its conclusion. The condition ?x != ?y
means
that (the values of) ?x
and ?y
must be different. This
prevents Bob from being his own brother.
Suppose we wanted to count the number of people with mothers in our dataset. We can certainly discover those people:
RULE "people with mothers" { ?x my:hasMother ?m } => ... what ... ?
If Ymris were an imperative programming language, we might think of
initialising variables to zero and incrementing them every time we
find a mothered person. For various technical reasons (summarised as
"that can get very confusing and inefficient"), you can't do that
in Ymris. What you can do is to bundle all the x
's
up together and count the size of the resulting collection:
RULE "people with mothers" {ALL ?x. { ?x my:hasMother ?m }} => { my:motheredCount rdf:value count(?x) }
The ALL ?x.
says that we're looking for all the x
's
that are bound by the pattern ?x my:hasMother ?m
. Outside the
ALL
, the variable x
is bound to the complete
collection -- the aggregate -- of matching values.
In the conclusion, the object of the rdf:value
statement is the built-in function count
, which returns
the number of elements in its aggregate argument. So the effect of
the rule is to count the number N
of mothered persons
in the input data and deduce the statement my:motheredCount rdf:value N
.
The only useful thing you can do with an aggregate variable
like x
is to use it in an expression, eg as a
function argument -- you can't use it as a subject, predicate, or
object in a triple. Useful functions include max
,
min
, sum
, mean
,
[OTHERS?] and count
.
There's a subtlety in the people with mothers rule which is shown up by asking for a count of mothers:
RULE "mothers of people" {ALL ?m. { ?x my:hasMother ?m }} => { my:motherCount rdf:value count(?m) }
Suppose we apply this to the model:
@prefix my: <http://cdollin.hpl.hp.com/my/>. my:tom my:mother my:alice. my:harry my:mother my:alice.
We will deduce my:motherCount rdf:value 2
, because
we're counting how many times someone is a mother not
how many people are mothers. So mothers with multiple
children are counted multiple times.
Fear not:
RULE "mothers of people" {ALL DISTINCT ?m. { ?x my:hasMother ?m }} => { my:motherCount rdf:value count(?m) }
Now we deduce my:motherCount rdf:value 1
, because the
DISTINCT
modifier discards duplicate values of m
.
multiple-variable aggregation, grouping, converting aggregates to lists/seqs/strings/etc, gotchas (circularities).
RDF is based on the open-world assumption that anyone can makes statements about anything, and so you can't draw conclusions from a local absence of some statement. But sometimes you can realistically say you have all the relevant data, or you'd like to draw a provisional conclusion.
Ymris allows you to draw conclusions from the absence of some triple pattern. This can be useful in data validation or for providing a default.
RULE "everyone should have an age" {?x a my:Person. UNLESS GIVEN {?x my:age ?y }} => {?x a my:AgeProblem}
The UNLESS GIVEN
introduces a pattern which is looked
for in the given instance data. If the pattern is not found --
ie, if there is some Person
with no age
--
then we conclude that x
is an AgeProblem
person.
RULE "faking a genre" {?x a my:Book. UNLESS GIVEN {?x my:genre ?y }} => {?x my:genre my:MainstreamFiction }
Here, if some Book
hasn't been given a specific
genre
in the base data, we conclude ("assume" would be
a better word) that it's mainstream fiction.
The term following UNLESS GIVEN
can use all the matching
machinery: multiple triples, use of functions and predicates,
aggregation, even nested uses of UNLESS
.
UNLESS GIVEN
cannot see any triples that are deduced
by the ruleset. This makes it easy to implement and understand. Sometimes
we want to able to detect that even with inference some triple
pattern never turns up.
RULE "if all else fails" {?x a my:Book. UNLESS ALREADY {?x my:genre ?y }} => {?x my:genre my:MainstreamFiction }
UNLESS ALREADY
looks in both the base data and the deductions
for any genre
statements: it preferentially allows other rules
to try and deduce a genre
for the book. (A third variant form
is UNLESS DERIVED
, which checks only the deductions.)
I wonder if some of this should be dealt with by the layering possibilities of the ruleset combination language. AnUNLESS DERIVED
could just be anUNLESS GIVEN
in the layer after the derivations. However, the rules compiler/deployer might be better positioned than the programmer to make these decisions. Does the layering argument also apply to aggregations?
Ymris has a range of built-in predicates, some of which it inherits from SPARQL, some of which are present for compatability with Jena 2, and some are specific to Ymris.
is_iri(X)
is true if X
is a IRI node
(not a literal value, not a blank node)
is_blank(X)
is true if X
is a blank node
(not a literal value, not a IRI node).
is_literal(X)
is true if X
is a literal
node (not a blank or IRI node).
regex(S, R, F)
is true if S
is a string
literal which matches the regular expression R
with
the (optional) flags string F
.
sameTerm(X, Y)
is true if X
and Y
represent the same RDF terms. (Different from =
in that
=
will fail if it finds literals whose types it does not
understand.)
langMatches(T, R)
is true if the language tag T
matches the language range R
according to the rules in
http://www.ietf.org/rfc/rfc4647.txt.
=
!=
<
>
See also functions inherited from Jena 2. Note that not
all the Jena 2 predicates and functions are supported, for
example the bound/unbound
predicates are not supported
because the expected implementation is as a forward engine where
there are no unbound terms. Also predicates and functions with
state are not supported. noValue
is superseded
by UNLESS
.
isLiteral(?x), notLiteral(?x), isFunctor(?x), notFunctor(?x), isBNode(?x), notBNode(?x).
equal(?x,?y), notEqual(?x,?y) Test if x=y (or x != y). The equality test is semantic equality so that, for example, the xsd:int 1 and the xsd:decimal 1 would test equal.
lessThan(?x, ?y), greaterThan(?x, ?y) le(?x, ?y), ge(?x, ?y) Test if x is <, >,
= y. Only passes if both x and y are numbers or time instants (can be integer or floating point or XSDDateTime).
isDType(?l, ?t) notDType(?l, ?t) Tests if literal ?l is (or is not) an instance of the datatype defined by resource ?t.
print(?x, ...) Print (to standard out) a representation of each argument. This is useful for debugging rather than serious IO work.
listContains(?l, ?x) listNotContains(?l, ?x) Passes if ?l is a list which contains (does not contain) the element ?x, both arguments must be ground, can not be used as a generator.
listEntry(?list, ?index, ?val) Binds ?val to the ?index'th entry in the RDF list ?list. If there is no such entry the variable will be unbound and the call will fail. Only useable in rule bodies.
listLength(?l, ?len) Binds ?len to the length of the list ?l.
listEqual(?la, ?lb) listNotEqual(?la, ?lb) listEqual tests if the two arguments are both lists and contain the same elements. The equality test is semantic equality on literals (sameValueAs) but will not take into account owl:sameAs aliases. listNotEqual is the negation of this (passes if listEqual fails).
Ymris has a range of built-in functions, from the same sources as its predicates. Any predicate may be used as a function delivering a boolean value, and any function that delivers a boolean value can be used as a predicate.
Some Jena 2 predicates are replaced by functions, since the predicates manipulated bindings and Ymris predicates and functions are not permitted to do so.
now(?x) Binds ?x to an xsd:dateTime value corresponding to the current time.
makeInstance(?x, ?p, ?v) makeInstance(?x, ?p, ?t, ?v) Binds ?v to be a blank node which is asserted as the value of the ?p property on resource ?x and optionally has type ?t. Multiple calls with the same arguments will return the same blank node each time - thus allowing this call to be used in backward rules.
sum(?a, ?b, ?c) addOne(?a, ?c) difference(?a, ?b, ?c) min(?a, ?b, ?c) max(?a, ?b, ?c) product(?a, ?b, ?c) quotient(?a, ?b, ?c) Sets c to be (a+b), (a+1) (a-b), min(a,b), max(a,b), (a*b), (a/b). Note that these do not run backwards, if in sum a and c are bound and b is unbound then the test will fail rather than bind b to (c-a). This could be fixed.
strConcat(?a1, .. ?an, ?t) uriConcat(?a1, .. ?an, ?t) Concatenates the lexical form of all the arguments except the last, then binds the last argument to a plain literal (strConcat) or a URI node (uriConcat) with that lexical form. In both cases if an argument node is a URI node the URI will be used as the lexical form.
regex(?t, ?p) regex(?t, ?p, ?m1, .. ?mn) Matches the lexical form of a literal (?t) against a regular expression pattern given by another literal (?p). If the match succeeds, and if there are any additional arguments then it will bind the first n capture groups to the arguments ?m1 to ?mn. The regular expression pattern syntax is that provided by java.util.regex. Note that the capture groups are numbered from 1 and the first capture group will be bound to ?m1, we ignore the implicit capature group 0 which corresponds to the entire matched string. So for example regexp('foo bar', '(.*) (.*)', ?m1, ?m2) will bind m1 to "foo" and m2 to "bar".
An implementation of Ymris is expected to be able to allow the user to define their own functions and predicates. Because functions are referred to by URIs, different functions with the same (local) name can be used in a single ruleset.
No special syntax is required (or provided) by Ymris to use a user-declared function or predicate. Each implementation will provide some way to attach a function's URI to its implementation (eg, as a Java class with an appropriate method), and document that method.
Sometimes it is useful to be able to introduce names for computed values within a rule, eg to avoid writing an expression multiple times, or to give the value an explanatory name.
RULE "demonstrate let (needs improvement)" { ?x P ?y. LET y2 = y * 2. ?z Q ?y2 } => { ?x PQ ?z }
The defined name appears between the LET
and the =
of the definition; the defining
expression follows the =
. The placement of
the LET
within the terms of the premises of
the rule does not matter.
Ymris allows rulesets to be composed from other rulesets using composition expressions following the explicit rules of the ruleset.
RULESET "composition soup" PREFIX my: <http://cdollin.hpl.hp.com/my/> RUN my:local, my:global
The rulesets are referred to by their URIs, which may be
shortened as usual using declared prefixes. The comma-separated
rulesets are merged and run as one ruleset; any inferences from
my:global
are visible to my:local
and
vice versa.
RULESET "composition by sequencing" PREFIX my: <http://cdollin.hpl.hp.com/my/> RUN my:local; my:global
Separating the ruleset references with the ;
operator
(note that this is not the same as the use of ;
in terms) specifies that the rulesets are applied in sequence; the
inferences from my:local
are available to
my:global
, but the my:global
inferences
are not visible to my:local
.
RULESET "composition by independant parallelism" PREFIX my: <http://cdollin.hpl.hp.com/my/> RUN my:local & my:global
Separating the ruleset inferences with the &
operator specifies that the rulesets are applied independantly
and their inferences are combined. Neither sub-ruleset sees
the inferences produced by the other.
The operators can be combined, including the use of brackets
()
for grouping, allowing expressions like
a & b & (c; d)
. The &
operator
has the weakest precedence and the ;
the
strongest, with ,
between them, so that:
parses as:
The rules in this ruleset can be referred to by the special
identifier these
. If no RUN
directive
is given, it is as though RUN these
had been
supplied.
See the full ymris grammar for the details of the syntax.