The Stanbol Enhancement Structure (PROPOSAL)

Please NOTE: This is a proposal for the future version of the Enhancement Structure used by the Stanbol Enhancer. This DOES NOT describe the Enhancement Structure used by the current version of the Stanbol Enhancer!

This describe the schema (ontology) used by the Apache Stanbol Enhancer to express features extracted from parsed content items. The main purpose of this is to standardizes information created by EnhamncementEngines to enable users to easily work with enhancement results, but also to support cooperation between different enhancement engines.

Overview

The Stanbol Enhancement Structure is build around the following main Concepts. Each of this concepts covers a specific aspect related to the enhancement process of content.

The following list gives an overview about the concepts used by the Stanbol Enhancement Strucutre:

Overview about the Stanbol Enhancement Structure

Enhancements encoded based on this specification need to confirm to the following rules:

Specification

Namespaces and used Notations

While the Stanbol Enhancement Structure does define some Concepts and Properties it also uses a lot of existing things from other ontologies. To improve the readability of this specification namespace prefixes + local names are used instead of the full URLs by this specification.

All the namespace prefixes used within this specification are described by the following list:

Notations used by this specification:

A special NOTE to the usage of <{code}> in comairism to {value}^^xsd:anyURI:

ContentItem <ci>

The ContentItem <ci> represents a content parsed to the Stanbol Enhancer. It is the central resource used to link all the enhancements created by the EnhancementEngines.

<ci> rdf:type sb:ContentItem
[<ci> sb:embeds-knowledge {knowlegeGraphId}]
[<ci> sb:has-section sb:ContentItem]
[<ci> <{metadatafield}> {value(s)}]

The ContentItem itself does only define two fields:

In addition metadata extracted or parsed with the parsed content (e.g. Dublin Core, EXIF, ID3 ...) can also be directly added to the ContentItem <ci>. EnhancementEngines may used such information during the EnancementProcess.

Example: Embedded Knowledge

TODO: Move this to an own section about RDFa support!

This example shows how SIOC (Semantically-Interlinked Online Communities) and RDFa can be used to embed knowledge to tell Stanbol how to process parsed HTML markup.

<body about="http://www.examplenews.com/featuredNews"><table><tr>
    <td><!-- The menue: Not to be enhanced --> </td>
    <td><span property="sic:content" about="http://www.examplenews.com/story123"> 
        This is the Content of this page to be enhanced by the Stanbol enhancer
    </span><span property="sic:content" about="http://www.examplenews.com/interview456">
        And there may be even more than one Sections within the document that need to be enhanced
    </span></td>
    <td> <!-- Advertisements: Not to be enhanced --> </td>
</tr></trable></body>

By parsing this as Content the Stanbol Enhancer should create:

NOTE: This assumes the presence of

Enhancement

The concept "Enhancement" defines properties that allow Stanbol EnhancementEngines to formally describe information about the enhancement process. This information are crucial for EnhancemetnEngines to cooperate with each other but typical Stanbol users will not need to border with such information even that in some situation such knowledge might even be useful on the client side e.g. if someone wants to ignore all enhancements created by an specific enhancement engine, or to calculate all enhancements affected by the removal of an part of the content.

The following code segments shows the knowledge typically described by using the Enhancement concept

<e> rdf:type sb:Enhancement
<e> dc:creator enhancementEngine^^xsd:anyURI
<e> dc:contributor enhancementEngine^^xsd:anyURI
<e> dc:created date^^xsd:dateTime
<e> dc:modified date^^xsd:dateTime
[<e> sb:relatedTo <relatedEnhancement>]
[<e> sb:dependsOn <dependsOnEnhancement>]

The presence of the statement "<e> rdf:type sd:Enhancement" statement indicated that enhancement metadata are present for the resource <e>. This also means that if there is some configuration set to exclude such information, than all the above properties MUST be removed from the results of the enhancement process. The metadata defined by sb:Enhancement MUST BE added for all sb:Annotation and sb:Suggestion instances created by an EnhancementEngine. This also includes any rdf:subClassOf of those two Concepts.

The following figure shows an example of an sb:Annotation and a sb:Suggestion for Paris with the according metadata as defined by the sb:Enhancement concept.

Example: sb:Annotation and sb:Suggestion including sb:Enhancement metadata

Note that sb:Annotation and sb:Suggestion are not sub-classes of sb:Annotation. EnhancementEngines need to add sb:Enhancement as an additional rdf:type to sb:Annotation and sb:Suggestion.

Description of the properties defined/used by sb:Enhancement:

In addition EnhancementEngines might want/need to add additional metadata to the sb:Annotation and sb:Suggestion instances they create. Implementors of such EnhancementEngines are free to define there own Enhancemnt types. Such types MUST BE defined as rdfs:subClassOf sb:Enhancement and SHOULD use **Enhancement in there Concept name. EnhancementEngine MUST also add both the specific type AND sb:Enhancement as rdf:type values.


Sections below are not yet updated


Annotations

The concept "Annotation" provides metadata about the extracted feature. This information are important both for the enhancement process and the users of the Stanbol Enhancer. The following code segment shows the knowledge typically provided by an Annotation <a>. A description of the properties is provided below:

<a> rdf:type sb:Annotation
[<a> rdf:type sb:Enhancement, sb:Occurrence]
<a> sb:extracted-from <ci>
<a> dc:title label  //TODO: maybe it is better to use rdfs:label
<a> dc:role annotationRole^^xsd:anyURI
<a> dc:type annotationType^^xsd:anyURI
<e> sb:confidence value^^xsd:float
<a> sb:entity entity^^xsd:anyURI
<a> sb:entity-type entityType^^xsd:anyURI
<a> sb:suggestion <a1>

The following properties are defined for Annotations <a>

Annotations Type describe the type of the annotated feature based on a terminology standardized by Stanbol. Current types include

This list should only contain some types useful for grouping Annotations in user interfaces. The exact types of entities can be anyway added by using the sb:entity-type property.

TODO: We need to decide if we create an own controlled vocabulary within the Stanbol namespace or if we select some concepts defined in an external ontology (such as the dbpedia ontology that is currently used).

Annotation Roles describe the proposed role of the extracted feature in relation to the content. The following list shows the currently defined roles:

NOTE: Such roles should make it more easy to support additional Annotations roles as suggested by STANBOL-48 and STANBOL-12 that includes STANBOL-28 and STANBOL-29.

sb:Suggestion

Suggestions are used by the Stanbol Enhancer to suggest possible values for the resolution features extracted from the parsed content. Currently there are two different use cases for Suggestions defined

sb:Suggestion uses the following properties

In addition all sb:Suggestions are also of type sb:Enhancement to allow EnhancementEngine to provide enhancement metadata for them.

for details how they are used please see the following Example

==== Example ====

As example lets assume that the following RDFa annotated content is parsed to the Stanbol Enhancer

<span typeof="cal:Vevent">
    <h3 property="dc:title"> Stanbol Teleconference </h3>
    <span property="cal:summary>
        <p> Agenda: </p>
        <ul>
            <li> ... </li>
        <ul>
        <p> Participants: </p>
        <ul>
            <li typeof="foaf:Person" property="foaf:name">Rupert Westenthaler</li>
            <li typeof="foaf:Person" property="foaf:name">Olivier Grisel</li>
            <li> ... </li>
        </ul>
    </span>
</span>

(1) Suggest the Entities for Rupert and Olivier (2) Suggest to link Rupert and Olivier as values for "cal:attendee"

Both for Rupert Westenthaler and Olivier Grisel an EntityAnnotation would be present - in that case created by the RDFa extractor, but in principle this could also work if the RDFa markup is missing. In such cases the EntityAnnotations could be created by an NLPEnhancementEngine.

<a1> rdf:type sb:EntityAnnotation
<a1> dc:title Rupert Westenthaler
<a1> sb:entity-type foaf:Person
<a1> sb:hasOccurrence <o1>
<a1> sb:hasSuggestion <s1>

<a2> rdf:type sb:EntityAnnotation
<a2> dc:title Olivier Grisel
<a1> sb:entity-type foaf:Person
<a2> sb:hasOccurrence <o2>
<a2> sb:hasSuggestion <s2>

Lets ignore the occurrences - because how to create Occurrences for RDFa markup is a whole different story that needs to be specified - and concentrate on the suggestions.

<s1> rdf:type sb:Suggestion
<s1> sb:entity <http://www.example.com/person/Rupert_Westenthaler>
<s1> sb:entity-type foaf:Person, vCard:vCard, dbpedia-ont:Person
<s1> sb:confidence 123,456

<s2> rdf:type sb:Suggestion
<s2> sb:entity <http://www.example.com/person/Olivier_Grisel>
<s2> sb:entity-type foaf:Person, vCard:vCard, dbpedia-ont:Person
<s2> sb:confidence 234,567

If the suggestion is accepted by the client the RDFa markup could be updated like this

<li about="http://www.example.com/person/Rupert_Westenthaler"
    typeof="foaf:Person" property="foaf:name">Rupert Westenthaler</li>
<li about="http://www.example.com/person/Olivier_Grisel"
    typeof="foaf:Person" property="foaf:name">Olivier Grisel</li>

Now lets have a detailed look at the suggestions to add Rupert and Olivier as a "cal:attendee" to the meeting. First we need to create an EntityAnnotation for the Meeting that would be created by the RDFa extractor

<a> rdf:type sb:EntityAnnotation
<a> dc:title "Stanbol Teleconference"
<a> sb:entity-type cal:Vevent
<a> sb:hasOccurrence <o>
<a> sb:hasSuggestion <s3>
<a> sb:hasSuggestion <s4>

Again lets skip the occurrence and look at the two suggestions. What I want to do here is to suggest to use the Annotations for Rupert () and Olivier () as values for the property "cal:attendee".

It is important to suggest here the annotations and as values and NOT the suggested entities (e.g. http://www.example.com/person/Rupert_Westenthaler in case of ) because the Stanbol Enhancer can not assume that the user will accepts the suggestions for and for .

The following suggestions also use the sb:field property to tell the user that the suggestions is about values for the "cal:attendee" property.

<s3> rdf:type sb:Suggestion
<s3> sb:field cal:attendee
<s3> sb:entity <a1>
<s3> sb:entity-type sb:EntityAnnotation
<s3> sb:confidence 12,34

<s4> rdf:type sb:Suggestion
<s4> sb:field cal:attendee
<s4> sb:entity <a2>
<s4> sb:entity-type sb:EntityAnnotation
<s4> sb:confidence 12,34

NOTE:

Here the RDFa markup if the user accepts and but not and

<span typeof="cal:Vevent">
    [...]
    <p> Participants: </p>
    <ul property="cal:attendee">
        <li typeof="foaf:Person" property="foaf:name">Rupert Westenthaler</li>
        <li typeof="foaf:Person" property="foaf:name">Olivier Grisel</li>
        <li> ... </li>
    </ul>
</span>

and finally the RDFa markup if the all suggestions are accepted by the client side

<span typeof="cal:Vevent">
    [...]
    <p> Participants: </p>
    <ul property="cal:attendee">
        <li about="http://www.example.com/person/Rupert_Westenthaler"
            typeof="foaf:Person" property="foaf:name">Rupert Westenthaler</li>
        <li about="http://www.example.com/person/Olivier_Grisel"
            typeof="foaf:Person" property="foaf:name">Olivier Grisel</li>
    </ul>
</span>

Occurrences

By default detected Features are considered to be extracted from the whole content. While this assumption is appropriate for things like Categorizations and keywords for a lot of cases it is possible to specify the exact occurrence of features within the content and/or the metadata of the content. In such cases the sb:Annotation will define one or more values for the sb:occurrence value.

Different Occurrence descriptions are needed to describe the position of a feature within different types of content or within the parsed metadata.

TextOccurrence:

Describe the occurrence of a feature within an textual content.

<o> rdf:type sb:TextOccurrence
    sb:TextOccurrence rdfs:subClassOf sb:Occurrence
<o> rdf:type sb:Occurrence
<o> sb:selected-text selectedText
<o> sb:start startPosition^^xsd:long
<o> sb:end endPosition^^xsd:long
<o> sb:context selectionContext
<o> sb:occurrence-within-context count^^xsd:int

MetadataOccurrence:

Describes the occurrence of an feature within the metadata of the parsed content. This are extremely useful to link entities for literal values provided by metadata standards such as creator information for Dublin Core, Artist, Album, Label ... information provided by ID3 or Camera Model information as present in EXIF metadata. Also geo-point to City, Region, Country enhancements could be done by using this type of occurrences.

<o> rdf:type sb:MetadataOccurrence
    sb:MetadataOccurrence rdfs:subClassOf sb:Occurrence
<o> rdf:type sb:Occurrence
<o> sb:field metadataProperty^^xsd:anyURI
<o> sb:value value

Other Occurrence Types

Use Cases and Examples

This Sections describes uses cases how the Stanbol Enhancement Structure is used to enhance documents. It also provides examples of how users can use/query for enhancements based on the returned knowledge

Simple Text Enhancement

An User types the text "Next week I will travel to Paris" and would like to have general Enhancements like Tags, Keywords and Categories

Lets assume that Paris was detected to describe a location and travel to be a keyword. There are also two known Entities with the name "Paris" and the type Location. This would result in an enhancement graph as follows

# The content item 
<ci> rdf:type sb:ContentItem

# Paris as detected by the nlpEngine as location
<a1> rdf:type sb:Enhancement
<a1> rdf:type sb:Annotation
<a1> rdf:type sb:Occurrence
<a1> rdf:type sb:TextOccurrence
# Properties for Enhancement
<a1> sb:extracted-from <ci>
<a1> dc:creator urn:stanbol.engines:nlpEngine
<a1> dc:created "2011-02-28T12:13:14Z"
# Properties for Annotation
<a1> dc:title "Paris"
<a1> dc:role sb:Tag
<a1> dc:type: dbpedia-ont:Place
<a1> dc:suggestion <a2>, <a3>
<a1> sb:confidence 0.85
# Properties for TextOccurrence
<ai> sb:selected-text "Paris"
<a1> sb:start 28
<a1> sb:end 32
<a1> sb:context "Next week I will travel to Paris"
<a1> sb:occurrence-within-context 1

# dbpedia:Paris as suggested Entity
<a2> rdf:type sb:Enhancement
<a2> rdf:type sb:Annotation
# Properties for Enhancement
<a2> sb:extracted-from <ci>
<a2> dc:requires <a1>
<a2> dc:creator urn:stanbol.engines:entityTaggingEngine
<a2> dc:created "2011-02-28T12:13:18Z"
# Properties for Annotation
<a2> dc:title "Paris"
<a2> dc:role sb:Suggestion
<a2> dc:type: dbpedia-ont:Place
<a2> sb:entity http://dbpedia.org/resources/Paris
<a2> sb:entity-type dbpedia-ont:City, dbpedia-ont:Settlement, dbpedia-ont:PopulatedPlace, dbpedia-ont:Place
<a2> sb:confidence 123.456

# dbpedia:Paris,_Texas as suggested Entity
<a3> rdf:type sb:Enhancement
<a3> rdf:type sb:Annotation
# Properties for Enhancement
<a3> sb:extracted-from <ci>
<a3> dc:requires <a1>
<a3> dc:creator urn:stanbol.engines:entityTaggingEngine
<a3> dc:created "2011-02-28T12:13:19Z"
# Properties for Annotation
<a3> dc:title "Paris, Texas"
<a3> dc:role sb:Suggestion
<a3> dc:type: dbpedia-ont:Place
<a3> sb:entity http://dbpedia.org/resources/Paris,_Texas
<a3> sb:entity-type dbpedia-ont:City, dbpedia-ont:Settlement, dbpedia-ont:PopulatedPlace, dbpedia-ont:Place
<a3> sb:confidence 12.34

# travel as detected keyword
<a4> rdf:type sb:Enhancement
<a4> rdf:type sb:Annotation
# Properties for Enhancement
<a4> sb:extracted-from <ci>
<a4> dc:creator urn:stanbol.engines:keywordExtractionEngine
<a4> dc:created "2011-02-28T12:13:22Z"
# Properties for Annotation
<a4> dc:title "travel"
<a4> dc:role sb:Keyword
<a4> dc:type: dbpedia-ont:Activity //can we expect this to be available -> probably not

When consuming the following queries would be used:

Getting all Tags: to get all Keywords/Categories replace sb:Tag with sb:Keyword/sb:Category

PREFIX dc: <http://purl.org/dc/terms/>
PREFIX sb: <http://stanbol.apache.org/ontology/1.0/>    
SELECT ?id, ?title, ?type 
WHERE {
    ?id dc:role sb:Tag .
    ?id dc:title ?title .
    OPTIONAL { ?id dc:type ?type }
}

Getting suggestions for an known Annotation (e.g. urn:annotation1)

PREFIX dc: <http://purl.org/dc/terms/>
PREFIX sb: <http://stanbol.apache.org/ontology/1.0/>    
SELECT ?entity, ?title, ?type ?score
WHERE {
    <urn:annotation1> sb:suggestion ?id .
    ?id dc:title ?title .
    ?id sb:entity ?entity .
    OPTIONAL { ?id sb:entity-type ?type } .
    OPTIONAL { ?id sb:confidence ?score }
}

Getting all selected Entities within the Text

PREFIX dc: <http://purl.org/dc/terms/>
PREFIX sb: <http://stanbol.apache.org/ontology/1.0/>    
SELECT ?id, ?title, ?start, ?end, ?type 
WHERE {
    ?id dc:role sb:Tag .
    ?id dc:title ?title .
    ?id sb:start ?start .
    ?id sb:end ?end .
    OPTIONAL { ?id dc:type ?type }
}

Getting all Locations and optionally the occurrences within the text

PREFIX dc: <http://purl.org/dc/terms/>
PREFIX sb: <http://stanbol.apache.org/ontology/1.0/>    
PREFIX dbpedia-ont: <http://dbpedia.org/ontology/>  
SELECT ?id, ?title, ?start, ?end
WHERE {
    ?id dc:type dbpedia-ont:Place .
    ?id dc:title ?title .
    OPTIONAL {
        ?id sb:start ?start .
        ?id sb:end ?end
    }
}

Enhancement of Metadata

This example shows the the Enhancement Structure allows to create enhancements based on parsed Metadata.

Lets assume that a user parses a content item and an additional file providing Dublin Core metadata that include (among others)

Further assume that both Richard and Rachel works for the company running the Stanbol Enhancer and there is an EnhancementEngine that knows about Company resource. This example uses the URI "http://www.company.org/team/Richard_Cypher" and "http://www.company.org/team/Rachel_Brandstone" to identify the two example employees.

#The content item
<ci> rdf:type sb:ContentItem
<ci> dc:creator "Richard Cypher", "Rachel Brandstone"
<ci> dc:contributor "Richard Cypher"
<ci> {other Dublin Core metadata extracted from the parsed file}

# Annotation describing the "Richard Cypher"
# Assumed to be created by the dcAnnotationEngine with the help
# of the entityTaggingEngine.
<a1> rdf:type sb:Enhancement
<a1> rdf:type sb:Annotation
<a1> rdf:type sb:Occurrence
<a1> rdf:type sb:MetadataOccurrence
# Properties for Enhancement
<a1> sb:extracted-from <ci>
<a1> dc:creator urn:stanbol.engines:dcAnnotationEngine
<a1> dc:contributor urn:stanbol.engines:entityTaggingEngine
<a1> dc:created "2011-02-28T13:14:15Z"
# Properties for Annotation
<a1> dc:title "Richard Cypher"
<a1> dc:role sb:Tag
<a1> dc:type: dbpedia-ont:Person
<a1> sb:confidence 1.0
<a1> sb:entity http://www.company.org/team/Richard_Cypher
<a1> sb:entity-type foaf:Agent, foaf:Person, vCard:Contact
# Properteis for MetadataOccurrence
<a1> sb:field dc:creator, dc:contributor
<a1> sb:value "Richard Cypher"

# Annotation describing the "Rachel Brandstone"
<a1> rdf:type sb:Enhancement
<a1> rdf:type sb:Annotation
<a1> rdf:type sb:Occurrence
<a1> rdf:type sb:MetadataOccurrence
# Properties for Enhancement
<a1> sb:extracted-from <ci>
<a1> dc:creator urn:stanbol.engines:dcAnnotationEngine
<a1> dc:contributor urn:stanbol.engines:entityTaggingEngine
<a1> dc:created "2011-02-28T13:14:22Z"
# Properties for Annotation
<a1> dc:title "Rachel Brandstone"
<a1> dc:role sb:Tag
<a1> dc:type: dbpedia-ont:Person
<a1> sb:confidence 1.0
<a1> sb:entity http://www.company.org/team/Rachel_Brandstone
<a1> sb:entity-type foaf:Agent, foaf:Person, vCard:Contact
# Properteis for MetadataOccurrence
<a1> sb:field dc:creator
<a1> sb:value "Rachel Brandstone"

NOTE: One could also create two sb:Annotations for both Richard and Rachel, one Annotation describing the annotated value and a second suggesting the entity for the first, but that seams like an unnecessary complexity as long as there is only one person with this name in the company. Nonetheless this decision needs to be reviewed. Therefore the code for Richard when using this variant.

#Annotation describing "Richard Cypher" as extracted from the DC description
<a1> rdf:type sb:Enhancement
<a1> rdf:type sb:Annotation
<a1> rdf:type sb:Occurrence
<a1> rdf:type sb:MetadataOccurrence
# Properties for Enhancement
<a1> sb:extracted-from <ci>
<a1> dc:creator urn:stanbol.engines:dcAnnotationEngine
<a1> dc:created "2011-02-28T13:14:15Z"
# Properties for Annotation
<a1> dc:title "Richard Cypher"
<a1> dc:role sb:Tag
<a1> dc:type: dbpedia-ont:Person
<a1> sb:confidence 1.0
<a1> sb:suggestion <a3>
# Properteis for MetadataOccurrence
<a1> sb:field dc:creator, dc:contributor
<a1> sb:value "Richard Cypher"

# Annotation describing the employee Richard Cypher
<a3> rdf:type sb:Enhancement
<a3> rdf:type sb:Annotation
# Properties for Enhancement
<a3> sb:extracted-from <ci>
<a3> dc:requires <a1>
<a3> dc:creator urn:stanbol.engines:entityTaggingEngine
<a3> dc:created "2011-02-28T13:14:18Z"
# Properties for Annotation
<a3> dc:title "Richard Cypher"
<a3> dc:role sb:Suggestion
<a3> dc:type: dbpedia-ont:Person
<a3> sb:entity http://www.company.org/team/Richard_Cypher
<a3> sb:entity-type foaf:Agent, foaf:Person, vCard:Contact
<a3> sb:confidence 8.76

When consuming the following queries would be used:

Getting all Annotations for the dc:creator field

Version based on variant 1:

PREFIX dc: <http://purl.org/dc/terms/>
PREFIX sb: <http://stanbol.apache.org/ontology/1.0/>    
SELECT ?id, ?title, ?creatorId
WHERE {
    ?id dc:title ?title .
    ?id sb:entity ?creatorId .
    ?id sb:field dc:creator.
}

Version for variant 2:

PREFIX dc: <http://purl.org/dc/terms/>
PREFIX sb: <http://stanbol.apache.org/ontology/1.0/>    
SELECT ?id, ?title, ?creatorId
WHERE {
    ?ma sb:field dc:creator .
    ?ma sb:suggestion ?id . 
    ?id dc:title ?title .
    ?id sb:entity ?creatorId .
    ?id sb:field dc:creator.
}

Getting all Annotations created for DC properties

Version based on variant 1:

PREFIX dc: <http://purl.org/dc/terms/>
PREFIX sb: <http://stanbol.apache.org/ontology/1.0/>    
SELECT ?id, ?title, ?field, ?entity
WHERE {
    ?id dc:title ?title .
    ?id sb:entity ?entity .
    ?id sb:field ?field.
    FILTER(REGEX(asString(?field),"$http://purl.org/dc/terms/.*"))
}

Version based on variant 2:

PREFIX dc: <http://purl.org/dc/terms/>
PREFIX sb: <http://stanbol.apache.org/ontology/1.0/>    
SELECT ?id, ?title, ?field, ?entity
WHERE {
    ?ma sb:field dc:creator .
    ?ma sb:field ?field.
    ?ma sb:suggestion ?id . 
    ?id dc:title ?title .
    ?id sb:entity ?entity .
    FILTER(REGEX(asString(?field),"$http://purl.org/dc/terms/.*"))
}