Adopting Linked Media principles for Stanbol Entityhub

Linked Data describes the idea of linking - formally unconnected - bits of data over the web. Think about how hyperlinks are used to navigate within the Web of documents. Linked data tries to do the same for the Web of Data. This basic idea is also central to most of the Apache Stanbol Components. However Stanbol is not only concerned about about linking data but also with interlinking the web of documents with the web of data. Therefore this proposal to extend Linked Data principles to also support content and not just data seams like a natural fit for Apache Stanbol.

This Documents first provides a short introduction to Linked Data and the proposed Linked Media extensions. The second part of the document analysis requirements of the Stanbol Entityhub related to Linked Data and Linked Media. The third section goes than into more details on how Linked Media principles could be implemented by Entityhub.

Short Introduction to Linked Data and proposed Linked Media extensions

from linkeddata.org

What is Linked Data?

The Web enables us to link related documents. Similarly it enables us to link related data. The term Linked Data refers to a set of best practices for publishing and connecting structured data on the Web. Key technologies that support Linked Data are URIs (a generic means to identify entities or concepts in the world), HTTP (a simple yet universal mechanism for retrieving resources, or descriptions of resources), and RDF (a generic graph-based data model with which to structure and link data that describes things in the world).

The following terminology is often used with with Linked Data:

A more detailed overview is provided by the Linked Data Tutorial.

Linked Media

The Linked Media proposal tries to extend Linked Data by two features.

  1. Creating and updating of resources: Linked data currently covers only retrieval of information, which is sufficient for sites like DBpedia or Geonames where users are only able to consume data. When creating interactive (web) applications one needs to be able to create/update and remove information. Features that are currently not covered by linked data, but well defined for RESTful Services. The Linked Media proposal therefore suggest to use HTTP PUT, POST and DELETE request for this purpose.
  2. Handling both content and metadata: Linked Data uses Content Negotiation to select suitable content types. In addition it provides means to redirect to Information Resources about Non-Information Resources. However linked data does not differentiate between metadata and content. One can not explicitly ask first for an GIF Image and later for the metadata as RDF. Or first for an HTML blog post and later for its metadata formatted as HTML. Such a differentiation is only supported for Non-Information Resources. E.g. for a famous painting (Non-Information Resource) and a photo (Information Resource). Liked Media proposes to use the "rel" parameter of the Accept header to allow users to explicitly ask for content ("Accept: type/subtype; rel=content") or metadata ("Accept: type/subtype; rel=meta").

For a more detailed description please follow the link to the Linked Media proposal [1] as posted by by Sebastian Schaffert on the linked open data mailing list of W3C. You might also be interested in reading the following discussion. Note also ResourceWebService [2] a first implementation of the Linked Media proposal based on the Kiwi2/Linked Media Framework [3][4].

Requirements of the Stanbol Entityhub

This section tries to identify requirements of the Stanbol Entityhub related to Linked Data and Linked Media. The goal of this analysis is to identify where it makes sense to adopt Linked Data/Media principles for the RESTful interface of the Entityhub.

The Entityhub fulfills two requirements:

  1. it allows to define and manage network of referenced sites used to retrieve information about entities from. In addition the Entityhub also supports the use of local caches to speedup access and to get independency of the availability of remote services.
  2. it manages an own (local) site that is used to manage local entities. Such entities can be created locally but it is also possible to import them form any referenced site. Typical examples of locally managed entities are customers, employees, concepts of a company thesaurus, offices, meeting rooms ...

Entity Model of the Entityhub

Entities managed by the entityhub define first an unique ID. In case the referenced site follows linked data principles this will be the HTTP URI of the Non-Information resource. However this might be any valid URI (including URNs). The URI prefix of locally managed entities are configureable. Therefore the URI type of locally managed entities depends on the configuration. The Entity itself represents a Non-Information Resource. Each Entity comes with a Representation. The representation holds all information known by the site about the entity. In Linked Data terminology the Representation is the Information Resource a User needs to be redirected when he requests the Entity (Non-Information Resource). Finally an Entity also links to the ID of the (referenced) site managing it. This allows users to track who is providing the information for an Entity.

Currently the Entityhub distinguish three different types of Entities:

  1. Sign: All Entities managed by referenced sites
  2. Symbol: All locally managed Entities. Symbols hold additional metadata such as a preferred label, a state.
  3. EntityMapping: Mappings form Symbols to Signs. Linked Data typically uses owl:sameAs to define such mappings however in case of the Entityhub such mappings need to hold additional meta information such as the state, expire data of the mapping ...

Metadata such as license, copyright statements, attributions as well as informations about the organization managing a referenced site are managed with referenced sites and not with single entities.

All the additional information provided by this three Entity types as well as the additional metadata provided for referenced sites are based on Linked Data principles metadata about the Information Resource - the Representation - and not about the Non-Information Resource - the Entity.

Therefore the Entityhub manages:

RESTful Services of the Entityhub

The Entityhub defines the following service endpoints:

  1. The (referenced) Site Manager: Provides retrieval and search over all referenced sites.
  2. (referenced) Site Endpoint: Provides the same interface but for a specific referenced site.
  3. The Entityhub Endpoint: Provides full read/write and retrieval access for locally managed Entities.

Therefore the Entityhub needs to support read only access for Entities managed by referenced sites and full read/write access (CRUD) locally managed Entities.

Summary

Consuming Linked Data:

Referenced Entities (Entities of Referenced Sites)

Local Entities (Entities managed by the Entityhub)

Based on this evaluation of the Model and the Services provided by the Entityhub the proposed Linked Media extension to the Linked Data principles would be sufficient to cover most of the functionalities exposed by the Entityhub as RESTful services. While for referenced Sites only the distinction between Metadata and Content is needed for locally managed Entities also the possibility to create, update and remove Entities, their Representation (content) and metadata is of central importance. The main functionalities not covered is the import of Entities from referenced sites. Also for functionalities like the creation of mappings and the management of the Entity workflow special additions to the generic Linked Media/Linked Data API would be useful.

Specific Considerations

This section contains Entityhub specific considerations about some of the principles defined for Linked Data and Linked Media.

Resource Identifier

Linked data defines the principle to use HTTP URIs as Resource Indetifier so that one can retrieve data by directly accessing the URI of a resource. This does not work out for the Entityhub because it needs to also manage remote entities and also for local entities this will not always be an option. Because of that the RESTful interface needs also to support an alternative that allows to parse the URI of an entity as a parameter. This is also a requirement to don't affect the IDs of entities when the Entityhub is deployed on an different host of even by using localhost. In addition this allows to use use other URI types (mainly URNs but also other protocols such as LDAP) as identifiers for locally managed entities.

Redirects for Content Negotiation

It is important to consider that Entities are Non-Information Resources and based on Linked Data Principles requests for Non-Information resources need to be answered with redirects ("303 See Other") to the URI of the Information Resource. In practice such redirects are for two things:

  1. To allow Users to directly access (and bookmark) URIs of a specific format and therefore bypass content negotiation. This is mainly because Browsers do not allow to define the "Accept" headers. Because of that without this indirection typical users would be unable to retrieve other formats that HTML.

    For the Entityhub where most of the requests will be issued by clients that support the usage of "Accept" headers, the usage of redirects seems unfavorable because: First it will double the numbers of requests and also adds an additional RTT (round trip time). Secound browsers always issue a GET request when following an redirect independent of the type for the initial request. This can cause problems when returning redirects for POST, PUT and DELETE requests. Because of this for the Entityhub it would make sense to provide the possibility to deactivate/activate the usage of redirects (e.g. via a configuration, a request property or even a header field).

  2. To attach metadata of the Information Resources. As an example take the Linked Data endpoint of the New York Times. It uses "http://data.nytimes.com/{uuid}" for Entities and "http://data.nytimes.com/{uuid}.rdf" for the RDF XML representations. When looking at the representations provided for Entities (e.g. take North Carolina one can see that triples using "http://data.nytimes.com/{uuid}" as subject are data about North Carolina where triples that use "http://data.nytimes.com/{uuid}.rdf" as subject represent metadata. Note also that the metadata is also connected to the representation of North Carolina by the foaf:primaryTopic relation.

    When using extensions proposed by Linked Media, than it would be possible to directly refer to the metadata by setting the "rel" parameter of the "Accept" header to "meta". Therefore a request defining "Accept: application/rdf+xml; rel=meta" would - assuming that redirects are deactivated - directly return the metadata for for the requested entity (e.g. the license) encoded as RDF XML. In case redirects are enabled it would return a "303 See Other" with the URI of the metadata.

    Note that - in principle - there are two kinds of redirects: (1) redirects between Resources. This includes redirects from Entities to Representation ("rel=content") as well as to the Metadata ("rel=meta"); (2) redirects used for Content Negotiation. Therefore it would be possible to provide the possibility to enable/disable this types separately.

    Also note that in cases where several redirects would be needed to reach the final resource (e.g. when requesting information about an Non-Information Resource in "text/html": Non-Information Resource -> Information resource -> HTML version) than the request will directly return the final destination.

Redesigning the Entityhub

This section evaluates necessary changes to the Entityhub.

URI scheme for Resources

The support of Linked Data requires the use of a local URI. This is in contrast to the parameter based approach ("?id={remoteURI}") as currently used by the Entityhub. The goal is that the Entityhub allows both variants

http://{host}/entityhub/{site}/entity/{localname} and
http://{host}/entityhub/{site}/entity?uri={uri}

to refer an Entity. This requires that the Entityhub provides a local HTTP URI for any (local or remote) entity. The suggestion is to use the local name of the remote entity or the MD5 of the whole URI in cases where this is not possible.

To support the redirects as defined by Linked Data it is also necessary to generate own URIs for Representations. To support the differentiation between Content and Metadata we need also an own URI for the metadata.

The proposal is to use file extension like additions to the local name of Entities:

http://{host}/entityhub/{site}/entity/{localname}.rep

is used to directly refer to the Representation of an Entity - in Linked Media terminology the Information Resource. Note that the local HTTP URI is use as base for the ".rep" extension. "?uri={uri}.rep" will not be supported. Users of the Entityhub can therefore use the ".rep" extension to directly access the content for an Entity. Note that content negotiation will still be needed when requesting this kind of URIs.

Similar to the above the ".meta" extension will be used for constructing URIs for the metadata:

http://{host}/entityhub/{site}/entity/{localname}.meta

For referenced entities such representations will be created by merging remote metadata with locally managed. Remote Metadata will be recognized by Resources with a foaf:primaryTopic relation to the Entity. Local Metadata can include information known for the referenced site (e.g. license, copyright, attributions, information about the managing organization ...) as well as mappings to other (locally managed) entities.

For locally managed Entities the metadata will also include all the additional information as currently defined by the Symbol API (state, predecessors, successors).

Note that the URIs for Representations and Metadata are optional and will be omitted based on HTTP request headers in case redirects are disabled. However even in case that redirects are disabled it is still possible to use such URIs for requests.

URI scheme for Content Negotiation

To confirm with the Linked Data principles the Entityhub needs to provide unique HTTP URIs for any content type Information Resources (Content and Metadata Resoruces) can be serialized. As for the ".rep" and ".meta" extensions used to directly access Representations and their Metadata the proposal is also to use of file extensions to indicate the media type. In cases users wish to parse the remote URI as parameter it is also possible to parse the extension or the media type as parameter.

http://{host}/entityhub/{site}/entity/{localname}.{extension} or
http://{host}/entityhub/{site}/entity?uri={uri}&format={extension}&mediaType={mediatype}

This shows the case that the extension is directly added to the local URI of the entity. In this case the "rel" parameter of the Accept header would be used to determine if the content - representation - or the metadata need to be encoded in the response. If not specified the representation will be returned.

To allow also to directly address the representation or the metadata in a specific format the Entityhub also supports the following two variants:

http://{host}/entityhub/{site}/entity/{localname}.rep.{extension}
http://{host}/entityhub/{site}/entity/{localname}.meta.{extension}

Note that the URIs used for content negotiation are optional and will be omitted based on HTTP request headers in case redirects are disabled. However even in case that redirects are disabled it is still possible to use such URIs for requests.

HTTP Request/Response Headers with special use

This section provides information about header fields that are specially evaluated by the Entityhub. Normal evaluations of headers as specified by RFC2616 section 14 e.g. the use of Content-Type to read data parsed by PUT/POST requests are not described.

Accept header

The Accept header allows to specify the media type of the content as expected by the client in the response. The Linked Media proposal suggests to use the "rel" parameter to specify if the response should return the data or the metadata of the requested resource. The semantics of the "rel" parameter is defined for the Link header by RFC5988. An related example can be found on the LinkHeader page on the W3C wiki.

The pattern useable for Accept header looks like

Accept: {media-type}[; rel=meta]

If no "rel" pattern is specified the Entityhub will return the data (representation about the entity) as default. If users want to retrieve the the metadata they need to add "rel=meta". The {media-type} is always applied to the information selected by the "rel" parameter.

Cache-Control

The Entityhub supports the following cache-request-directives to allow clients some control about local caching of entities managed by remote sites. Note that the Stanbol OFFLINE mode has precedence over Cache-Control specifications

The Link header is central to Linked Data and Linked Media because it is used to expose internal structures defined in-between Resources (in-between Entities but also between Entities and there Representations and Metadata)

The principle Syntax of Link headers is as follows:

Link: <{uri}>; rel="{relation}"; type="{media-type}"

The relation parameter defines the type of the relation. Registered relation types are mainly used to improve the navigation of users. The values "content" and "meta" as suggested by the Linked Media proposal are currently not registered. In such cases RFC5988 requires the use of absolute URIs as {relation}. This document will use "content" and "meta" instead of the full URIs as required by RFC5988.

Regardless of that the values used for the "rel" parameter within the "Link" header by the Entityhub MUST BE the SAME as supported values for the "rel" parameter in the "Accept" header for requests. A pragmatic solution would be to support both the short form and a full URI.

The Entityhub will add the following Links (if applicable)

Entity Model

This changes to the RESTful API should be also reflected in the Java API. Currently on the API level there are three types of Entities: Sign, Symbol and EntityMapping. The only differentiation between those Entities are a different set of metadata. However there is no plan to distinguish such types on the RESTful API level.

To streamline the domain model and to bring it more in line with the RESTful API the proposal is to drop the different Entity types. The Sign, Symbol and EnttiyMapping Interfaces will be replaced by a single Entity interface with the following Methods

Entity
    + getId() : String
    + getSite() : String
    + getRepresentation() : Representation
    + getMetadata() : Representation

The use of the Representation interface also for the Metadata allows the use of the same parsers and serializes for both content and metadata. Functionality currently depending on the special APIs of Sign, Symbol and EntityMapping need to be adapted to retrieve the information via the Representation interface. This should be implemented by an utility class.

References

[1] http://lists.w3.org/Archives/Public/public-lod/2011May/0019.html

[2] http://code.google.com/p/kiwi/source/browse/kiwi-core/src/main/java/kiwi/core/webservices/resource/ResourceWebService.java

[3] Kiwi Project: http://www.kiwi-community.eu/ Blog: http://planet.kiwi-project.eu/

[4] Kiwi Source Repository: http://code.google.com/p/kiwi/