Title: Using custom/local vocabularies with Apache Stanbol The ability to work with custom vocabularies is necessary for many organisations. Use cases range from being able to detect various types of named entities specific of a company or to detect and work with concepts from a specific domain. For text enhancement and linking to external sources, the Entityhub component of Apache Stanbol allows to work with local indexes of datasets for several reasons: - do not want to rely on internet connectivity to these services, thus working offline with a huge set of entities - want to manage local updates of these public repositories and - want to work with local resources only, such as your LDAP directory or a specific and private enterprise vocabulary of a specific domain. Creating your custom indexes the preferred way of working with custom vocabularies. For small vocabularies, with Entithub one can also upload simple ontologies together instance data directly to the Entityhub and manage them - but as a major downside to this approach, one can only manage one ontology per installation. This document focuses on the main case: Creating and using a local SOLr indexes of a custom vocabularies e.g. a SKOS thesaurus or taxonomy of your domain. ## Creating and working with custom local indexes Stanbol provides the machinery to start with vocabularies in standard languages such as [SKOS - Simple Knowledge Organization Systems](http://www.w3.org/2004/02/skos/) or more general [RDF](http://www.w3.org/TR/rdf-primer/) encoded data sets. The respective Stanbol components, which are needed for this functionality are the Entityhub for creating and managing the index and several [Enhancement Engines](enhancer/engines/list.html) to make use of the indexes during the enhancement process. ### A. Create your own index **Step 1 : Create the indexing tool** The indexing tool provides a default configuration for creating a SOLr index of RDF files (e.g. a SKOS export of a thesaurus or a set of foaf files). If not yet built during the Stanbol build process of the Entityhub call mvn install in the directory {root}/entityhub/indexing/genericrdf/and than mvn assembly:single Move the generated tool from target/org.apache.stanbol.entityhub.indexing.genericrdf-*-jar-with-dependencies.jar into a custom directory, where you want to index your files. **Step 2 : Create the index** Initialize the tool with java -jar org.apache.stanbol.entityhub.indexing.genericrdf-*-jar-with-dependencies.jar init You will get a directory with the default configuration files, one for the sources and a distribution directory for the resulting files. Make sure, that you adapt the default configuration with at least - the id/name and license information of your data and - namespaces and properties mapping you want to include to the index (see example of a [mappings.txt](examples/anl-mappings.txt) including default and specific mappings for one dataset) Then, copy your source files into the respective directory indexing/resources/rdfdata. Several standard formats for RDF, multiple files and archives of them are supported. *For more details of possible configurations, please consult the [README](https://github.com/apache/stanbol/blob/trunk/entityhub/indexing/genericrdf/README.md).* Then, you can start the index by running java -Xmx1024m -jar org.apache.stanbol.entityhub.indexing.genericrdf-*-jar-with-dependencies.jar index Depending on your hardware and on complexity and size of your sources, it may take several hours to built the index. As a result, you will get an archive of a [SOLr](http://lucene.apache.org/solr/) index together with an OSGI bundle to work with the index in Stanbol. **Step 3 : Initialize the index within Stanbol** At your running Stanbol instance, copy the ZIP archive into {root}/sling/datafiles. Then, at the "Bundles" tab of the administration console add and start the org.apache.stanbol.data.site.{name}-{version}.jar. ### B. Configure and use the index with enhancement engines Before you can make use of the custom vocabulary you need to decide, which kind of enhancements you want to support. If your enhancements are Named Entities in its strict sense (Persons, Locations, Organizations), then you may use the standard NER engine together with its EntityLinkingEngine to configure the destination of your links. In cases, where you want to match all kinds of named entities and concepts from your custom vocabulary, you should work with the [KeywordLinkingEngine](enhancer/engines/keywordlinkingengine.html) to both, find occurrences and to link them to custom entities. In this case, you'll get only results, if there is a match, while in the case above, you even get entities, where you don't find exact links. This approach will have its advantages when you need to have a high recall rate on your custom entities. In the following the configuration options are described briefly. **Use the KeywordLinkingEngine only** (1) To make sure, that the enhancement process uses the KeywordLinkingEngine only, deactivate the "standard NLP" enhancement engines, especially the NamedEntityExtractionEnhancementEngine (NER) and the EntityLinkingEngine before to work with the TaxonomyLinkingEngine. (2) Open the configuration console at http://localhost:8080/system/console/configMgr and navigate to the KeywordLinkingEngine. Its main options are configurable via the UI. - Referenced Site: {put the id/name of your index} - Label Field: {the property to search for} - Type Field: {types of matched entries} - Redirect Field: {redirection links} - Redirect Mode: {ignore, follow, add values} - Min Token Length: {set minimal token length} - Suggestions: {maximum number of suggestions} - Languages: {languages to use} *Full details on the engine and its configuration are available [here](enhancer/engines/keywordlinkingengine.html).* **Use several instances of the KeywordLinkingEngine** To work at the same time with different instances of the KeywordLinkingEngine can be useful in cases, where you have two or more distinct custom vocabularies/indexes and/or if you want to combine your specific domain vocabulary with general purpose datasets such as dbpedia or others. **Use the KeywordLinkingEngine together with the NER engine and the EntityLinkingEngine** If your text corpus contains common entities and enterprise specific as well and you are interested getting enhancements for both, you may also use the KeywordLinkingEngine for your custom thesaurus and the NERengine together with the EntityLinkingEngine targeting at e.g. dbpedia at the same time. ## Examples You can find guidance for the following indexers in the README files at {root}/entityhub/indexing/{name-for-indexer} - [dbpedia](http://dbpedia.org/) dataset (Wikipedia data) For dbpedia, there is also a [script](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/entityhub/indexing/dbpedia/fetch_prepare.sh) available, which helps in generating your own dbpedia index. - [geonames.org](http://www.geonames.org) dataset (geolocation data) - [DBLP](http://dblp.uni-trier.de/) dataset (scientific bibliography data) ## Demos and Resources - The full [demo](http://dev.iks-project.eu:8081/) installation of Stanbol is configured to also work with an environmental thesaurus - if you test it with unstructured text from the domain, you should get enhancements with additional results for specific "concepts". - Download custom test indexes and installer bundles for Stanbol from [here](http://dev.iks-project.eu/downloads/stanbol-indices/) (e.g. for GEMET environmental thesaurus, or a big dbpedia index). - A very concrete example using metadata from the Austrian National Library is described [here](http://blog.iks-project.eu/using-custom-vocabularies-with-apache-stanbol/).