Title: Configure Apache Stanbol to work with multiple languages The following languages are supported - - English - German - Danish - Swedish - Dutch - Portuguese ##Configuration steps - Have language labels in your target data and install the index - Add language models to your Stanbol instance - Activate the LangIdEnhancementEngine and the KeywordLinkingEngine - Configure the KeywordLinkingEngine ###Install your index In DBpedia, there exist language labels for many entities. In case you want to use an index of your custom vocabulary, first [create the index](customvocabulary.html) from it and add the index to your stanbol instance. Simply paste the {yourindex}.solr.zip into your {stanbol-root}/sling/datafiles directory and install the respective OSGI bundle at your OSGI admin console. Make sure, that this index contains language labels in all languages you want to work with and that they are properly indexed. ###Build and add the necessary language bundles To build the language bundles go to "{stanbol-root}/data/" and call mvn clean install -P opennlp This enables the profile to build the OpenNLP models for all languages. After this the bundles are available in the folder {stanbol-root}/data/opennlp/lang/{language}/target The naming of the bundles is "org.apache.stanbol.data.opennlp.lang.{language}-*.jar". Add the bundles via the OSGI admin console in the bundles tab. The language bundles will fetch and install the according [OpenNLP](http://dev.iks-project.eu/downloads/opennlp/models-1.5/) models for the languages you want to use. ###Activate LangID engine and KeywordLinkingEngine Go to the admin console and deactivate some of the available engines. Especially the standard NER engine and the Entity Linking Engines should be deactivated, as they do not support multiple languages. At least two engines need to be activated: - The [Language Identification Engine](enhancer/engines/langidengine.html) provides you with the language of the text you want to enhance, it creates a dc:terms languaage property. The - The [Keyword Linking Engine](enhancer/engines/keywordlinkingengine.html) provides you with the TextAnnotations (selects potential parts of your text) as well as with EntitiyAnnotations (provides suggestions for links). Be aware, that the result (especially the recall) heavily depends on the amount of entities you have specified in your target data source. ###Configure the KeywordLinkingEngine At the OSGI admin console, you can get the most relevant configuration options of the Keyword Linking Engine. - **Referenced Site:** The ID of the Entityhub Referenced Site holding the Controlled Vocabulary (e.g. a taxonomy or just a set of named entities) - **Label Field:** The field used to match Entities with a mentions within the parsed text. - **Type Field:** The field used to retrieve the types of matched Entities. Values of that field are expected to be URIs - **Redirect Field:** Entities may define redirects to other Entities (e.g. "USA"(http://dbpedia.org/resource/USA) -> "United States"(http://dbpedia.org/resource/United_States). Values of this field are expected to link to other entities part of the controlled vocabulary - **Redirect Mode:** Defines how to process redirects of Entities mentioned in the parsed content.. Three modes to deal with such links are supported: Ignore redirects; Add values from redirected Entities to extracted; Follow Redirects and suggest the redirected Entity instead of the extracted. - **Min Token Length:** The minimum length of Tokens used to lookup Entities within the Controlled Vocabulary. This parameter is ignored in case a POS (Part of Speech) tagger is available for the language of the parsed content. - **Suggestions:** The maximal number of suggestions returned for a single mention. (org.apache.stanbol.enhancer.engines.keywordextraction.maxSuggestions) Languages - **Languages to process:** An empty text indicates that all languages are processed. Use ',' as separator for languages (e.g. 'en,de' to enhance only English and German texts). - **Default Matching Language:** The language used in addition to the language detected for the analysed text to search for Entities. Typically this configuration is an empty string to search for labels without any language defined, but for some data sets (such as DBpedia.org) that add languages to any labels it might improve resuls to change this configuration (e.g. to 'en' in the case of DBpedia.org). Read the technical description of this [Enhancement Engine](enhancer/engines/keywordlinkingengine.html) to learn about more configuration options. ##Results Depending on your linking target dataset - the engine provides you with enhancement suggestions using labels in your chosen language(s). Note: In the actual version of the DBpedia index, the link directs to the english version of the resource. ##Examples This [article](http://blog.iks-project.eu/apache-stanbol-now-with-multi-language-support/) from October 2011 describes how to deal with multilingual texts.