Fork me on GitHub

Machine Translation for Search

Oak supports CLIR (Cross Language Information Retrieval) by using Machine Translation to decorate search queries. Such an extension is provided within the oak-search-mt bundle.

Query time MT for Lucene indexes

Machine translation at query time is supported for Oak Lucene indexes by an extension of Oak Lucene's FulltextQueryTermsProvider API called MTFulltextQueryTermsProvider. The initial implementation details can be found in OAK-4348.

The MTFulltextQueryTermsProvider will take the text of a given query and eventually translate it and provide a new Lucene query (to be added to the original one). Query time machine translation will be performed in the MTFulltextQueryTermsProvider only if the index definition of the selected index matches the node types defined in the MTFulltextQueryTermsProvider configuration (e.g. Oak:Unstructured).

The MTFulltextQueryTermsProvider will try to perform the translation of the whole text first and, secondly, of the single tokens as they are created by the Lucene Analyzer passed in the #getQueryTerm(String text, Analyzer analyzer, NodeState indexDefinition) API call.

Machine Translation is currently implemented by means of Apache Joshua, a statistical machine translation toolkit. MTFulltextQueryTermsProvider will require a language pack (a SMT model) in order to perform translation of search queries.

Apache Joshua

Apache Joshua is a statistical machine translation toolkit originally developed at Johns Hopkins University University of Pennsylvania, donated in 2015 to the Apache Software Foundation. For more information on the usage of Apache Joshua for multi language search see the slides/video from the Berlin Buzzwords 2017 presentation Embracing diversity: searching over multiple languages.

Language Packs

Apache Joshua can be used to train machine translation models called language packs, however it provides a set of ready to use (Apache licensed) language packs for many language pairs at:

https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs

Setup

Multiple MTFulltextQueryTermsProvider can be configured (for different language pairs) by using MTFulltextQueryTermsProviderFactory OSGi configuration factory. In order to instantiate a MTFulltextQueryTermsProviderFactory the following properties need to be configured:

  • path.to.config -> the path to the joshua.config configuration file (e.g. of a downloaded language pack)
  • node.types -> the list of node types for which query time MT expansion should be done
  • min.score -> the minimum score (between 0 and 1) for a translated sentence / token to be used while expanding the query (this is used to filter out low quality translations)