---------------- Configuring Tika ---------------- ~~ Licensed to the Apache Software Foundation (ASF) under one or more ~~ contributor license agreements. See the NOTICE file distributed with ~~ this work for additional information regarding copyright ownership. ~~ The ASF licenses this file to You under the Apache License, Version 2.0 ~~ (the "License"); you may not use this file except in compliance with ~~ the License. You may obtain a copy of the License at ~~ ~~ http://www.apache.org/licenses/LICENSE-2.0 ~~ ~~ Unless required by applicable law or agreed to in writing, software ~~ distributed under the License is distributed on an "AS IS" BASIS, ~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ~~ See the License for the specific language governing permissions and ~~ limitations under the License. Configuring Tika Out of the box, Apache Tika will attempt to start with all available Detectors and Parsers, running with sensible defaults. For most users, this default configuration will work well. This page gives you information on how to configure the various components of Apache Tika, such as Parsers and Detectors, if you need fine-grained control over ordering, exclusions and the like. %{toc|section=1|fromDepth=1} * {Configuring Parsers} Through the Tika Config xml, it is possible to have a high degree of control over which parsers are or aren't used, in what order of preferences etc. It is also possible to override just certain parts, to (for example) have "default except for PDF". Currently, it is only possible to have a single parser run against a document. There is on-going discussion around fallback parsers and combining the output of multiple parsers running on a document, but none of these are available yet. To override some parser certain default behaviours, include the <<< DefaultParser >>> in your configuration, with excludes, then add other parser definitions in. To prevent the <<< DefaultParser >>> (with its auto-discovery) being used, simply omit it from your config, and list all other parsers you want instead. To override just some default behaviour, you can use a Tika Config something like this: --- image/jpeg application/pdf application/pdf --- To configure things in code, the key classes to use to build up your own custom parser heirarchy are {{{./api/org/apache/tika/parser/DefaultParser.html}org.apache.tika.parser.DefaultParser}}, {{{./api/org/apache/tika/parser/CompositeParser.html}org.apache.tika.parser.CompositeParser}} and {{{./api/org/apache/tika/parser/ParserDecorator.html}org.apache.tika.parser.ParserDecorator}}. * {Configuring Detectors} Through the Tika Config xml, it is possible to have a high degree of control over which detectors are or aren't used, in what order of preferences etc. It is also possible to override just certain parts, to (for example) have "default except for no POIFS Container Detction". To override some detector certain default behaviours, include the <<< DefaultDetector >>>, with any <<< detector-exclude >>> entries you need, in your configuration, then add other detectors definitions in. To prevent the <<< DefaultParser >>> (with its auto-discovery) being used, simply omit it from your config, and list all other detectors you want instead. To override just some default behaviour, you can use a Tika Config something like this: --- --- Or to just only use certain detectors, you can use a Tika Config something like this: --- --- In code, the key classes to use to build up your own custom detector heirarchy are {{{./api/org/apache/tika/detect/DefaultDetector.html}org.apache.tika.detect.DefaultDetector}} and {{{./api/org/apache/tika/detect/CompositeDetector.html}org.apache.tika.detect.CompositeDetector}}. * {Configuring Mime Types} TODO Mention non-standard paths, and custom mime type files * {Configuring Language Identifiers} At this time, there is no unified way to configure language identifiers. While the work on that is ongoing, for now you will need to review the {{{./api/}Tika Javadocs}} to see how individual identifiers are configured. * {Configuring Translators} At this time, there is no unified way to configure Translators. While the work on that is ongoing, for now you will need to review the {{{./api/}Tika Javadocs}} to see how individual Translators are configured. ~~ When Translators can have their parameters configured, mention here about ~~ specifying which single one to use in the Tika Config XML * {Configuring the Service Loader} Tika has a number of service provider types such as parsers, detectors, and translators. The {{{./api/org/apache/tika/config/ServiceLoader.html}org.apache.tika.config.ServiceLoader}} class provides a registry of each type of provider. This allows Tika to create implementations such as {{{./api/org/apache/tika/parser/DefaultParser.html}org.apache.tika.parser.DefaultParser}}, {{{./api/org/apache/tika/language/translate/DefaultTranslator.html}org.apache.tika.language.translate.DefaultTranslator}}, and {{{./api/org/apache/tika/detect/DefaultDetector.html}org.apache.tika.detect.DefaultDetector}} that can match the appropriate provider to an incoming piece of content. The ServiceLoader's registry can be populated either statically or dynamically. ** Static Static loading is the default which requires no configuration. This configuration options is used in Tika deployments where the Tika JAR files reside together in the same classloader hierarchy. The services provides are loaded from provider configuration files located within the tika-parsers JAR file at META-INF/services. ** Dynamic Dynamic loading may be required if the tika service providers will reside in different classloaders such as in OSGi. To allow a provider created in tika-config.xml to utilize dynamically loaded services you need to configure the ServiceLoader to be dynamic with the following configuration: --- .... --- ** Load Error Handling The ServiceLoader can contains a handler to deal with errors that occur during provider initialization. For example if a class fails to initialize LoadErrorHandler deals with the exception that is thrown. This handler can be configured to: * <<< IGNORE >>> - (Default) Do nothing when providers fail to initialize. * <<< WARN >>> - Log a warning when providers fail to initialize. * <<< THROW >>> - Throw an exception when providers fail to initialize. [] For example to set the LoadErrorHandler to WARN then use the following configuration: --- .... --- * {Using a Tika Configuration XML file} However you call Tika, the System Property of <<< tika.config >>> is checked first, and the Environment Variable of <<< TIKA_CONFIG >>> is tried next. Setting one of those will cause Tika to use your given Tika Config XML file. If you are calling Tika from your own code, then you can pass in the location of your Tika Config XML file when you construct your <<>> instance. From that, you can fetch your configured parser, detectors etc. --- TikaConfig config = new TikaConfig("/path/to/tika-config.xml"); Detector detector = config.getDetector(); Parser autoDetectParser = new AutoDetectParser(config); --- For users of the Tika App, in addition to the sytem property and the environement variable, you can also use the <<< --config=[tika-config.xml] >>> option to select a different Tika Config XML file to use For users of the Tika Server, in addition to the sytem property and the environement variable, you can also use <<< -c [tika-config.xml] >>> or <<< --config [tika-config.xml] >>> options to select a different Tika Config XML file to use