---------------- Configuring Tika ---------------- ~~ Licensed to the Apache Software Foundation (ASF) under one or more ~~ contributor license agreements. See the NOTICE file distributed with ~~ this work for additional information regarding copyright ownership. ~~ The ASF licenses this file to You under the Apache License, Version 2.0 ~~ (the "License"); you may not use this file except in compliance with ~~ the License. You may obtain a copy of the License at ~~ ~~ http://www.apache.org/licenses/LICENSE-2.0 ~~ ~~ Unless required by applicable law or agreed to in writing, software ~~ distributed under the License is distributed on an "AS IS" BASIS, ~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ~~ See the License for the specific language governing permissions and ~~ limitations under the License. Configuring Tika Out of the box, Apache Tika will attempt to start with all available Detectors and Parsers, running with sensible defaults. For most users, this default configuration will work well. This page gives you information on how to configure the various components of Apache Tika, such as Parsers and Detectors, if you need fine-grained control over ordering, exclusions and the like. %{toc|section=1|fromDepth=1} * {Configuring Parsers} ~~ TODO Add more on in 1.10, which has more support In Tika 1.9, there is some support for configuring Parsers in the Tika Config xml. You can provide a custom list of parser to use, in a custom order, and you can also force certain mimetypes to be used or not-used for parsers. You can do so with Tika Config something like: --- image/jpeg application/pdf application/pdf --- In code, the key classes to use to build up your own custom parser heirarchy are {{{./api/org/apache/tika/parser/DefaultParser.html}org.apache.tika.parser.DefaultParser}}, {{{./api/org/apache/tika/parser/CompositeParser.html}org.apache.tika.parser.CompositeParser}} and {{{./api/org/apache/tika/parser/ParserDecorator.html}org.apache.tika.parser.ParserDecorator}}. * {Configuring Detectors} ~~ TODO Add more on in 1.10, which has more support In Tika 1.9, there is limited support for configuring Detectors in the Tika Config xml. You can provide a custom list of detectors to use, in a custom order, with Tika Config something like: --- --- In code, the key classes to use to build up your own custom detector heirarchy are {{{./api/org/apache/tika/detect/DefaultDetector.html}org.apache.tika.detect.DefaultDetector}} and {{{./api/org/apache/tika/detect/CompositeDetector.html}org.apache.tika.detect.CompositeDetector}}. * {Configuring Mime Types} TODO Mention non-standard paths, and custom mime type files * {Configuring Language Identifiers} At this time, there is no unified way to configure language identifiers. While the work on that is ongoing, for now you will need to review the {{{./api/}Tika Javadocs}} to see how individual identifiers are configured. * {Configuring Translators} At this time, there is no unified way to configure Translators. While the work on that is ongoing, for now you will need to review the {{{./api/}Tika Javadocs}} to see how individual Translators are configured. * {Using a Tika Configuration XML file} However you call Tika, the System Property of <<< tika.config >>> is checked first, and the Environment Variable of <<< TIKA_CONFIG >>> is tried next. Setting one of those will cause Tika to use your given Tika Config XML file. If you are calling Tika from your own code, then you can pass in the location of your Tika Config XML file when you construct your <<>> instance. From that, you can fetch your configured parser, detectors etc. --- TikaConfig config = new TikaConfig("/path/to/tika-config.xml"); Detector detector = config.getDetector(); Parser autoDetectParser = new AutoDetectParser(config); --- For users of the Tika App, in addition to the sytem property and the environement variable, you can also use the <<< --config=[tika-config.xml] >>> option to select a different Tika Config XML file to use For users of the Tika Server, in addition to the sytem property and the environement variable, you can also use <<< -c [tika-config.xml] >>> or <<< --config [tika-config.xml] >>> options to select a different Tika Config XML file to use