Getting Started with Apache Tika
This document describes how to build Apache Tika from sources and how to start using Tika in an application.
Getting and building the sources
To build Tika from sources you first need to either download a source release or checkout the latest sources from version control.
Once you have the sources, you can build them using the Maven 2 build system. Executing the following command in the base directory will build the sources and install the resulting artifacts in your local Maven repository.
mvn install
If you want to build only the app or the server with the standard parsers, you can save time with:
mvn install -am -pl :tika-app
Or:
mvn install -am -pl :tika-server-standard
See the Maven documentation for more information about the available build options.
Note that you need Java 8 or higher to build Tika. For a full build, you'll also need to have Docker installed.
Build artifacts
The Tika build consists of a number of components and produces the following main binaries:
- tika-core/target/tika-core-*.jar
- Tika core library. Contains the core interfaces and classes of Tika, but none of the parser implementations.
- tika-parsers/tika-parsers-standard/tika-parsers-standard-package/target/tika-parsers-standard-package-*.jar
- Tika parsers. Collection of classes that implement the Tika Parser interface based on various external parser libraries. This includes the most commonly used parsers. Users may want to add tika-parser-sqlite3-package and tika-parser-scientific-package or other parser modules.
- tika-app/target/tika-app-*.jar
- Tika application. Combines the above components and the standard parser libraries into a single runnable jar with a GUI and a command line interface.
- tika-server/tika-server-standard/target/tika-server-standard-*.jar
- Tika JAX-RS REST application. This is a Jetty web server running Tika REST services with the parsers in tika-parsers-standard-package as described in this page.
- tika-bundles/tika-bundle-standard/target/tika-bundle-standard-*.jar
- Tika bundle. An OSGi bundle that combines tika-parsers with non-OSGified parser libraries to make them easy to deploy in an OSGi environment.
- tika-eval/tika-eval-app/target/tika-eval-app-*.jar
- Tika eval module. Commandline tool to assess the output of Tika or compare the output of two different versions of Tika or other text extraction packages.
Using Tika as a Maven dependency
The core library, tika-core, contains the key interfaces and classes of Tika and can be used by itself if you don't need the full set of parsers from the tika-parsers component. The tika-core dependency looks like this:
<dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-core</artifactId> <version>3.0.0-BETA</version> </dependency>
If you want to use Tika to parse documents (instead of simply detecting document types, etc.), you'll want to add a dependency on at least tika-parsers-standard-package :
<dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parsers-standard-package</artifactId> <version>3.0.0-BETA</version> </dependency>
Note that adding this dependency will introduce a number of transitive dependencies to your project. You need to make sure that these dependencies won't conflict with your existing project dependencies. You can use the following command in the tika-parsers-standard-package directory to get a full listing of all the dependencies.
$ mvn dependency:tree | grep :compile
You may also want to add one or more of the following dependencies:
<dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parser-sqlite3-package</artifactId> <version>3.0.0-BETA</version> </dependency> <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parser-scientific-package</artifactId> <version>3.0.0-BETA</version> </dependency>
You may also consider adding dependencies on modules under the tika-parsers-ml module.
Using Tika in a Gradle-built project
To add a dependency on Apache Tika to your Gradle built project, including the full set of parsers, you should depend on the tika-core artifact and the tika-parsers-standard-package artifact:
dependencies { runtime 'org.apache.tika:tika-core:3.0.0-BETA' runtime 'org.apache.tika:tika-parsers-standard-package:3.0.0-BETA' }
Using Tika in an Ant project
If you are using Apache Ivy as your dependency manager tool with Ant, then to include Tika with the full set of parsers, you should depend on the tika-parsers artifact like this:
<dependencies> <dependency org="org.apache.tika" name="tika-core" rev="3.0.0-BETA"/> <dependency org="org.apache.tika" name="tika-parsers-standard-package" rev="3.0.0-BETA"/> </dependencies>
Otherwise, probably the easiest way to use Tika is to include the full tika-app jar on your classpath. For just core functionality, you can add the tika-core jar, but be aware that the full set of parsers have a large number of dependencies which must be included which is very fiddly to do by hand with Ant! To include Tika in your Ant project, you should do something like:
<classpath> ... <!-- your other classpath entries --> <!-- either: Tika Core only, no parsers --> <pathelement location="path/to/tika-core-3.0.0-BETA.jar"/> <!-- or: Tika with all Parsers--> <pathelement location="path/to/tika-app-3.0.0-BETA.jar"/> </classpath>
Using Tika as a command line utility
The Tika application jar (tika-app-*.jar) can be used as a command line utility for extracting text content and metadata from all sorts of files. This runnable jar contains all the dependencies it needs, so you don't need to worry about classpath settings to run it.
The usage instructions are shown below.
usage: java -jar tika-app.jar [option...] [file|port...] Options: -? or --help Print this usage message -v or --verbose Print debug level messages -V or --version Print the Apache Tika version number -g or --gui Start the Apache Tika GUI -s or --server Start the Apache Tika server -f or --fork Use Fork Mode for out-of-process extraction --config=<tika-config.xml> TikaConfig file. Must be specified before -g, -s, -f or the dump-x-config ! --dump-minimal-config Print minimal TikaConfig --dump-current-config Print current TikaConfig --dump-static-config Print static config --dump-static-full-config Print static explicit config -x or --xml Output XHTML content (default) -h or --html Output HTML content -t or --text Output plain text content -T or --text-main Output plain text content (main content only) -m or --metadata Output only metadata -j or --json Output metadata in JSON -y or --xmp Output metadata in XMP -J or --jsonRecursive Output metadata and content from all embedded files (choose content type with -x, -h, -t or -m; default is -x) -l or --language Output only language -d or --detect Detect document type --digest=X Include digest X (md2, md5, sha1, sha256, sha384, sha512 -eX or --encoding=X Use output encoding X -pX or --password=X Use document password X -z or --extract Extract all attachements into current directory --extract-dir=<dir> Specify target directory for -z -r or --pretty-print For JSON, XML and XHTML outputs, adds newlines and whitespace, for better readability --list-parsers List the available document parsers --list-parser-details List the available document parsers and their supported mime types --list-parser-details-apt List the available document parsers and their supported mime types in apt format. --list-detectors List the available document detectors --list-met-models List the available metadata models, and their supported keys --list-supported-types List all known media types and related information --compare-file-magic=<dir> Compares Tika's known media types to the File(1) tool's magic directory Description: Apache Tika will parse the file(s) specified on the command line and output the extracted text content or metadata to standard output. Instead of a file name you can also specify the URL of a document to be parsed. If no file name or URL is specified (or the special name "-" is used), then the standard input stream is parsed. If no arguments were given and no input data is available, the GUI is started instead. - GUI mode Use the "--gui" (or "-g") option to start the Apache Tika GUI. You can drag and drop files from a normal file explorer to the GUI window to extract text content and metadata from the files. - Batch mode Simplest method. Specify two directories as args with no other args: java -jar tika-app.jar <inputDirectory> <outputDirectory> Batch Options: -i or --inputDir Input directory -o or --outputDir Output directory -numConsumers Number of processing threads -bc Batch config file -maxRestarts Maximum number of times the watchdog process will restart the child process. -timeoutThresholdMillis Number of milliseconds allowed to a parse before the process is killed and restarted -fileList List of files to process, with paths relative to the input directory -includeFilePat Regular expression to determine which files to process, e.g. "(?i)\.pdf" -excludeFilePat Regular expression to determine which files to avoid processing, e.g. "(?i)\.pdf" -maxFileSizeBytes Skip files longer than this value Control the type of output with -x, -h, -t and/or -J. To modify child process jvm args, prepend "J" as in: -JXmx4g or -JDlog4j.configuration=file:log4j.xml.
You can also use the jar as a component in a Unix pipeline or as an external tool in many scripting languages.
# Check if an Internet resource contains a specific keyword curl http://.../document.doc \ | java -jar tika-app.jar --text \ | grep -q keyword