Getting Started with Apache Tika

This document describes how to build Apache Tika from sources and how to start using Tika in an application.

Getting and building the sources

To build Tika from sources you first need to either download a source release or checkout the latest sources from version control.

Once you have the sources, you can build them using the Maven 2 build system. Executing the following command in the base directory will build the sources and install the resulting artifacts in your local Maven repository.

mvn install

See the Maven documentation for more information about the available build options.

Note that you need Java 5 or higher to build Tika.

Build artifacts

The Tika 1.2 build consists of a number of components and produces the following main binaries:

tika-core/target/tika-core-1.2.jar
Tika core library. Contains the core interfaces and classes of Tika, but none of the parser implementations. Depends only on Java 5.
tika-parsers/target/tika-parsers-1.2.jar
Tika parsers. Collection of classes that implement the Tika Parser interface based on various external parser libraries.
tika-app/target/tika-app-1.2.jar
Tika application. Combines the above libraries and all the external parser libraries into a single runnable jar with a GUI and a command line interface.
tika-bundle/target/tika-bundle-1.2.jar
Tika bundle. An OSGi bundle that combines tika-parsers with non-OSGified parser libraries to make them easy to deploy in an OSGi environment.

Using Tika as a Maven dependency

The core library, tika-core, contains the key interfaces and classes of Tika and can be used by itself if you don't need the full set of parsers from the tika-parsers component. The tika-core dependency looks like this:

  <dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>1.2</version>
  </dependency>

If you want to use Tika to parse documents (instead of simply detecting document types, etc.), you'll want to depend on tika-parsers instead:

  <dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers</artifactId>
    <version>1.2</version>
  </dependency>

Note that adding this dependency will introduce a number of transitive dependencies to your project, including one on tika-core. You need to make sure that these dependencies won't conflict with your existing project dependencies. The listing below shows all the compile-scope dependencies of tika-parsers in the Tika 1.2 release.

+- org.apache.tika:tika-core:jar:1.2:compile
+- org.gagravarr:vorbis-java-tika:jar:0.1:compile
|  \- org.gagravarr:vorbis-java-core:jar:tests:0.1:runtime
+- org.apache.felix:org.apache.felix.scr.annotations:jar:1.6.0:provided
+- edu.ucar:netcdf:jar:4.2-min:compile
|  \- org.slf4j:slf4j-api:jar:1.5.6:compile
+- org.apache.james:apache-mime4j-core:jar:0.7:compile
+- org.apache.james:apache-mime4j-dom:jar:0.7:compile
+- org.apache.commons:commons-compress:jar:1.3:compile
+- commons-codec:commons-codec:jar:1.5:compile
+- org.apache.pdfbox:pdfbox:jar:1.6.0:compile
|  +- org.apache.pdfbox:fontbox:jar:1.6.0:compile
|  +- org.apache.pdfbox:jempbox:jar:1.6.0:compile
|  \- commons-logging:commons-logging:jar:1.2.1:compile
+- org.bouncycastle:bcmail-jdk15:jar:1.45:compile
+- org.bouncycastle:bcprov-jdk15:jar:1.45:compile
+- org.apache.poi:poi:jar:3.8-beta5:compile
+- org.apache.poi:poi-scratchpad:jar:3.8-beta5:compile
+- org.apache.poi:poi-ooxml:jar:3.8-beta5:compile
|  +- org.apache.poi:poi-ooxml-schemas:jar:3.8-beta5:compile
|  |  \- org.apache.xmlbeans:xmlbeans:jar:2.3.0:compile
|  \- dom4j:dom4j:jar:1.6.1:compile
+- org.apache.geronimo.specs:geronimo-stax-api_1.0_spec:jar:1.0.1:compile
+- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:compile
+- asm:asm:jar:3.1:compile
+- com.googlecode.mp4parser:isoparser:jar:1.0-beta-5:compile
|  \- net.sf.scannotation:scannotation:jar:1.0.2:compile
|     \- javassist:javassist:jar:3.6.0.GA:compile
+- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:compile
+- de.l3s.boilerpipe:boilerpipe:jar:1.2.0:compile
+- rome:rome:jar:0.9:compile
|  \- jdom:jdom:jar:1.0:compile
+- org.gagravarr:vorbis-java-core:jar:0.1:compile

Using Tika in an Ant project

Unless you use a dependency manager tool like Apache Ivy, the easiest way to use Tika is to include either the tika-core or the tika-app jar in your classpath, depending on whether you want just the core functionality or also all the parser implementations.

<classpath>
  ... <!-- your other classpath entries -->

  <!-- either: -->
  <pathelement location="path/to/tika-core-${tika.version}.jar"/>
  <!-- or: -->
  <pathelement location="path/to/tika-app-${tika.version}.jar"/>

</classpath>

Using Tika as a command line utility

The Tika application jar (tika-app-1.2.jar) can be used as a command line utility for extracting text content and metadata from all sorts of files. This runnable jar contains all the dependencies it needs, so you don't need to worry about classpath settings to run it.

The usage instructions are shown below.

Options:
    -?  or --help          Print this usage message
    -v  or --verbose       Print debug level messages
    -V  or --version       Print the Apache Tika version number

    -g  or --gui           Start the Apache Tika GUI
    -s  or --server        Start the Apache Tika server
    -f  or --fork          Use Fork Mode for out-of-process extraction

    -x  or --xml           Output XHTML content (default)
    -h  or --html          Output HTML content
    -t  or --text          Output plain text content
    -T  or --text-main     Output plain text content (main content only)
    -m  or --metadata      Output only metadata
    -j  or --json          Output metadata in JSON
    -y  or --xmp           Output metadata in XMP
    -l  or --language      Output only language
    -d  or --detect        Detect document type
    -eX or --encoding=X    Use output encoding X
    -pX or --password=X    Use document password X
    -z  or --extract       Extract all attachements into current directory
    --extract-dir=<dir>    Specify target directory for -z
    -r  or --pretty-print  For XML and XHTML outputs, adds newlines and
                           whitespace, for better readability

    --create-profile=X
         Create NGram profile, where X is a profile name
    --list-parsers
         List the available document parsers
    --list-parser-details
         List the available document parsers, and their supported mime types
    --list-detectors
         List the available document detectors
    --list-met-models
         List the available metadata models, and their supported keys
    --list-supported-types
         List all known media types and related information

Description:
    Apache Tika will parse the file(s) specified on the
    command line and output the extracted text content
    or metadata to standard output.

    Instead of a file name you can also specify the URL
    of a document to be parsed.

    If no file name or URL is specified (or the special
    name "-" is used), then the standard input stream
    is parsed. If no arguments were given and no input
    data is available, the GUI is started instead.

- GUI mode

    Use the "--gui" (or "-g") option to start the
    Apache Tika GUI. You can drag and drop files from
    a normal file explorer to the GUI window to extract
    text content and metadata from the files.

- Server mode

    Use the "--server" (or "-s") option to start the
    Apache Tika server. The server will listen to the
    ports you specify as one or more arguments.

You can also use the jar as a component in a Unix pipeline or as an external tool in many scripting languages.

# Check if an Internet resource contains a specific keyword
curl http://.../document.doc \
  | java -jar tika-app-1.2.jar --text \
  | grep -q keyword