Apache Tika

Getting Started with Apache Tika

This document describes how to build Apache Tika from sources and how to start using Tika in an application.

Getting and building the sources

To build Tika from sources you first need to either download a source release or checkout the latest sources from version control.

Once you have the sources, you can build them using the Maven 2 build system. Executing the following command in the base directory will build the sources and install the resulting artifacts in your local Maven repository.

mvn install

See the Maven documentation for more information about the available build options.

Note that you need Java 5 or higher to build Tika.

Build artifacts

Starting with Tika 0.5, the build consists of a number of components and produces the following main binaries (x.y stands for the current Tika version number):

tika-core/target/tika-core-x.y.jar: Tika core library. Contains the core interfaces and classes of Tika, but none of the parser implementations. Depends only on Java 5.
tika-core/target/tika-core-x.y-jdk14.jar: Java 1.4 version of the Tika core library.
tika-parsers/target/tika-parsers-x.y.jar: Tika parsers. Collection of classes that implement the Tika Parser interface based on various external parser libraries.
tika-app/target/tika-app-x.y.jar: Tika application. Combines the above libraries and all the external parser libraries into a single runnable jar with a GUI and a command line interface.

Using Tika as a Maven dependency

Since the 0.5 release Tika has been split to components to give you more control over which parts of Tika you want to use in your application. The core library, tika-core, contains the key interfaces and classes, so you'll always want to include a dependency to it:

  <dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>x.y</version>  <!-- 0.5 or higher -->
  </dependency>

This dependency only gives you basic Tika functionality without any of the parser libraries. If you want to use Tika to parse documents (instead of simply detecting document types, etc.), you also need the tika-parsers dependency:

  <dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers</artifactId>
    <version>x.y</version>  <!-- same version as in tika-core -->
  </dependency>

Note that adding this dependency will introduce a number of transitive dependencies to your project. You need to make sure that these dependencies won't conflict with your existing project dependencies. The listing below shows all the compile-scope dependencies of the current Tika parsers release (0.5, November 2009). You can use the command "mvn dependency:tree" to check the latest tree of dependencies on any one of Tika's core, parsers and app projects.

org.apache.tika:tika-parent:pom:0.5
org.apache.tika:tika-core:bundle:0.5
\- junit:junit:jar:3.8.1:test
org.apache.tika:tika-parsers:bundle:0.5
+- org.apache.tika:tika-core:jar:0.5:compile
+- org.apache.commons:commons-compress:jar:1.0:compile
+- org.apache.pdfbox:pdfbox:jar:0.8.0-incubating:compile
|  +- org.apache.pdfbox:fontbox:jar:0.8.0-incubator:compile
|  \- org.apache.pdfbox:jempbox:jar:0.8.0-incubator:compile
+- org.apache.poi:poi:jar:3.5-FINAL:compile
+- org.apache.poi:poi-scratchpad:jar:3.5-FINAL:compile
+- org.apache.poi:poi-ooxml:jar:3.5-FINAL:compile
|  +- org.apache.poi:ooxml-schemas:jar:1.0:compile
|  |  \- org.apache.xmlbeans:xmlbeans:jar:2.3.0:compile
|  \- dom4j:dom4j:jar:1.6.1:compile
|     \- xml-apis:xml-apis:jar:1.0.b2:compile
+- org.apache.geronimo.specs:geronimo-stax-api_1.0_spec:jar:1.0.1:compile
+- commons-logging:commons-logging:jar:1.1.1:compile
+- org.ccil.cowan.tagsoup:tagsoup:jar:1.2:compile
+- asm:asm:jar:3.1:compile
+- log4j:log4j:jar:1.2.14:compile
+- junit:junit:jar:3.8.1:test
+- org.mockito:mockito-core:jar:1.7:test
|  +- org.hamcrest:hamcrest-core:jar:1.1:test
|  \- org.objenesis:objenesis:jar:1.0:test
\- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:compile
org.apache.tika:tika-app:bundle:0.5
\- org.apache.tika:tika-parsers:jar:0.5:provided
   +- org.apache.tika:tika-core:jar:0.5:provided
   +- org.apache.commons:commons-compress:jar:1.0:provided
   +- org.apache.pdfbox:pdfbox:jar:0.8.0-incubating:provided
   |  +- org.apache.pdfbox:fontbox:jar:0.8.0-incubator:provided
   |  \- org.apache.pdfbox:jempbox:jar:0.8.0-incubator:provided
   +- org.apache.poi:poi:jar:3.5-FINAL:provided
   +- org.apache.poi:poi-scratchpad:jar:3.5-FINAL:provided
   +- org.apache.poi:poi-ooxml:jar:3.5-FINAL:provided
   |  +- org.apache.poi:ooxml-schemas:jar:1.0:provided
   |  |  \- org.apache.xmlbeans:xmlbeans:jar:2.3.0:provided
   |  \- dom4j:dom4j:jar:1.6.1:provided
   |     \- xml-apis:xml-apis:jar:1.0.b2:provided
   +- org.apache.geronimo.specs:geronimo-stax-api_1.0_spec:jar:1.0.1:provided
   +- commons-logging:commons-logging:jar:1.1.1:provided
   +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2:provided
   +- asm:asm:jar:3.1:provided
   +- log4j:log4j:jar:1.2.14:provided
   \- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:provided

Using Tika in an Ant project

Unless you use a dependency manager tool like Apache Ivy, to use Tika in you application you can include the Tika jar files and the dependencies individually.

<classpath>
  ... <!-- your other classpath entries -->
  <pathelement location="path/to/tika-core-0.5.jar"/>
  <pathelement location="path/to/tika-parsers-0.5.jar"/>
  <pathelement location="path/to/commons-logging-1.1.1.jar"/>
  <pathelement location="path/to/commons-compress-1.0.jar"/>
  <pathelement location="path/to/pdfbox-0.7.3.jar"/>
  <pathelement location="path/to/fontbox-0.1.0.jar"/>
  <pathelement location="path/to/jempbox-0.2.0.jar"/>
  <pathelement location="path/to/bcmail-jdk14-136.jar"/>
  <pathelement location="path/to/bcprov-jdk14-136.jar"/>
  <pathelement location="path/to/poi-3.5-beta6.jar"/>
  <pathelement location="path/to/poi-scratchpad-3.5-beta6.jar"/>
  <pathelement location="path/to/poi-ooxml-3.5-beta6.jar"/>
  <pathelement location="path/to/ooxml-schemas-1.0.jar"/>
  <pathelement location="path/to/xmlbeans-2.3.0.jar"/>
  <pathelement location="path/to/dom4j-1.6.1.jar"/>
  <pathelement location="path/to/nekohtml-1.9.9.jar"/>
  <pathelement location="path/to/xercesImpl-2.8.1.jar"/>
  <pathelement location="path/to/xml-apis-1.0.b2.jar"/>
  <pathelement location="path/to/geronimo-stax-api_1.0_spec-1.0.jar"/>
  <pathelement location="path/to/asm-3.1.jar"/>
  <pathelement location="path/to/log4j-1.2.14.jar"/>
</classpath>

An easy way to gather all these libraries is to run "mvn dependency:copy-dependencies" in the Tika source directory. This will copy all Tika dependencies to the target/dependencies directory.

Alternatively you can simply drop the entire tika-app jar to your classpath to get all of the above dependencies in a single archive.

Using Tika as a command line utility

The Tika application jar (tika-app-x.y.jar) can be used as a command line utility for extracting text content and metadata from all sorts of files. This runnable jar contains all the dependencies it needs, so you don't need to worry about classpath settings to run it.

The usage instructions are shown below.

usage: java -jar tika-app-x.y.jar [option] [file]

Options:
    -? or --help       Print this usage message
    -v or --verbose    Print debug level messages
    -g or --gui        Start the Apache Tika GUI
    -x or --xml        Output XHTML content (default)
    -h or --html       Output HTML content
    -t or --text       Output plain text content
    -m or --metadata   Output only metadata

Description:
    Apache Tika will parse the file(s) specified on the
    command line and output the extracted text content
    or metadata to standard output.

    Instead of a file name you can also specify the URL
    of a document to be parsed.

    If no file name or URL is specified (or the special
    name "-" is used), then the standard input stream
    is parsed.

    Use the "--gui" (or "-g") option to start
    the Apache Tika GUI. You can drag and drop files
    from a normal file explorer to the GUI window to
    extract text content and metadata from the files.

You can also use the jar as a component in a Unix pipeline or as an external tool in many scripting languages.

# Check if an Internet resource contains a specific keyword
curl http://.../document.doc \
  | java -jar tika-app-x.y.jar --text \
  | grep -q keyword

Getting Started with Apache Tika

Getting and building the sources

Build artifacts

Using Tika as a Maven dependency

Using Tika in an Ant project

Using Tika as a command line utility

Documentation

The Apache Software Foundation

Books about Tika