Apache UIMA TikaAnnotator README file INTRODUCTION Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. TikaAnnotator uses Tika to generate annotations representing the original markup of a document, extract its text and metadata. It consists of three resources (see /desc): - FileSystemCollectionReader : similar to the one in UIMA examples but uses TIKA to extract the text from binary documents and generates annotations to represent the markup - MarkupAnnotator : takes the original content from a view and generates a new view containing the extracted text with markup annotations - TikaWrapper : utility class which allows to populate a CAS from a binary document; used by the FileSystemCollectionReader VERSION This version wraps Tika 0.4. In that version of Tika, the packaging for Tika was split into several parts. The tika-core jar contains only the core client-visible classes and interfaces and has zero dependencies beyond Java 5. All the actual parser implementations and external parser dependencies are in the tika-parsers jar. See http://lucene.apache.org/tika/gettingstarted.html for the full details. COMPILATION You can use the ANT script to compile the sources. Note that you need to add the Tika-jars in the /lib directory; it is recommended to use the Tika-*-standalone.jar which contains all the libraries used internally by Tika. For more information on UIMA, see: http://incubator.apache.org/uima For more information on Tika, see: http://incubator.apache.org/tika/