----------------------- Tika API Usage Examples ----------------------- ~~ Licensed to the Apache Software Foundation (ASF) under one or more ~~ contributor license agreements. See the NOTICE file distributed with ~~ this work for additional information regarding copyright ownership. ~~ The ASF licenses this file to You under the Apache License, Version 2.0 ~~ (the "License"); you may not use this file except in compliance with ~~ the License. You may obtain a copy of the License at ~~ ~~ http://www.apache.org/licenses/LICENSE-2.0 ~~ ~~ Unless required by applicable law or agreed to in writing, software ~~ distributed under the License is distributed on an "AS IS" BASIS, ~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ~~ See the License for the specific language governing permissions and ~~ limitations under the License. Apache Tika API Usage Examples This page provides a number of examples on how to use the various Tika APIs. All of the examples shown are also available in the {{{https://svn.apache.org/repos/asf/tika/trunk/tika-example}Tika Example module}} in SVN. %{toc|section=1|fromDepth=1} * {Parsing} Tika provides a number of different ways to parse a file. These provide different levels of control, flexibility, and complexity. ** {Parsing using the Tika Facade} The {{{./api/org/apache/tika/Tika.html}Tika facade}}, provides a number of very quick and easy ways to have your content parsed by Tika, and return the resulting plain text %{include|source=src/examples-src/main/java/org/apache/tika/example/ParsingExample.java|snippet=aj:..parseToStringExample()|show-gutter=false} ** {Parsing using the Auto-Detect Parser} For more control, you can call the {{{./api/org/apache/tika/parser/Parser.html}Tika Parsers}} directly. Most likely, you'll want to start out using the {{{./api/org/apache/tika/parser/AutoDetectParser.html}Auto-Detect Parser}}, which automatically figures out what kind of content you have, then calls the appropriate parser for you. %{include|source=src/examples-src/main/java/org/apache/tika/example/ParsingExample.java|snippet=aj:..parseExample()|show-gutter=false} * {Picking different output formats} With Tika, you can get the textual content of your files returned in a number of different formats. These can be plain text, html, xhtml, xhtml of one part of the file etc. This is controlled based on the {{{http://docs.oracle.com/javase/7/docs/api/org/xml/sax/ContentHandler.html}ContentHandler}} you supply to the Parser. ** {Parsing to Plain Text} By using the {{{./api/org/apache/tika/sax/BodyContentHandler.html}BodyContentHandler}}, you can request that Tika return only the content of the document's body as a plain-text string. %{include|source=src/examples-src/main/java/org/apache/tika/example/ContentHandlerExample.java|snippet=aj:..parseToPlainText()|show-gutter=false} ** {Parsing to XHTML} By using the {{{./api/org/apache/tika/sax/ToXMLContentHandler.html}ToXMLContentHandler}}, you can get the XHTML content of the whole document as a string. %{include|source=src/examples-src/main/java/org/apache/tika/example/ContentHandlerExample.java|snippet=aj:..parseToHTML()|show-gutter=false} If you just want the body of the xhtml document, without the header, you can chain together a {{{./api/org/apache/tika/sax/BodyContentHandler.html}BodyContentHandler}} and a {{{./api/org/apache/tika/sax/ToXMLContentHandler.html}ToXMLContentHandler}} as shown: %{include|source=src/examples-src/main/java/org/apache/tika/example/ContentHandlerExample.java|snippet=aj:..parseBodyToHTML()|show-gutter=false} ** {Fetching just certain bits of the XHTML} It possible to execute XPath queries on the parse results, to fetch only certain bits of the XHTML. %{include|source=src/examples-src/main/java/org/apache/tika/example/ContentHandlerExample.java|snippet=aj:..parseOnePartToHTML()|show-gutter=false} * {Custom Content Handlers} The textual output of parsing a file with Tika is returned via the SAX {{{http://docs.oracle.com/javase/7/docs/api/org/xml/sax/ContentHandler.html}ContentHandler}} you pass to the parse method. It is possible to customise your parsing by supplying your own ContentHandler which does special things. ** {Extract Phone Numbers from Content into the Metadata} By using the {{{./api/org/apache/tika/sax/PhoneExtractingContentHandler.html}PhoneExtractingContentHandler}}, you can have any phone numbers found in the textual content of the document extracted and placed into the Metadata object for you. %{include|source=src/examples-src/main/java/org/apache/tika/example/GrabPhoneNumbersExample.java|snippet=aj:..process(..File)|show-gutter=false} ** {Streaming the plain text in chunks} Sometimes, you want to chunk the resulting text up, perhaps to output as you go minimising memory use, perhaps to output to HDFS files, or any other reason! With a small custom content handler, you can do that. %{include|source=src/examples-src/main/java/org/apache/tika/example/ContentHandlerExample.java|snippet=aj:..parseToPlainTextChunks()|show-gutter=false} * {Translation} Tika provides a pluggable Translation system, which allow you to send the results of parsing off to an external system or program to have the text translated into another language. ** {Translation using the Microsoft Translation API} In order to use the Microsoft Translation API, you need to sign up for a Microsoft account, get an API key, then pass the key to Tika before translating. %{include|source=src/examples-src/main/java/org/apache/tika/example/TranslatorExample.java|snippet=aj:..microsoftTranslateToFrench(..String)|show-gutter=false} * {Language Identification} Tika provides support for identifying the language of text, through the {{{./api/org/apache/tika/language/LanguageIdentifier.html}LanguageIdentifier}} class. %{include|source=src/examples-src/main/java/org/apache/tika/example/LanguageIdentifierExample.java|snippet=aj:..identifyLanguage(..String)|show-gutter=false} * {Additional Examples} A number of other examples are also available, including all of the examples from the {{{http://manning.com/mattmann/}Tika In Action book}}. These can all be found in the {{{https://svn.apache.org/repos/asf/tika/trunk/tika-example}Tika Example module}} in SVN.