Class ParsingExample

java.lang.Object
org.apache.tika.example.ParsingExample

public class ParsingExample extends Object
  • Constructor Details

    • ParsingExample

      public ParsingExample()
  • Method Details

    • parseToStringExample

      public String parseToStringExample() throws IOException, SAXException, TikaException
      Example of how to use Tika's parseToString method to parse the content of a file, and return any text found.

      Note: Tika.parseToString() will extract content from the outer container document and any embedded/attached documents.

      Returns:
      The content of a file.
      Throws:
      IOException
      SAXException
      TikaException
    • parseExample

      public String parseExample() throws IOException, SAXException, TikaException
      Example of how to use Tika to parse a file when you do not know its file type ahead of time.

      AutoDetectParser attempts to discover the file's type automatically, then call the exact Parser built for that file type.

      The stream to be parsed by the Parser. In this case, we get a file from the resources folder of this project.

      Handlers are used to get the exact information you want out of the host of information gathered by Parsers. The body content handler, intuitively, extracts everything that would go between HTML body tags.

      The Metadata object will be filled by the Parser with Metadata discovered about the file being parsed.

      Note: This example will extract content from the outer document and all embedded documents. However, if you choose to use a ParseContext, make sure to set a Parser or else embedded content will not be parsed.

      Returns:
      The content of a file.
      Throws:
      IOException
      SAXException
      TikaException
    • parseNoEmbeddedExample

      public String parseNoEmbeddedExample() throws IOException, SAXException, TikaException
      If you don't want content from embedded documents, send in a ParseContext that does contains a EmptyParser.
      Returns:
      The content of a file.
      Throws:
      IOException
      SAXException
      TikaException
    • parseEmbeddedExample

      public String parseEmbeddedExample() throws IOException, SAXException, TikaException
      This example shows how to extract content from the outer document and all embedded documents. The key is to specify a Parser in the ParseContext.
      Returns:
      content, including from embedded documents
      Throws:
      IOException
      SAXException
      TikaException
    • recursiveParserWrapperExample

      public List<Metadata> recursiveParserWrapperExample() throws IOException, SAXException, TikaException
      For documents that may contain embedded documents, it might be helpful to create list of metadata objects, one for the container document and one for each embedded document. This allows easy access to both the extracted content and the metadata of each embedded document. Note that many document formats can contain embedded documents, including traditional container formats -- zip, tar and others -- but also common office document formats including: MSWord, MSExcel, MSPowerPoint, RTF, PDF, MSG and several others.

      The "content" format is determined by the ContentHandlerFactory, and the content is stored in org.apache.tika.parser.RecursiveParserWrapper#TIKA_CONTENT

      The drawback to the RecursiveParserWrapper is that it caches metadata and contents in memory. This should not be used on files whose contents are too big to be handled in memory.

      Returns:
      a list of metadata object, one each for the container file and each embedded file
      Throws:
      IOException
      SAXException
      TikaException
    • serializedRecursiveParserWrapperExample

      public String serializedRecursiveParserWrapperExample() throws IOException, SAXException, TikaException
      We include a simple JSON serializer for a list of metadata with JsonMetadataList. That class also includes a deserializer to convert from JSON back to a List.

      This functionality is also available in tika-app's GUI, and with the -J option on tika-app's commandline. For tika-server users, there is the "rmeta" service that will return this format.

      Returns:
      a JSON representation of a list of Metadata objects
      Throws:
      IOException
      SAXException
      TikaException
    • extractEmbeddedDocumentsExample

      public List<Path> extractEmbeddedDocumentsExample(Path outputPath) throws IOException, SAXException, TikaException
      Parameters:
      outputPath - -- output directory to place files
      Returns:
      list of files created
      Throws:
      IOException
      SAXException
      TikaException