Apache Tika 1.9

The most notable changes in Tika 1.9 over the previous release are:

  • The ability to use the cTAKES clinical text knowledge extraction system for biomedical data is now included as a Tika parser (TIKA-1645, TIKA-1642).
  • Tika-server allows a user to specify the Tika config from the command line (TIKA-1652, TIKA-1426).
  • Matlab file detection has been improved (TIKA-1634).
  • The EXIFTool was added as an External parser (TIKA-1639).
  • If FFMPEG is installed and on the PATH, it is a usable Parser in Tika now (TIKA-1510).
  • Fixes have been applied to the ExternalParser to make it functional (TIKA-1638).
  • Tika service loading can now be more verbose with the org.apache.tika.service.error.warn system property (TIKA-1636).
  • Tika Server now allows for metadata extraction from remote URLs and in addition it outputs the detected language as a metadata field (TIKA-1625).
  • OUTPUT_FILE_TOKEN not being replaced in ExternalParser contributed by Pascal Essiembre (TIKA-1620).
  • Tika REST server now supports language identification (TIKA-1622).
  • All of the example code from the Tika in Action book has been donated to Tika and added to tika-examples (TIKA-1562).
  • Tika server now logs errors determining ContentDisposition (TIKA-1621).
  • An algorithm for using Byte Histogram frequencies to construct a Neural Network and to perform MIME detection was added (TIKA-1582).
  • A Bayesian algorithm for MIME detection by probabilistic means was added (TIKA-1517).
  • Tika now incorporates the Apache Spatial Information System capability of parsing Geographic ISO 19139 files (TIKA-443). It can also detect those files as well.
  • Update the MimeTypes code to support inheritance (TIKA-1535).
  • Provide ability to parse and identify Global Change Master Directory Interchange Format (GCMD DIF) scientific data files (TIKA-1532).
  • Improvements to detect CBOR files by extension (TIKA-1610).
  • Change xerial.org's sqlite-jdbc jar to "provided" (TIKA-1511). Users will now need to add sqlite-jdbc to their classpath for the Sqlite3Parser to work.
  • ExternalParser.check now catches (suppresses) SecurityException and returns false, so it's OK to run Tika with a security policy that does not allow execution of external processes (TIKA-1628).

The following people have contributed to Tika 1.9 by submitting or commenting on the issues resolved in this release:

  • Aakarsh Medleri Hire Math
  • Anya Yun Li
  • Arturo Beltran
  • Chris A. Mattmann
  • Gautham Gowrishankar
  • Giuseppe Totaro
  • Jan Kronquist
  • Ji-Hyun Oh
  • Konstantin Gribov
  • Lewis John McGibbney
  • Lorenz Leutgeb
  • Luke sh
  • Michael McCandless
  • Nick Burch
  • Pascal Essiembre
  • Pavel Micka
  • Selina Chu
  • Tim Allison
  • Tyler Palsulich

See http://s.apache.org/4n1 for more details on these contributions.