Apache Tika 0.6

The most notable changes in Tika 0.6 over the previous release are:

  • Mime-type detection for HTML (and all types) has been improved, allowing malformed HTML files and those HTML files that require a bit more observed content before the type is properly detected, are now correctly identified by the AutoDetectParser. (TIKA-327, TIKA-357, TIKA-366, TIKA-367)
  • Tika now has an additional OSGi bundle packaging that includes all the required parser libraries. This bundle package makes it easy to use all Tika features in an OSGi environment. (TIKA-340, TIKA-342)
  • The Apache POI dependency used for parsing Microsoft Office file formats has been upgraded to version 3.6. The most visible improvement in this version is the notably reduced ooxml jar file size. The tika-app jar size is now down to 15MB from the 25MB in Tika 0.5. (TIKA-353)
  • Handling of character encoding information in input metadata and HTML <meta> tags has been improved. When no applicable encoding information is available, the encoding is detected by looking at the input data. (TIKA-332, TIKA-334, TIKA-335, TIKA-341)
  • Some document types like Excel spreadsheets contain content like numbers or formulas whose exact text format depends on the current locale. So far Tika has used the platform default locale in such cases, but clients can now explicitly specify the locale by passing a Locale instance in the parse context. (TIKA-125)
  • The default text output encoding of the tika-app jar is now UTF-8 when running on Mac OS X. This is because the default encoding used by Java is not compatible with the console application in Mac OS X. On all other platforms the text output from tika-app still uses the platform default encoding. (TIKA-324)
  • A flash video (video/x-flv) parser has been added. (TIKA-328)
  • The handling of Number and Date cell formatting within the Microsoft Excel documents has been added. This include currencies, percentages and scientific formats. (TIKA-103)

The following people have contributed to Tika 0.6 by submitting or commenting on the issues resolved in this release:

  • Andrzej Bialecki
  • Bertrand Delacretaz
  • Chris A. Mattmann
  • Dave Meikle
  • Erik Hetzner
  • Felix Meschberger
  • Jukka Zitting
  • Julien Nioche
  • Ken Krugler
  • Luke Nezda
  • Maxim Valyanskiy
  • Niall Pemberton
  • Peter Wolanin
  • Piotr B.
  • Sami Siren
  • Yuan-Fang Li

See http://tinyurl.com/yc3dk67 for more details on these contributions.