Apache Tika 0.5

The most notable changes in Tika 0.5 over the previous release are:

  • Improved RDF/OWL mime detection using both MIME magic as well as pattern matching. (TIKA-309)
  • An org.apache.tika.Tika facade class has been added to simplify common text extraction and type detection use cases. (TIKA-269)
  • A new parse context argument was added to the Parser.parse() method. This context map can be used to pass things like a delegate parser or other settings to the parsing process. The previous parse() method signature has been deprecated and will be removed in Tika 1.0. (TIKA-275)
  • A simple ngram-based language detection mechanism has been added along with predefined language profiles for 18 languages. (TIKA-209)
  • The media type registry in Tika was synchronized with the MIME type configuration in the Apache HTTP Server. Tika now knows about 1274 different media types and can detect 672 of those using 927 file extension and 280 magic byte patterns. (TIKA-285)
  • Tika now uses the Apache PDFBox version 0.8.0-incubating for parsing PDF documents. This version is notably better than the 0.7.3 release used earlier. (TIKA-158)

The following people have contributed to Tika 0.5 by submitting or commenting on the issues resolved in this release:

  • Alex Baranov
  • Bart Hanssens
  • Benson Margulies
  • Chris A. Mattmann
  • Daan de Wit
  • Erik Hetzner
  • Frank Hellwig
  • Jeff Cadow
  • Joachim Zittmayr
  • Jukka Zitting
  • Julien Nioche
  • Ken Krugler
  • Maxim Valyanskiy
  • MRIT64
  • Paul Borgermans
  • Piotr B.
  • Robert Newson
  • Sascha Szott
  • Ted Dunning
  • Thilo Goetz
  • Uwe Schindler
  • Yuan-Fang Li

See http://tinyurl.com/yl9prwp for more details on these contributions.