Apache Tika 0.10

The most notable changes in Tika 0.10 over the previous release are:

  • A parser for CHM help files was added. (TIKA-245)
  • Invalid characters are now replaced with the Unicode replacement character (U+FFFD), whereas before such characters were replaced with spaces, so you may need to change your processing of Tika's output to now handle U+FFFD (TIKA-698).
  • The RTF parser was rewritten to perform its own direct shallow parse of the RTF content, instead of using RTFEditorKit from javax.swing. This fixes several issues in the old parser, including doubling of Unicode characters in certain cases (TIKA-683), exceptions on mal-formed RTF docs (TIKA-666), and missing text from some elements (header/footer, hyperlinks,footnotes, text inside pictures).
  • Handling of temporary files within Tika was much improved (TIKA-701, TIKA-654, TIKA-645, TIKA-153).
  • The Tika GUI got a facelift and some extra features (TIKA-635).
  • The apache-mime4j dependency of the email message parser was upgraded from version 0.6 to 0.7 (TIKA-716). The parser also now accepts a MimeConfig object in the ParseContext as configuration (TIKA-640).

The following people have contributed to Tika 0.10 by submitting or commenting on the issues resolved in this release:

  • Alain Viret
  • Alex Ott
  • Alexander Chow
  • Andreas Kemkes
  • Andrew Khoury
  • Babak Farhang
  • Benjamin Douglas
  • Benson Margulies
  • Chris A. Mattmann
  • chris hudson
  • Chris Lott
  • Cristian Vat
  • Curt Arnold
  • Cynthia L Wong
  • Dave Brosius
  • David Benson
  • Enrico Donelli
  • Erik Hetzner
  • Erna de Groot
  • Gabriele Columbro
  • Gavin
  • Geoff Jarrad
  • Gregory Kanevsky
  • Günter Rombauts
  • Henning Gross
  • Henri Bergius
  • Ingo Renner
  • Ingo Wiarda
  • Izaak Alpert
  • Jan Høydahl
  • Jens Wilmer
  • Jeremy Anderson
  • Joseph Vychtrle
  • Joshua Turner
  • Jukka Zitting
  • Julien Nioche
  • Karl Heinz Marbaise
  • Ken Krugler
  • Kostya Gribov
  • Luciano Leggieri
  • Mads Hansen
  • Mark Butler
  • Matt Sheppard
  • Maxim Valyanskiy
  • Michael McCandless
  • Michael Pisula
  • Murad Shahid
  • Nick Burch
  • Oleg Tikhonov
  • Pablo Queixalos
  • Paul Jakubik
  • Raimund Merkert
  • Rajiv Kumar
  • Robert Trickey
  • Sami Siren
  • samraj
  • Selva Ganesan
  • Sjoerd Smeets
  • Stephen Duncan Jr
  • Tran Nam Quang
  • Uwe Schindler
  • Vitaliy Filippov

See http://s.apache.org/vR for more details on these contributions.