Apache Tika 1.5

The most notable changes in Tika 1.5 over the previous release are:

  • Fixed bug in handling of embedded file processing in PDFs (TIKA-1228).
  • Added SourceCodeParser to support java, Groovy, C++ files (TIKA-1224).
  • Updated Tika Server to support multipart/form-data payloads (TIKA-1198).
  • Updated Tika Server to CXF 2.7.8 (TIKA-1197).
  • Updated Tika Server to accept requests over wildcard addresses (TIKA-1196).
  • Added option to use alternate NonSequentialPDFParser (TIKA-1201).
  • Content from PDF AcroForms is now extracted (TIKA-973).
  • Fixed invalid asterisks from master slide in PPT (TIKA-1171).
  • Added test cases to confirm handling of auto-date in PPT and PPTX (TIKA-817).
  • Text from tables in PPT files is once again extracted correctly (TIKA-1076).
  • Text is extracted from text boxes in XLSX (TIKA-1100).
  • Tika no longer hangs when processing Excel files with custom fraction format (TIKA-1132).
  • Disconcerting stacktrace from missing beans no longer printed for some DOCX files (TIKA-792).
  • Upgraded POI to 3.10-beta2 (TIKA-1173) (TIKA-1173).
  • Upgraded PDFBox to 1.8.4 (TIKA-1230) (TIKA-1230).
  • Made HtmlEncodingDetector more flexible in finding meta header charset (TIKA-1001).
  • Added sanitized test HTML file for local file test (TIKA-1139). (TIKA-1139).
  • Fixed bug that prevented attachments within a PDF from being processed if the PDF itself was an attachment (TIKA-1124).
  • Text from paragraph-level structured document tags in DOCX files is now extracted (TIKA-1130). (TIKA-1130).
  • RTF: Fixed ArrayIndexOutOfBoundsException when parsing list override (TIKA-1192).
  • CLI: TikaCLI now escapes invalid filename characters as hex characters (TIKA-1078).

    The following people have contributed to Tika 1.5 by submitting or commenting on the issues resolved in this release:

    • Albert L.
    • Andrew Jackson
    • Andrzej Bialecki
    • Boris Naguet
    • Chris A. Mattmann
    • Curtis Warner
    • Damien Dykman
    • Daniel Bonniot de Ruisselet
    • Daniel Gibby
    • Dave Kincaid
    • Dave Meikle
    • Dietmar Glachs
    • Emil Burzo
    • Gaurav
    • Giuseppe Totaro
    • Grzegorz Kaczmarczyk
    • Hong-Thai Nguyen
    • Jason Sherman
    • Jeremy
    • Jukka Zitting
    • Kabron Kline
    • Kai-Uwe Schmidt
    • Kazuaki Matsuba
    • Ken Krugler
    • Lewis John McGibbney
    • Lutz Theurer
    • Marius Dumitru Florea
    • Markus Jelsma
    • Michael Graessle
    • Michael McCandless
    • Nick Burch
    • Niels Beekman
    • Oliver Heger
    • Paul Brinich
    • Ralf Schmitt
    • Ray Gauss II
    • Rian Stockbower
    • Ryan Krueger
    • Sergey Beryozkin
    • Stefano Fornari
    • Sumeet Gorab
    • Tim Allison
    • Timo Boehme
    • Uwe Schindler
    • Vadim Roizman
    • Yegor Kozlov
    • brat
    • David Rapin
    • Gunter Rombauts
    • Isha Marwah

    See http://s.apache.org/oQ for more details on these contributions.