--------------- Apache Tika 1.5 --------------- ~~ Licensed to the Apache Software Foundation (ASF) under one or more ~~ contributor license agreements. See the NOTICE file distributed with ~~ this work for additional information regarding copyright ownership. ~~ The ASF licenses this file to You under the Apache License, Version 2.0 ~~ (the "License"); you may not use this file except in compliance with ~~ the License. You may obtain a copy of the License at ~~ ~~ http://www.apache.org/licenses/LICENSE-2.0 ~~ ~~ Unless required by applicable law or agreed to in writing, software ~~ distributed under the License is distributed on an "AS IS" BASIS, ~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ~~ See the License for the specific language governing permissions and ~~ limitations under the License. Apache Tika 1.5 The most notable changes in Tika 1.5 over the previous release are: * Fixed bug in handling of embedded file processing in PDFs ({{{http://issues.apache.org/jira/browse/TIKA-1228}TIKA-1228}}). * Added SourceCodeParser to support java, Groovy, C++ files ({{{http://issues.apache.org/jira/browse/TIKA-1224}TIKA-1224}}). * Updated Tika Server to support multipart/form-data payloads ({{{http://issues.apache.org/jira/browse/TIKA-1198}TIKA-1198}}). * Updated Tika Server to CXF 2.7.8 ({{{http://issues.apache.org/jira/browse/TIKA-1197}TIKA-1197}}). * Updated Tika Server to accept requests over wildcard addresses ({{{http://issues.apache.org/jira/browse/TIKA-1196}TIKA-1196}}). * Added option to use alternate NonSequentialPDFParser ({{{http://issues.apache.org/jira/browse/TIKA-1201}TIKA-1201}}). * Content from PDF AcroForms is now extracted ({{{http://issues.apache.org/jira/browse/TIKA-973}TIKA-973}}). * Fixed invalid asterisks from master slide in PPT ({{{http://issues.apache.org/jira/browse/TIKA-1171}TIKA-1171}}). * Added test cases to confirm handling of auto-date in PPT and PPTX ({{{http://issues.apache.org/jira/browse/TIKA-817}TIKA-817}}). * Text from tables in PPT files is once again extracted correctly ({{{http://issues.apache.org/jira/browse/TIKA-1076}TIKA-1076}}). * Text is extracted from text boxes in XLSX ({{{http://issues.apache.org/jira/browse/TIKA-1100}TIKA-1100}}). * Tika no longer hangs when processing Excel files with custom fraction format ({{{http://issues.apache.org/jira/browse/TIKA-1132}TIKA-1132}}). * Disconcerting stacktrace from missing beans no longer printed for some DOCX files ({{{http://issues.apache.org/jira/browse/TIKA-792}TIKA-792}}). * Upgraded POI to 3.10-beta2 (TIKA-1173) ({{{http://issues.apache.org/jira/browse/TIKA-1173}TIKA-1173}}). * Upgraded PDFBox to 1.8.4 (TIKA-1230) ({{{http://issues.apache.org/jira/browse/TIKA-1230}TIKA-1230}}). * Made HtmlEncodingDetector more flexible in finding meta header charset ({{{http://issues.apache.org/jira/browse/TIKA-1001}TIKA-1001}}). * Added sanitized test HTML file for local file test (TIKA-1139). ({{{http://issues.apache.org/jira/browse/TIKA-1139}TIKA-1139}}). * Fixed bug that prevented attachments within a PDF from being processed if the PDF itself was an attachment ({{{http://issues.apache.org/jira/browse/TIKA-1124}TIKA-1124}}). * Text from paragraph-level structured document tags in DOCX files is now extracted (TIKA-1130). ({{{http://issues.apache.org/jira/browse/TIKA-1130}TIKA-1130}}). * RTF: Fixed ArrayIndexOutOfBoundsException when parsing list override ({{{http://issues.apache.org/jira/browse/TIKA-1192}TIKA-1192}}). * CLI: TikaCLI now escapes invalid filename characters as hex characters ({{{http://issues.apache.org/jira/browse/TIKA-1078}TIKA-1078}}). The following people have contributed to Tika 1.5 by submitting or commenting on the issues resolved in this release: * Albert L. * Andrew Jackson * Andrzej Bialecki * Boris Naguet * Chris A. Mattmann * Curtis Warner * Damien Dykman * Daniel Bonniot de Ruisselet * Daniel Gibby * Dave Kincaid * Dave Meikle * Dietmar Glachs * Emil Burzo * Gaurav * Giuseppe Totaro * Grzegorz Kaczmarczyk * Hong-Thai Nguyen * Jason Sherman * Jeremy * Jukka Zitting * Kabron Kline * Kai-Uwe Schmidt * Kazuaki Matsuba * Ken Krugler * Lewis John McGibbney * Lutz Theurer * Marius Dumitru Florea * Markus Jelsma * Michael Graessle * Michael McCandless * Nick Burch * Niels Beekman * Oliver Heger * Paul Brinich * Ralf Schmitt * Ray Gauss II * Rian Stockbower * Ryan Krueger * Sergey Beryozkin * Stefano Fornari * Sumeet Gorab * Tim Allison * Timo Boehme * Uwe Schindler * Vadim Roizman * Yegor Kozlov * brat * David Rapin * Gunter Rombauts * Isha Marwah See {{http://s.apache.org/oQ}} for more details on these contributions.