---------------- Apache Tika 1.17 ---------------- ~~ Licensed to the Apache Software Foundation (ASF) under one or more ~~ contributor license agreements. See the NOTICE file distributed with ~~ this work for additional information regarding copyright ownership. ~~ The ASF licenses this file to You under the Apache License, Version 2.0 ~~ (the "License"); you may not use this file except in compliance with ~~ the License. You may obtain a copy of the License at ~~ ~~ http://www.apache.org/licenses/LICENSE-2.0 ~~ ~~ Unless required by applicable law or agreed to in writing, software ~~ distributed under the License is distributed on an "AS IS" BASIS, ~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ~~ See the License for the specific language governing permissions and ~~ limitations under the License. Apache Tika 1.17 The most notable changes in Tika 1.17 over the previous release are: * This will be the last version that supports Java 7. The next version will require Java 8. * Fix thread-safety in ChmExtractor ({{{http://issues.apache.org/jira/browse/TIKA-2519}TIKA-2519}}). * Upgrade cxf to 3.0.16 ({{{http://issues.apache.org/jira/browse/TIKA-2516}TIKA-2516}}). * Allow users to configure maxMainMemoryBytes for PDFs via shrike (PR-213). * Extract underline and strikethrough in docx ({{{http://issues.apache.org/jira/browse/TIKA-2347}TIKA-2347}} and {{{http://issues.apache.org/jira/browse/TIKA-2512}TIKA-2512}}). * Cache TikaConfig in EmbeddedDocumentUtil for better performance in documents with large number of attachments ({{{http://issues.apache.org/jira/browse/TIKA-2511}TIKA-2511}}). * Extract media files from ooxml ({{{http://issues.apache.org/jira/browse/TIKA-2510}TIKA-2510}}). * Standardize the way the Image and Video captioning dockers and extraction work ({{{http://issues.apache.org/jira/browse/TIKA-2400}TIKA-2400}}, {{{http://github.com/apache/tika/pull/208}Github-208}}) * Upgrade to xmpcore 5.1.3 ({{{http://issues.apache.org/jira/browse/TIKA-2034}TIKA-2034}}). * Upgrade to metadata-extractor 2.10.1 ({{{http://issues.apache.org/jira/browse/TIKA-2486}TIKA-2486}}). * Upgrade to OpenNLP 1.8.3 ({{{http://issues.apache.org/jira/browse/TIKA-2502}TIKA-2502}}). * Upgrade to Jackson 2.9.2 ({{{http://issues.apache.org/jira/browse/TIKA-2501}TIKA-2501}}). * Catch potential NPE in getting InputStream for attachments in PST file ({{{http://issues.apache.org/jira/browse/TIKA-2488}TIKA-2488}}). * Upgrade to PDFBox 2.0.8 ({{{http://issues.apache.org/jira/browse/TIKA-2489}TIKA-2489}}). * Allow configuration of markLimit in EncodingDetectors via tika-config.xml ({{{http://issues.apache.org/jira/browse/TIKA-2485}TIKA-2485}}). * RFC822Parser now selects the best alternative for multipart/alternative body components. This aligns with the behavior of the OutlookParser ({{{http://issues.apache.org/jira/browse/TIKA-2478}TIKA-2478}}). Users can select legacy behavior via the "extractAllAlternatives" parameter in the RFC822 parser definition in tika-config.xml. * Narrow mime detection for ms-owner files and add detectionfor .nls files ({{{http://issues.apache.org/jira/browse/TIKA-2469}TIKA-2469}}). * Fix bug in CharsetDetector that led to different detected charsets depending on whether user setText with a byte[] or an InputStream via Sean Story ({{{http://issues.apache.org/jira/browse/TIKA-2475}TIKA-2475}}). * Remove JAXB for easier use with Java 9 via Robert Munteanu ({{{http://issues.apache.org/jira/browse/TIKA-2466}TIKA-2466}}). * Upgrade to POI 3.17 ({{{http://issues.apache.org/jira/browse/TIKA-2429}TIKA-2429}}). * Enabling extraction of standard references from text ({{{http://issues.apache.org/jira/browse/TIKA-2449}TIKA-2449}}). * Load external custom mimetypes XML from system property tika.custom-mimetypes ({{{http://issues.apache.org/jira/browse/TIKA-2460}TIKA-2460}}). * Extract number of tiffs in a multi-page tiff ({{{http://issues.apache.org/jira/browse/TIKA-2451}TIKA-2451}}). * Fix detection of emails extracted from mbox ({{{http://issues.apache.org/jira/browse/TIKA-2456}TIKA-2456}}). * Add OverrideDetector and allow PSTParser to specify body content typeas text or html -- to avoid incorrect auto-detection of rfc/mbox, etc. ({{{http://issues.apache.org/jira/browse/TIKA-2454}TIKA-2454}}) * AutoDetectParser throws ZeroByteFileException for zero-byte files after detection on the file extension ({{{http://issues.apache.org/jira/browse/TIKA-2450}TIKA-2450}}). * Extract phonetic runs in docx with experimental SAX parser ({{{http://issues.apache.org/jira/browse/TIKA-2448}TIKA-2448}}). * Extract phonetic runs from xls and allow users to turn off extraction of phonetic runs in both xls and xlsx ({{{http://issues.apache.org/jira/browse/TIKA-2440}TIKA-2440}}). * OOXML locale should be set by POI's LocaleUtil not Locale.getDefault(). Fix unit tests to be robust against different locales in OOXML and ExcelParser ({{{http://issues.apache.org/jira/browse/TIKA-2438}TIKA-2438}}). * Tika now has support for automatic image captioning, that combines Computer Vision and Natural Language Processing to automatically generate a readable caption for an image({{{http://issues.apache.org/jira/browse/TIKA-2262}TIKA-2262}}, {{{http://issues.apache.org/jira/browse/TIKA-2355}TIKA-2355}}, {{{http://issues.apache.org/jira/browse/TIKA-2402}TIKA-2402}}, Gh-198, Gh-196, Gh-189). * Add TestCorruptedFiles to allow devs to test parsers against corrupted input files ({{{http://issues.apache.org/jira/browse/TIKA-2430}TIKA-2430}}). * Correct Mimetype definition for Windows batch files (CMD and BAT) which are the same ({{{http://issues.apache.org/jira/browse/TIKA-2445}TIKA-2445}}) * PSDParser memory use improvements ({{{http://issues.apache.org/jira/browse/TIKA-2447}TIKA-2447}}) * Add underline extraction from Word documents (doc/docx) via Stuart Hendren as well as strike through extraction in docx ({{{http://issues.apache.org/jira/browse/TIKA-2347}TIKA-2347}}, {{{http://github.com/apache/tika/pull/173}Github-173}}) The following people have contributed to Tika 1.17 by submitting or commenting on the issues resolved in this release: * Aashish Chaudhary * Abhijit Rajwade * Advokat * Albert L. * Alessandro De Angelis * Aman R Mathur * Ann Burgess * Bin Hawking * Bob Paulin * Chris A. Mattmann * Chris Bryant * Chris Wilson * Daniel Bonniot de Ruisselet * Dave Meikle * Dillon Welch * Dustin Spicuzza * Eamonn Saunders * frank * Giuseppe Totaro * Jan Burkhardt * jefferyyuan * Julian Reschke * Karl Buchta * Karl Richter * Ken Krugler * Konstantin Gribov * Lewis John McGibbney * Luis Filipe Nassif * Ɓukasz Ozimek * Madhav Sharan * Markus Jelsma * Matthew Caruana Galizia * Michael McCandless * Mike Cantrell * Nick Burch * Paul Ramirez * Peter Weiss * RameshKalidindi * Ravi * Ray Gauss II * Reinhard Schwab * Robert Letzler * Robert Munteanu * Roberto Benedetti * Rupert Westenthaler * Sam H * Sergey Beryozkin * Sergey Tsalkov * Stefano Fornari * Stuart Hendren * Takahiro Ochi * Thamme Gowda * Thejan Wijesinghe * Thomas Mortagne * Tilman Hausherr * Tim Allison * Tyler Palsulich * TzeKai Lee * Uwe Schindler * Yaniv Kunda See {{https://s.apache.org/bX5z}} for more details on these contributions.