--------------- Apache Tika 1.7 --------------- ~~ Licensed to the Apache Software Foundation (ASF) under one or more ~~ contributor license agreements. See the NOTICE file distributed with ~~ this work for additional information regarding copyright ownership. ~~ The ASF licenses this file to You under the Apache License, Version 2.0 ~~ (the "License"); you may not use this file except in compliance with ~~ the License. You may obtain a copy of the License at ~~ ~~ http://www.apache.org/licenses/LICENSE-2.0 ~~ ~~ Unless required by applicable law or agreed to in writing, software ~~ distributed under the License is distributed on an "AS IS" BASIS, ~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ~~ See the License for the specific language governing permissions and ~~ limitations under the License. Apache Tika 1.8 The most notable changes in Tika 1.8 over the previous release are: * Fix null pointer when processing ODT footer styles ({{{http://issues.apache.org/jira/browse/TIKA-1600}TIKA-1600}}). * Upgrade to com.drewnoakes' metadata-extractor to 2.0 and add parser for webp metadata ({{{http://issues.apache.org/jira/browse/TIKA-1594}TIKA-1594}}). * Duration extracted from MP3s with no ID3 tags ({{{http://issues.apache.org/jira/browse/TIKA-1589}TIKA-1589}}). * Upgraded to PDFBox 1.8.9 ({{{http://issues.apache.org/jira/browse/TIKA-1575}TIKA-1575}}). * Tika now supports the IsaTab data standard for bioinformatics both in terms of MIME identification and in terms of parsing ({{{http://issues.apache.org/jira/browse/TIKA-1580}TIKA-1580}}). * Tika server can now enable CORS requests with the command line "--cors" or "-C" option ({{{http://issues.apache.org/jira/browse/TIKA-1586}TIKA-1586}}). * Update jhighlight dependency to avoid using LGPL license. Thank @kkrugler for his great contribution ({{{http://issues.apache.org/jira/browse/TIKA-1581}TIKA-1581}}). * Updated HDF and NetCDF parsers to output file version in metadata ({{{http://issues.apache.org/jira/browse/TIKA-1578}TIKA-1578}} and {{{http://issues.apache.org/jira/browse/TIKA-1579}TIKA-1579}}). * Upgraded to POI 3.12-beta1 ({{{http://issues.apache.org/jira/browse/TIKA-1531}TIKA-1531}}). * Added tika-batch module for directory to directory batch processing. This is a new, experimental capability, and the API will likely change in future releases ({{{http://issues.apache.org/jira/browse/TIKA-1330}TIKA-1330}}). * Translator.translate() Exceptions are now restricted to TikaException and IOException ({{{http://issues.apache.org/jira/browse/TIKA-1416}TIKA-1416}}). * Tika now supports MIME detection for Microsoft Extended Makefiles (EMF) ({{{http://issues.apache.org/jira/browse/TIKA-1554}TIKA-1554}}). * Tika has improved delineation in XML and HTML MIME detection ({{{http://issues.apache.org/jira/browse/TIKA-1365}TIKA-1365}}). * Upgraded the Drew Noakes metadata-extractor to version 2.7.2 ({{{http://issues.apache.org/jira/browse/TIKA-1576}TIKA-1576}}). * Added basic style support for ODF documents, contributed by Axel Dörfler ({{{http://issues.apache.org/jira/browse/TIKA-1063}TIKA-1063}}). * Move Tika server resources and writers to separate org.apache.tika.server.resource and writer packages ({{{http://issues.apache.org/jira/browse/TIKA-1564}TIKA-1564}}). * Upgrade UCAR dependencies to 4.5.5 ({{{http://issues.apache.org/jira/browse/TIKA-1571}TIKA-1571}}). * Fix Paths in Tika server welcome page ({{{http://issues.apache.org/jira/browse/TIKA-1567}TIKA-1567}}). * Fixed infinite recursion while parsing some PDFs ({{{http://issues.apache.org/jira/browse/TIKA-1038}TIKA-1038}}). * XHTMLContentHandler now properly passes along body attributes, contributed by Markus Jelsma ({{{http://issues.apache.org/jira/browse/TIKA-995}TIKA-995}}). * TikaCLI option --compare-file-magic to report mime types known to the file(1) tool but not known / fully known to Tika. * MediaTypeRegistry support for returning known child types. * Support for excluding (blacklisting) certain Parsers from being used by DefaultParser via the Tika Config file, using the new parser-exclude tag ({{{http://issues.apache.org/jira/browse/TIKA-1558}TIKA-1558}}). * Detect Global Change Master Directory (GCMD) Directory Interchange Format (DIF) files ({{{http://issues.apache.org/jira/browse/TIKA-1561}TIKA-1561}}). * Tika's JAX-RS server can now return stacktraces for parse exceptions ({{{http://issues.apache.org/jira/browse/TIKA-1323}TIKA-1323}}). * Added MockParser for testing handling of exceptions, errors and hangs in code that uses parsers ({{{http://issues.apache.org/jira/browse/TIKA-1553}TIKA-1553}}). * The ForkParser service removed from Activator. Rollback of ({{{http://issues.apache.org/jira/browse/TIKA-1354}TIKA-1354}}). * Increased the speed of language identification by a factor of two -- contributed by Toke Eskildsen ({{{http://issues.apache.org/jira/browse/TIKA-1549}TIKA-1549}}). * Added parser for Sqlite3 db files. BEWARE: the org.xerial dependency includes native libs. Some users may need to exclude this dependency or configure it specially for their environment ({{{http://issues.apache.org/jira/browse/TIKA-1511}TIKA-1511}}). * Use POST instead of PUT for tika-server form methods ({{{http://issues.apache.org/jira/browse/TIKA-1547}TIKA-1547}}). * A basic wrapper around the UNIX file command was added to extract Strings. In addition a parse to handle Strings parsing from octet-streams using Latin1 charsets as added ({{{http://issues.apache.org/jira/browse/TIKA-1541}TIKA-1541}}, {{{http://issues.apache.org/jira/browse/TIKA-1483}TIKA-1483}}). * Add test files and detection mechanism for Gridded Binary (GRIB) files ({{{http://issues.apache.org/jira/browse/TIKA-1539}TIKA-1539}}). * The RAR parser was updated to handle Chinese characters using the functionality provided by allowing encoding to be used within ZipArchiveInputStream ({{{http://issues.apache.org/jira/browse/TIKA-936}TIKA-936}}). * Fix out of memory error in surefire plugin ({{{http://issues.apache.org/jira/browse/TIKA-1537}TIKA-1537}}). * Build a parser to extract data from GRIB formats ({{{http://issues.apache.org/jira/browse/TIKA-1423}TIKA-1423}}). * Upgrade to Commons Compress 1.9 ({{{http://issues.apache.org/jira/browse/TIKA-1534}TIKA-1534}}). * Include media duration in metadata parsed by MP4Parser ({{{http://issues.apache.org/jira/browse/TIKA-1530}TIKA-1530}}). * Support password protected 7zip files (using a PasswordProvider, in keeping with the other password supporting formats) ({{{http://issues.apache.org/jira/browse/TIKA-1521}TIKA-1521}}). * Password protected Zip files should not trigger an exception ({{{http://issues.apache.org/jira/browse/TIKA-1028}TIKA-1028}}). The following people have contributed to Tika 1.8 by submitting or commenting on the issues resolved in this release: * Adam Lamar * Alejandro León Mora * Aleksandr Dubinsky * Andrew Hwang * Ann Burgess * Ben McCann * Chris A. Mattmann * David Pilato * Giuseppe Totaro * Hari Sekhon * Jan Goyvaerts * Juha Haaga * Karl Wright * Konstantin Gribov * Lewis John McGibbney * lixin * Luis Filipe Nassif * Luke sh * Markus Jelsma * Matt Sheppard * Max Daniline * Michael McCandless * mortee * Nick Burch * Oleg Oshmyan * Oskar Wickström * Pascal Essiembre * Rob Tulloh * Sean Zhao * Sergey Beryozkin * Shinichiro Abe * Tien Nguyen Manh * Tilman Hausherr * Tim Allison * Toke Eskildsen * Tyler Palsulich * Vineet Ghatge See {{http://s.apache.org/L6Z}} for more details on these contributions.