--------------- Apache Tika 1.6 --------------- ~~ Licensed to the Apache Software Foundation (ASF) under one or more ~~ contributor license agreements. See the NOTICE file distributed with ~~ this work for additional information regarding copyright ownership. ~~ The ASF licenses this file to You under the Apache License, Version 2.0 ~~ (the "License"); you may not use this file except in compliance with ~~ the License. You may obtain a copy of the License at ~~ ~~ http://www.apache.org/licenses/LICENSE-2.0 ~~ ~~ Unless required by applicable law or agreed to in writing, software ~~ distributed under the License is distributed on an "AS IS" BASIS, ~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ~~ See the License for the specific language governing permissions and ~~ limitations under the License. Apache Tika 1.6 The most notable changes in Tika 1.6 over the previous release are: * Parse output should indicate which Parser was actually used ({{{http://issues.apache.org/jira/browse/TIKA-674}TIKA-674}}). * Use the forbidden-apis Maven plugin to check for unsafe Java operations ({{{http://issues.apache.org/jira/browse/TIKA-1387}TIKA-1387}}). * Created an ExternalTranslator class to interface with command line Translators ({{{http://issues.apache.org/jira/browse/TIKA-1385}TIKA-1385}}). * Created a MosesTranslator as a subclass of ExternalTranslator that calls the Moses Decoder machine translation program ({{{http://issues.apache.org/jira/browse/TIKA-1385}TIKA-1385}}). * Created the tika-example module. It will have examples of how to use the main Tika interfaces ({{{http://issues.apache.org/jira/browse/TIKA-1390}TIKA-1390}}). * Upgraded to Commons Compress 1.8.1 ({{{http://issues.apache.org/jira/browse/TIKA-1275}TIKA-1275}}). * Upgraded to POI 3.11-beta1 ({{{http://issues.apache.org/jira/browse/TIKA-1380}TIKA-1380}}). * Tika now extracts SDTCell content from tables in .docx files ({{{http://issues.apache.org/jira/browse/TIKA-1317}TIKA-1317}}). * Tika now supports detection of the Persian/Farsi language. ({{{http://issues.apache.org/jira/browse/TIKA-1337}TIKA-1337}}). * The Tika Detector interface is now exposed through the JAX-RS server ({{{http://issues.apache.org/jira/browse/TIKA-1335}TIKA-1335}}, {{{http://issues.apache.org/jira/browse/TIKA-1336}TIKA-1336}}). * Tika now has support for parsing binary Matlab files as part of our larger effort to increase the number of scientific data formats supported. ({{{http://issues.apache.org/jira/browse/TIKA-1327}TIKA-1327}}). * The Tika Server URLs for the unpacker resources have been changed, to bring them under a common prefix. The mapping is /unpacker/{id} -> /unpack/{id} /all/{id} -> /unpack/all/{id} ({{{http://issues.apache.org/jira/browse/TIKA-1324}TIKA-1324}}). * Added module and core Tika interface for translating text between languages and added a default implementation that call's Microsoft's translate service ({{{http://issues.apache.org/jira/browse/TIKA-1319}TIKA-1319}}). * Added an Translator implementation that calls Lingo24's Premium Machine Translation API ({{{http://issues.apache.org/jira/browse/TIKA-1381}TIKA-1381}}). * Made RTFParser's list handling slightly more robust against corrupt list metadata ({{{http://issues.apache.org/jira/browse/TIKA-1305}TIKA-1305}}). * Fixed bug in CLI json output ({{{http://issues.apache.org/jira/browse/TIKA-1291}TIKA-1291}}/ {{{http://issues.apache.org/jira/browse/TIKA-1310}TIKA-1310}}). * Added ability to turn off image extraction from PDFs. Users must now turn on this capability via the PDFParserConfig. ({{{http://issues.apache.org/jira/browse/TIKA-1294}TIKA-1294}}). * Upgrade to PDFBox 1.8.6 ({{{http://issues.apache.org/jira/browse/TIKA-1290}TIKA-1290}}, {{{http://issues.apache.org/jira/browse/TIKA-1231}TIKA-1231}}, {{{http://issues.apache.org/jira/browse/TIKA-1233}TIKA-1233}}, {{{http://issues.apache.org/jira/browse/TIKA-1352}TIKA-1352}}). * Zip Container Detection for DWFX and XPS formats, which are OPC based ({{{http://issues.apache.org/jira/browse/TIKA-1204}TIKA-1204}}, {{{http://issues.apache.org/jira/browse/TIKA-1221}TIKA-1221}}). * Added a user facing welcome page to the Tika Server, which says what it is, and a very brief summary of what is available. ({{{http://issues.apache.org/jira/browse/TIKA-1269}TIKA-1269}}). * Added Tika Server endpoints to list the available mime types, Parsers and Detectors, similar to the --list- methods on the Tika CLI App ({{{http://issues.apache.org/jira/browse/TIKA-1270}TIKA-1270}}). * Improvements to NetCDF and HDF parsing to mimic the output of ncdump and extract text dimensions and spatial and variable information from scientific data files ({{{http://issues.apache.org/jira/browse/TIKA-1265}TIKA-1265}}). * Extract attachments from RTF files ({{{http://issues.apache.org/jira/browse/TIKA-1010}TIKA-1010}}). * Support Outlook Personal Folders File Format *.pst ({{{http://issues.apache.org/jira/browse/TIKA-623}TIKA-623}}). * Added mime entries for additional Ogg based formats ({{{http://issues.apache.org/jira/browse/TIKA-1259}TIKA-1259}}). * Updated the Ogg Vorbis plugin to v0.4, which adds detection for a wider range of Ogg formats, and parsers for more Ogg Audio ones ({{{http://issues.apache.org/jira/browse/TIKA-1113}TIKA-1113}}). * PDF: Images in PDF documents can now be extracted as embedded resources. ({{{http://issues.apache.org/jira/browse/TIKA-1268}TIKA-1268}}). * Fixed RuntimeException thrown for certain Word Documents ({{{http://issues.apache.org/jira/browse/TIKA-1251}TIKA-1251}}). * CLI: TikaCLI now has another option: --list-parser-details-apt, which outputs the list of supported parsers in APT format. This is used to generate the list on the formats page ({{{http://issues.apache.org/jira/browse/TIKA-411}TIKA-411}}). The following people have contributed to Tika 1.6 by submitting or commenting on the issues resolved in this release: * Alexander Chow * Amit Gupta * Andreas * Andreas Hubold * Andrzej Bialecki * Ann Burgess * Avi * Boris Naguet * Chris A. Mattmann * Chris Bamford * Christian Reuschling * Cservenak, Tamas * Damiano * Dave Meikle * Erik Hetzner * Fabian Lange * Hassan Akram * Hong-Thai Nguyen * Jonathan Evans * Jukka Zitting * Kaijian Xu * Ken Krugler * Konstantin Gribov * Lewis John McGibbney * Luis Filipe Nassif * Marco Quaranta * Martin Kalcher * Matthias Krueger * Matthieu Neamar * Nick Burch * Nicolas Gavalda * Omid Pourhadi * Pradeep Singh * Ray Gauss II * Sasa Milenkovic * Sebastian Nagel * Sergey Beryozkin * Steffen * Steve R * Tim Allison * Tran Nam Quang * Tyler Palsulich * Vladimir Glina See {{http://s.apache.org/ojn}} for more details on these contributions.