---------------- Apache Tika 1.15 ---------------- ~~ Licensed to the Apache Software Foundation (ASF) under one or more ~~ contributor license agreements. See the NOTICE file distributed with ~~ this work for additional information regarding copyright ownership. ~~ The ASF licenses this file to You under the Apache License, Version 2.0 ~~ (the "License"); you may not use this file except in compliance with ~~ the License. You may obtain a copy of the License at ~~ ~~ http://www.apache.org/licenses/LICENSE-2.0 ~~ ~~ Unless required by applicable law or agreed to in writing, software ~~ distributed under the License is distributed on an "AS IS" BASIS, ~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ~~ See the License for the specific language governing permissions and ~~ limitations under the License. Apache Tika 1.15 The most notable changes in Tika 1.15 over the previous release are: * Tika now has a module for Deep Learning powered by theDL4J toolkit. The initial included model is for InceptionV3and so using this module, natively in Java, Tika can useDeep learning for metadata/text extraction from Images usingthe power of the Inception model ({{{http://github.com/apache/tika/pull/165}Github-165}}). * A new parser for sentiment analysis using a categorical(multi-class, anry, sad, neutral, like, love) and binary(positive/negative) was added leveraging the USC datascience work ({{{http://issues.apache.org/jira/browse/TIKA-2016}TIKA-2016}}). * Tika now has the ability to automatically detect objects in videos,using OpenCV and Tensorflow ({{{http://issues.apache.org/jira/browse/TIKA-2322}TIKA-2322}}). * Change default behavior to parse embedded documents even if the userforgets to specify a Parser.class in the ParseContext ({{{http://issues.apache.org/jira/browse/TIKA-2096}TIKA-2096}}).Users who wish to parse only the container document should setan EmptyParser as the Parser.class in the ParseContext. * Change default behavior of Office Parsers to _not_ extractMacros. User needs to setExtractMacros to "true" ({{{http://issues.apache.org/jira/browse/TIKA-2302}TIKA-2302}}). * Added tika-eval module ({{{http://issues.apache.org/jira/browse/TIKA-1332}TIKA-1332}}). * Unified logging across Tika: SLF4J as logging API, Apache Log4j asimplementation with JCL and JUL bridges in standalone tools liketika-app, tika-batch and tika-server ({{{http://issues.apache.org/jira/browse/TIKA-2245}TIKA-2245}}). * Add parser for XLSB files ({{{http://issues.apache.org/jira/browse/TIKA-1195}TIKA-1195}}). * Add parsers for EMF/WMF files ({{{http://issues.apache.org/jira/browse/TIKA-2246}TIKA-2246}}/{{{http://issues.apache.org/jira/browse/TIKA-2247}TIKA-2247}}). * Add parsers for WordPerfect and QuattroPro (.qpw) files.Contributed by Pascal Essiembre ({{{http://issues.apache.org/jira/browse/TIKA-1946}TIKA-1946}} and {{{http://issues.apache.org/jira/browse/TIKA-2228}TIKA-2228}}). * Add experimental SAX parser for .pptx files. To select this parser,set useSAXPptxExtractor(true) on OfficeParserConfig ({{{http://issues.apache.org/jira/browse/TIKA-2210}TIKA-2210}}). * Add experimental SAX parser for .docx files. To select this parser,set useSAXDocxExtractor(true) on OfficeParserConfig ({{{http://issues.apache.org/jira/browse/TIKA-1321}TIKA-1321}}, {{{http://issues.apache.org/jira/browse/TIKA-2191}TIKA-2191}}). * Add mime detection and parser for Word 2006ML format ({{{http://issues.apache.org/jira/browse/TIKA-2179}TIKA-2179}}). * Bug fix for WordPerfect via Pascal Essiembre ({{{http://issues.apache.org/jira/browse/TIKA-2352}TIKA-2352}}). * Added "text-main" equivalent option to tika-server via/tika/main ({{{http://issues.apache.org/jira/browse/TIKA-2343}TIKA-2343}}). * Enabled configuration of the EncodingDetector used byparsers that extend AbstractEncodingDetectorParser ({{{http://issues.apache.org/jira/browse/TIKA-2273}TIKA-2273}}). * Prevent easily preventable OOMs for both detection and parsingof some compression formats ({{{http://issues.apache.org/jira/browse/TIKA-2330}TIKA-2330}}). * Extract images and thumbnails from ODT via Sam Bayer ({{{http://issues.apache.org/jira/browse/TIKA-2295}TIKA-2295}}). * Fix potential NPE in FeedParser via Julien Nioche ({{{http://issues.apache.org/jira/browse/TIKA-2269}TIKA-2269}}). * Official mime types for BMP, EMF and WMF have been registered withIANA, so switch to these (image/bmp image/emf image/wmf) ({{{http://issues.apache.org/jira/browse/TIKA-2250}TIKA-2250}}) * Be more parsimonious with BufferedInputStreams via Josh Hight({{{http://issues.apache.org/jira/browse/TIKA-2244}TIKA-2244}}). * Enable handling of hyphenated language codes in TesseractOCRParservia Graham Russell ({{{http://issues.apache.org/jira/browse/TIKA-2231}TIKA-2231}}). * Improve style tags in ODT ({{{http://issues.apache.org/jira/browse/TIKA-2242}TIKA-2242}}). * Add container detection for embedded MSEquation files ({{{http://issues.apache.org/jira/browse/TIKA-2238}TIKA-2238}}). * Add parsing of JBIG2 and extraction of JBIG2 from PDFs whenrequired dependencies are added to class path by user.Contributed by Pascal Essiembre ({{{http://issues.apache.org/jira/browse/TIKA-2232}TIKA-2232}}). * Mime magic for the OneNote family (.one / .onetoc / .onepkg), no parser({{{http://issues.apache.org/jira/browse/TIKA-2224}TIKA-2224}}). * Add configurability of "preserve-interword-spacing" toTesseractOCRParser ({{{http://issues.apache.org/jira/browse/TIKA-2190}TIKA-2190}}). * Upgrade PDFBox to 2.0.6 and JempBox 1.8.13 ({{{http://issues.apache.org/jira/browse/TIKA-2361}TIKA-2361}}. * Refactor MockParser to consolidate service loadingand mime types into tika-core/src/test ({{{http://issues.apache.org/jira/browse/TIKA-2195}TIKA-2195}}). * Enabled extraction of embedded objects from headers, footers,footnotes, endnotes and comments in legacy .docx parser ({{{http://issues.apache.org/jira/browse/TIKA-2192}TIKA-2192}}). * Allow extraction of PDActions (including Javascript) fromPDFs ({{{http://issues.apache.org/jira/browse/TIKA-2090}TIKA-2090}}). This is turned off by default. Usersmust setExtractActions(true) on the PDFParserConfig. * Change default behavior in experimental .docx parser to ignoredeleted text to align with .doc ({{{http://issues.apache.org/jira/browse/TIKA-2187}TIKA-2187}}). * Upgrade to Apache POI 3.16 ({{{http://issues.apache.org/jira/browse/TIKA-2116}TIKA-2116}}, {{{http://issues.apache.org/jira/browse/TIKA-2181}TIKA-2181}}, {{{http://issues.apache.org/jira/browse/TIKA-2329}TIKA-2329}}). * Allow configuration of timeout for ForkParser ({{{http://issues.apache.org/jira/browse/TIKA-2170}TIKA-2170}}). * Add extraction of .jpx inline images from PDFs when required dependencies are added by user to class path ({{{http://issues.apache.org/jira/browse/TIKA-2175}TIKA-2175}}). * Add .jpx, .jp2, .ppm to formats handled by Tesseract ({{{http://issues.apache.org/jira/browse/TIKA-2174}TIKA-2174}}). * Upgrade "provided" Sqlite to 3.16.1 ({{{http://issues.apache.org/jira/browse/TIKA-2334}TIKA-2334}}). * Upgrade CXF version to 3.0.12 ({{{http://issues.apache.org/jira/browse/TIKA-2292}TIKA-2292}}). * Add Lingo24 Language Detector ({{{http://issues.apache.org/jira/browse/TIKA-2297}TIKA-2297}}). * Further mime magic for WebVTT ({{{http://issues.apache.org/jira/browse/TIKA-1772}TIKA-1772}}) * Extend support for increased PSM options up to 13 for modernversions of Tesseract ({{{http://issues.apache.org/jira/browse/TIKA-2357}TIKA-2357}}). The following people have contributed to Tika 1.15 by submitting or commenting on the issues resolved in this release: * Adam Carroll * Aeham Abushwashi * Anastasija Mensikova * Bipul Kumar * Chris A. Mattmann * Dave Meikle * David Pilato * Fabio * Frederic Ronny * Jan Van Raemdonck * Jasper Hafkenscheid * Jorge Spinsanti * Joshua Hight * Julian * Julien Nioche * Ken Krugler * Kevin Oberlag * Konstantin Gribov * Laszlo Marai * Lewis John McGibbney * Luis Filipe Nassif * Madhav Sharan * Matthew Caruana Galizia * Michal Hlavac * Mike Liu * Nick Burch * Nick C * Nino Skopac * Panagiotis Mpailis * Pascal Essiembre * Peter Weiss * Robin Schimpf * Sean Story * senthil * Sergey Beryozkin * Seva Alekseyev * Thamme Gowda * Thomas Galla * Tim Allison * Tim Kingsbury See {{https://s.apache.org/XowY}} for more details on these contributions.