Apache Tika 1.18
The most notable changes in Tika 1.18 over the previous release are:
- Upgrade to Jackson 2.9.5 (TIKA-2634).
- Add support for brotli (TIKA-2621).
- Upgrade PDFBox to 2.0.9 and include new jbig2-imageio from org.apache.pdfbox (TIKA-2579 and TIKA-2607).
- Support for TIFF images in PDF files (TIKA-2338)
- Detection of full encrypted 7z files (TIKA-2568)
- Various new mimes and typo fixes in tika-mimetypes.xml via Andreas Meier (TIKA-2527).
- Revert to listenForAllRecords=false in ExcelExtractor via Grigoriy Alekseev (TIKA-2590)
- Add workaround to identify TIFFs that might confuse commons-compress's tar detection via Daniel Schmidt(TIKA-2591)
- Ignore non-IANA supported charsets in HTML meta-headers during charset detection in HTMLEncodingDetectorvia Andreas Meier (TIKA-2592)
- Add detection and parsing of zstd (if user provides com.github.luben:zstd-jni) via Andreas Meier (TIKA-2576)
- Allow for RFC822 detection for files starting with "dkim-" and/or "x-" via Andreas Meier (TIKA-2578 and TIKA-2587)
- Extract xlsx files embedded in OLE objects within PPT and PPTX via Brian McColgan (TIKA-2588).
- Extract files embedded in HTML and javascript inside HTML that are stored in the Data URI scheme (TIKA-2563).
- Extract text from grouped text boxes in PPT (TIKA-2569).
- Extract language metadata item from PDF files via Matt Sheppard (TIKA-2559)
- RFC822 with multipart/mixed, first text element should be treated as the main body of the email, not an attachment (TIKA-2547).
- Swap out com.tdunning:json for com.github.openjson:openjson to avoid jar conflicts (TIKA-2556).
- No longer hardcode HtmlParser for XML files in tika-server (TIKA-2551).
- Require Java 8 (TIKA-2553).
- Add a parser for XPS (TIKA-2524).
- Mime magic for Dolby Digital AC3 and EAC3 files
- Fixed bug where TesseractOCRParser ignores configured ImageMagickPath, and set rotation script to ignore Python warnings (TIKA-2509)
- Upgrade geo-apis to 3.0.1 (TIKA-2535).
- Added local Docker image build using dockerfile-maven-plugin to allow images to be built from source (TIKA-1518).
The following people have contributed to Tika 1.18 by submitting or commenting on the issues resolved in this release:
- Andreas Meier
- Andrei Rebegea
- Anto
- Asela
- Brian McColgan
- daniel schmidt
- Dave Meikle
- David Pilato
- Ewan Mellor
- Grigoriy Alekseev
- Guillaume Smet
- Julian Reschke
- Konstantin Gribov
- Luis Filipe Nassif
- Manolo Caracuel
- Marc Prudhommeaux
- Matt Sheppard
- Nick Burch
- Nicolas Belisle
- Nik Everett
- Ohad R
- Peter Davies
- Richard A
- Richard Jones
- Sasha Goodman
- Stefan Sveen
- Tim Allison
See https://s.apache.org/CJNU for more details on these contributions.