Apache Tika 2.8.0
The most notable changes in Tika 2.8.0 over the previous release are:
- Enable counting and/or parsing of incremental updates in PDFs. This is an experimental feature and may change in later releases (TIKA-4017).
- Fixed bug that prevented the the loading of CompositeExternalParser in tika-app and tika-server-standard. This parser will call exiftool and ffmpeg if those are installed, as was the behavior in Tika 1.x. Exclude org.apache.tika.parser.external.CompositeExternalParserif you do not want this behavior (TIKA-4022).
- Geotopic parser moved back to o.a.t.parser.geo (TIKA-4009).
- Removed the shading of tika-parsers-standard-module (TIKA-4038).
- Enable optional extraction of file system metadata in FileSystemFetcher (TIKA-4035).
- Allow pretty printing in FileSystemEmitter (TIKA-4034).
- Add detection for and a new mime type for older postscript-based Adobe Illustrator "application/illustrator+ps" files (TIKA-3971).
- Add magic detection for canon raw file types: crw, cr2 and cr3 (TIKA-3991).
- Add detection for ONIX message files (TIKA-4011).
- Add detection and a parser for ActiveMime files (TIKA-3987).
- Add extraction of rendition layout value and version from Epub (TIKA-4013).
- Improve embedded file extraction from PDFs (TIKA-4012).
- Improve metadata extraction from WARCs (TIKA-4018).
- Update to PDFBox 2.0.28 (TIKA-4016).
- Users may now avoid the ZeroByteFileException via asetting on the AutoDetectParserConfig (TIKA-3976).
- Fix bug in closing a elements in the presence of b elementsin RTF files (TIKA-3972).
- Improve extraction of embedded file names in .docx (TIKA-3968).
- Normalize author, title, subject and description to their Dublin Core properties in the HTMLParser (TIKA-3963).
The following people have contributed to Tika 2.8.0 by submitting or commenting on the issues resolved in this release:
- Amit Pandey
- Chris Mattmann
- Gregory Lepore
- Josh Burchard
- Tayseer Sabha
- Thomas Ledoux
- Tilman Hausherr
- Tim Allison
See https://s.apache.org/sigxx for more details on these contributions.