--------------- Apache Tika 0.8 --------------- ~~ Licensed to the Apache Software Foundation (ASF) under one or more ~~ contributor license agreements. See the NOTICE file distributed with ~~ this work for additional information regarding copyright ownership. ~~ The ASF licenses this file to You under the Apache License, Version 2.0 ~~ (the "License"); you may not use this file except in compliance with ~~ the License. You may obtain a copy of the License at ~~ ~~ http://www.apache.org/licenses/LICENSE-2.0 ~~ ~~ Unless required by applicable law or agreed to in writing, software ~~ distributed under the License is distributed on an "AS IS" BASIS, ~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ~~ See the License for the specific language governing permissions and ~~ limitations under the License. Apache Tika 1.1 The most notable changes in Tika 1.1 over the previous release are: * Link Extraction: The rel attribute is now extracted from links per the LinkConteHandler. ({{{http://issues.apache.org/jira/browse/TIKA-824}TIKA-824}}) * MP3: Fixed handling of UTF-16 (two byte) ID3v2 tags (previously the last character in a UTF-16 tag could be corrupted) ({{{http://issues.apache.org/jira/browse/TIKA-793}TIKA-793}}) * Performance: Loading of the default media type registry is now significantly faster. ({{{http://issues.apache.org/jira/browse/TIKA-780}TIKA-780}}) * PDF: Allow controlling whether overlapping duplicated text should be removed. Disabling this (the default) can give big speedups to text extraction and may workaround cases where non-duplicated characters were incorrectly removed ({{{http://issues.apache.org/jira/browse/TIKA-767}TIKA-767}}). Allow controlling whether text tokens should be sorted by their x/y position before extracting text ({{{http://issues.apache.org/jira/browse/TIKA-612}TIKA-612}}); this is necessary for certain PDFs. Fixed cases where too many

tags appear in the XHTML output, causing NPE when opening some PDFs with the GUI ({{{http://issues.apache.org/jira/browse/TIKA-778}TIKA-778}}). * RTF: Fixed case where a font change would result in processing bytes in the wrong font's charset, producing bogus text output ({{{http://issues.apache.org/jira/browse/TIKA-777}TIKA-777}}). Don't output whitespace in ignored group states, avoiding excessive whitespace output ({{{http://issues.apache.org/jira/browse/TIKA-781}TIKA-781}}). Binary embedded content (using \bin control word) is now skipped correctly; previously it could cause the parser to incorrectly extract binary content as text ({{{http://issues.apache.org/jira/browse/TIKA-782}TIKA-782}}). * CLI: New TikaCLI option "--list-detectors", which displays the mimetype detectors that are available, similar to the existing "--list-parsers" option for parsers. ({{{http://issues.apache.org/jira/browse/TIKA-785}TIKA-785}}). * Detectors: The order of detectors, as supplied via the service registry loader, is now controlled. User supplied detectors are prefered, then Tika detectors (such as the container aware ones), and finally the core Tika MimeTypes is used as a backup. This allows for specific, detailed detectors to take preference over the default mime magic + filename detector. ({{{http://issues.apache.org/jira/browse/TIKA-786}TIKA-786}}) * Microsoft Project (MPP): Filetype detection has been fixed, and basic metadata (but no text) is now extracted. ({{{http://issues.apache.org/jira/browse/TIKA-789}TIKA-789}}) * Outlook: fixed NullPointerException in TikaGUI when messages with embedded RTF or HTML content were filtered ({{{http://issues.apache.org/jira/browse/TIKA-801}TIKA-801}}). * Ogg Vorbis and FLAC: Parser added for Ogg Vorbis and FLAC audio files, which extract audio metadata and tags ({{{http://issues.apache.org/jira/browse/TIKA-747}TIKA-747}}). * MP4: Improved mime magic detection for MP4 based formats (including QuickTime, MP4 Video and Audio, and 3GPP) ({{{http://issues.apache.org/jira/browse/TIKA-851}TIKA-851}}). * MP4: Basic metadata extracting parser for MP4 files added, which includes limited audio and video metadata, along with the iTunes media metadata (such as Artist and Title) ({{{http://issues.apache.org/jira/browse/TIKA-852}TIKA-852}}). * Document Passwords: A new ParseContext object, PasswordProvider, has been added. This provides a way to supply the password for a document during processing. Currently, only password protected PDFs and Microsoft OOXML Files are supported. ({{{http://issues.apache.org/jira/browse/TIKA-850}TIKA-850}}). The following people have contributed to Tika 1.1 by submitting or commenting on the issues resolved in this release: * Alex Ott * Alexander Chow * Ali Oral * Andrzej Bialecki * Antoni Mylka * Arjohn Kampman * Bastian Mathes * Chris A. Mattmann * Craig Stires * David Tran * Etienne Jouvin * Fabian Lange * Geoff Jarrad * Jan Høydahl * Jerome Lacoste * John Mastarone * Jukka Zitting * Julien Nioche * Ken Krugler * Lau Brino * Markus Jelsma * Maxim Valyanskiy * Michael McCandless * Nick Burch * Pablo Queixalos * Paul Hill * Paul Pearcy * peter royal * PNS * Radek * Ray Gauss II * Stephan Mühlstrasser * Swapna Vuppala * Torsten Krah * William Seemann * Yegor Kozlov See {{http://s.apache.org/Jn4}} for more details on these contributions.