--------------- Apache Tika 0.5 --------------- ~~ Licensed to the Apache Software Foundation (ASF) under one or more ~~ contributor license agreements. See the NOTICE file distributed with ~~ this work for additional information regarding copyright ownership. ~~ The ASF licenses this file to You under the Apache License, Version 2.0 ~~ (the "License"); you may not use this file except in compliance with ~~ the License. You may obtain a copy of the License at ~~ ~~ http://www.apache.org/licenses/LICENSE-2.0 ~~ ~~ Unless required by applicable law or agreed to in writing, software ~~ distributed under the License is distributed on an "AS IS" BASIS, ~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ~~ See the License for the specific language governing permissions and ~~ limitations under the License. Apache Tika 0.5 The most notable changes in Tika 0.5 over the previous release are: * Improved RDF/OWL mime detection using both MIME magic as well as pattern matching. ({{{https://issues.apache.org/jira/browse/TIKA-309}TIKA-309}}) * An org.apache.tika.Tika facade class has been added to simplify common text extraction and type detection use cases. ({{{https://issues.apache.org/jira/browse/TIKA-269}TIKA-269}}) * A new parse context argument was added to the Parser.parse() method. This context map can be used to pass things like a delegate parser or other settings to the parsing process. The previous parse() method signature has been deprecated and will be removed in Tika 1.0. ({{{https://issues.apache.org/jira/browse/TIKA-275}TIKA-275}}) * A simple ngram-based language detection mechanism has been added along with predefined language profiles for 18 languages. ({{{https://issues.apache.org/jira/browse/TIKA-209}TIKA-209}}) * The media type registry in Tika was synchronized with the MIME type configuration in the Apache HTTP Server. Tika now knows about 1274 different media types and can detect 672 of those using 927 file extension and 280 magic byte patterns. ({{{https://issues.apache.org/jira/browse/TIKA-285}TIKA-285}}) * Tika now uses the Apache PDFBox version 0.8.0-incubating for parsing PDF documents. This version is notably better than the 0.7.3 release used earlier. ({{{https://issues.apache.org/jira/browse/TIKA-158}TIKA-158}}) The following people have contributed to Tika 0.5 by submitting or commenting on the issues resolved in this release: * Alex Baranov * Bart Hanssens * Benson Margulies * Chris A. Mattmann * Daan de Wit * Erik Hetzner * Frank Hellwig * Jeff Cadow * Joachim Zittmayr * Jukka Zitting * Julien Nioche * Ken Krugler * Maxim Valyanskiy * MRIT64 * Paul Borgermans * Piotr B. * Robert Newson * Sascha Szott * Ted Dunning * Thilo Goetz * Uwe Schindler * Yuan-Fang Li See {{http://tinyurl.com/yl9prwp}} for more details on these contributions.