The Apache Tika toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries. You can find the latest release on the download page. See the Getting Started guide for instructions on how to start using Tika.
Tika is a project of the Apache Software Foundation, and was formerly a subproject of Apache Lucene.