ApacheCon NA 2010 Session

Scientific data curation and processing with Apache Tika

In the first part of the talk, I will provide detail on Apache Tika, a content analysis and detection toolkit from the ASF. Tika provides flexible interfaces in Java, from the command line, and interactive, GUI-based for exploring the information landscape. I'll cover Tika's major components including its MIME type detection system, its Parsing framework (that integrates several major third-party parsing libraries), its language identifer and its Metadata system. The second part of the talk will show how NASA processes 10s-100s of terabytes of scientific data in myriad formats (e.g., HDF4/HDF5, NetCDF 4, Grib) and how it harmonizes the data's associated metadata models (e.g., HDF-EOS, CF, FGDC, etc.) using Tika. The discussion will center on the architecture of NASA's processing systems for the Earth science Decadal Study missions including OCO and SMAP, and the important open source technologies like Tika that help to implement the architecture. At NASA, Tika is being used in concert with other Apache technologies including OODT (a grid technology used for science data processing currently incubating at Apache), Apache Lucene/Solr, and Apache SIS (another current Incubator project whose goal is to provide a computational library for geospatial data) to automate, virtualize and increase the efficiency of NASA's science data processing pipeline.