ApacheCon NA 2010 Session

Hive Design

Hive is an open source, peta-byte scale date warehousing framework built on top of Hadoop that enables scalable analytics on large data sets using SQL and some language extensions. Scalable analysis on large data sets has been core to the functions of a number of teams at Facebook - both engineering and non-engineering. Apart from ad hoc analysis and business intelligence applications used by analysts across the company, a number of Facebook products are also based on analytics. These products range from simple reporting applications like Insights for the Facebook Ad Network, to more advanced machine learning applications. As a result a flexible infrastructure that caters to the needs of these diverse applications and users and that also scales up in a cost effective manner with the ever increasing amounts of data being generated on Facebook, is critical. Hive fills that need and brings the power of Hadoop to users who are familar with SQL. It is flexible enough to understand different data formats (including custom formats) and also allows users to embedded cutom map/reduce logic or functions within a SQL like query. It is powerful enough to support many different kinds of analytics applications. In this presentation we will be talking in more detail about Hive, the motivations behind it and how it is used at Facebook to analyze and manage 6PB of compressed data in our Hadoop cluster.