Pig is a platform for a data flow programming on large data sets in a parallel
environment. It consists of a language to specify these programs,
Pig Latin,
a compiler for this language, and an execution engine to execute the programs.
Pig runs on hadoop
MapReduce, reading data from and writing data to HDFS, and doing processing via
one or more MapReduce jobs.
Design
This section gives a very high overview of the design of the Pig system.
Throughout the documents you can see design for that package or class by
looking for the Design heading in the documentation.
Overview
Pig's design is guided by our
pig philosophy.
Pig shares many similarities with a traditional RDBMS design. It has a parser,
type checker, optimizer, and operators that perform the data processing. However,
there are some
significant differences. Pig does not have a data catalog, there are no
transactions, pig does not directly manage data storage, nor does it implement the
execution framework.
High Level Architecture
Pig is split between the front and back ends of the engine. In the front end,
the parser transforms a Pig Latin script into a logical plan.
Semantic checks (such
as type checking) and some optimizations (such as determining which fields in the data need
to be read to satisfy the script) are done on this Logical Plan. The Logical
Plan is than transformed into a
{@link org.apache.pig.backend.hadoop.executionengine.physicalLayer.plans.PhysicalPlan}.
This Physical Plan contains the operators that will be applied to the data. This is then
divided into a set of MapReduce jobs by the
{@link org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler} into an
{@link org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.plans.MROperPlan}. This
MROperPlan (aka the map reduce plan) is then optimized (for example, the combiner is used where
possible, jobs that scan the same input data are combined where possible, etc.). Finally a set of
MapReduce jobs are generated by the
{@link org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler}. These are
submitted to Hadoop and monitored by the
{@link org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher}.
On the backend, each
{@link org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce.Map},
{@link org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner.Combine}, and
{@link org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce.Reduce}
use the pipeline of physical operators constructed in the front end to load, process, and store
data.
Programmatic Interface
In addition to the command line and grunt interfaces, users can connect to
{@link org.apache.pig.PigServer} from a Java program.
Pig makes it easy for users to extend its functionality by implementing User Defined Functions
(UDFs). There are interfaces for defining functions to load data
{@link org.apache.pig.LoadFunc}, storing data {@link org.apache.pig.StoreFunc}, doing evaluations
on fields (including collections of data, so user defined aggregates are possible)
{@link org.apache.pig.EvalFunc} and filtering data {@link org.apache.pig.FilterFunc}.