Apache Beam (incubating)

Apache Beam is an open source, unified model and set of language-specific SDKs for defining data processing workflows that may then be executed on top of a set of supported runners, currently including Apache Flink, Apache Spark, and Google Cloud Dataflow.

Using Apache Beam

You can use Beam for nearly any kind of data processing task, including both batch and streaming data processing. Beam provides a unified data model that can represent any size data set, including an unbounded or infinite data set from a continuously updating data source such as Kafka.

In particular, Beam pipelines can represent high-volume computations, where the steps in your job need to process an amount of data that exceeds the memory capacity of a cost-effective cluster. Beam is particularly useful for Embarrassingly Parallel data processing tasks, in which the problem can be decomposed into many smaller bundles of data that can be processed independently and in parallel.

You can also use Beam for Extract, Transform, and Load (ETL) tasks and pure data integration. These tasks are useful for moving data between different storage media and data sources, transforming data into a more desirable format, or loading data onto a new system.

Programming Model

Beam provides a simple and elegant programming model to express your data processing jobs. Each job is represented by a data processing pipeline that you create by writing a program with Beam. Each pipeline is an independent entity that reads some input data, performs some transforms on that data to gain useful or actionable intelligence about it, and produces some resulting output data. A pipeline’s transform might include filtering, grouping, comparing, or joining data.

Beam provides several useful abstractions that allow you to think about your data processing pipeline in a simple, logical way. Beam simplifies the mechanics of large-scale parallel data processing, freeing you from the need to manage orchestration details such as partitioning your data and coordinating individual workers.

Key Concepts

See the programming model documentation to learn more about how Beam implements these concepts.


Twitter

Apache Project

Apache Beam is an Apache Software Foundation project, available under the Apache v2 license.