Apache Flink

Fast and reliable large-scale data processing engine

Apache Flink is Fast

Fast

State-of-the art performance exploiting in-memory processing and data streaming.

Apache Flink is Reliable

Reliable

Flink is designed to perform very well even when the cluster's memory runs out.

Apache Flink is Expressive

Expressive

Write beautiful, type-safe code in Java and Scala. Execute it on a cluster.

Apache Flink is Easy to use

Easy to use

Few configuration parameters required. Cost-based optimizer built in.

Apache Flink is Scalable

Scalable

Tested on clusters of 100s of machines, Google Compute Engine, and Amazon EC2.

Apache Flink is Hadoop integrated

Hadoop-compatible

Flink runs on YARN and HDFS and has a Hadoop compatibility package.

Fast

Flink exploits in-memory data streaming and integrates iterative processing deeply into the system runtime.

This makes the system extremely fast for data-intensive and iterative jobs.

Apache Flink is Fast
Apache Flink is Reliable

Reliable and Scalable

Flink is designed to perform well when memory runs out.

Flink contains its own memory management component, serialization framework, and type inference engine.

Tested in clusters of 100s of nodes, Amazon EC2, and Google Compute Engine.

Expressive

Write beautiful, type-safe, and maintainable code in Java or Scala. Execute it on a cluster.

You can use native Java and Scala data types without packing them into key-value pairs, logical field addressing, and a wealth of operators.

Word Count in Flink's Scala API

case class Word (word: String, frequency: Int)

val counts = text
  .flatMap {line => line.split(" ").map(word => Word(word,1))}
  .groupBy("word").sum("frequency")

Transitive Closure

case class Path (from: Long, to: Long)

val tc = edges.iterate(10) { paths: DataSet[Path] =>
  val next = paths
    .join(edges).where("to").equalTo("from") {
      (path, edge) => Path(path.from, edge.to)
    }
    .union(paths).distinct()
  next
}
Apache Flink is Easy to use

Easy to use

Flink requires few configuration parameters. And the system's bult-in optimizer takes care of finding the best way to execute the program in any enviroment.

Run on YARN with 3 commands, in a stand-alone cluster, or locally in a debugger.

Apache Flink integrated with hadoop

Compatible with Hadoop

Flink supports all Hadoop input and output formats and data types.

You can run your legacy MapReduce operators unmodified and faster on Flink.

Flink can read data from HDFS and HBase, and runs on top of YARN.

System Overview

Flink contains APIs in Java and Scala for analyzing data from batch and streaming data sources, as well as its own optimizer and distributed runtime with custom memory management