January 2015 in the Flink community
Happy 2015! Here is a (hopefully digestible) summary of what happened last month in the Flink community.
0.8.0 release
Flink 0.8.0 was released. See here for the release notes.
Flink roadmap
The community has published a roadmap for 2015 on the Flink wiki. Check it out to see what is coming up in Flink, and pick up an issue to contribute!
Scaling ALS
Flink committers employed at data Artisans published a blog post on how they scaled matrix factorization with Flink and Google Compute Engine to matrices with 28 billion elements.
Articles in the press
The Apache Software Foundation announced Flink as a Top-Level Project. The announcement was picked up by the media, e.g., here, here, and here.
Hadoop Summit
A submitted abstract on Flink Streaming won the community vote at “The Future of Hadoop” track.
Meetups and talks
Flink was presented at the Paris Hadoop User Group, the Bay Area Hadoop User Group, the Apache Tez User Group, and FOSDEM 2015. The January Flink meetup in Berlin had talks on recent community updates and new features.
Notable code contributions
Note: Code contributions listed here may not be part of a release or even the Flink master repository yet.
Using off-heap memory
This pull request enables Flink to use off-heap memory for its internal memory uses (sort, hash, caching of intermediate data sets).
Gelly, Flink’s Graph API
This pull request introduces Gelly, Flink’s brand new Graph API. Gelly offers a native graph programming abstraction with functionality for vertex-centric programming, as well as available graph algorithms. See this slide set for an overview of Gelly.
Semantic annotations
Semantic annotations are a powerful mechanism to expose information about the behavior of Flink functions to Flink’s optimizer. The optimizer can leverage this information to generate more efficient execution plans. For example the output of a Reduce operator that groups on the second field of a tuple is still partitioned on that field if the Reduce function does not modify the value of the second field. By exposing this information to the optimizer, the optimizer can generate plans that avoid expensive data shuffling and reuse the partitioned output of Reduce. Semantic annotations can be defined for most data types, including (nested) tuples and POJOs. See the snapshot documentation for details (not online yet).
New YARN client
The improved YARN client of Flink now allows users to deploy Flink on YARN for executing a single job. Older versions only supported a long-running YARN session. The code of the YARN client has been refactored to provide an (internal) Java API for controlling YARN clusters more easily.