Upcoming Events

03 Oct 2014

We are happy to announce several upcoming Flink events both in Europe and the US. Starting with a Flink hackathon in Stockholm (Oct 8-9) and a talk about Flink at the Stockholm Hadoop User Group (Oct 8). This is followed by the very first Flink Meetup in Berlin (Oct 15). In the US, there will be two Flink Meetup talks: the first one at the Pasadena Big Data User Group (Oct 29) and the second one at Silicon Valley Hands On Programming Events (Nov 4).

We are looking forward to seeing you at any of these events. The following is an overview of each event and links to the respective Meetup pages.

The hackathon will take place at KTH/SICS from Oct 8th-9th. You can sign up here: https://docs.google.com/spreadsheet/viewform?formkey=dDZnMlRtZHJ3Z0hVTlFZVjU2MWtoX0E6MA.

Here is a rough agenda and a list of topics to work upon or look into. Suggestions and more topics are welcome.

Wednesday (8th)

9:00 - 10:00 Introduction to Apache Flink, System overview, and Dev environment (by Stephan)

10:15 - 11:00 Introduction to the topics (Streaming API and system by Gyula & Marton), (Graphs by Vasia / Martin / Stephan)

11:00 - 12:30 Happy hacking (part 1)

12:30 - Lunch (Food will be provided by KTH / SICS. A big thank you to them and also to Paris, for organizing that)

13:xx - Happy hacking (part 2)

Thursday (9th)

Happy hacking (continued)

Suggestions for topics

Streaming
  • Sample streaming applications (e.g. continuous heavy hitters and topics on the twitter stream)

  • Implement a simple SQL to Streaming program parser. Possibly using Apache Calcite (http://optiq.incubator.apache.org/)

  • Implement different windowing methods (count-based, time-based, ...)

  • Implement different windowed operations (windowed-stream-join, windowed-stream-co-group)

  • Streaming state, and interaction with other programs (that access state of a stream program)

Graph Analysis
  • Prototype a Graph DSL (simple graph building, filters, graph properties, some algorithms)

  • Prototype abstractions different Graph processing paradigms (vertex-centric, partition-centric).

  • Generalize the delta iterations, allow flexible state access.

Meetup: Hadoop User Group Talk, Stockholm (Oct 8)

Hosted by Spotify, opens at 6 PM.

http://www.meetup.com/stockholm-hug/events/207323222/

We are happy to announce the first Flink meetup in Berlin. You are very welcome to to sign up and attend. The event will be held in Betahaus Cafe.

http://www.meetup.com/Apache-Flink-Meetup/events/208227422/

Meetup: Pasadena Big Data User Group (Oct 29)

http://www.meetup.com/Pasadena-Big-Data-Users-Group/

Meetup: Silicon Valley Hands On Programming Events (Nov 4)

http://www.meetup.com/HandsOnProgrammingEvents/events/210504392/

Upcoming Events

Apache Flink 0.6 available

26 Aug 2014

We are happy to announce the availability of Flink 0.6. This is the first release of the system inside the Apache Incubator and under the name Flink. Releases up to 0.5 were under the name Stratosphere, the academic and open source project that Flink originates from.

What is Flink?

Apache Flink is a general-purpose data processing engine for clusters. It runs on YARN clusters on top of data stored in Hadoop, as well as stand-alone. Flink currently has programming APIs in Java and Scala. Jobs are executed via Flink's own runtime engine. Flink features:

Robust in-memory and out-of-core processing: once read, data stays in memory as much as possible, and is gracefully de-staged to disk in the presence of memory pressure from limited memory or other applications. The runtime is designed to perform very well both in setups with abundant memory and in setups where memory is scarce.

POJO-based APIs: when programming, you do not have to pack your data into key-value pairs or some other framework-specific data model. Rather, you can use arbitrary Java and Scala types to model your data.

Efficient iterative processing: Flink contains explicit "iterate" operators that enable very efficient loops over data sets, e.g., for machine learning and graph applications.

A modular system stack: Flink is not a direct implementation of its APIs but a layered system. All programming APIs are translated to an intermediate program representation that is compiled and optimized via a cost-based optimizer. Lower-level layers of Flink also expose programming APIs for extending the system.

Data pipelining/streaming: Flink's runtime is designed as a pipelined data processing engine rather than a batch processing engine. Operators do not wait for their predecessors to finish in order to start processing data. This results to very efficient handling of large data sets.

Release 0.6

Flink 0.6 builds on the latest Stratosphere 0.5 release. It includes many bug fixes and improvements that make the system more stable and robust, as well as breaking API changes.

The full release notes are available here.

Download the release here.

Contributors

  • Wilson Cao
  • Ufuk Celebi
  • Stephan Ewen
  • Jonathan Hasenburg
  • Markus Holzemer
  • Fabian Hueske
  • Sebastian Kunert
  • Vikhyat Korrapati
  • Aljoscha Krettek
  • Sebastian Kruse
  • Raymond Liu
  • Robert Metzger
  • Mingliang Qi
  • Till Rohrmann
  • Henry Saputra
  • Chesnay Schepler
  • Kostas Tzoumas
  • Robert Waury
  • Timo Walther
  • Daniel Warneke
  • Tobias Wiens
Apache Flink 0.6 available

Stratosphere version 0.5 available

31 May 2014

We are happy to announce a new major Stratosphere release, version 0.5. This release adds many new features and improves the interoperability, stability, and performance of the system. The major theme of the release is the completely new Java API that makes it easy to write powerful distributed programs.

The release can be downloaded from the Stratosphere website and from GitHub. All components are available as Apache Maven dependencies, making it simple to include Stratosphere in other projects. The website provides extensive documentation of the system and the new features.

Shortlist of new Features

Below is a short list of the most important additions to the Stratosphere system.

New Java API

This release introduces a completely new data set-centric Java API. This programming model significantly eases the development of Stratosphere programs, supports flexible use of regular Java classes as data types, and adds many new built-in operators to simplify the writing of powerful programs. The result are programs that need less code, are more readable, interoperate better with existing code, and execute faster.

Take a look at the examples to get a feel for the API.

General API Improvements

Broadcast Variables: Publish a data set to all instances of another operator. This is handy if the your operator depends on the result of a computation, e.g., filter all values smaller than the average.

Distributed Cache: Make (local and HDFS) files locally available on each machine processing a task.

Iteration Termination Improvements Iterative algorithms can now terminate based on intermediate data sets, not only through aggregated statistics.

Collection data sources and sinks: Speed-up the development and testing of Stratosphere programs by reading data from regular Java collections and inserting back into them.

JDBC data sources and sinks: Read data from and write data to relational databases using a JDBC driver.

Hadoop input format and output format support: Read and write data with any Hadoop input or output format.

Support for Avro encoded data: Read data that has been materialized using Avro.

Deflate Files: Stratosphere now transparently reads .deflate compressed files.

Runtime and Optimizer Improvements

DAG Runtime Streaming: Detection and resolution of streaming data flow deadlocks in the data flow optimizer.

Intermediate results across iteration boundaries: Intermediate results computed outside iterative parts can be used inside iterative parts of the program.

Stability fixes: Various stability fixes in both optimizer and runtime.

Setup & Tooling

Improved YARN support: Many improvements based on user-feedback: Packaging, Permissions, Error handling.

Java 8 compatibility

Contributors

In total, 26 people have contributed to Stratosphere since the last release. Thank you for making this project possible!

  • Alexander Alexandrov
  • Jesus Camacho
  • Ufuk Celebi
  • Mikhail Erofeev
  • Stephan Ewen
  • Alexandr Ferodov
  • Filip Haase
  • Jonathan Hasenberg
  • Markus Holzemer
  • Fabian Hueske
  • Vasia Kalavri
  • Aljoscha Krettek
  • Rajika Kumarasiri
  • Sebastian Kunert
  • Aaron Lam
  • Robert Metzger
  • Faisal Moeen
  • Martin Neumann
  • Mingliang Qi
  • Till Rohrmann
  • Chesnay Schepler
  • Vyachislav Soludev
  • Tuan Trieu
  • Artem Tsikiridis
  • Timo Walther
  • Robert Waury

Stratosphere is going Apache

The Stratosphere project has been accepted to the Apache Incubator and will continue its work under the umbrella of the Apache Software Foundation. Due to a name conflict, we are switching the name of the project. We will make future releases of Stratosphere through the Apache foundation under a new name.

Stratosphere version 0.5 available

Stratosphere accepted as Apache Incubator Project

16 Apr 2014

We are happy to announce that Stratosphere has been accepted as a project for the Apache Incubator. The proposal has been accepted by the Incubator PMC members earlier this week. The Apache Incubator is the first step in the process of giving a project to the Apache Software Foundation. While under incubation, the project will move to the Apache infrastructure and adopt the community-driven development principles of the Apache Foundation. Projects can graduate from incubation to become top-level projects if they show activity, a healthy community dynamic, and releases.

We are glad to have Alan Gates as champion on board, as well as a set of great mentors, including Sean Owen, Ted Dunning, Owen O'Malley, Henry Saputra, and Ashutosh Chauhan. We are confident that we will make this a great open source effort.

Stratosphere accepted as Apache Incubator Project

Stratosphere got accepted for Google Summer of Code 2014

24 Feb 2014

Students: Apply now for exciting summer projects in the Big Data / Analytics field

We are pleased to announce that Stratosphere got accepted to Google Summer of Code 2014 as a mentoring organization. This means that we will host a bunch of students to conduct projects within Stratosphere over the summer. Read more on the GSoC manual for students and the official FAQ. Students can improve their coding skills, learn to work with open-source projects, improve their CV and get a nice paycheck from Google.

If you are an interested student, check out our idea list in the wiki. It contains different projects with varying ranges of difficulty and requirement profiles. Students can also suggest their own projects.

We welcome students to sign up at our developer mailing list to discuss their ideas. Applying students can use our wiki (create a new page) to create a project proposal. We are happy to have a look at it.

Stratosphere got accepted for Google Summer of Code 2014

Use Stratosphere with Amazon Elastic MapReduce

18 Feb 2014

Get started with Stratosphere within 10 minutes using Amazon Elastic MapReduce.

This step-by-step tutorial will guide you through the setup of Stratosphere using Amazon Elastic MapReduce.

Background

Amazon Elastic MapReduce (Amazon EMR) is part of Amazon Web services. EMR allows to create Hadoop clusters that analyze data stored in Amazon S3 (AWS' cloud storage). Stratosphere runs on top of Hadoop using the recently released cluster resource manager YARN. YARN allows to use many different data analysis tools in your cluster side by side. Tools that run with YARN are, for example Apache Giraph, Spark or HBase. Stratosphere also runs on YARN and that's the approach for this tutorial.

1. Step: Login to AWS and prepare secure access

You need to have SSH keys to access the Hadoop master node. If you do not have keys for your computer, generate them:

  • Select EC2 and click on "Key Pairs" in the "NETWORK & SECURITY" section.
  • Click on "Create Key Pair" and give it a name
  • After pressing "Yes" it will download a .pem file.
  • Change the permissions of the .pem file
  • chmod og-rwx ~/work-laptop.pem

2. Step: Create your Hadoop Cluster in the cloud

  • Select Elastic MapReduce from the AWS console
  • Click the blue "Create cluster" button.
  • Choose a Cluster name
  • You can let the other settings remain unchanged (termination protection, logging, debugging)
  • For the Hadoop distribution, it is very important to choose one with YARN support. We use 3.0.3 (Hadoop 2.2.0) (the minor version might change over time)
  • Remove all applications to be installed (unless you want to use them)
  • Choose the instance types you want to start. Stratosphere runs fine with m1.large instances. Core and Task instances both run Stratosphere, but only core instances contain HDFS data nodes.
  • Choose the EC2 key pair you've created in the previous step!
  • Thats it! You can now press the "Create cluster" button at the end of the form to boot it!

3. Step: Launch Stratosphere

You might need to wait a few minutes until Amazon started your cluster. (You can monitor the progress of the instances in EC2). Use the refresh button in the top right corner.

You see that the master is up if the field Master public DNS contains a value (first line), connect to it using SSH.

ssh hadoop@<your master public DNS> -i <path to your .pem>
# for my example, it looks like this:
ssh hadoop@ec2-54-213-61-105.us-west-2.compute.amazonaws.com -i ~/Downloads/work-laptop.pem
(Windows users have to follow these instructions to SSH into the machine running the master.)

Once connected to the master, download and start Stratosphere for YARN:
  • Download and extract Stratosphere-YARN
  • wget http://stratosphere-bin.s3-website-us-east-1.amazonaws.com/stratosphere-dist-0.5-SNAPSHOT-yarn.tar.gz
    # extract it
    tar xvzf stratosphere-dist-0.5-SNAPSHOT-yarn.tar.gz
  • Start Stratosphere in the cluster using Hadoop YARN
  • cd stratosphere-yarn-0.5-SNAPSHOT/
    ./bin/yarn-session.sh -n 4 -jm 1024 -tm 3000
    The arguments have the following meaning
    • -n number of TaskManagers (=workers). This number must not exeed the number of task instances
    • -jm memory (heapspace) for the JobManager
    • -tm memory for the TaskManagers
Once the output has changed from
JobManager is now running on N/A:6123
to
JobManager is now running on ip-172-31-13-68.us-west-2.compute.internal:6123
Stratosphere has started the JobManager. It will take a few seconds until the TaskManagers (workers) have connected to the JobManager. To see how many TaskManagers connected, you have to access the JobManager's web interface. Follow the steps below to do that ...

4. Step: Launch a Stratosphere Job

This step shows how to submit and monitor a Stratosphere Job in the Amazon Cloud.
  • Open an additional terminal and connect again to the master of your cluster.
  • We recommend to create a SOCKS-proxy with your SSH that allows you to easily connect into the cluster. (If you've already a VPN setup with EC2, you can probably use that as well.)
    ssh -D localhost:2001 hadoop@<your master dns name> -i <your pem file>
    Notice the -D localhost:2001 argument: It opens a SOCKS proxy on your computer allowing any application to use it to communicate through the proxy via an SSH tunnel to the master node. This allows you to access all services in your EMR cluster, such as the HDFS NameNode or the YARN web interface.
  • Configure a browser to use the SOCKS proxy. Open a browser with SOCKS proxy support (such as Firefox). Ideally, do not use your primary browser for this, since ALL traffic will be routed through Amazon.
    • To configure the SOCKS proxy with Firefox, click on "Edit", "Preferences", choose the "Advanced" tab and press the "Settings ..." button.
    • Enter the details of the SOCKS proxy localhost:2001. Choose SOCKS v4.
    • Close the settings, your browser is now talking to the master node of your cluster

Since you're connected to the master now, you can open several web interfaces:
YARN Resource Manager: http://<masterIPAddress>:9026/
HDFS NameNode: http://<masterIPAddress>:9101/

You find the masterIPAddress by entering ifconfig into the terminal:

[hadoop@ip-172-31-38-95 ~]$ ifconfig
eth0      Link encap:Ethernet  HWaddr 02:CF:8E:CB:28:B2  
          inet addr:172.31.38.95  Bcast:172.31.47.255  Mask:255.255.240.0
          inet6 addr: fe80::cf:8eff:fecb:28b2/64 Scope:Link
          RX bytes:166314967 (158.6 MiB)  TX bytes:89319246 (85.1 MiB)

Optional: If you want to use the hostnames within your Firefox (that also makes the NameNode links work), you have to enable DNS resolution over the SOCKS proxy. Open the Firefox config about:config and set network.proxy.socks_remote_dns to true.

The YARN ResourceManager also allows you to connect to Stratosphere's JobManager web interface. Click the ApplicationMaster link in the "Tracking UI" column.

To run the Wordcount example, you have to upload some sample data.

# download a text
wget http://www.gnu.org/licenses/gpl.txt
# upload it to HDFS:
hadoop fs -copyFromLocal gpl.txt /input

To run a Job, enter the following command into the master's command line:

# optional: go to the extracted directory
cd stratosphere-yarn-0.5-SNAPSHOT/
# run the wordcount example
./bin/stratosphere run -w -j examples/stratosphere-java-examples-0.5-SNAPSHOT-WordCount.jar  -a 16 hdfs:///input hdfs:///output

Make sure that the number of TaskManager's have connected to the JobManager.

Lets go through the command in detail:

  • ./bin/stratosphere is the standard launcher for Stratosphere jobs from the command line
  • The -w flag stands for "wait". It is a very useful to track the progress of the job.
  • -j examples/stratosphere-java-examples-0.5-SNAPSHOT-WordCount.jar the -j command sets the jar file containing the job. If you have you own application, place your Jar-file here.
  • -a 16 hdfs:///input hdfs:///output the -a command specifies the Job-specific arguments. In this case, the wordcount expects the following input <numSubStasks> <input> <output>.

You can monitor the progress of your job in the JobManager webinterface. Once the job has finished (which should be the case after less than 10 seconds), you can analyze it there. Inspect the result in HDFS using:

hadoop fs -tail /output

If you want to shut down the whole cluster in the cloud, use Amazon's webinterface and click on "Terminate cluster". If you just want to stop the YARN session, press CTRL+C in the terminal. The Stratosphere instances will be killed by YARN.



Written by Robert Metzger (@rmetzger_).

Use Stratosphere with Amazon Elastic MapReduce

Accessing Data Stored in MongoDB with Stratosphere

28 Jan 2014

We recently merged a pull request that allows you to use any existing Hadoop InputFormat with Stratosphere. So you can now (in the 0.5-SNAPSHOT and upwards versions) define a Hadoop-based data source:

HadoopDataSource source = new HadoopDataSource(new TextInputFormat(), new JobConf(), "Input Lines");
TextInputFormat.addInputPath(source.getJobConf(), new Path(dataInput));

We describe in the following article how to access data stored in MongoDB with Stratosphere. This allows users to join data from multiple sources (e.g. MonogDB and HDFS) or perform machine learning with the documents stored in MongoDB.

The approach here is to use the MongoInputFormat that was developed for Apache Hadoop but now also runs with Stratosphere.

JobConf conf = new JobConf();
conf.set("mongo.input.uri","mongodb://localhost:27017/enron_mail.messages");
HadoopDataSource src = new HadoopDataSource(new MongoInputFormat(), conf, "Read from Mongodb", new WritableWrapperConverter());

Example Program

The example program reads data from the enron dataset that contains about 500k internal e-mails. The data is stored in MongoDB and the Stratosphere program counts the number of e-mails per day.

The complete code of this sample program is available on GitHub.

Prepare MongoDB and the Data

  • Install MongoDB
  • Download the enron dataset from their website.
  • Unpack and load it
 bunzip2 enron_mongo.tar.bz2
 tar xvf enron_mongo.tar
 mongorestore dump/enron_mail/messages.bson

We used Robomongo to visually examine the dataset stored in MongoDB.

Build MongoInputFormat

MongoDB offers an InputFormat for Hadoop on their GitHub page. The code is not available in any Maven repository, so we have to build the jar file on our own.

  • Check out the repository
git clone https://github.com/mongodb/mongo-hadoop.git
cd mongo-hadoop
  • Set the appropriate Hadoop version in the build.sbt, we used 1.1.
hadoopRelease in ThisBuild := "1.1"
  • Build the input format
./sbt package

The jar-file is now located in core/target.

The Stratosphere Program

Now we have everything prepared to run the Stratosphere program. I only ran it on my local computer, out of Eclipse. To do that, check out the code ...

git clone https://github.com/stratosphere/stratosphere-mongodb-example.git

... and import it as a Maven project into your Eclipse. You have to manually add the previously built mongo-hadoop jar-file as a dependency. You can now press the "Run" button and see how Stratosphere executes the little program. It was running for about 8 seconds on the 1.5 GB dataset.

The result (located in /tmp/enronCountByDay) now looks like this.

11,Fri Sep 26 10:00:00 CEST 1997
154,Tue Jun 29 10:56:00 CEST 1999
292,Tue Aug 10 12:11:00 CEST 1999
185,Thu Aug 12 18:35:00 CEST 1999
26,Fri Mar 19 12:33:00 CET 1999

There is one thing left I want to point out here. MongoDB represents objects stored in the database as JSON-documents. Since Stratosphere's standard types do not support JSON documents, I was using the WritableWrapper here. This wrapper allows to use any Hadoop datatype with Stratosphere.

The following code example shows how the JSON-documents are accessed in Stratosphere.

public void map(Record record, Collector<Record> out) throws Exception {
    Writable valWr = record.getField(1, WritableWrapper.class).value();
    BSONWritable value = (BSONWritable) valWr;
    Object headers = value.getDoc().get("headers");
    BasicDBObject headerOb = (BasicDBObject) headers;
    String date = (String) headerOb.get("Date");
    // further date processing
}

Please use the comments if you have questions or if you want to showcase your own MongoDB-Stratosphere integration.

Written by Robert Metzger (@rmetzger_).

Accessing Data Stored in MongoDB with Stratosphere

Optimizer Plan Visualization Tool

26 Jan 2014

Stratosphere's hybrid approach combines MapReduce and MPP database techniques. One central part of this approach is to have a separation between the programming (API) and the way programs are executed (execution plans). The compiler/optimizer decides the details concerning caching or when to partition/broadcast with a holistic view of the program. The same program may actually be executed differently in different scenarios (input data of different sizes, different number of machines).

If you want to know how exactly the system executes your program, you can find it out in two ways:

  1. The browser-based webclient UI, which takes programs packaged into JARs and draws the execution plan as a visual data flow (check out the documentation for details).

  2. For programs using the Local- or Remote Executor, you can get the optimizer plan using the method LocalExecutor.optimizerPlanAsJSON(plan). The resulting JSON string describes the execution strategies chosen by the optimizer. Naturally, you do not want to parse that yourself, especially for longer programs.

The builds 0.5-SNAPSHOT and later come with a tool that visualizes the JSON string. It is a standalone version of the webclient's visualization, packed as an html document tools/planVisualizer.html.

If you open it in a browser (for example chromium-browser tools/planVisualizer.html) it shows a text area where you can paste the JSON string and it renders that string as a dataflow plan (assuming it was a valid JSON string and plan). The pictures below show how that looks for the included sample program that uses delta iterations to compute the connected components of a graph.

Optimizer Plan Visualization Tool

Stratosphere 0.4 Released

13 Jan 2014

We are pleased to announce that version 0.4 of the Stratosphere system has been released.

Our team has been working hard during the last few months to create an improved and stable Stratosphere version. The new version comes with many new features, usability and performance improvements in all levels, including a new Scala API for the concise specification of programs, a Pregel-like API, support for Yarn clusters, and major performance improvements. The system features now first-class support for iterative programs and thus covers traditional analytical use cases as well as data mining and graph processing use cases with great performance.

In the course of the transition from v0.2 to v0.4 of the system, we have changed pre-existing APIs based on valuable user feedback. This means that, in the interest of easier programming, we have broken backwards compatibility and existing jobs must be adapted, as described in the migration guide.

This article will guide you through the feature list of the new release.

Scala Programming Interface

The new Stratosphere version comes with a new programming API in Scala that supports very fluent and efficient programs that can be expressed with very few lines of code. The API uses Scala's native type system (no special boxed data types) and supports grouping and joining on types beyond key/value pairs. We use code analysis and code generation to transform Scala's data model to the Stratosphere runtime. Stratosphere Scala programs are optimized before execution by Stratosphere's optimizer just like Stratosphere Java programs.

Learn more about the Scala API at the Scala Programming Guide

Iterations

Stratosphere v0.4 introduces deep support for iterative algorithms, required by a large class of advanced analysis algorithms. In contrast to most other systems, "looping over the data" is done inside the system's runtime, rather than in the client. Individual iterations (supersteps) can be as fast as sub-second times. Loop-invariant data is automatically cached in memory.

We support a special form of iterations called “delta iterations” that selectively modify only some elements of intermediate solution in each iteration. These are applicable to a variety of applications, e.g., use cases of Apache Giraph. We have observed speedups of 70x when using delta iterations instead of regular iterations.

Read more about the new iteration feature in the documentation

Hadoop YARN Support

YARN (Yet Another Resource Negotiator) is the major new feature of the recently announced Hadoop 2.2. It allows to share existing clusters with different runtimes. So you can run MapReduce alongside Storm and others. With the 0.4 release, Stratosphere supports YARN. Follow our guide on how to start a Stratosphere YARN session.

Improved Scripting Language Meteor

The high-level language Meteor now natively serializes JSON trees for greater performance and offers additional operators and file formats. We greatly empowered the user to write crispier scripts by adding second-order functions, multi-output operators, and other syntactical sugar. For developers of Meteor packages, the API is much more comprehensive and allows to define custom data types that can be easily embedded in JSON trees through ad-hoc byte code generation.

Spargel: Pregel Inspired Graph Processing

Spargel is a vertex-centric API similar to the interface proposed in Google's Pregel paper and implemented in Apache Giraph. Spargel is implemented in 500 lines of code (including comments) on top of Stratosphere's delta iterations feature. This confirms the flexibility of Stratosphere's architecture.

Web Frontend

Using the new web frontend, you can monitor the progress of Stratosphere jobs. For finished jobs, the frontend shows a breakdown of the execution times for each operator. The webclient also visualizes the execution strategies chosen by the optimizer.

Accumulators

Stratosphere's accumulators allow program developers to compute simple statistics, such as counts, sums, min/max values, or histograms, as a side effect of the processing functions. An example application would be to count the total number of records/tuples processed by a function. Stratosphere will not launch additional tasks (reducers), but will compute the number "on the fly" as a side-product of the functions application to the data. The concept is similar to Hadoop's counters, but supports more types of aggregation.

Refactored APIs

Based on valuable user feedback, we refactored the Java programming interface to make it more intuitive and easier to use. The basic concepts are still the same, however the naming of most interfaces changed and the structure of the code was adapted. When updating to the 0.4 release you will need to adapt your jobs and dependencies. A previous blog post has a guide to the necessary changes to adapt programs to Stratosphere 0.4.

Local Debugging

You can now test and debug Stratosphere jobs locally. The LocalExecutor allows to execute Stratosphere Jobs from IDE's. The same code that runs on clusters also runs in a single JVM multi-threaded. The mode supports full debugging capabilities known from regular applications (placing breakpoints and stepping through the program's functions). An advanced mode supports simulating fully distributed operation locally.

Miscellaneous

  • The configuration of Stratosphere has been changed to YAML
  • HBase support
  • JDBC Input format
  • Improved Windows Compatibility: Batch-files to start Stratosphere on Windows and all unit tests passing on Windows.
  • Stratosphere is available in Maven Central and Sonatype Snapshot Repository
  • Improved build system that supports different Hadoop versions using Maven profiles
  • Maven Archetypes for Stratosphere Jobs.
  • Stability and Usability improvements with many bug fixes.

Download and get started with Stratosphere v0.4

There are several options for getting started with Stratosphere.

Tell us what you think!

Are you using, or planning to use Stratosphere? Sign up in our mailing list and drop us a line.

Have you found a bug? Post an issue on GitHub.

Follow us on Twitter and GitHub to stay in touch with the latest news!

Stratosphere 0.4 Released