S4: Distributed Stream Computing Platform

August 2012: S4 0.5.0 has been released! Get it here!.

S4 0.5.0 "Piper" is a complete refactoring of the previous version of S4. It provides a clearer API and a more robust implementation. It features TCP based communications, state recovery, and a new set of tools. More information is available in an overview and you may also check the release notes.

S4 is a general-purpose, distributed, scalable, fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data.

motivation

S4 fills the gap between complex proprietary systems and batch-oriented open source computing platforms. We aim to develop a high performance computing platform that hides the complexity inherent in parallel processing system from the application programmer.

implementation

The core platform is written in Java. The implementation is modular and pluggable, and S4 applications can be easily and dynamically combined for creating more sophisticated stream processing systems.

open source

S4 was initially released by Yahoo! Inc. in October 2010 and is an Apache Incubator project since September 2011. It is licensed under the Apache 2.0 license.

overview

proven

S4 has been deployed in production systems at Yahoo! to process thousands of search queries per second.

decentralized

All nodes are symmetric with no centralized service and no single point of failure. This greatly simplifies deployments and cluster configuration changes.

scalable

Throughput increases linearly as additional nodes are added to the cluster. There is no predefined limit on the number of nodes that can be supported.

extensible

Applications can easily be written and deployed using a simple API. Many basic applications for stream processing are available out of the box and more are being written.

cluster management

S4 hides all cluster management tasks using a communication layer built on top of ZooKeeper, a distributed, open-source coordination service for distributed applications.

fault-tolerance

When a server in the cluster fails, a stand-by server is automatically activated to take over the tasks. Checkpointing and recovery minimize state loss.