S4 is a general-purpose, distributed, scalable, fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data.
motivation
S4 fills the gap between complex proprietary systems and batch-oriented open source computing platforms. We aim to develop a high performance computing platform that hides the complexity inherent in parallel processing system from the application programmer.
implementation
The core platform is written in Java. The implementation is modular and pluggable, and S4 applications can be easily and dynamically combined for creating more sophisticated stream processing systems.
open source
S4 was initially released by Yahoo! Inc. in October 2010 and is an Apache Incubator project since September 2011. It is licensed under the Apache 2.0 license.
overview
proven
S4 has been deployed in production systems at Yahoo! to process thousands of search queries per second.
decentralized
All nodes are symmetric with no centralized service and no single point of failure. This greatly simplifies deployments and cluster configuration changes.
scalable
Throughput increases linearly as additional nodes are added to the cluster. There is no predefined limit on the number of nodes that can be supported.
extensible
Applications can easily be written and deployed using a simple API. Many basic applications for stream processing are available out of the box and more are being written.
cluster management
S4 hides all cluster management tasks using a communication layer built on top of ZooKeeper, a distributed, open-source coordination service for distributed applications.
fault-tolerance
When a server in the cluster fails, a stand-by server is automatically activated to take over the tasks. Checkpointing and recovery minimize state loss.