On using Samza at Optimizely to compute analytics over session windows.
Optimizely is the world’s leading experimentation platform, enabling businesses to deliver continuous experimentation and personalization across websites, mobile apps and connected devices. At Optimizely, billions of events are tracked on a daily basis and session metrics are provided to their users in real-time.
Prior to introducing Samza for their realtime computation, the engineering team at Optimizely built their data-pipeline using a complex Lambda architecture using Druid and Hbase. Since some session metrics were computed using Map-Reduce jobs, they could be delayed up to hours after the events are received. As business requirements evolved, this solution became more and more challenging to scale.
The engineering team at Optimizely turned to stream processing to reduce latencies. In their solution, each up-stream client associates a sessionId with the events it generates. Upon receiving each event, the Samza job extracts various fields (e.g. ip address, location information, browser version, etc) and updates aggregated metrics for the session. At the end of a time-window, the merged metrics for that session are ingested to HBase.
With the new solution
- The median query latency was reduced from 40+ ms to 5 ms
- Session metrics are now available in real-time
- Write-rate to Hbase is reduced, since the metrics are pre-aggregated by Samza
- Storage requirements on Hbase are drastically reduced
- Lower development effort thanks to out-of-the-box Kafka integration
Here is a testimonial from Optimizely
“At Optimizely, we have built the world’s leading experimentation platform, which ingests billions of click-stream events a day from millions of visitors for analysis. Apache Samza has been a great asset to Optimizely’s Event ingestion pipeline allowing us to perform large scale, real time stream computing such as aggregations (e.g. session computations) and data enrichment on a multiple billion events / day scale. The programming model, durability and the close integration with Apache Kafka fit our needs perfectly” says Vignesh Sukumar, Senior Engineering Manager at Optimizely.
In addition to this case-study, Apache Samza is also leveraged for other usecases such as data-enrichment, re-partitioning of event streams and computing realtime metrics etc.
Key Samza features: Stateful processing, Windowing, Kafka-integration