Hedwig - Converting Hadoop M/R ETL systems to Stream Processing
TripAdvisor is one of the world’s largest travel websites that provides hotel and restaurant reviews, accommodation bookings and other travel-related content. It produces and processes billions events everyday including billing records, reports, monitoring events and application notifications.
Prior to migrating to Samza, TripAdvisor used Hadoop to ETL its data. In this model, raw data was rolled up to hourly and daily snapshots in a number of stages with joins and sliding windows applied. Session data was then extracted from the daily snapshots. About 300 million sessions were produced daily. With this solution, the engineering team were faced with a few challenges
- Long lag time to produce business-critical metrics
- Difficult to debug and troubleshoot due to scripts, environments, etc.
The engineering team at TripAdvisor decided to replace the Hadoop solution with a multi-stage Samza pipeline.
In the new solution, after raw data is first collected by Flume and ingested through a Kafka cluster, it is parsed, cleansed and re-partitioned by the Lookback Router; then processing logic such as windowing, grouping, joining, fraud detection are applied by the Session Collector and the Fraud Collector, The pipeline uses Samza’s RocksDB store to perform stateful aggregations; finally the Uploader writes results to ElasticSearch, RedShift and Hive.
The new solution achieved significant improvements:
- Processing time is reduced from 3 hours to 1 hour
- Individual stages in the pipeline are scaled independently
- Overall hardware requirement is reduced to ⅓ thanks to optimized usage
- Much simpler to debug and test the solution
Key Samza features: Stateful processing, Windowing, Kafka-integration