On Availability of Intermediate Data in Cloud Computations Steven Y. Ko, Imranul Hoque, Brian Cho, and Indranil Gupta Distributed Protocols Research Group (DPRG) University of Illinois at Urbana-Champaign Our Position Intermediate data as a first-class citizen for dataflow programming frameworks in clouds 2 Our Position Intermediate data as a first-class citizen for dataflow programming frameworks in clouds ◦ Dataflow programming frameworks 3 Our Position Intermediate data as a first-class citizen for dataflow programming frameworks in clouds ◦ Dataflow programming frameworks ◦ The importance of intermediate data 4 Our Position Intermediate data as a first-class citizen for dataflow programming frameworks in clouds ◦ Dataflow programming frameworks ◦ The importance of intermediate data ◦ Outline of a solution 5 Our Position Intermediate data as a first-class citizen for dataflow programming frameworks in clouds ◦ Dataflow programming frameworks ◦ The importance of intermediate data ◦ Outline of a solution This talk ◦ Builds up the case ◦ Emphasizes the need, not the solution 6 Dataflow Programming Frameworks Runtime systems that execute dataflow programs ◦ MapReduce (Hadoop), Pig, Hive, etc. ◦ Gaining popularity for massive-scale data processing ◦ Distributed and parallel execution on clusters A dataflow program consists of ◦ Multi-stage computation ◦ Communication patterns between stages 7 Example 1: MapReduce Two-stage computation with all-to-all comm. ◦ Google introduced,Yahoo! open-sourced (Hadoop) ◦ Two functions – Map and Reduce – supplied by a programmer ◦ Massively parallel execution of Map and Reduce Stage 1: Map Shuffle (all-to-all) Stage 2: Reduce 8 Example 2: Pig and Hive Pig from Yahoo! & Hive from Facebook Built atop MapReduce Declarative, SQL-style languages Automatic generation & execution of multiple MapReduce jobs 9 Example 2: Pig and Hive Multi-stage with either all-to-all or 1-to-1 Stage 1: Map Shuffle (all-to-all) Stage 2: Reduce 1-to-1 comm. Stage 3: Map Stage 4: Reduce 10 Usage 11 Usage Google (MapReduce) ◦ Indexing: a chain of 24 MapReduce jobs ◦ ~200K jobs processing 50PB/month (in 2006) Yahoo! (Hadoop + Pig) ◦ WebMap: a chain of 100 MapReduce jobs Facebook (Hadoop + Hive) ◦ ~300TB total, adding 2TB/day (in 2008) ◦ 3K jobs processing 55TB/day Amazon ◦ Elastic MapReduce service (pay-as-you-go) Academic clouds ◦ Google-IBM Cluster at UW (Hadoop service) ◦ CCT at UIUC (Hadoop & Pig service) 12 One Common Characteristic Intermediate data ◦ Intermediate data? data between stages Similarities to traditional intermediate data ◦ E.g., .o files ◦ Critical to produce the final output ◦ Short-lived, written-once and read-once, & used-immediately 13 One Common Characteristic Intermediate data ◦ Written-locally & read-remotely ◦ Possibly very large amount of intermediate data (depending on the workload, though) ◦ Computational barrier Stage 1: Map Computational Barrier Stage 2: Reduce 14 Computational Barrier + Failures Availability becomes critical. ◦ Loss of intermediate data before or during the execution of a task => the task can’t proceed Stage 1: Map Stage 2: Reduce 15 Current Solution Store locally & re-generate when lost ◦ Re-run affected map & reduce tasks ◦ No support from a storage system Assumption: re-generation is cheap and easy Stage 1: Map Stage 2: Reduce 16 Hadoop Experiment Emulab setting (for all plots in this talk) ◦ 20 machines sorting 36GB ◦ 4 LANs and a core switch (all 100 Mbps) Normal execution: Map–Shuffle–Reduce Map Shuffle Reduce 17 Hadoop Experiment 1 failure after Map ◦ Re-execution of Map-Shuffle-Reduce ~33% increase in completion time Map Shuffle Shuffl Reduce Map Reduce e 18 Re-Generation for Multi-Stage Cascaded re-execution: expensive Stage 1: Map Stage 2: Reduce Stage 3: Map Stage 4: Reduce 19 Importance of Intermediate Data Why? ◦ Critical for execution (barrier) ◦ When lost, very costly Current systems handle it themselves. ◦ Re-generate when lost: can lead to expensive cascaded re-execution ◦ No support from the storage We believe the storage is the right abstraction, not the dataflow frameworks. 20 Our Position Intermediate data as a first-class citizen for dataflow programming frameworks in clouds Dataflow programming frameworks The importance of intermediate data ◦ Outline of a solution Why is storage the right abstraction? Challenges Research directions 21 Why is Storage the Right Abstraction? Replication stops cascaded re-execution. Stage 1: Map Stage 2: Reduce Stage 3: Map Stage 4: Reduce 22 So, Are We Done? No! Challenge: minimal interference ◦ Network is heavily utilized during Shuffle. ◦ Replication requires network transmission too. ◦ Minimizing interference is critical for the overall job completion time. Any existing approaches? ◦ HDFS (Hadoop’s default file system): much interference (next slide) ◦ Background replication with TCP-Nice: not designed for network utilization & control (no further discussion, please refer to our paper) 23 Modified HDFS Interference Unmodified HDFS ◦ Much overhead with synchronous replication Modification for asynchronous replication ◦ With an increasing level of interference Four levels of interference ◦ Hadoop: original, no replication, no interference ◦ Read: disk read, no network transfer, no actual replication ◦ Read-Send: disk read & network send, no actual replication ◦ Rep.: full replication 24 Modified HDFS Interference Asynchronous replication ◦ Network utilization makes the difference Both Map & Shuffle get affected ◦ Some Maps need to read remotely 25 Our Position Intermediate data as a first-class citizen for dataflow programming frameworks in clouds Dataflow programming frameworks The importance of intermediate data ◦ Outline of a new storage system design Why is storage the right abstraction? Challenges Research directions 26 Research Directions Two requirements ◦ Intermediate data availability to stop cascaded re-execution ◦ Interference minimization focusing on network interference Solution ◦ Replication with minimal interference 27 Research Directions Replication using spare bandwidth ◦ Not much network activity during Map & Reduce computation ◦ Tight B/W monitoring & control Deadline-based replication ◦ Replicate every N stages Replication based on a cost model ◦ Replicate only when re-execution is more expensive 28 Summary Our position ◦ Intermediate data as a first-class citizen for dataflow programming frameworks in clouds Problem: cascaded re-execution Requirements ◦ Intermediate data availability ◦ Interference minimization Further research needed 29 BACKUP 30 Default HDFS Interference Replication of Map and Reduce outputs 31 Default HDFS Interference Replication policy: local, then remote-rack Synchronous replication 32