Shared Scan Batch Scheduling in Cloud Computing Xiaodan Wang Randal Burns Johns Hopkins University Chris Olston Anish Das Sarma Yahoo! Research Project Goals Eliminate redundant data processing for concurrent workflows that access the same dataset in the Cloud Batch MapReduce workflows to enable scan sharing – – – Data-intensive workloads (tens of minutes to hours) – – Single pass scan of shared data segments Alleviate contention and improve scalability Utilize fewer map/reduce slots under load Joins across multiple datasets User specified rewards for early completion Trade-offs between efficient resource utilization and deadlines Shared Scan Batch Scheduling in Cloud Computing Data-Driven Batch Scheduling Q1 R1 R2 R3 Q2 R2 R3 R4 Q3 R1 R2 Co-schedule by Sub-query Batch Sched. Data Access by Query Decomposition Turbulence DB R2 Q1 Q2 R1 Q1 Q3 R3 Q1 Q2 R3 Q2 Q3 Query Results Throughput scales with contention (Astro. & Turbulence) Decompose into sub-queries based on data access Co-schedule sub-queries to amortize I/O Evaluate data atoms based on utility metric – – – Reordering based on contention vs. arrival order (CIDR’09) Adaptive starvation resistance Job-aware (queries with data dependency) (SC’10) Shared Scan Batch Scheduling in Cloud Computing Application in Cloud Computing Fixed Cloud (fixed resources) – – – – Single pass scan of shared data Alleviate contention (utilize less map/reduce slots, shared loading and shuffling of data) Earn rewards for early completion (soft deadlines) Local improvement w/ simulated annealing, greedy ordering Elastic Cloud – – – – Machine charge = (# of machines) x (# hours) Speed-up factors w/ more machines (i.e. more parallelism) Add machines to meet soft deadlines Aggressive batching to minimize machine charge (efficiency) Shared Scan Batch Scheduling in Cloud Computing Nova Workflow Platform What is Nova? – – Content mgmt and workflow scheduling for the Cloud Leverages existing resources Cloud Data: HDFS/Zebra storage Cloud Computing: Oozie, Pig/MR/Hadoop Users define complex workflows in Oozie that consume the data App 1 App 2 App 3 Sample Pig Oozie A = load ‘input1' as (a, b, c); Workflow engineA for MapB = filter bycoordinating a > 5; Reduce/Pig jobs in Hadoop (i.e. Workflow B into 'output1'; DAGstore in which nodes are MR tasks and edges C = group B by b; are dataflows) store C into 'output2'; Shared Scan Batch Scheduling in Cloud Computing Advanced workflow: Nova Simple workflow: Oozie Dataflow: Pig Processing: Hadoop MR Storage: HDFS Sample Nova Workflow crawle r output Nova Data Nova Data Nova Data candidate entity occurrences crawled pages (url, content) validated entity occurrences (url, entity string) Nova Tasks cand. entity extractor Nova Data entity occurrence counts (url, entity id) (entity id, count) Nova Task Nova Task join groupwise count Nova Data entities editors (entity id, entity string) Shared Scan Batch Scheduling in Cloud Computing Shared Scan via Workflow Merging Nova Workflow 1 c1s0 c3s0 c2s0 c4s0 Nova Workflow 1.2 (scans c2s0 once) c1s0 Workflow Merger Nova Workflow 2 c3s0 c4s0 c2s0 c5s0 c2s0 Input Data Output Data Pig/MR Sample Use Cases in Nova Tasks – – – Concurrent research, production, maintenance workflows over same data Content enrichment workflows (i.e. dedup, clustering) over news content Webmap workflows consuming same URL table Shared Scan Batch Scheduling in Cloud Computing c5s0 Performance Impact Input Data (1) Split(Tuple) Nested Nested Plan Nested Plan Plan … (2) Map1 (1) Shared Loading (network, redundant proc.) Map2 Mapn (2) Consolidated computation (shared startup/tear down) Combine(Tuple) (3) Reducer parallelism (Max/Sum # of reducers) Shuffle Demux(Tuple) Reduce1 Nested Nested Plan Nested Plan Plan … Reduce2 (3) Reducem Output Data Output Data Output Data Output OutputData Data Output OutputData Data Output OutputData Data Shared Scan Batch Scheduling in Cloud Computing Completion Time by Scheduling Strategy 3000000 Sequential-NoMerge 2500000 Concurrent-NoMerge Merged Time (ms) 2000000 1500000 1000000 500000 0 1 2 3 4 5 6 # of Shingling Workflows Performance in Nova for different enrichment workflows (ie. de-dup) on news content (SIGMOD’11) Shared Scan Batch Scheduling in Cloud Computing Utilization of Grid Resources (Slot Time) 4000000 Concurrent-NoMerge Map 3500000 Concurrent-NoMerge Reduce Slot Time (ms) 3000000 Merge Map 2500000 Merge Reduce 2000000 1500000 1000000 500000 0 1 2 3 4 # of Shingling Workflows Shared Scan Batch Scheduling in Cloud Computing 5 6 7 PigMix: Load Cost Savings Shared Scan Batch Scheduling in Cloud Computing PigMix: Estimating Makespan Shared Scan Batch Scheduling in Cloud Computing Ongoing Work Starvation resistance – – – – Predicting workflow runtime and frequency – – Account for heterogeneity in workflow sizes Provide soft deadline guarantees Handling cascading failures Prefer jobs with high load cost (less dilation, high slot time savings, map-only jobs) Robustness to inaccuracies in cost estimates Conserve or expend Cloud resources based on deadline requirements and system load Jobs that join/scan multiple input sources Shared Scan Batch Scheduling in Cloud Computing Questions? Shared Scan Batch Scheduling in Cloud Computing Nova Workflow Platform Nova features – Abstraction for complex workflows that consume data Incrementally arriving data (logs, crawls, feeds, ...) Incremental processing of arriving data – Stateless: shingle every newly-crawled page – Stateful: maintain inlink counts as web grows Scheduling processing steps – Periodic: run inlink counter once per week – – Triggered: run inlink counter after link extractor Provides provenance, metadata management, incremental processing (i.e. joins), data replication, transactional guarantees Shared Scan Batch Scheduling in Cloud Computing PigMix: Reducer Parallelism Shared Scan Batch Scheduling in Cloud Computing Optimizing for Shared Scan Define a job J (i.e. MapReduce or Pig) – – d(J) defines a soft deadline of each job – – Scans files f(J) = (F1, …, Fi), scan time per file: s(Fi) Fixed processing cost c(J) Step: d defined by n pairs of (ti, pi) where 0<ti< ti+1 and pi>pi+1 (a job that completes by ti is award pi points) Linearly decay: enforce eventual completion w/ negative pts Cost of shared scan for Jobs J1 and J2 c(J1) + c(J2) + ∑Fє(f(J1) U f(J2)) s(F) Maximize points and minimize resources – – Local improvement w/ simulated annealing, greedy ordering Aggressive batching when load is high Shared Scan Batch Scheduling in Cloud Computing Performance Evaluation Experimental Setup – – Nova with Shared Scan Module 200 node Hadoop cluster – Shingling workflow (offline content enrichment) – 128MB HDFS block size 1GB RAM per node 640 mapper and 320 reducer slots De-duplication of news Filter and extract features from content Cluster content by feature and pick one per cluster Execution of multiple de-dup workflows using different clustering alg. Scheduling strategies compared Sequential-NoMerge (slower, conserve Grid resources) Concurrent-NoMerge (fast, elastic Grid resources) Merged (fast, conserve Grid resources) Shared Scan Batch Scheduling in Cloud Computing