Survey on Programming and Tasking in Cloud Computing Environments PhD Qualifying Exam Zhiqiang Ma Supervisor: Lin Gu Feb. 18, 2011 Outline Introduction Approaches Application framework level approach Language level approach Instruction level approach Our work: MRlite Conclusion 2 Cloud computing Internet services are the most popular applications nowadays Millions of users Computation is large and complex Google already processed 20TB data in 2004 Cloud computing provides massive computing resources Available on demand A promising model to support processing large datasets housed on clusters 3 How to program and task? Challenges Parallelize the execution Scheduling the large scale distributed computation Handling faults High performance Ensuring fairness Programming models for Grid Do not automatically parallelize users’ programs Pass the fault-tolerance work to applications 4 Outline Introduction Approaches Application framework level approach Language level approach Instruction level approach Our work: MRlite Conclusion 5 Approaches Approach Advantage Disadvantage Application framework level Language level Instruction level 6 MapReduce MapReduce: parallel computing framework for large-scale data processing Successful used in datacenters comprising commodity computers A fundamental piece of software in the Google architecture for many years Open source variant already exists: Hadoop Widely used in solving data-intensive problems MapReduce Hadoop … Hadoop or variants … 7 MapReduce Map and Reduce are higher-order functions Map: apply an operation to all elements in a list Reduce: Like “fold”; aggregate elements of a list 12 + 22 + 3 2 + 4 2 + 52 = ? m: x2 r: + Initial value 0 1 2 3 4 5 m m m m m 1 4 9 16 25 r r r r r 1 5 14 30 55 final value 8 MapReduce’s data flow 9 MapReduce Massive parallel processing made simple Example: world count Map: parse a document and generate <word, 1> pairs Reduce: receive all pairs for a specific word, and count Map // D is a document for each word w in D output <w, 1> Reduce Reduce for key w: count = 0 for each input item count = count + 1 output <w, count> 10 MapReduce easily scales up Input files Map phase Intermediate files Reduce phase Output files 11 MapReduce Input Computation Output 12 Dryad General-purpose execution environment for distributed, data-parallel applications Concentrates on throughput not latency Application written in Dryad is modeled as a directed acyclic graph (DAG) Many programs can be represented as a distributed execution graph 13 Dryad Outputs Processing vertices Channels (file, pipe, shared memory) Inputs 14 Dryad Concurrency arise from vertices running simultaneously across multiple machines Vertices subroutines are usually quite simple as sequential programs User have control over the communication graph Each vertex can has multiple input and output 15 Approaches Approach Application framework level Advantage Automatically parallelize users’ programs; Users are relaxed from the details of distributing the execution Disadvantage Programs must follow the specific model Language’ level Instruction level 16 Tasking of execution Performance Fairness Locality is crucial Speculative execution The same cluster shared by multiple users Small jobs requires small response time while throughput is important for big jobs Correctness Fault-tolerance 17 Locality and fairness Locality is crucial Bandwidth is scarce resource Input data with duplications are stored in the same cluster for executions Fairness Short jobs requires short response time Locality and fairness conflicts with each other 18 FIFO scheduler in Hadoop Jobs in a queue with priority order FIFO by default When there are available slots Assign slots to tasks, that have local data, in priority order Limit the assignment of non-local task to optimize locality 19 FIFO scheduler 2 tasks JobQueue 1 tasks Node 1 Node 2 Node 3 Node 4 20 FIFO scheduler – locality optimization 4 tasks Only dispatch one non-local task at one time JobQueue 1 tasks Node 1 Node 2 Node 3 Far away in network topology Node 4 21 Problem: fairness 3 tasks JobQueue 3 tasks Node 1 Node 2 Node 3 Node 4 22 Problem: response time JobQueue 3 tasks 3 tasks Small job: Only 1 task) 1 task Node 1 Node 2 Node 3 Node 4 23 Fair scheduling Assign free slots to the job that has the fewest running tasks Strict fairness Running jobs gets nearly equal number of slots The small jobs finishes quickly 24 Fair Scheduling JobQueue Node 1 Node 2 Node 3 Node 4 25 Problem: locality JobQueue Node 1 Node 2 Node 3 Node 4 26 Delay Scheduling Skip the job that cannot launch a local task Relax fairness slightly Allow a job to launch non-local tasks if be skipped long enough Avoid starvation 27 Delay Scheduling 120 0 0 0 skipcount Waiting time is short: Threshold: 2 Tasks finish quickly Skipped job is in the head of the queue Node 1 Node 2 JobQueue Node 3 Node 4 28 “Fault” Tolerance Nodes fail Re-run tasks Nodes are slow (stragglers) Run backup tasks (speculative execution) To minimize job’s response time Important for short jobs 29 Speculative execution The scheduler schedules backup executions of the remaining in-progress tasks The task is marked as completed whenever either the primary or the backup execution completes Improve job response time by 44% according Google’s experiments 30 Speculative execution mechanism Seems a simple problem, but Resource for speculative tasks is not free How to choose nodes to run speculative tasks? How to distinguish “stragglers” from nodes that are slightly slower? Stragglers should be found out early 31 Hadoop’s scheduler Start speculative tasks based on a simple heuristic Comparing each task’s progress to the average Assumption of homogeneous environment The default scheduler works well Broken in utility computing Virtualized “utility computing” environments, such as EC2 How to robustly perform speculative execution (backup tasks) in heterogeneous environments? 32 Speculative execution in Hadoop When there is no “higher priority” tasks, looks for a task to execute speculatively Assumption: The is no cost to launching a speculative task Comparing each task’s progress to the average progress Assumption: Nodes perform similarly. (“Slow node is faulty”; “Nodes that ask for new tasks are fast”) Nodes may be slightly (2-3x) slower in “utility computing”, which may not hurt the response time or ask for tasks but not fast 33 Speculative execution in Hadoop Threshold for speculative execution (Average progress score of each category of tasks) – 0.2 Tasks beyond the threshold are “equally slow” Ranks candidates by locality Wrong tasks may be chosen 35% completed 2x slower task with data available on idle node or 5% completed 10x slower task? Too many speculative tasks and thrashing Taking away resources from useful tasks 34 Speculative execution in Hadoop Progress score Map: fraction of input data Reduce: three phase (1/3 for each) and fraction of data processed Incorrect speculation of reduce tasks Copy phase takes most of the time, but account only 1/3 30% tasks finishes quickly, 70% are in copy phase: Avg. progress rate = 30%*1+70%*1/3 = 53%, threshold=33% 35 LATE Longest Approximate Time to End Principles Ranks candidate by longest time to end Only launch speculative tasks on fast nodes Choose the right task that hurts the job’s response time; slow nodes can be utilized as long as it doesn’t hurt the response time Not every node that asks for task is fast Cap speculative tasks Limit resource contention and thrashing 36 LATE algorithm Cap speculative tasks If a node asks for a new task and there are fewer than SpeculativeCap speculative tasks running: Ignore the request if the node's total progress is below SlowNodeThreshold Only launch speculative tasks on fast nodes Rank currently running tasks by estimated time left Launch a copy of the highest-ranked task with progress rate below SlowTaskThreshold Rank candidates by longest time to end 37 Approaches Approach Application framework level Advantage Automatically parallelize users’ programs; Users are relaxed from the details of distributing the execution Disadvantage Programs must follow the specific model Language level Instruction level 38 Language level approach Programming frameworks Traditional programming language Still not clear and compact enough Without giving special focus on high parallelism for large computing cluster New language Clear, compact and expressive Automatically parallelized “normal” programs Comfortable way for user to think about data processing problem on large distributed datasets 39 Sawzall Interpreted, procedural high-level programming language Exploit high parallelism Automate very large data sets analysis Give users a way to clearly and expressively design distributed data processing programs 40 Overall flow Filtering Analysis each record individually Expressed in Sawzall Map Aggregation Collate and reduce the intermediate values Predefined aggregators Reduce 41 An example Find out the most-linked-to page of each domain Aggregator: highest value Stores url Indexed by domain Weighted by pagerank max_pagerank_url: table maximum(1)[domain:string] of url:string weight pagerank:int; doc:Document = input; input: pre-defined variable initialized by Sawzall Interpreted into Documentn type emit max_pagerank_url[domain(doc.url)] <- doc.url weight doc.pagerank; emit: sends intermediate value to the aggregator 42 Unusual features Sawzall runs on one record at a time Nothing in the language to have one input record influent another emit statement is the only output primitive Explicit line between filtering and aggregation Enables high degree of parallelism even though it is hidden from the language 43 Approaches Approach Application framework level Advantage Automatically parallelize users’ programs; Users are relaxed from the details of distributing the execution Clearer, more expressive Language level Comfortable way for programming Disadvantage Programs must follow the specific model More restrict programming model Instruction level 44 Instruction level approach Provides instruction level abstracts and compatibility to users’ applications May choose traditional ISA such as x86/x86-64 Run traditional applications without any modification Easier to migrate applications to cloud computing environments 45 Amazon Elastic Compute Cloud (EC2) Provides virtual machines runs traditional OS Traditional programs can work on EC2 Amazon Machine Image (AMI) Boot instances Unit of deployment, packaged-up environment Users design and implement the application logic in AMI; EC2 handles the deployment and resource allocation 46 vNUMA Virtual shared-memory multiprocessor machine build from commodity workstations VM Make the computational power available to legacy applications and OSs VM PM Virtualization VM VM PM PM PM vNUMA 47 Architecture Hypervisor CPU On each node Virtual CPUs are mapped to real CPUs on nodes Memory Divided between the nodes with equal-sized portions Each node manages a subset of the pages 48 Memory mapping In application’s virtual memory address read *a Application translate a to VM’s physical memory address b OS VM maps b to real physical address c on node VMM find *c PM PM 49 Approaches Approach Application framework level Advantage Automatically parallelize users’ programs; Users are relaxed from the details of distributing the execution Clearer, more expressive Language level Instruction level Comfortable way for programming Supports traditional applications Disadvantage Programs must follow the specific model More restrict programming model Users handles the tasking Hard to scale up 50 Outline Introduction Approaches Application framework level approach Language level approach Instruction level approach Our work: MRlite Conclusion 51 Our work Analyze MapReduce’s design and use a case study to probe the limitation One-way scalability Difficult to handle dynamic, interactive and semantic-rich applications Design a new parallelization framework – MRlite Able to scale “up” like MapReduce, and scale “down” to process moderate-size data Low latency and massive parallelism Small run-time system overhead Design a general parallelization framework and programming paradigm for cloud computing 52 Architecture of MRlite The MRlite master accepts jobs from clients and schedules them to execute on slaves application MRlite master scheduler MRlite client slave slave Distributed nodes accept tasks from master and execute them slave Linked together with the app, the MRlite client library accepts calls from app and submits jobs to the master High speed Distributed storage slave High speed distributed storage, stores intermediate files Data flow Command flow 53 9044 Result 9000 8000 7000 gcc (on one node) mrcc/Hadoop mrcc/MRlite Time (sec) 6000 5000 4000 2936 3000 2000 1000 1419 506 653 312 50 128 65 0 Linux kernel ImageMagick Xen tools The evaluation shows that MRlite is one order of magnitude faster than Hadoop on problems that MapReduce has difficulty in handling. 54 Outline Introduction Approaches Application framework level approach Language level approach Instruction level approach Our work: MRlite Conclusion 55 Conclusion Cloud Computing needs a general programming framework Cloud computing shall not be a platform to run just simple OLAP applications. It is important to support complex computation and even OLTP on large data sets Design MRlite: a general parallelization framework for cloud computing Handles applications with complex logic flow and data dependencies Mitigates the one-way scalability problem Able to handle all MapReduce tasks with comparable (if not better) performance 56 Conclusion Emerging computing platforms increasingly emphasize parallelization capability, such as GPGPU MRlite respects applications’ natural logic flow and data dependencies This modularization of parallelization capability from application logic enables MRlite to integrate GPGPU processing very easily (future work) 57 Thank you! Appendix Appendix LATE: Estimate finish times progress score progress rate = execution time 1 – progress score estimated time left = =( progress rate 1 progress score - 1 ) X execution time The smaller progress score, the longer estimated time left. 60 Appendix LATE: Solve the problems in Hadoop’s default scheduler Nodes may be slightly (2-3x) slower in “utility computing”, which may not hurt the response time or ask for tasks but not fast Too many speculative tasks and thrashing Ranks candidate by locality Wrong tasks may be chosen Incorrect speculation of reducers 61