Fault-Tolerant Workflow Scheduling Using Spot Instances on Clouds Deepak Poola, Kotagiri Ramamohanarao, and Rajkumar Buyya Cloud Computing and Distributed Systems (CLOUDS) Laboratory Department of Computing and Information Systems, The University of Melbourne, Email: deepakc@student.unimelb.edu.au,{kotagiri,rbuyya}@unimelb.edu.au ICCS-2014, Cairns, Australia Cloud Computing Cloud Computing 2 Offers resources as a subscription based service Highly scalable Highly available Driven by market principles Dynamically configured and delivered on demand Different pricing models Benefits of Cloud Computing • • • • • • • • 3 Scalability or elasticity On-Demand resource provisioning Wide range of resource types Pay-as-you-go model Attractive cost models Illusion of unlimited resources Cheaper and fast storage facilities Plethora of tools for ease of use – Content-delivery – Networking – Deployment and Management – Monitoring Spot Instances • • • • • • • Started by Amazon around December 2009 Idle or unused datacenter capacity Spot price is decided in an Auction-like mechanism Varies with time and instance type Varies between regions and availability zones bid should be higher than or equal to the spot price Offers upto 60% cost reductions Workflows • Scientific workflow systems aim at automating large • • • 5 complex data analysis to make it easier for scientists. Workflows are collection of tasks that are data dependent or control dependent. Workflows can be represented as Directed Acyclic Graph Workflow scheduling maps tasks to resources whilst maintaining dependencies Jargons – Makespan – Deadline – Cost Sample Workflow – Budget Research overview • • • • • • 6 Just-in-time and adaptive scheduling heuristic Using spot and on-demand instances An intelligent bidding strategy Minimizes the execution cost Providing a robust schedule Satisfying the deadline constraint Background • Workflow is represented a DAG • Makespan is the total elapsed time • Pricing models • – On-Demand – Spot Critical Path is the longest path from the start node to the exit node Latest Time to On-Demand (LTO) • It is the latest time the algorithm has to switch to ondemand instances to satisfy the deadline constraint Start LTO Spot Instances Deadline On-Demand System Model Runtime Estimation • We use Downey’s analytical model • Downey’s model requires: – – – – task’s average parallelism, A, coefficient of variance of parallelism, σ, task length the number of cores • Cirne et al model to generate A and σ Failure Estimator • • • • • Estimates the failure probability of a particular bid price Based on spot price The history price of one month prior is considered Total time of the spot price history, HT And total out of bid time, OBTbidt is measured Scheduling Algorithm Scheduling Algorithm (Contd..) Scheduling Algorithm (Contd..) Two type of Scheduling Algorithms • Conservative: CP and LTO is estimated on the lowest cost instance. – CP is the longest, hence less slack time – Uses spot instances cautiously under relaxed deadlines • Aggressive: CP and LTO is estimated on the highest cost instance. – CP is smallest, hence more slack time – opt on-demand instances that are expensive under failures Bidding Strategy Intelligent Bidding Strategy • Current spot price (pspot) • On-demand price (pOD) • Failure probability (FP) of the previous bid price • LTO • Current time (CT) •α •β Intelligent Bidding Strategy • α : dictates how much higher the bid value must be above the current spot price • β : determines how fast the bid value reaches the ondemand price • FP of the previous bid is used as a feedback to the current bid price Intelligent Bidding Strategy Other Bidding Strategies • On-Demand Bidding Strategy : uses the on-demand price as the bid price. • Naive Bidding Strategy: uses the current spot price as the bid price for the instance Simulation Setup • • • • CloudSim was used for simulation LIGO workflow with 1000 tasks was considered For On-Demand 9 different VMs types wereconsidered For Spot, 1 VM type was used Results : Comparison between algorithms Mean execution cost of algorithms with varying deadline (with 95% confidence interval) Results : Comparison between bidding strategies Mean Execution Cost of bidding strategies with varying deadline (with 95% confidence interval) Results : Task Failures Mean of task failures due to bidding strategies Results : Checkpointing Conclusion • Two scheduling heuristics that map workflow tasks onto spot • • • • and on-demand instance are presented They minimize the execution cost They are robust and fault-tolerant towards out-of-bid failures and performance variations A bidding strategy that bids intelligently to minimize the cost is presented Demonstrates the use of checkpointing, which offers cost savings up to 14% © Copyright The University of Melbourne 2009