Chapter 9 – Cloud Resource Management and Scheduling Contents 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. Policies and mechanisms for resource management. Cloud resource utilization. Resource management and dynamic application scaling. Control theory and optimal resource management. A two-level resource allocation architecture. Feedback control based on dynamic thresholds. Coordination of power and performance management. A utility-based model for cloud-based Web services. Scheduling algorithms for computer clouds. Delay scheduling. Data-aware scheduling. Apache capacity scheduler. Start-up fair queuing. Borrowed virtual time. Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 2 Policies and mechanisms for resource management ◼ ◼ Resource management → critical function of any man-made system. Such systems have finite resources that have to be shared among different system components. Policies and mechanisms for resource allocation. Policy → principles guiding decisions. Mechanisms → the means to implement policies. ◼ Resource management affects three basic criteria for system evaluation: Functionality → if the system functions according to specification. Performance → if the system design performance criteria are met. Cost → if the cost of building and maintaining the system meet specifications. ◼ Scheduling in a computing system → deciding how to allocate resources of a system, such as CPU cycles, memory, secondary storage space, I/O and network bandwidth, between users and tasks. Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 3 Challenges for cloud resource management ◼ CRM - Cloud resource management: Requires complex policies and decisions for multi-objective optimization. Affected by unpredictable interactions with the environment, e.g., system failures, attacks. Cloud service providers are faced with large fluctuating loads which challenge the claim of cloud elasticity. ◼ Effective CRM is extremely challenging; The scale of the cloud infrastructure → makes it impossible to have accurate global state information The interactions of the system with a large user population → makes it nearly impossible to predict the type and the intensity of the system workload. ◼ The strategies for resource management for IaaS, PaaS, and SaaS are different. Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 4 CRM policies 1. 2. 3. 4. 5. Admission control → prevent the system from accepting workload in violation of high-level system objectives. Capacity allocation → allocate resources for individual activations of a service. Load balancing → distribute the workload evenly among the servers. Energy optimization → minimization of energy consumption. Quality of service (QoS) guarantees → ability to satisfy timing or other conditions specified by a Service Level Agreement. Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 5 CRM mechanisms ◼ ◼ ◼ ◼ Control theory → uses the feedback to guarantee system stability and predict transient behavior. Machine learning → does not need a performance model of the system. Utility-based → require a performance model and a mechanism to correlate user-level performance with cost. Market-oriented/economic → do not require a model of the system, e.g., combinatorial auctions for bundles of resources. Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 6 Tradeoffs The normalized performance and energy consumption, function of the processor speed; the performance decreases at a lower rate than does the energy when the clock rate decreases. To reduce cost and save energy we may need to concentrate the load on fewer servers rather than balance the load among them. Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 7 Cloud resource utilization and energy efficiency ◼ Energy use for computing scales linearly with the number of computing devices. Indeed, performance growth rate and improvements in electrical efficiency almost cancel out. According to Moore's Law the number of transistors on a chip, thus, the computing power of microprocessors doubles every 1.5 years. Electrical efficiency of computing devices doubles about every 1.5 years. ◼ ◼ The energy consumption of cloud data centers is growing, has a significant ecological impact, and affects the cost of cloud services. Cloud services costs are affected by energy costs. Example - the costs for two AWS regions, US East and South America are: upfront $2,604/year versus $5,632/year; hourly cots are $0.412 versus $0.724. Higher energy and communication costs are responsible for the difference in this example; the energy costs for the two regions differ by about 40% . Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 8 Cloud elasticity and overprovisioning ◼ ◼ ◼ Elasticity → additional resources are guaranteed to be allocated when an application needs them and these resources will be released when they are no longer needed. The user ends up paying only for the resources it has actually used. Over-provisioning → a cloud service provider has to invest in a larger infrastructure than the typical cloud workload warrants. It follows that the average cloud server utilization is low. Elasticity is based on overprovisioning and on two assumptions: There is an effective admission control mechanism. The likelihood of all running applications dramatically increasing their resource consumption at the same time is extremely low. ◼ ◼ ◼ Performance per Watt of power (PWP) → common measure of energy efficiency. Low server utilization affects negatively PWP and the ecological impact of cloud computing. Conclusion →overprovisioning is not economically sustainable. Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 9 Percentage of power usage 100 Typical operating region 90 Power 80 70 Energy efficiency 60 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 Percentage of system utilization Even when power requirements scale linearly with the load, the energy efficiency of a computing system is not a linear function of the load. Even when idle, a system may use 50\% of the power corresponding to the full load. Typical operating region for the servers at a data center is from about 10% to 50% of the load. Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 10 Energy efficiency and energy-proportional systems ◼ ◼ ◼ ◼ ◼ An energy-proportional system consumes no power when idle, very little power under a light load and more power as the load increases. The system always operates at 100% efficiency. The dynamic range is determined by the lower and the upper limit of the device power consumption. A large dynamic range means that the device is able to operate at a lower fraction of its peak power when its load is low. Different subsystems of a computing system behave differently in terms of energy efficiency. Processors used in cloud servers consume less than 1/3 of their peak power at very-low load and have a dynamic range of more than 70% of peak power. Processors used in mobile and/or embedded systems are better. Example: a 2.4 GHz Intel Q6600 processor with 4 GB of RAM consumes 110 W when idle and 175 W when fully loaded. Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 11 Energy saving ◼ ◼ ◼ ◼ The alternative to resource management policy when servers are always on, regardless of their load, is to develop energy-aware load balancing and scaling policies. Such policies combine dynamic power management with load balancing and attempt to identify servers operating outside their optimal energy regime and decide if and when they should be switched to a sleep state or what other actions should be taken to optimize the energy consumption. Energy optimization cannot be considered in isolation, it has to be coupled with admission control, capacity allocation, load balancing, and quality of service. Existing mechanisms cannot support concurrent optimization of all the policies. Mechanisms based on a solid foundation such as control theory are too complex and do not scale well. Those based on machine learning are not fully developed, and the others require a model of a system with a dynamic configuration operating in a fast-changing environment. Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 12 Resource management and dynamic scaling ◼ Two application scaling strategies: Vertical scaling → keeps the number of VMs of an application constant, but increases the amount of resources allocated to each one of them. This can be done either by migrating the VMs to more powerful servers, or by keeping the VMs on the same servers, but increasing their share of the CPU time. The first alternative involves additional overhead; the VM is stopped, a snapshot of it is taken, the file is transported to a more powerful server, and, finally, the VM is restated at the new site. Horizontal scaling → common scaling strategy on a cloud; it is done by increasing the number of VMs as the load increases and reducing this number when the load decreases. Often, this leads to an increase of communication bandwidth consumed by the application. Load balancing among the running VMs is critical for this mode of operation. ◼ An application should de designed to support scaling. Workload partitioning of a modularly divisible application is static. The workload of an arbitrarily divisible application can be partitioned dynamically. Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 13 Control theory and optimal CRM ◼ ◼ ◼ Control theory has been used to design adaptive resource management for several classes of applications including power management, task scheduling, QoS adaptation in web servers, and load balancing. Classical feedback control methods are used to regulate key operating parameters of the system based on measurement of system output. The main components of a cloud control system: Inputs → offered workload and policies for admission control, capacity allocation, load balancing, energy optimization, and the QoS guarantees. Control system components → sensors used to estimate relevant measures of performance and controllers which implement policies. Outputs → the resource allocations to the individual applications. ◼ The feedback control assumes a linear time-invariant system model, and a closed-loop controller. This controller is based on an open-loop system transfer function which satisfies stability and sensitivity constraints. Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 14 Feedback and Stability ◼ Control granularity → the level of detail of the information used to control the system. Fine control → very detailed information about the parameters controlling the system state is used. Coarse control → the accuracy of these parameters is traded for the efficiency of implementation. ◼ ◼ The controllers use the feedback provided by sensors to stabilize the system. Stability is related to the change of the output. Sources of instability in any control system: The delay in getting the system reaction after a control action. The granularity of the control, the fact that a small change enacted by the controllers leads to very large changes of the output. Oscillations, when the changes of the input are too large and the control is too weak, such that the changes of the input propagate directly to the output. Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 15 The structure of a cloud controller disturbance r external traffic Predictive filter forecast Optimal controller (k ) s u* (k) Queuing dynamics (k ) state feedback q(k) The controller uses the feedback regarding the current state and the estimation of the future disturbance due to environment to compute the optimal inputs over a finite horizon. r and s are the weighting factors of the performance index. Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 16 A two-level resource allocation architecture ◼ The automatic resource management is based on two levels of controllers, one for the service provider and one for the application. Inputs are: the offered workload and the policies for admission control, the capacity allocation, the load balancing, the energy optimization, and the QoS guarantees in the cloud. System components are sensors used to estimate relevant measures of performance and controllers implementing various policies. Output is the resource allocations to the individual applications. ◼ Is it beneficial to have two types of controllers: application controllers → determine if additional resources are needed. cloud controllers → arbitrate requests for resources and allocates the physical resources. ◼ ◼ ◼ Choose fine versus coarse control. Dynamic thresholds based on time averages better versus static ones. Use a high and a low threshold versus a high threshold only. Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 17 Two-level cloud controller Application n Application 1 Application 1 SLA 1 Application controller …. SLA n …. Application n Application controller VM VM Monitor Decision Cloud Controller …. Actuator VM VM Monitor …. Decision Actuator Cloud Platform Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 18 Lessons from the two-level experiment ◼ ◼ ◼ ◼ The actions of the control system should be carried out in a rhythm that does not lead to instability. Adjustments should only be carried out after the performance of the system has stabilized. If upper and a lower thresholds are set, then instability occurs when they are too close to one another if the variations of the workload are large enough and the time required to adapt does not allow the system to stabilize. The actions consist of allocation/deallocation of one or more virtual machines. Sometimes allocation/dealocation of a single VM required by one of the threshold may cause crossing of the other, another source of instability. Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 19 More on control theory application to CRM ◼ ◼ ◼ ◼ ◼ Regulate the key operating parameters of the system based on measurement of the system output. The system transfer function satisfies stability and sensitivity constraints. A threshold → the value of a parameter related to the state of a system that triggers a change in the system behavior. Thresholds → used to keep critical parameters of a system in a predefined range. Two types of policies: threshold-based → upper and lower bounds on performance trigger adaptation through resource reallocation; such policies are simple and intuitive but require setting per-application thresholds. 2. sequential decision → based on Markovian decision models. 1. Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 20 Feedback control based on dynamic thresholds ◼ Algorithm Compute the integral value of the high and the low threshold as averages of the maximum and, respectively, the minimum of the processor utilization over the process history. Request additional VMs when the average value of the CPU utilization over the current time slice exceeds the high threshold. Release a VM when the average value of the CPU utilization over the current time slice falls below the low threshold. ◼ Conclusions Dynamic thresholds perform better than the static ones. Two thresholds are better than one. Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 21 Coordination of power and performance management ◼ ◼ ◼ ◼ ◼ ◼ Use separate controllers/managers for the two objectives. Identify a minimal set of parameters to be exchanged between the two managers. Use a joint utility function for power and performance. Set up a power cap for individual systems based on the utilityoptimized power management policy. Use a standard performance manager modified only to accept input from the power manager regarding the frequency determined according to the power management policy. Use standard software systems. Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 22 Communication between autonomous managers Performance manager Control policy Power manager Control policy Power data Performance data Blade Blade Workload generator Workload distribution Blade Power assignment Power Blade Blade Autonomous performance and power managers cooperate to ensure prescribed performance and energy optimization; they are fed with performance and power data and implement the performance and power management policies Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 23 Utility-based model for cloud-based web services ◼ ◼ ◼ ◼ ◼ A service level agreement (SLA) → specifies the rewards as well as penalties associated with specific performance metrics. The SLA for cloud-based web services uses the average response time to reflect the Quality of Service. We assume a cloud providing K different classes of service, each class k involving Nk applications. The system is modeled as a network of queues with multi-queues for each server. A delay center models the think time of the user after the completion of service at one server and the start of processing at the next server. Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 24 Utility function U(R) P e n a t l t y R e w a r d 0 R0 R1 R2 R - response time The utility function U(R) is a series of step functions with jumps corresponding to the response time, R=R0 | R1 | R2, when the reward and the penalty levels change according to the SLA. The dotted line shows a quadratic approximation of the utility function. Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 25 Si2 Si3 Si1 Si6 Si4 vk vkmax Si5 rkmax rk (b) (a) (a) The utility function: vk the revenue (or the penalty) function of the response time rk for a request of class k. (b) A network of multiqueues. Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 26 The model requires a large number of parameters Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 27 Cloud scheduling algorithms ◼ Scheduling → responsible for resource sharing at several levels: A server can be shared among several virtual machines. A virtual machine could support several applications. An application may consist of multiple threads. ◼ ◼ A scheduling algorithm should be efficient, fair, and starvation-free. The objectives of a scheduler: Batch system → maximize throughput and minimize turnaround time. Real-time system → meet the deadlines and be predictable. ◼ ◼ Best-effort: batch applications and analytics. Common algorithms for best effort applications: Round-robin. First-Come-First-Serve (FCFS). Shortest-Job-First (SJF). Priority algorithms. Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 28 More on cloud scheduling algorithms ◼ Multimedia applications (e.g., audio and video streaming) Have soft real-time constraints. Require statistically guaranteed maximum delay and throughput. ◼ ◼ Real-time applications have hard real-time constraints. Scheduling algorithms for real-time applications: Earliest Deadline First (EDF). Rate Monotonic Algorithms (RMA). ◼ Algorithms for integrated scheduling of several classes of applications: Resource Allocation/Dispatching (RAD) . Rate-Based Earliest Deadline (RBED). Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 29 Quantity Hard real-time strict Hard-requirements Soft-requirements Best-effort Timing loose loose strict Best-effort policies → do not impose requirements regarding either the amount of resources allocated to an application, or the timing when an application is scheduled. Soft-requirements policies → require statistically guaranteed amounts and timing constraints Hard-requirements policies → demand strict timing and precise amounts of resources. Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 30 Delay scheduling ◼ ◼ ◼ How to simultaneously ensure fairness and maximize resource utilization without compromising locality and throughput for Big Data applications running on large computer clusters? This was one of the questions faced early on in the cloud computing era by Facebook, Yahoo, and other large IT service providers. Example - each Hadoop job consists of multiple Map and Reduce tasks and the question is how to allocate resources to the tasks of newly submitted jobs. Hadoop approach: The job tracker of the Hadoop master manages a number of slave servers running under the control of task trackers with slots for Map and Reduce tasks. A FIFO scheduler with five priority levels assigns slots to tasks based on their priority. The fewer the number of tasks of a job already running on slots of all servers, the higher is the priority of the remaining tasks. Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 31 Challenges of priority based mechanisms ◼ ◼ Priority-based allocation does not consider data locality, namely the need to place tasks close to their input data. Locality affects the throughput, Server locality, i.e., getting data from the local server, is significantly better in terms of time and overhead than rack locality, i.e. getting input data from a different server in the same rack. The network bandwidth in a large cluster is considerably lower than the disk bandwidth; also the latency for local data access is much lower than the latency of a remote disk access. ◼ ◼ In steady-state, priority scheduling leads to the tendency to assign the same slot repeatedly to the next task(s) of the same job. As one of the job's tasks completes execution its priority decreases and the available slot is allocated to the next task of the same job. Priority scheduling favors the occurrence of sticky slots!! Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 32 Task locality and average job locality ◼ ◼ ◼ A task assignment satisfies the locality requirement if the input task data are stored on the server hosting the slot allocated to the task. How should a fair scheduler operate on a shared cluster? What is the number n of slots of a shared cluster the scheduler should allocate to jobs assuming that tasks of all jobs take an average of T seconds to complete? A sensible answer is that the scheduler should provide enough slots such that the response time on the shared cluster should be the same as the completion time of the job on a fictitious private cluster with n available slots for the n tasks of the job as soon as job arrives. Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 33 Delay scheduling a counterintuitive policy ◼ ◼ ◼ ◼ This policy delays scheduling the tasks of a new job for a relatively short time to address the conflict between fairness and locality. The new policy skips a task of the job at the head of the priority queue if the input data are not available on the server where the slot is located and repeats this process up to D times as specified by the delay scheduling algorithm showed in the next box. Delay scheduling performs well when most tasks are short relative to job duration, and when a running task can read a given data block from multiple locations. Results for workloads at Yahoo and Facebook show an almost doubling of the throughput under the new policy, while ensuring fairness. Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 34 Hadoop fair scheduler (HFS) ◼ HFS design goals: 1. 2. 3. Fair sharing at the level of users rather than jobs. This requires a twolevel scheduling, the first level allocates task slots to pools of jobs using a fair sharing policy; at the second level each pool allocates its slots to jobs in the pool. User controlled scheduling; the second level policy can be either FIFO or fair sharing of the slots in the pool. Predictable turnaround time. Each pool has a guaranteed minimum share of slots. To accomplish this goal HFS defines a minimum share timeout and a fair share timeout and when the corresponding timeout occurs it kills buggy jobs or tasks taking a very long time. Instead of using a minimum skip count it use a wait time to determine how long a job waits to allocate a slot to its next ready-to-run task. Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 35 HFS operation ◼ ◼ ◼ ◼ ◼ ◼ HFS creates a sorted list of jobs ordered according to its scheduling policy. Scans down this list to identify the job allowed to schedule a task next, and within each pool applies the pool's internal scheduling policy. Pools missing their minimum share are placed at the head of the sorted list and the other pools are sorted to achieve a weighted fair sharing. A job starts at locality level 0 and can only launch node-local tasks. After at least W1 seconds the job advances at level 1 and may launch rack-local tasks, then after a further W2 seconds, it goes to level 2 and may launch off-rack tasks. If a job launches a local task with a locality higher than the level it is on, it goes back down to a previous level. Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 36 9.11 Data-aware scheduling Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 37 Apache capacity scheduler Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 38 Fair queuing and startup fair queuing R(t) R(t) Fai(tai)=Sai(tai)+Pai Fai(tai)=Sai(tai)+Pai Sai(tai)=Rai(tai) Sai(tai)=Fai-1(tai-1) Fai-1(tai-1) Rai(tai) tai tai-1 tai-1 (a) tai (b) Fair queuing - schedule multiple flows through a switch. The transmission of packet i of a flow can only start after the packet is available and transmission of the previous packet has finished. (a) The new packet arrives after the previous has finished. (b) The new packet arrives before the previous one was finished. Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 39 Start-time fair queuing (SFQ) ◼ ◼ Organize the consumers of the CPU bandwidth in a tree structure. The root node is the processor and the leaves of this tree are the threads of each application. When a virtual machine is not active, its bandwidth is reallocated to the other VMs active at the time. When one of the applications of a virtual machine is not active, its allocation is transferred to the other applications running on the same VM. If one of the threads of an application is not runnable then its allocation is transferred to the other threads of the applications. Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 40 Server VM2 (3) VM1 (1) A2 (1) A1 (3) t1,1 (1) t1,2 (1) t1,3 (1) t2 (1) A3 (1) vs1 (1) vs2 (1) vs3 (1) The SFQ tree for scheduling when two virtual machines VM1 and VM2 run on a powerful server. Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 41 Borrowed virtual time (BVT) ◼ ◼ Objective - support low-latency dispatching of real-time applications, and weighted sharing of CPU among several classes of applications. A thread i has an effective virtual time, Ei. an actual virtual time, Ai. a virtual time warp, Wi. ◼ ◼ ◼ The scheduler thread maintains its own scheduler virtual time (SVT) defined as the minimum actual virtual time of any thread. The threads are dispatched in the order of their effective virtual time, policy called the Earliest Virtual Time (EVT). Context switches are triggered by events such as: the running thread is blocked waiting for an event to occur. the time quantum expires. an interrupt occurs. when a thread becomes runnable after sleeping. Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 42 Thread a 0 12 24 36 t 12 24 36 thread b is suspended 48 thread b is reactivated Thread b 0 3 3 6 6 9 9 12 12 15 Virtual time 36 24 12 Real time 6 0 3 15 18 21 24 36 48 60 Top → the virtual startup time and the virtual finish time and function of the real time t for each activation of threads a and b. Bottom → the virtual time of the scheduler v(t) function of the real time Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 43 Effective virtual time 450 390 360 300 270 210 180 120 90 30 Real time (mcu) 2 5 11 9 14 20 18 23 29 27 32 38 36 41 45 The effective virtual time and the real time of the threads a (solid line) and b (dotted line) with weights wa = 2 wb when the actual virtual time is incremented in steps of 90 mcu. Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 44 Effective virtual time 450 390 360 300 270 210 180 120 90 30 Real time (mcu) 2 5 12 9 14 21 18 23 30 27 32 36 39 41 45 -60 The effective virtual time and the real time of the threads a (solid line), b (dotted line), and the c with real-time constraints (thick solid line). Thread c wakes up periodically at times t=9, 18, 27, 36,…, is active for 3 units of time and has a time warp of 60 mcu. Dan C. Marinescu Cloud Computing Third Edition - Chapter 9 45