Cluster Scheduler Ytao 2013.5.15 Reference: Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center NSDI’2011 Multi-agent Cluster Scheduling for Scalability and Flexibility. Berkerly techdoc EECS-2012-273. (doctoral dissertation) Omega: flexible scalable schedulers for large compute clusters EuroSys’2013 Cluster Scheduler(Intro) • Cloud computing framework varies. • New frameworks will likely continue to merge, and no single framework will be optimal for all application. • Cluster with multiple frameworks improves ultilization and data sharing. Problem Statement: • In the face of increasing demand for cluster resources by diverse cluster computing applications and the growing number of machines in typical clusters, it is a challenge to design cluster schedulers that provide flexible, scalable, and effcient resource allocations Cluster Scheduler Monolithic State Scheduling(MSS) • Traditional & popular ones: (LSF , condor Hadoop) • Concept: a single scheduling agent process that makes all scheduling decisions sequentially • Usage: The agent takes input about framework requirements, resource availability, and organizational policies, and computes a global schedule for all tasks Cluster Scheduler Monolithic State Scheduling(MSS 2) • Advantage: optimal scheduling. Global! • Challenge: – Complexity: capture all framework requirements – Scalebility : New frameworks emerge – Lose framework’s own scheduling optimization Cluster Scheduler (update) • Scalability. (response time, number of machines) • Flexibility (heterogeneous mix of job) • Usability and Maintainability(easily adapt new types of jobs, frameworks) • Fault isolation(Minimize dependencies between unrelated jobs) • Utilization(Achieve high cluster resource utilization. e.g., cpu utilization, memory utilization) Partitioned State Scheduling(PSS) • PSS: in PSS, cluster state is divided between multiple scheduling agents as non-overlapping scheduling domains • Statically Partitioned State Scheduling (SPS): statically set cluster resources for particular frameworks. • Dynamically Partitioned State Scheduling(DPS) : Mesos NSDI 2011 Replicated State Scheduling(RSS) • in RSS scheduling domains may overlap and optimistic consistency control is used to resolve conflicting transactions • Omega EuroSys 2013 Cluster Environment • • • • Use of commodity servers Tens to hundreds of thousands of servers Heterogeneous resources Use of commodity networks Mix workloads • Service Jobs vs. Terminating Jobs • Service Jobs consist of a set of service tasks that conceptually are intended to run forever, and these tasks are interacted with by means of requestresponse interfaces. , e.g., a set of web servers or relational database servers. • Terminating Jobs, on the other hand, are given a set of inputs, perform some work as a function of those inputs, and are intended terminate eventually (traditional HPC cluster management only considers this) Mesos • Goal: – – – – Support and demonstrate multi-agent scheduling Support fair-sharing meta-scheduling policy Increase overall cluster utilization Scale to tens of thousands of machines and hundreds of jobs • Mesos aims to provide a scalable and resilient core for enabling various frameworks to efficiently share clusters. • To define a minimal interface that enables efficient resource sharing across frameworks, • push control of task scheduling and execution to the frameworks. Mesos (resource allocation) • Resource offer strategy: spare available resource – Fairness – Priority • Framework rejects offer if resource allocation is not satisfied. Wait for a good offer. Mesos(sum) • take advantage of short tasks to increase cluster utilization • Two-level :Resource offer and scheduler makes it scalable. • Good at batch jobs. How about service jobs? RSS Omega • One of the major drawbacks of Partitioned State Scheduling is that scheduling domains must be selected before the scheduling agent performs its task-resource assignments, thereby potentially restricting the “goodness of fit” that might be achieved by the scheduling agent in its task-resource assignments. Role of job manager • 1.If job queue is not empty, remove next job from job queue • 2. Sync: Begin a transaction by synchronizing private cluster state with common cluster • 3. Schedule: Engage scheduling agent to attempt to create taskresource assignments for all tasks in job, modifying private cluster state in the process. • 4. Submit: Attempt to commit job transaction (i.e., all task-resource assignments for the job) from private cluster state back to common cluster state. Job transaction can succeed or fail. • 5. Record which task-resource assignments were successfully committed to common cluster state. • 6. If any tasks in job remain unscheduled---either because no suitable resources were found for the task during the “schedule" stage or the task-resource assignment, experienced a---insert job back into job queue to be handled again in a future transaction Role of meta scheduling agent • Attempt to execute transactions submitted by job managers according to transaction mode settings • Detect conflicts according to conflict detection semantics and policies • Enforce meta-scheduling policies For each submitted job • 1. reject task-resource assignments that would violate policies • 2. reject task-resource assignments that conflict with previously accepted transactions • 3. reject job all task-resource assignments in a job transaction if using all-or-nothing transaction semantics and at least one taskresource assignment was rejected Conflict mode • Machine-granularity: • Reject task-resourse assign only if sequential number and machine are both conflicted. • Resource-granularity: • Reject task-resourse assign if cluster state enter an invalid state(cpu memory) Transaction Granularity Semantics • all-or-nothing – if a single task-resource assignment conflicts thennone of the task-resource assignments in the transaction are applied to common cluster state • incremental – each task-resource assignment can fail independent of all others Omega (sum) • trade of between whole cluster resource utilization and conflicts. • By having a view of whole cluster, dynastically assign task number based on global optimal is feasible • Meta-sheduler needs more modification • Thanks