Apache Mesos Design Decisions mesos.apache.org @ApacheMesos Benjamin Hindman – @benh this is not a talk about YARN at least not explicitly! this talk is about Mesos! a little history Mesos started as a research project at Berkeley in early 2009 by Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica our motivation increase performance and utilization of clusters our intuition ① static partitioning considered harmful static partitioning considered harmful datacenter static partitioning considered harmful static partitioning considered harmful static partitioning considered harmful static partitioning considered harmful faster! static partitioning considered harmful higher utilization! our intuition ② build new frameworks “Map/Reduce is a big hammer, but not everything is a nail!” Apache Mesos is a distributed system for running and building other distributed systems Mesos is a cluster manager Mesos is a resource manager Mesos is a resource negotiator Mesos replaces static partitioning of resources to frameworks with dynamic resource allocation Mesos is a distributed system with a master/slave architecture masters slaves frameworks register with the Mesos master in order to run jobs/tasks frameworks masters slaves Mesos @Twitter in early 2010 goal: run long-running services elastically on Mesos Apache Aurora (incubating) Aurora is a Mesos framework that makes it easy to launch services written in Ruby, Java, Scala, Python, Go, etc! masters Storm, Jenkins, … masters a lot of interesting design decisions along the way many appear (IMHO) in YARN too design decisions ① two-level scheduling and resource offers ② fair-sharing and revocable resources ③ high-availability and fault-tolerance ④ execution and isolation ⑤ C++ design decisions ① two-level scheduling and resource offers ② fair-sharing and revocable resources ③ high-availability and fault-tolerance ④ execution and isolation ⑤ C++ frameworks get allocated resources from the masters framework offer hostname 4 CPUs 4 GB RAM masters resources are allocated via resource offers a resource offer represents a snapshot of available resources (one offer per host) that a framework can use to run tasks frameworks use these resources to decide what tasks to run framework task 3 CPUs 2 GB RAM masters a task can use a subset of an offer Mesos challenged the status quo of cluster managers cluster manager status quo application specification cluster manager the specification includes as much information as possible to assist the cluster manager in scheduling and execution cluster manager status quo application cluster manager wait for task to be executed cluster manager status quo application result cluster manager problems with specifications ① hard to specify certain desires or constraints ② hard to update specifications dynamically as tasks executed and finished/failed an alternative model framework request 3 CPUs 2 GB RAM masters a request is purposely simplified subset of a specification, mainly including the required resources question: what should Mesos do if it can’t satisfy a request? question: what should Mesos do if it can’t satisfy a request? ① wait until it can … question: what should Mesos do if it can’t satisfy a request? ① wait until it can … ② offer the best it can immediately question: what should Mesos do if it can’t satisfy a request? ① wait until it can … ② offer the best it can immediately an alternative model framework offer hostname 4 CPUs 4 GB RAM masters an alternative model framework offer offer hostname offer hostname 4 CPUs offer hostname 44 CPUs GB RAM hostname 44 CPUs GB RAM 44 CPUs GB RAM 4 GB RAM masters an alternative model framework offer offer hostname offer hostname 4 CPUs offer hostname 44 CPUs GB RAM hostname 44 CPUs GB RAM 44 CPUs GB RAM 4 GB RAM masters framework uses the offers to perform it’s own scheduling an analogue: non-blocking sockets application write(s, buffer, size); kernel an analogue: non-blocking sockets application 42 of 100 bytes written! kernel resource offers address asynchrony in resource allocation IIUC, even YARN allocates “the best it can” to an application when it can’t satisfy a request requests are complimentary (but not necessary) offers represent the currently available resources a framework can use question: should resources within offers be disjoint? framework1 framework2 offer hostname 4 CPUs 4 GB RAM offer hostname 4 CPUs 4 GB RAM masters concurrency control pessimistic optimistic concurrency control pessimistic optimistic all offers overlap with one another, thus causing frameworks to “compete” first-come-first-served concurrency control pessimistic offers made to different frameworks are disjoint optimistic Mesos semantics: assume overlapping offers design comparison: Google’s Omega the Omega model framework snapshot database a framework gets a snapshot of the cluster state from a database (note, does not make a request!) the Omega model framework transaction database a framework submits a transaction to the database to “acquire” resources (which it can then use to run tasks) failed transactions occur when another framework has already acquired sought resources isomorphism? observation: snapshots are optimistic offers Omega and Mesos framework framework offer hostname 4 CPUs 4 GB RAM snapshot database masters Omega and Mesos framework framework task 3 CPUs 2 GB RAM transaction database masters thought experiment: what’s gained by exploiting the continuous spectrum of pessimistic to optimistic? pessimistic optimistic design decisions ① two-level scheduling and resource offers ② fair-sharing and revocable resources ③ high-availability and fault-tolerance ④ execution and isolation ⑤ C++ Mesos allocates resources to frameworks using a fair-sharing algorithm we created called Dominant Resource Fairness (DRF) DRF, born of static partitioning datacenter static partitioning across teams team promotions trends recommendations static partitioning across teams team promotions trends fairly shared! recommendations goal: fairly share the resources without static partitioning partition utilizations team promotions trends recommendations utilization 45% CPU 100% RAM 75% CPU 100% RAM 100% CPU 50% RAM observation: a dominant resource bottlenecks each team from running any more jobs/tasks dominant resource bottlenecks team promotions trends recommendations utilization 45% CPU 100% RAM 75% CPU 100% RAM 100% CPU 50% RAM bottleneck RAM RAM CPU insight: allocating a fair share of each team’s dominant resource guarantees they can run at least as many jobs/tasks as with static partitioning! … if my team gets at least 1/N of my dominant resource I will do no worse than if I had my own cluster, but I might do better when resources are available! DRF in Mesos framework masters ① frameworks specify a role when they register (i.e., the team to charge for the resources) DRF in Mesos framework masters ① frameworks specify a role when they register (i.e., the team to charge for the resources) ② master calculates each role’s dominant resource (dynamically) and allocates appropriately $tep 4: Profit (statistical multiplexing) in practice, fair sharing is insufficient weighted fair sharing team promotions trends recommendations weighted fair sharing team weight promotions trends recommendations 0.17 0.5 0.33 Mesos implements weighted DRF masters masters can be configured with weights per role resource allocation decisions incorporate the weights to determine dominant fair shares in practice, weighted fair sharing is still insufficient a non-cooperative framework (i.e., has long tasks or is buggy) can get allocated too many resources Mesos provides reservations framework (trends) offer hostname 4 CPUs 4 GB RAM role: trends masters slaves can be configured with resource reservations for particular roles (dynamic, time based, and percentage based reservations are in development) resource offers include the reservation role (if any) reservations provide guarantees, but at the cost of utilization reservations trends 20% recommendations 40% unused 30% used 10% promotions 40% revocable resources framework (promotions) offer hostname 4 CPUs 4 GB RAM role: trends masters reserved resources that are unused can be allocated to frameworks from different roles but those resources may be revoked at any time preemption via revocation … my tasks will not be killed unless I’m using revocable resources! design decisions ① two-level scheduling and resource offers ② fair-sharing and revocable resources ③ high-availability and fault-tolerance ④ execution and isolation ⑤ C++ high-availability and faulttolerance a prerequisite @twitter machine failure ① framework failover process failure (bugs!) ② master failover upgrades ③ slave failover high-availability and faulttolerance a prerequisite @twitter machine failure ① framework failover process failure (bugs!) ② master failover upgrades ③ slave failover ① framework failover framework framework framework re-registers with master and resumes operation all tasks keep running across framework failover! masters high-availability and faulttolerance a prerequisite @twitter machine failure ① framework failover process failure (bugs!) ② master failover upgrades ③ slave failover ② master failover framework after a new master is elected all frameworks and slaves connect to the new master masters all tasks keep running across master failover! high-availability and faulttolerance a prerequisite @twitter machine failure ① framework failover process failure (bugs!) ② master failover upgrades ③ slave failover ③ slave failover mesos-slave task slave task ③ slave failover mesos-slave task slave task ③ slave failover task slave task ③ slave failover mesos-slave task slave task ③ slave failover mesos-slave task slave task ③ slave failover @twitter mesos-slave slave (large in-memory services, expensive to restart) design decisions ① two-level scheduling and resource offers ② fair-sharing and revocable resources ③ high-availability and fault-tolerance ④ execution and isolation ⑤ C++ execution framework frameworks launch fine-grained tasks for execution task 3 CPUs 2 GB RAM masters if necessary, a framework can provide an executor to handle the execution of a task executor mesos-slave executor slave task task executor mesos-slave executor slave task task task executor mesos-slave executor slave task goal: isolation isolation mesos-slave executor slave task task isolation mesos-slave executor task slave task containers executor + task design means containers can have changing resource allocations isolation mesos-slave executor slave task task isolation mesos-slave executor slave task task isolation mesos-slave executor slave task task isolation mesos-slave executor slave task task isolation mesos-slave executor slave task task isolation mesos-slave executor slave task task isolation mesos-slave executor slave task task making the task first-class gives us true fine-grained resources sharing requirement: fast task launching (i.e., milliseconds or less) virtual machines an anti-pattern operating-system virtualization containers (zones and projects) control groups (cgroups) namespaces isolation support tight integration with cgroups CPU (upper and lower bounds) memory network I/O (traffic controller, in development) filesystem (using LVM, in development) statistics too rarely does allocation == usage (humans are bad at estimating the amount of resources they’re using) used @twitter for capacity planning (and oversubscription in development) CPU upper bounds? in practice, determinism trumps utilization design decisions ① two-level scheduling and resource offers ② fair-sharing and revocable resources ③ high-availability and fault-tolerance ④ execution and isolation ⑤ C++ requirements: ① performance ② maintainability (static typing) ③ interfaces to low-level OS (for isolation, etc) ④ interoperability with other languages (for library bindings) garbage collection a performance anti-pattern consequences: ① antiquated libraries (especially around concurrency and networking) ② nascent community github.com/3rdparty/libprocess concurrency via futures/actors, networking via message passing github.com/3rdparty/stout monads in C++, safe and understandable utilities but … scalability simulations to 50,000+ slaves @twitter we run multiple Mesos clusters each with 3500+ nodes design decisions ① two-level scheduling and resource offers ② fair-sharing and revocable resources ③ high-availability and fault-tolerance ④ execution and isolation ⑤ C++ final remarks frameworks • Hadoop (github.com/mesos/hadoop) • Spark (github.com/mesos/spark) • DPark (github.com/douban/dpark) • Storm (github.com/nathanmarz/storm) • Chronos (github.com/airbnb/chronos) • MPICH2 (in mesos git repository) • Marathon (github.com/mesosphere/marathon) • Aurora (github.com/twitter/aurora) write your next distributed system with Mesos! port a framework to Mesos write a “wrapper” ~100 lines of code to write a wrapper (the more lines, the more you can take advantage of elasticity or other mesos features) see http://github.com/mesos/hadoop Thank You! mesos.apache.org mesos.apache.org/blog @ApacheMesos ② master failover framework after a new master is elected all frameworks and slaves connect to the new master master all tasks keep running across master failover! stateless master to make master failover fast, we choose to make the master stateless state is stored in the leaves, at the frameworks and the slaves makes sense for frameworks that don’t want to store state (i.e., can’t actually failover) consequences: slaves are fairly complicated (need to checkpoint), frameworks need to save master failover to make master failover fast, we choose to make the master stateless state is stored in the leaves, at the frameworks and the slaves makes sense for frameworks that don’t want to store state (i.e., can’t actually failover) consequences: slaves are fairly complicated (need to checkpoint), frameworks need to save Apache Mesos is a distributed system for running and building other distributed systems origins Berkeley research project including Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica mesos.apache.org/documentation ecosystem operators framework developers mesos developers a tour of mesos from different perspectives of the ecosystem the operator the operator People who run and manage frameworks (Hadoop, Storm, MPI, Spark, Memcache, etc) Tools: virtual machines, Chef, Puppet (emerging: PAAS, Docker) “ops” at most companies (SREs at Twitter) the static partitioners for the operator, Mesos is a cluster manager for the operator, Mesos is a resource manager for the operator, Mesos is a resource negotiator for the operator, Mesos replaces static partitioning of resources to frameworks with dynamic resource allocation for the operator, Mesos is a distributed system with a master/slave architecture masters slaves frameworks/applications register with the Mesos master in order to run jobs/tasks masters slaves frameworks can be required to authenticate as a principal* framework SASL CRAM-MD5 secret mechanism (Kerberos in development) SASL masters masters initialized with secrets Mesos is highly-available and fault-tolerant the framework developer the framework developer … Mesos uses Apache ZooKeeper for coordination Apache ZooKeeper masters slaves increase utilization with revocable resources and preemption framework1 framework2 framework3 reservations framework1 framework2 framework3 masters hostname: 4 CPUs 4 GB RAM role: - 15% 24% 61% optimistic vs pessimistic what to say here … authorization* principals can be used for: authorizing allocation roles authorizing operating system users (for execution) authorization agenda motivation and overview resource allocation frameworks, schedulers, tasks, status updates high-availability resource isolation and statistics security case studies agenda motivation and overview resource allocation frameworks, schedulers, tasks, status updates high-availability resource isolation and statistics security case studies I’d love to answer some questions with the help of my data! I think I’ll try Hadoop. your datacenter + Hadoop happy? Not exactly … … Hadoop is a big hammer, but not everything is a nail! I’ve got some iterative algorithms, I want to try Spark! datacenter management datacenter management datacenter management static partitioning static partitioning static partitioning considered harmful static partitioning considered harmful (1) hard to share data (2) hard to scale elastically (to exploit statistical multiplexing) (3) hard to fully utilize machines (4) hard to deal with failures static partitioning considered harmful (1) hard to share data (2) hard to scale elastically (to exploit statistical multiplexing) (3) hard to fully utilize machines (4) hard to deal with failures Hadoop … (map/reduce) (distributed file system) HDFS HDFS HDFS Could we just give Spark it’s own HDFS cluster too? HDFS x 2 HDFS x 2 HDFS x 2 HDFS x 2 tee incoming data (2 copies) HDFS x 2 tee incoming data (2 copies) periodic copy/sync That sounds annoying … let’s not do that. Can we do any better though? HDFS HDFS HDFS HDFS static partitioning considered harmful (1) hard to share data (2) hard to scale elastically (to exploit statistical multiplexing) (3) hard to fully utilize machines (4) hard to deal with failures During the day I’d rather give more machines to Spark but at night I’d rather give more machines to Hadoop! datacenter management datacenter management datacenter management datacenter management static partitioning considered harmful (1) hard to share data (2) hard to scale elastically (to exploit statistical multiplexing) (3) hard to fully utilize machines (4) hard to deal with failures datacenter management datacenter management datacenter management static partitioning considered harmful (1) hard to share data (2) hard to scale elastically (to exploit statistical multiplexing) (3) hard to fully utilize machines (4) hard to deal with failures datacenter management datacenter management datacenter management I don’t want to deal with this! the datacenter … rather than think about the datacenter like this … … is a computer think about it like this … datacenter computer applications resources filesystem mesos applications kernel resources filesystem mesos applications kernel resources filesystem mesos frameworks kernel resources filesystem Step 1: filesystem Step 2: mesos run a “master” (or multiple for high availability) Step 2: mesos run “slaves” on the rest of the machines Step 3: frameworks Step 3: frameworks Step 3: frameworks Step 3: frameworks Step 3: frameworks Step 3: frameworks Step 3: frameworks Step 3: frameworks Step 3: frameworks Step 3: frameworks Step 3: frameworks Step 3: frameworks Step 3: frameworks $tep 4: profit $tep 4: profit (statistical multiplexing) $tep 4: profit (statistical multiplexing) $tep 4: profit (statistical multiplexing) $tep 4: profit (statistical multiplexing) $tep 4: profit (statistical multiplexing) $tep 4: profit (statistical multiplexing) reduces CapEx and OpEx! $tep 4: profit (statistical multiplexing) reduces latency! $tep 4: profit (utilize) $tep 4: profit (utilize) $tep 4: profit (utilize) $tep 4: profit (utilize) $tep 4: profit (utilize) $tep 4: profit (utilize) $tep 4: profit (failures) $tep 4: profit (failures) $tep 4: profit (failures) agenda motivation and overview resource allocation frameworks, schedulers, tasks, status updates high-availability resource isolation and statistics security case studies agenda motivation and overview resource allocation frameworks, schedulers, tasks, status updates high-availability resource isolation and statistics security case studies mesos frameworks kernel resources filesystem mesos frameworks kernel resources resource allocation resource allocation reservations can reserve resources per slave to provide guaranteed resources requires human participation (ops) to determine what roles should be reserved what resources kind of like thread affinity, but across many machines (and not just for CPUs) resource allocation resource allocation resource allocation (1) allocate reserved resources to frameworks authorized for a particular role (2) allocate unused reserved resources and unused unreserved resources fairly amongst all frameworks according to their weights preemption if a framework runs tasks outside of it’s reservations they can be preempted (i.e., the task killed and the resources revoked) for a framework running a task within its reservation agenda motivation and overview resource allocation frameworks, schedulers, tasks, status updates high-availability resource isolation and statistics security case studies mesos frameworks kernel framework ≈ distributed system framework commonality run processes/tasks simultaneously (distributed) handle process failures (fault-tolerant) optimize performance (elastic) framework commonality run processes/tasks simultaneously (distributed) coordinate execution handle process failures (fault-tolerant) optimize performance (elastic) frameworks are execution coordinators frameworks are execution coordinators frameworks are execution schedulers end-to-end principle “application-specific functions ought to reside in the end hosts of a network rather than intermediary nodes” i.e., frameworks want to coordinate their tasks execution and they should be able to framework anatomy frameworks framework anatomy frameworks scheduling API scheduling scheduling i’d like to run some tasks! scheduling here are some resource offers! resource offers an offer represents the snapshot of available resources on a particular machine that a framework can use to run tasks foo.bar.com: 4 CPUs 4 GB RAM schedulers pick which resources to use to run their tasks “two-level scheduling” mesos: controls resource allocations to schedulers schedulers: make decisions about what to run given allocated resources concurrency control the same resources may be offered to different frameworks concurrency control the same resources may be offered to different frameworks pessimistic no overlapping offers optimistic all overlapping offers tasks the “threads” of the framework, a consumer of resources (cpu, memory, etc) either a concrete command line or an opaque description (which requires an executor) tasks here are some resources! tasks launch these tasks! tasks tasks status updates status updates status updates task status update! status updates status updates status updates task status update! more scheduling more scheduling i’d like to run some tasks! agenda motivation and overview resource allocation frameworks, schedulers, tasks, status updates high-availability resource isolation and statistics security case studies high-availability high-availability (master) high-availability (master) high-availability (master) high-availability (master) high-availability (master) high-availability (master) task status update! high-availability (master) i’d like to run some tasks! high-availability (master) high-availability (framework) high-availability (framework) high-availability (framework) high-availability (framework) high-availability (slave) high-availability (slave) high-availability (slave) agenda motivation and overview resource allocation frameworks, schedulers, tasks, status updates high-availability resource isolation and statistics security case studies resource isolation leverage Linux control groups (cgroups) CPU (upper and lower bounds) memory network I/O (traffic controller, in progress) filesystem (lvm, in progress) resource statistics rarely does allocation == usage (humans are bad at estimating the amount of resources they’re using) per task/executor statistics are collected (for all fork/exec’ed processes too!) can help with capacity planning agenda motivation and overview resource allocation frameworks, schedulers, tasks, status updates high-availability resource isolation and statistics security case studies security Twitter recently added SASL support, default mechanism is CRAM-MD5, will support Kerberos in the short term agenda motivation and overview resource allocation frameworks, schedulers, tasks, status updates high-availability resource isolation and statistics security case studies framework commonality run processes/tasks simultaneously (distributed) handle process failures (fault-tolerant) optimize performance (elastic) framework commonality as a “kernel”, mesos provides a lot of primitives that make writing a new framework easier such as launching tasks, doing failure detection, etc, why re-implement them each time!? case study: chronos distributed cron with dependencies developed at airbnb ~3k lines of Scala! distributed, highly available, and fault tolerant without any network programming! http://github.com/airbnb/chronos analytics analytics + services analytics + services analytics + services case study: aurora “run 200 of these, somewhere, forever” developed at Twitter highly available (uses the mesos replicated log) uses a python DSL to describe services leverages service discovery and proxying (see Twitter commons) http://github.com/twitter/aurora frameworks • Hadoop (github.com/mesos/hadoop) • Spark (github.com/mesos/spark) • DPark (github.com/douban/dpark) • Storm (github.com/nathanmarz/storm) • Chronos (github.com/airbnb/chronos) • MPICH2 (in mesos git repository) • Marathon (github.com/mesosphere/marathon) • Aurora (github.com/twitter/aurora) write your next distributed system with mesos! port a framework to mesos write a “wrapper” scheduler ~100 lines of code to write a wrapper (the more lines, the more you can take advantage of elasticity or other mesos features) see http://github.com/mesos/hadoop conclusions datacenter management is a pain conclusions mesos makes running frameworks on your datacenter easier as well as increasing utilization and performance while reducing CapEx and OpEx! conclusions rather than build your next distributed system from scratch, consider using mesos conclusions you can share your datacenter between analytics and online services! Questions? mesos.apache.org @ApacheMesos aurora aurora aurora aurora aurora framework commonality run processes simultaneously (distributed) handle process failures (fault-tolerance) optimize execution (elasticity, scheduling) primitives scheduler – distributed system “master” or “coordinator” (executor – lower-level control of task execution, optional) requests/offers – resource allocations tasks – “threads” of the distributed system … scheduler Apache Hadoop Chronos scheduler (1) brokers for resources (2) launches tasks (3) handles task termination brokering for resources (1) make resource requests 2 CPUs 1 GB RAM slave * (2) respond to resource offers 4 CPUs 4 GB RAM slave foo.bar.com offers: non-blocking resource allocation exist to answer the question: “what should mesos do if it can’t satisfy a request?” (1) wait until it can (2) offer the best allocation it can immediately offers: non-blocking resource allocation exist to answer the question: “what should mesos do if it can’t satisfy a request?” (1) wait until it can (2) offer the best allocation it can immediately resource allocation request Apache Hadoop Chronos resource allocation request Apache Hadoop Chronos allocator dominant resource fairness resource reservations resource allocation request Apache Hadoop Chronos allocator dominant resource fairness resource reservations pessimistic optimistic resource allocation request Apache Hadoop Chronos allocator dominant resource fairness resource reservations pessimistic no overlapping offers optimistic all overlapping offers resource allocation offer Apache Hadoop Chronos allocator dominant resource fairness resource reservations “two-level scheduling” mesos: controls resource allocations to framework schedulers schedulers: make decisions about what to run given allocated resources end-to-end principle “application-specific functions ought to reside in the end hosts of a network rather than intermediary nodes” tasks either a concrete command line or an opaque description (which requires a framework executor to execute) a consumer of resources task operations launching/killing health monitoring/reporting (failure detection) resource usage monitoring (statistics) resource isolation cgroup per executor or task (if no executor) resource controls adjusted dynamically as tasks come and go! case study: chronos distributed cron with dependencies built at airbnb by @flo before chronos before chronos single point of failure (and AWS was unreliable) resource starved (not scalable) chronos requirements fault tolerance distributed (elastically take advantage of resources) retries (make sure a command eventually finishes) dependencies chronos leverages the primitives of mesos ~3k lines of scala highly available (uses Mesos state) distributed / elastic no actual network programming! after chronos after chronos + hadoop case study: aurora “run 200 of these, somewhere, forever” built at Twitter before aurora static partitioning of machines to services hardware outages caused site outages puppet + monit ops couldn’t scale as fast as engineers aurora highly available (uses mesos replicated log) uses a python DSL to describe services leverages service discovery and proxying (see Twitter commons) after aurora power loss to 19 racks, no lost services! more than 400 engineers running services largest cluster has >2500 machines Mesos Hadoop Spark MPI Storm Chronos Mesos Node Node Node Node Node Node Node Node Node Node Node Mesos Hadoop Spark MPI … Mesos Node Node Node Node Node Node Node Node Node Node Node Mesos Hadoop Spark MPI Storm … Mesos Node Node Node Node Node Node Node Node Node Node Node Mesos Hadoop Spark MPI Storm Chronos … Mesos Node Node Node Node Node Node Node Node Node Node Node $tep 4: Profit (statistical multiplexing)