Slide 1 Cluster-on-Demand (COD) Justin Moore Duke University Slide 2 How Big Is It? 500? 5000? 25,000? Clusters are growing Clusters are expensive – Power, A/C, Management … How to manage {heat, power, failures}? How to keep everything organized? How to divide resources? Slide 3 How Do You Use It? We’ve got good middleware – Batch queues, Internet Services, research apps … But customers are very picky – “Linux!” “FreeBSD!” “Windows!” “Minix!” “Minix??” – “I only need it for 30 minutes!!” Customers != administrators – Contributing to the problem, not the solution How to share and manage our clusters? “Can’t we all just get along??” Slide 4 COD: The More the Merrier Automated framework for resource management Owners define policies, customers define configs COD creates, configures dynamic virtual clusters – Isolated, secure collection of nodes – Backed by network storage – Automatic configuration: fast and OS-agnostic Middleware negotiates allocations with COD – Virtual Cluster Manager: COD-aware layer Slide 5 Dynamic Virtual Clusters Reserve pool (off-power) DB Ninja Virtual Cluster COD Manager Node reallocation Example: CNN on 9/11 SGE Virtual Cluster Slide 6 Those Wonderful Toys Leverage open standards and open source – DHCP, NFS, NIS, XML – Only constraint is that Linux must support hardware – PXELinux-based installer, RHAT/Debian tools Currently testing working COD prototype – Core of policy-based scheduling engine: CSP-solver – Framework of node requests + allocation negotiation – OS- and filesystem-agnostic installer – Testbed to examine policies and microbenchmarks Slide 7 COD: Size Doesn’t Matter Enable management scalability for hosting centers – Hierarchical policy-driven mechanisms – Empower owners and customers Details and paper at http://www.cs.duke.edu/~justin/cod/ Slide 8 Questions?