SEMCO SIG-Computing: Compute Clusters—Building Blocks of the Public Cloud Innovation Intelligence® Jeff Marraccini, Vice President, Computer Systems jeff@altair.com 14-Dec-2014 Copyright © 2014 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. Overview of today’s talk • Why clusters? • Some history • “Private cloud” clusters • Architecture • Failures • The Virtual Machine era • “Public cloud” clusters • Facebook and the Open Data Center • Appliance Computing • Resources to learn more Copyright © 2014 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. Why clusters? And what’s the big deal? • Mainframe costs • Microcomputer performance and Moore’s Law • Networking + computers + “cluster stack” = often vast power • What do we do with these 3-5 year old computers on a 7-10 year budget cycle? • Sony PlayStations and Apple Xserves Copyright © 2014 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. Universities, government agencies, companies, and basements near you… • They got us started… • NASA (you may be using an Ethernet driver based on their work!) • NSA fed back scalability ideas (!!) • Universities world wide – open source contributions • Military projects • The Beowulf Project • Basement clusters • Renderman users • LucasFilm • MASSIVE (Peter Jackson/WETA Digital!) • Older operating systems: Digital VAX/VMS & OpenVMS, Some UNIX, Microsoft Windows Server Clusters Copyright © 2014 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. What do they do? • Scientific and engineering computing – the start of it all • Render farms – special effects for movies, TV, commercials, games, live TV and sports overlays… • Media conversion (YouTube!) • Web services • E-Mail • Databases, “Big Data” • Storage • Building and testing software! • Social media (combining a lot of the above) • Cracking passwords, encryption • Neural networks / expert systems / IBM WATSON Copyright © 2014 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. Some of the largest clusters are… • 10’s-100’s of thousands of cores • NSA (probably), along with other governments’ security arms • Other classified installations • CERN • TOP 500 Supercomputers • Public clouds (Google, Amazon, Microsoft, Rackspace, IBM, others) • 1’s-10’s of thousands of cores • Square Kilometer Array (Australia / South Africa, just got back from there) • Weather forecasting • Japan’s Earth project (early 2000’s) • Government Labs • Large organizations (corporate, universities, “smaller” public cloud providers) • Small businesses often have dozens to hundreds of cores, and may not realize it if leasing public cloud services! Copyright © 2014 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. Software development for clusters & architecture • Message passing (MPI), can achieve huge scales • Shared memory with proprietary interconnect (Some Cray, NEC, SGI Altix) • Single instance (Beowulf, OpenMOSIX, some Cray, NEC, SGI Altix) Copyright © 2014 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. “Private Cloud” • Internal use clusters • Sometimes accessible via remote access, Virtual Private Networks • “Secret sauce” behind internal tools • Requires a forging of networking, storage, and computing teams • Oracle 10g databases often first exposure • Scalable internal storage (EMC Isilon, ExaGrid, HP 3PAR, etc.) Copyright © 2014 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. Altair’s Internal Clusters • We use PBS Professional for all (it’s our product!) • HyperWorks Unlimited – “cluster in a box” – several around the world, hundreds to 2048 cores • Legacy “E-Compute” & Compute Manager (newer) – several clusters of a couple hundred cores each • HyperWorks build/QA – several hundred cores, Windows, Linux, Mac, Michigan and India, 50+ compilations and thousands of tests daily • Test clusters – 128-256 cores, often restaged Copyright © 2014 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. High Availability Private Cluster Block Diagram Firewall Head Nodes • Protects often unpatched cluster software and firmware • Load balancer • Remote access • 1 • 2 • Authentication, Scheduling, Staging, Reloading, Push notifications, Periodic Checkpointing Switch Fabrics Execution Nodes • 1 • 2 • Infiniband, Myrinet, 1/10/40/100GB Ethernet, Proprietary (Cray!) • 1…N • Local storage Shared Storage Pools • Staging • Checkpoints 256 core SuperMicro TwinBlade chassis w/ 100TB storage, QDR InfiniBand, no HA Copyright © 2014 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. A regular cluster (or a basement one!) Head Node Cluster fabric(s) • Authentication, Scheduling, Staging, (Reloading, Push notifications, Periodic Checkpointing) • Ethernet switch • Infiniband switch • Storage Area Network Execution Nodes Shared Storage Pools • 1 … N (could be varying hardware) • Local storage (maybe!) • Staging • Checkpoints (maybe!) • Could be FreeNAS, NFS over ZFS, Lustre… Could well ALL be running on a single virtual machine hypervisor! Copyright © 2014 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. The Fabric – cluster scaling and speed • Infiniband • MyriNet • PCIe • Ethernet • Proprietary (CrayLink and others) • Virtual network switches 256 core SGI half rack, QDR InfiniBand, Nvidia GPU’s, no HA Copyright © 2014 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. Storage Varying needs = varying capacities (CFD, “crash”, chemistry, atmospherics, password cracking…) Cluster storage is HARD, especially scale out Reliability - High availability often is more than 2X the cost Local storage limits (blades) Spinning it down = complex Copyright © 2014 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. Management • Staging nodes – potentially thousands • Herding cats = scheduling different user communities’ requirements • Failures and recovery • Staging jobs in/out • Push notifications • Portals • Continuous resource monitoring • Checkpointing • Energy efficiency Copyright © 2014 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. When it breaks • Nodes will fail • We have hardware failures every week, bigger clusters may have hourly failures or even more • Checkpointing = costly in storage and processing time • Restoring from a checkpoint • Restaging • Job migration • Jeff’s “I meant to type a 11 and typed 1” glitch • The dreaded faulty Infiniband cable Copyright © 2014 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. The Virtual Machine Cluster • Great way to demo cluster software • SIMH & OpenVMS (Jeff’s VMS cluster on a Surface Pro 2 tablet) • Virtual network switches • “Pull” the virtual network cable • Test your upgrades • Learn without spending $50,000 • Hypervisors add I/O latency • Fabric support limited • = Scalability limited Copyright © 2014 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. The Virtual Machine Cluster • Great way to demo cluster software • SIMH & OpenVMS (Jeff’s VMS cluster on a Surface Pro 2 tablet) • Virtual network switches • “Pull” the virtual network cable • Test your upgrades • Learn without spending $50,000 • Hypervisors add I/O latency • Fabric support limited • = Scalability limited Copyright © 2014 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. “I’m out of oomph” -> BURSTING • “Promise” of the Public Cloud • Credit card financed computing • Possibly loosely coupled • Fabric compromises • Getting better! Internal Cluster VPN to Amazon AWS/Microsoft Azure Cloud Execution Nodes Cloud storage Cloud fabric Copyright © 2014 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. Spread out clusters • May be in the “Public Cloud” or at multiple “Private Cloud” sites (research centers, remote data centers, leased private capacity) • Redundancy – Google File System, etc. quickly copy object data and store archival copies, etc. • Scalability • Lots of “dark fiber” available for leasing • Latency sensitivity Copyright © 2014 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. Facebook and Open Compute Project • Mainly useful for big organizations • Power efficiency, reduce impact • Shared power supplies • Optimized cooling • Storage & node spin-down • Designed to fail and be easily serviceable • Quick upgrades • Scalability beyond conventional designs • Might slow down commodity server price drops, volume decreasing • http://www.opencompute.org/ Copyright © 2014 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. Appliances and Platform as a Service (PaaS) • “Cluster in a box” (well, racks!) • Bursting • Project-based computing • Nimble • Geek skills embedded • Easy portal / front ends Copyright © 2014 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. Where do we go from here? • Public library access to Lynda.com – Amazon AWS & Microsoft Azure “Up and Running” courses • Install SIMH and set up a hobbyist OpenVMS cluster! https://vanalboom.org/node/18 • OpenStack on virtual machines: http://www.openstack.org/ and http://docs.openstack.org/developer/devstack/#quick-start • Example appliance: http://www.altair.com/hwul/ • PBS Professional, IBM LSF, GridEngine, other cluster mgmt. software • Lustre storage free software - http://wiki.lustre.org/ • Aside from security, the ability to build and maintain private and public cluster systems are near the top of the pay scale in IT!