EFFICIENT HIGH PERFORMANCE COMPUTING IN THE CLOUD Abhishek Gupta (gupta59@illinois.edu) Department of Computer Science, University of Illinois at Urbana Champaign, Urbana, IL 1 MOTIVATION: WHY CLOUDS FOR HPC ? Rent vs. own, pay-as-you-go No startup/maintenance cost, cluster create time Elastic resources No risk e.g. in under-provisioning Prevents underutilization Benefits of virtualization Flexibility and customization Security and isolation Migration and resource control Cloud for HPC: A cost-effective and timely solution? Multiple providers – Infrastructure as a Service 2 MOTIVATION: HPC-CLOUD DIVIDE HPC Application performance Dedicated execution HPC-optimized interconnects, OS Not cloud-aware Cloud Service, cost, resource utilization Multi-tenancy Commodity network, virtualization Not HPC-aware Mismatch: HPC requirements and cloud characteristics Only embarrassingly parallel, small scale HPC applications in clouds 3 OBJECTIVES HPC in cloud HPC-cloud: What, why, who Improve HPC performance How: Bridge HPC-cloud Gap Improve Cloud utilization=> Reduce cost OBJECTIVES AND CONTRIBUTIONS Goals HPC in cloud HPC-cloud: What, why, who How: Bridge HPC-cloud Gap Techniques Perf, cost Analysis Smart selection of platforms for applications Applicationaware VM consolidation Heterogeneity, Multi-tenancy aware HPC Malleable jobs: Dynamic shrink/expand Tools Extended OpenStack Nova Scheduler CloudSim Simulator Load Balancing Framework Object Migration 5 OUTLINE Performance of HPC in cloud Trends Challenges and Opportunities Application-aware cloud schedulers HPC-aware schedulers: improve HPC performance Application-aware consolidation: improve cloud utilization => reduce cost Cloud-aware HPC runtime Dynamic Load balancing: improve HPC performance Parallel runtime for shrink/expand: improve cloud utilization => reduce cost Conclusions 6 EXPERIMENTAL TESTBED AND APPLICATIONS Platform/ Resource Ranger (TACC) Network Infiniband (10Gbps) Taub (UIUC) Open Cirrus (HP) Private Cloud (HP) Voltaire QDR 10 Gbps Ethernet Emulated network Infiniband internal; 1 Gbps card under KVM Ethernet x-rack hypervisor (1Gbps Physical Ethernet) Public Cloud (HP) Emulated network under KVM hypervisor (1Gbps Physical Ethernet) NAS Parallel Benchmarks class B (NPB3.3-MPI) NAMD - Highly scalable molecular dynamics ChaNGa - Cosmology, N-body Sweep3D - A particle in ASCI code Jacobi2D - 5-point stencil computation kernel Nqueens - Backtracking state space search 7 PERFORMANCE (1/3) Some applications cloud-friendly 8 PERFORMANCE (2/3) Some applications scale till 16-64 cores 9 PERFORMANCE (3/3) Some applications cannot survive in cloud 10 BOTTLENECKS IN CLOUD: COMMUNICATION LATENCY Low is better Cloud message latencies (256μs) off by 2 orders of magnitude compared to supercomputers (4μs) 11 BOTTLENECKS IN CLOUD: COMMUNICATION BANDWIDTH High is better Cloud communication performance off by 2 orders of magnitude – why? 12 COMMODITY NETWORK OR VIRTUALIZATION OVERHEAD (BOTH?) Low is better High is better Significant virtualization overhead (Physical vs. virtual) Led to collaborative work on “Optimizing virtualization for HPC – Thin VMs, Containers, CPU affinity” with HP labs, Singapore. 13 PUBLIC VS. PRIVATE CLOUD Low is better Similar network performance for public and private cloud. Then, why does public cloud perform worse? Heterogeneity and Multi-tenancy 14 HETEROGENEITY AND MULTI-TENANCY CHALLENGE Heterogeneity: Cloud economics is based on: Creation of a cluster from existing pool of resources and Incremental addition of new resources. Multi-tenancy: Cloud providers run a profitable business by improving utilization of underutilized resources Cluster-level by serving large number of users, Server-level by consolidating VMs of complementary nature (such as memory- and compute-intensive) on same server. Heterogeneity and multi-tenancy intrinsic in clouds, driven by cloud economics but For HPC, one slow processor => all underutilized processors 15 OUTLINE Performance of HPC in cloud Trends Challenges and Opportunities Application-aware cloud schedulers HPC-aware schedulers: improve HPC performance Application-aware consolidation: improve cloud utilization => reduce cost Cloud-aware HPC runtime Dynamic Load balancing: improve HPC performance Parallel runtime for shrink/expand: improve cloud utilization => reduce cost Conclusions 16 Next … HPC in HPC-aware cloud Opportunities Challenges/Bottlenecks Heterogeneity Multitenancy VM consolidation Application-aware Cloud schedulers SCHEDULING/PLACEMENT 17 BACKGROUND: OPENSTACK NOVA OpenStack Open source cloud management system Linux of cloud management Nova scheduler (VM placement) Ignore the nature of application Ignores heterogeneity and network topology Considers the k VMs requested by an HPC user as k separate placement problems No co-relation between the VMs of a single request Nova scheduler is HPC-agnostic 18 HARDWARE, TOPOLOGY-AWARE VM PLACEMENT OpenStack on Open Cirrus test bed at HP Labs. 3 types of servers: Intel Xeon E5450 (3.00 GHz) Intel Xeon X3370 (3.00 GHz) Intel Xeon X3210 (2.13 GHz) KVM as hypervisor, virtio-net for n/w virtualization, VMs: m1.small CPU Timelines of 8 VMs running Jacobi2D – one iteration Decrease in execution time 20% improvement in time, across all processors 19 WHAT ABOUT MULTI-TENANCY: VM CONSOLIDATION FOR HPC IN CLOUD(1) 0.5 GB VM Requests HPC performance (prefers dedicated execution) vs. Resource utilization (shared usage in cloud)? Up to 23% savings How much interference? 20 VM CONSOLIDATION FOR HPC IN CLOUD (2) Experiment: Shared mode (2 apps on each node – 2 cores each on 4 core node) performance normalized wrt. dedicated mode 4 VM per app High is Scope Challenge: Interference better EP = Embarrisingly Parallel LU = LU factorization IS = Integer Sort ChaNGa = Cosmology Careful co-locations can actually improve performance. Why? Correlation : LLC misses/sec and shared mode performance. 21 METHODOLOGY: (1) APPLICATION CHARACTERIZATION Characterize applications along two dimensions: 1. Cache intensiveness 2. Assign each application a cache score (= 100K LLC misses/sec) Representative of the pressure they put on the last level cache and memory controller subsystem Parallel Synchronization and network sensitivity ExtremeHPC: IS (Parallel sorting) SyncHPC: LU, ChaNGa AsyncHPC: EP, MapReduce applications NonHPC: Web applications 22 METHODOLOGY: (2)HPC AWARE SCHEDULER: DESIGN AND IMPLEMENTATION Co-locate applications with complementary profiles Topo-awareness for ExtremeHPC, SyncHPC Cross-interference aware using LLC misses/sec Dedicated execution for ExtremeHPC Resource Packing for rest classes Less aggressive packing for SyncHPC - bulk sync apps 23 MDOBP (MULTI-DIMENSIONAL ONLINE BIN Physical host PACKING) : Requested Remaining α Memory Pack a VM request into hosts (bin) Dimension-aware heuristic CPUs Select the host for which the vector of requested resources aligns the most with the vector of remaining capacities* the host with the minimum α where cos(α) is calculated using dot product of the two vectors, and is given by: Residual capacities = (CPURes, MemRes) of a host , Requested VM: (CPUReq, MemReq). 24 *S. Lee, R. Panigrahy, V. Prabhakaran, V. Ramasubramanian, K. Talwar, L. Uyeda, and U. Wieder., “Validating Heuristics for Virtual Machines Consolidation,” Microsoft Research, Tech. Rep., 2011. IMPLEMENTATION ATOP OPEN STACK NOVA 25 RESULTS: CASE STUDY OF APPLICATIONAWARE SCHEDULING EP = Embarrassingly Parallel LU = LU factorization IS = Integer Sort IS.B.4 High is better Problem Size Number of requested VMs 8 nodes (32 cores) Less aggressive Packing Performance gains up to 45% for a single application Limiting negative impact of interference to 8% But, what about resource utilization? 26 SIMULATION CloudSim: simulation tool for modeling a cloud computing environment in a datacenter Extended the existing vmAllocationPolicySimple class to create an vmAllocationPolicyHPC Handle a user request comprising multiple VM instances Perform Application-aware scheduling Implemented dynamic VM creation and termination 27 SIMULATION RESULTS Assigned each job a cache score from (0-30) using a uniform distribution random number generator Modified execution times by -10% and -20% to account for the improvement in performance resulting from cache-awareness High is better 259 jobs METACENTRUM-02.swf log from Parallel workload archive Simulated first 1500 jobs on 1024 cores, for 100 seconds β=Cache threshold For cache threshold of 60 and adjustment of -10%, improvement in throughput by 259/801 = 32.3% 28 OUTLINE Performance of HPC in cloud Trends Challenges and Opportunities Application-aware cloud schedulers HPC-aware schedulers: improve HPC performance Application-aware consolidation: improve cloud utilization => reduce cost Cloud-aware HPC runtime Dynamic Load balancing: improve HPC performance Parallel runtime for shrink/expand: improve cloud utilization => reduce cost Conclusions 30 HETEROGENEITY AND MULTI-TENANCY Multi-tenancy => Dynamic heterogeneity Interference random and unpredictable Challenge: Running in VMs makes it difficult to determine if (and how much of) the load imbalance is Application-intrinsic or Idle times Caused by extraneous factors such as interference CPU/VM Time VMs sharing CPU: application functions appear to be taking longer time Existing HPC load balancers ignore effect of extraneous factors 31 CHARM++ AND LOAD BALANCING Migratable objects (chares) Object-based over-decomposition HPC VM1 Load balancer migrates objects from overloaded to under loaded VM HPC VM2 Background/ Interfering VM running on same host Physical Host 1 Objects (Work/Data Units) Physical Host 2 32 CLOUD-AWARE LOAD BALANCER Static Heterogeneity: Estimate the CPU capabilities for each VCPU, and use those estimates to drive the load balancing. Simple estimation strategy + periodic load re-distribution Dynamic Heterogeneity Instrument the time spent on each task Impact of interference: instrument the load external to the application under consideration (background load) Normalize execution time to number of ticks (processorindependent) Predict future load based on the loads of recently completed iterations (principle of persistence). Create sets of overloaded and under loaded cores Migrate objects based on projected loads from overloaded to underloaded VMs (Periodic refinement) 33 LOAD BALANCING APPROACH To get a processor-independent measure of task loads, normalize the execution times to number of ticks All processors should have load close to average load Average load depends on task execution time and overhead Overhead is the time processor is not executing tasks and not in idle mode. Tlb: wall clock time between two load balancing steps, Ti: CPU time consumed by task i on VCPU p Charm++ LB database from /proc/stat file 34 RESULTS: STENCIL3D • OpenStack on Open Cirrus test bed (3 types of processors), KVM, virtionet, VMs: m1.small, vcpupin for pinning VCPU to physical cores • Sequential NPB-FT as interference, Interfering VM pinned to one of the cores that the VMs of our parallel runs use Low is better Multi-tenancy awareness Heterogeneity awareness Periodically measuring idle time and migrating load away from time-shared VMs works well in practice. 35 RESULTS Load Imbalance High colored bars are better Improved CPU utilization 36 RESULTS: IMPROVEMENTS BY LB High is better Heterogeneity and Interference – one Slow node, hence four Slow VMs, rest Fast, one interfering VM (on a Fast core) which starts at iteration 50. Up to 40% Benefits 37 OUTLINE Performance of HPC in cloud Trends Challenges and Opportunities Application-aware cloud schedulers HPC-aware schedulers: improve HPC performance Application-aware consolidation: improve cloud utilization => reduce cost Cloud-aware HPC runtime Dynamic Load balancing: improve HPC performance Parallel runtime for shrink/expand: improve cloud utilization => reduce cost Conclusions 38 MALLEABLE PARALLEL JOBS Malleable jobs: dynamic shrink/expand number of processors Twofold merit in the context of cloud computing. Cloud user perspective: Dynamic pricing offered by cloud providers, such as Amazon EC2 Better value for the money spent based on priorities and deadlines Cloud provider perspective Malleable jobs + smart scheduler => better system utilization, response time, and throughput while following QoS Honor job priorities 39 Launcher (Charmrun) CCS Shrink Request Application Processes Tasks/Objects Sync. Point, Check for Shrink/Expand Request Object Evacuation Load Balancing Time Checkpoint to linux shared memory Rebirth (exec) or die (exit) Reconnect protocol Restore Object from Checkpoint ShrinkAck to external client Execution Resumes via stored callback Launcher (Charmrun) CCS Expand Request Time Application Processes Tasks/Objects Sync. Point, Check for Shrink/Expand Request Checkpoint to linux shared memory Rebirth (exec) or launch (ssh, fork) Connect protocol Restore Object from Checkpoint ExpandAck to external client Load Balancing Execution Resumes via stored callback OUTLINE Performance of HPC in cloud Trends Challenges and Opportunities Application-aware cloud schedulers HPC-aware schedulers: improve HPC performance Application-aware consolidation: improve cloud utilization => reduce cost Cloud-aware HPC runtime Dynamic Load balancing: improve HPC performance Parallel runtime for shrink/expand: improve cloud utilization => reduce cost Conclusions 42 CONCLUSIONS Bridge the gap between HPC and Cloud Performance and Cost HPC-aware clouds and cloud-aware HPC Key ideas can be extended beyond HPC-Clouds Application-aware scheduling, characterization and consolidation Load balancing Malleable jobs Comprehensive evaluation and analysis Performance benchmarking Application characterization 43 FINDINGS Question Answers Who • • Small and medium scale organizations (pay-as-you-go benefits) Owning applications which result in best performance/cost ratio in cloud vs. other platforms. What • • • Applications with less-intensive communication patterns Less sensitivity to noise/interference Small to medium scale Why • HPC users in small-medium enterprises much more sensitive to the CAPEX/OPEX argument. Ability to exploit a large variety of different architectures (Better utilization at global scale, potential consumer savings) • How • • Technical: Lightweight virtualization, CPU affinity, HPCaware Cloud schedulers, Cloud-Aware HPC runtime HPC in the cloud models: cloud bursting, hybrid supercomputer–cloud approach: application-aware mapping 44 44 FUTURE WORK Application-aware cloud consolidation + cloud-aware HPC load balancer Mapping applications to platforms 45 POSSIBLE BENEFITS Cost = Charging rate($ per core-hour) × P × Time Time constraint Choose this Low is better Cost constraint Low is better Choose this Interesting cross-over points when considering cost. Best platform depends on scale, budget. 46 CLOUD PROVIDER PERSPECTIVE Queue of Jobs Standard scheduling ( Backfilling, FCFS, Priority) + Application-awareness Multi-dimension optimization • Less load on SC • Reduced wait time • Better cloud utilization • • • • Online job scheduling Ji = (t,pn), pn= f(t, n), n∈N platforms, deadlines Output: start time (si), ni (which platform) Optimization fn: Utilization, turnaround time for a job, throughput Simplifications 47 WHAT ELSE I HAVE DONE Large scale HPC Applications EpiSimdemics: Collaborated with V-tech researchers to enable parallel simulation of contagion diffusion over very large social networks. scales up to 300,000 cores on Blue Waters. My focus on leveraging (and developing) Charm++ runtime features to optimize performance of EpiSimdemics. Information Set for Game Trees: Parallelized information set generation for game tree search applications. Analyzed the impact of load balancing strategies, problem sizes, and computational granularity on parallel scaling. 48 WHAT ELSE I HAVE DONE Runtime Systems and Schedulers Charm++ Runtime system: various projects for research and development of Charm++ parallel programming system and the associated ecosystem (tools etc). Adaptive Job Scheduler: extending an open-source job scheduler (SLURM) for enabling malleable HPC jobs. runtime support in Charm++ for such dynamic shrink/expand capability. Power-aware load balancing and scheduling Scalable Tree Startup: A multi-level scalable startup technique for parallel applications. 49 WHAT ELSE I HAVE DONE Architectures for Data intensive applications Graph500, HPCC Gups Simulation – Sandia SST 50 QUESTIONS? 51 BACKUP SLIDES 52 HPC-CLOUD ECONOMICS Then why cloud for HPC? Small-medium enterprises, startups with HPC needs Lower cost of running in cloud vs. supercomputer? For some applications? 53 HPC-CLOUD ECONOMICS* Cost = Charging rate($ per core-hour) × P × Time $ per CPU-hour on SC $ per CPU-hour on cloud High means cheaper to run in cloud Cloud can be cost-effective till some scale but what about performance? * Ack to Dejan Milojicic and Paolo Faraboschi who originally drew this figure 54 HPC-CLOUD ECONOMICS Cost = Charging rate($ per core-hour) × P × Time Low is better Best platform depends on application characteristics. How to select a platform for an application? 55 PROPOSED WORK(1): APP-TO-PLATFORM 1. Application characterization and relative performance estimation for structured applications 2. One-time benchmarking + interpolation for complex apps. Platform selection algorithms (cloud user perspective) Minimize cost meeting performance target Maximize performance under cost constraint Consider an application set as a whole Which application, which cloud Benefits: Performance, Cost 56 IMPACT Effective HPC in cloud (Performance, cost) Some techniques applicable beyond clouds Charm++ production system OpenStack scheduler CloudSim Industry participation (HP Lab’s award, internships) 2 patents 57