HPC Cloud - Parallel Programming Laboratory

advertisement
EFFICIENT HIGH PERFORMANCE
COMPUTING IN THE CLOUD
Abhishek Gupta (gupta59@illinois.edu)
Department of Computer Science,
University of Illinois at Urbana Champaign, Urbana, IL
1
MOTIVATION: WHY CLOUDS FOR HPC ?

Rent vs. own, pay-as-you-go


No startup/maintenance cost, cluster create time
Elastic resources
No risk e.g. in under-provisioning
 Prevents underutilization


Benefits of virtualization
Flexibility and customization
 Security and isolation
 Migration and resource control

Cloud for HPC: A cost-effective and timely solution?
Multiple providers – Infrastructure as a Service
2
MOTIVATION: HPC-CLOUD DIVIDE
HPC
Application
performance
 Dedicated execution
 HPC-optimized
interconnects, OS
 Not cloud-aware

Cloud
Service, cost, resource
utilization
 Multi-tenancy
 Commodity network,
virtualization
 Not HPC-aware

Mismatch: HPC requirements and cloud characteristics
Only embarrassingly parallel, small scale HPC applications in clouds
3
OBJECTIVES
HPC in cloud
HPC-cloud: What, why,
who
Improve HPC
performance
How:
Bridge HPC-cloud Gap
Improve Cloud utilization=>
Reduce cost
OBJECTIVES AND CONTRIBUTIONS
Goals
HPC in cloud
HPC-cloud: What, why,
who
How:
Bridge HPC-cloud Gap
Techniques
Perf,
cost
Analysis
Smart selection
of platforms for
applications
Applicationaware VM
consolidation
Heterogeneity,
Multi-tenancy
aware HPC
Malleable jobs:
Dynamic
shrink/expand
Tools Extended
OpenStack Nova
Scheduler
CloudSim
Simulator
Load Balancing
Framework
Object
Migration
5
OUTLINE

Performance of HPC in cloud
Trends
 Challenges and Opportunities


Application-aware cloud schedulers
HPC-aware schedulers: improve HPC performance
 Application-aware consolidation: improve cloud
utilization => reduce cost


Cloud-aware HPC runtime
Dynamic Load balancing: improve HPC performance
 Parallel runtime for shrink/expand: improve cloud
utilization => reduce cost


Conclusions
6
EXPERIMENTAL TESTBED AND APPLICATIONS
Platform/
Resource
Ranger
(TACC)
Network
Infiniband
(10Gbps)
Taub
(UIUC)
Open Cirrus
(HP)
Private Cloud (HP)
Voltaire QDR 10 Gbps Ethernet Emulated network
Infiniband
internal; 1 Gbps
card under KVM
Ethernet x-rack
hypervisor (1Gbps
Physical Ethernet)
Public Cloud (HP)
Emulated network
under KVM
hypervisor (1Gbps
Physical Ethernet)
NAS Parallel Benchmarks class B (NPB3.3-MPI)
 NAMD - Highly scalable molecular dynamics
 ChaNGa - Cosmology, N-body
 Sweep3D - A particle in ASCI code
 Jacobi2D - 5-point stencil computation kernel
 Nqueens - Backtracking state space search

7
PERFORMANCE (1/3)
Some applications cloud-friendly
8
PERFORMANCE (2/3)
Some applications scale till 16-64 cores
9
PERFORMANCE (3/3)
Some applications cannot survive in cloud
10
BOTTLENECKS IN CLOUD: COMMUNICATION
LATENCY
Low is
better
Cloud message latencies (256μs) off by 2 orders of
magnitude compared to supercomputers (4μs)
11
BOTTLENECKS IN CLOUD: COMMUNICATION
BANDWIDTH
High is better
Cloud communication performance off by 2 orders
of magnitude – why?
12
COMMODITY NETWORK OR
VIRTUALIZATION OVERHEAD (BOTH?)
Low is
better
High is
better


Significant virtualization overhead (Physical vs. virtual)
Led to collaborative work on “Optimizing virtualization for HPC –
Thin VMs, Containers, CPU affinity” with HP labs, Singapore.
13
PUBLIC VS. PRIVATE CLOUD
Low is
better
Similar network performance for public and private cloud.
Then, why does public cloud perform worse?
Heterogeneity and Multi-tenancy
14
HETEROGENEITY AND MULTI-TENANCY
CHALLENGE

Heterogeneity: Cloud economics is based on:



Creation of a cluster from existing pool of resources and
Incremental addition of new resources.
Multi-tenancy: Cloud providers run a profitable business
by improving utilization of underutilized resources


Cluster-level by serving large number of users,
Server-level by consolidating VMs of complementary nature
(such as memory- and compute-intensive) on same server.
Heterogeneity and multi-tenancy intrinsic in clouds,
driven by cloud economics but
For HPC, one slow processor => all underutilized
processors
15
OUTLINE

Performance of HPC in cloud
Trends
 Challenges and Opportunities


Application-aware cloud schedulers
HPC-aware schedulers: improve HPC performance
 Application-aware consolidation: improve cloud
utilization => reduce cost


Cloud-aware HPC runtime
Dynamic Load balancing: improve HPC performance
 Parallel runtime for shrink/expand: improve cloud
utilization => reduce cost


Conclusions
16
Next …
HPC in
HPC-aware cloud
Opportunities
Challenges/Bottlenecks
Heterogeneity
Multitenancy
VM
consolidation
Application-aware
Cloud schedulers
SCHEDULING/PLACEMENT
17
BACKGROUND: OPENSTACK NOVA

OpenStack
Open source cloud management system
 Linux of cloud management


Nova scheduler (VM placement)
Ignore the nature of application
 Ignores heterogeneity and network
topology
 Considers the k VMs requested by an
HPC user as k separate placement
problems


No co-relation between the VMs of a
single request
Nova scheduler is HPC-agnostic
18
HARDWARE, TOPOLOGY-AWARE VM PLACEMENT

OpenStack on Open Cirrus test bed at HP Labs. 3 types of servers:




Intel Xeon E5450 (3.00 GHz)
Intel Xeon X3370 (3.00 GHz)
Intel Xeon X3210 (2.13 GHz)
KVM as hypervisor, virtio-net for n/w virtualization, VMs: m1.small
CPU Timelines of
8 VMs running
Jacobi2D – one
iteration
Decrease in
execution time
20% improvement in time, across all processors
19
WHAT ABOUT MULTI-TENANCY: VM
CONSOLIDATION FOR HPC IN CLOUD(1)
0.5 GB
VM Requests
HPC performance
(prefers dedicated execution)
vs. Resource utilization
(shared usage in cloud)?
Up to 23% savings
How much interference?
20
VM CONSOLIDATION FOR HPC IN CLOUD (2)
Experiment: Shared mode (2 apps on each node – 2 cores each on 4 core
node) performance normalized wrt. dedicated mode
4 VM per app
High is
Scope
Challenge: Interference
better
EP = Embarrisingly Parallel
LU = LU factorization
IS = Integer Sort
ChaNGa = Cosmology
Careful co-locations can actually improve performance. Why?
Correlation : LLC misses/sec and shared mode performance.
21
METHODOLOGY:
(1) APPLICATION CHARACTERIZATION
Characterize applications along two dimensions:
1. Cache intensiveness


2.
Assign each application a cache score (= 100K LLC
misses/sec)
Representative of the pressure they put on the last level
cache and memory controller subsystem
Parallel Synchronization and network sensitivity




ExtremeHPC: IS (Parallel sorting)
SyncHPC: LU, ChaNGa
AsyncHPC: EP, MapReduce applications
NonHPC: Web applications
22
METHODOLOGY: (2)HPC AWARE SCHEDULER:
DESIGN AND IMPLEMENTATION
Co-locate applications with complementary profiles
Topo-awareness
for ExtremeHPC,
SyncHPC
Cross-interference aware
using LLC misses/sec
Dedicated
execution for
ExtremeHPC
Resource Packing for
rest classes
Less aggressive packing for
SyncHPC - bulk sync apps
23
MDOBP (MULTI-DIMENSIONAL ONLINE BIN
Physical host
PACKING) :
Requested
Remaining
α
Memory
Pack a VM request into hosts (bin)
 Dimension-aware heuristic
CPUs
 Select the host for which the vector of requested
resources aligns the most with the vector of remaining
capacities*


the host with the minimum α where cos(α) is calculated
using dot product of the two vectors, and is given by:
Residual capacities = (CPURes, MemRes) of a host , Requested VM: (CPUReq, MemReq).
24
*S. Lee, R. Panigrahy, V. Prabhakaran, V. Ramasubramanian, K. Talwar, L. Uyeda, and U. Wieder., “Validating
Heuristics for Virtual Machines Consolidation,” Microsoft Research, Tech. Rep., 2011.
IMPLEMENTATION ATOP OPEN STACK NOVA
25
RESULTS: CASE STUDY OF APPLICATIONAWARE SCHEDULING
EP = Embarrassingly Parallel
LU = LU factorization
IS = Integer Sort
IS.B.4
High is
better
Problem Size
Number of
requested VMs
8 nodes (32 cores)
Less aggressive
Packing



Performance gains up to 45% for a single application
Limiting negative impact of interference to 8%
But, what about resource utilization?
26
SIMULATION
CloudSim: simulation tool for modeling a cloud
computing environment in a datacenter
 Extended the existing vmAllocationPolicySimple class
to create an vmAllocationPolicyHPC




Handle a user request comprising multiple VM instances
Perform Application-aware scheduling
Implemented dynamic VM creation and termination
27
SIMULATION RESULTS


Assigned each job a cache score from (0-30) using a uniform distribution
random number generator
Modified execution times by -10% and -20% to account for the improvement
in performance resulting from cache-awareness
High is better
259 jobs
METACENTRUM-02.swf log
from Parallel workload archive
Simulated first 1500 jobs on
1024 cores, for 100 seconds
β=Cache threshold
For cache threshold of 60 and adjustment of -10%,
improvement in throughput by 259/801 = 32.3%
28
OUTLINE

Performance of HPC in cloud
Trends
 Challenges and Opportunities


Application-aware cloud schedulers
HPC-aware schedulers: improve HPC performance
 Application-aware consolidation: improve cloud
utilization => reduce cost


Cloud-aware HPC runtime
Dynamic Load balancing: improve HPC performance
 Parallel runtime for shrink/expand: improve cloud
utilization => reduce cost


Conclusions
30
HETEROGENEITY AND MULTI-TENANCY
Multi-tenancy => Dynamic heterogeneity
 Interference random and unpredictable
 Challenge: Running in VMs makes it difficult to
determine if (and how much of) the load imbalance is

Application-intrinsic or
Idle times
 Caused by extraneous factors such as interference

CPU/VM
Time
VMs sharing CPU: application functions
appear to be taking longer time
Existing HPC load balancers ignore effect of extraneous factors
31
CHARM++ AND LOAD BALANCING
Migratable objects (chares)
 Object-based over-decomposition

HPC VM1
Load balancer migrates
objects from overloaded
to under loaded VM
HPC VM2
Background/ Interfering VM
running on same host
Physical Host 1
Objects
(Work/Data Units)
Physical Host 2
32
CLOUD-AWARE LOAD BALANCER

Static Heterogeneity:
Estimate the CPU capabilities for each VCPU, and use
those estimates to drive the load balancing.
 Simple estimation strategy + periodic load re-distribution


Dynamic Heterogeneity






Instrument the time spent on each task
Impact of interference: instrument the load external to
the application under consideration (background load)
Normalize execution time to number of ticks (processorindependent)
Predict future load based on the loads of recently
completed iterations (principle of persistence).
Create sets of overloaded and under loaded cores
Migrate objects based on projected loads from overloaded
to underloaded VMs (Periodic refinement)
33
LOAD BALANCING APPROACH
To get a processor-independent
measure of task loads, normalize the
execution times to number of ticks
All processors should have
load close to average load
Average load depends on task
execution time and overhead
Overhead is the time processor is not
executing tasks and not in idle mode.
Tlb: wall clock time between two load balancing steps,
Ti: CPU time consumed by task i on VCPU p
Charm++ LB
database
from /proc/stat
file
34
RESULTS: STENCIL3D
• OpenStack on Open Cirrus test bed (3 types of processors), KVM, virtionet, VMs: m1.small, vcpupin for pinning VCPU to physical cores
• Sequential NPB-FT as interference, Interfering VM pinned to one of the
cores that the VMs of our parallel runs use
Low is better
Multi-tenancy
awareness
Heterogeneity
awareness
Periodically measuring idle time and migrating load away from
time-shared VMs works well in practice.
35
RESULTS
Load Imbalance
High colored
bars are better
Improved CPU utilization
36
RESULTS: IMPROVEMENTS BY LB
High is better
Heterogeneity and Interference – one Slow node, hence four Slow VMs, rest
Fast, one interfering VM (on a Fast core) which starts at iteration 50.
Up to 40% Benefits
37
OUTLINE

Performance of HPC in cloud
Trends
 Challenges and Opportunities


Application-aware cloud schedulers
HPC-aware schedulers: improve HPC performance
 Application-aware consolidation: improve cloud
utilization => reduce cost


Cloud-aware HPC runtime
Dynamic Load balancing: improve HPC performance
 Parallel runtime for shrink/expand: improve cloud
utilization => reduce cost


Conclusions
38
MALLEABLE PARALLEL JOBS
Malleable jobs: dynamic shrink/expand number of
processors
 Twofold merit in the context of cloud computing.


Cloud user perspective:
Dynamic pricing offered by cloud providers, such as Amazon EC2
 Better value for the money spent based on priorities and deadlines


Cloud provider perspective
Malleable jobs + smart scheduler => better system utilization,
response time, and throughput while following QoS
 Honor job priorities

39
Launcher
(Charmrun)
CCS
Shrink
Request
Application Processes
Tasks/Objects
Sync. Point, Check for
Shrink/Expand Request
Object Evacuation
Load Balancing
Time
Checkpoint to linux
shared memory
Rebirth (exec)
or die (exit)
Reconnect protocol
Restore Object
from Checkpoint
ShrinkAck to
external client
Execution Resumes
via stored callback
Launcher (Charmrun)
CCS
Expand
Request
Time
Application Processes
Tasks/Objects
Sync. Point, Check for
Shrink/Expand Request
Checkpoint to linux
shared memory
Rebirth (exec) or
launch (ssh, fork)
Connect protocol
Restore Object
from Checkpoint
ExpandAck to
external client
Load Balancing
Execution Resumes
via stored callback
OUTLINE

Performance of HPC in cloud
Trends
 Challenges and Opportunities


Application-aware cloud schedulers
HPC-aware schedulers: improve HPC performance
 Application-aware consolidation: improve cloud
utilization => reduce cost


Cloud-aware HPC runtime
Dynamic Load balancing: improve HPC performance
 Parallel runtime for shrink/expand: improve cloud
utilization => reduce cost


Conclusions
42
CONCLUSIONS

Bridge the gap between HPC and Cloud
Performance and Cost
 HPC-aware clouds and cloud-aware HPC


Key ideas can be extended beyond HPC-Clouds
Application-aware scheduling, characterization and
consolidation
 Load balancing
 Malleable jobs


Comprehensive evaluation and analysis
Performance benchmarking
 Application characterization

43
FINDINGS
Question
Answers
Who
•
•
Small and medium scale organizations (pay-as-you-go benefits)
Owning applications which result in best performance/cost
ratio in cloud vs. other platforms.
What
•
•
•
Applications with less-intensive communication patterns
Less sensitivity to noise/interference
Small to medium scale
Why
•
HPC users in small-medium enterprises much more sensitive
to the CAPEX/OPEX argument.
Ability to exploit a large variety of different architectures
(Better utilization at global scale, potential consumer savings)
•
How
•
•
Technical: Lightweight virtualization, CPU affinity, HPCaware Cloud schedulers, Cloud-Aware HPC runtime
HPC in the cloud models: cloud bursting, hybrid
supercomputer–cloud approach: application-aware mapping
44
44
FUTURE WORK
Application-aware cloud consolidation + cloud-aware
HPC load balancer
 Mapping applications to platforms

45
POSSIBLE BENEFITS
Cost = Charging rate($ per core-hour) × P × Time
Time constraint
Choose this
Low is better
Cost
constraint
Low is better
Choose this
Interesting cross-over points when considering cost.
Best platform depends on scale, budget.
46
CLOUD PROVIDER PERSPECTIVE
Queue of Jobs
Standard scheduling
( Backfilling, FCFS, Priority) +
Application-awareness
Multi-dimension optimization
• Less load on SC
• Reduced wait time
• Better cloud utilization
•
•
•
•
Online job scheduling Ji = (t,pn), pn= f(t, n), n∈N platforms, deadlines
Output: start time (si), ni (which platform)
Optimization fn: Utilization, turnaround time for a job, throughput
Simplifications
47
WHAT ELSE I HAVE DONE
Large scale HPC Applications
 EpiSimdemics:
Collaborated with V-tech researchers to enable parallel
simulation of contagion diffusion over very large social
networks.
 scales up to 300,000 cores on Blue Waters.
 My focus on leveraging (and developing) Charm++ runtime
features to optimize performance of EpiSimdemics.


Information Set for Game Trees:
Parallelized information set generation for game tree
search applications.
 Analyzed the impact of load balancing strategies, problem
sizes, and computational granularity on parallel scaling.

48
WHAT ELSE I HAVE DONE
Runtime Systems and Schedulers
 Charm++ Runtime system:


various projects for research and development of
Charm++ parallel programming system and the
associated ecosystem (tools etc).
Adaptive Job Scheduler:
extending an open-source job scheduler (SLURM) for
enabling malleable HPC jobs.
 runtime support in Charm++ for such dynamic
shrink/expand capability.
 Power-aware load balancing and scheduling


Scalable Tree Startup:

A multi-level scalable startup technique for parallel
applications.
49
WHAT ELSE I HAVE DONE
Architectures for Data intensive applications
Graph500, HPCC Gups
Simulation – Sandia SST
50
QUESTIONS?
51
BACKUP SLIDES
52
HPC-CLOUD ECONOMICS

Then why cloud for HPC?
Small-medium enterprises, startups with HPC needs
 Lower cost of running in cloud vs. supercomputer?


For some applications?
53
HPC-CLOUD ECONOMICS*
Cost = Charging rate($ per core-hour) × P × Time
$ per CPU-hour on SC
$ per CPU-hour on cloud
High means
cheaper to
run in cloud
Cloud can be cost-effective till some scale but
what about performance?
* Ack to Dejan Milojicic and Paolo Faraboschi who originally drew this figure
54
HPC-CLOUD ECONOMICS
Cost = Charging rate($ per core-hour) × P × Time
Low is better
Best platform depends on application characteristics.
How to select a platform for an application?
55
PROPOSED WORK(1): APP-TO-PLATFORM
1.
Application characterization and relative performance
estimation for structured applications

2.
One-time benchmarking + interpolation for complex apps.
Platform selection algorithms (cloud user perspective)
Minimize cost meeting performance target
 Maximize performance under cost constraint
 Consider an application set as a whole


Which application, which cloud
Benefits: Performance, Cost
56
IMPACT
Effective HPC in cloud (Performance, cost)
 Some techniques applicable beyond clouds
 Charm++ production system
 OpenStack scheduler
 CloudSim
 Industry participation (HP Lab’s award, internships)
 2 patents

57
Download