Cameron_Slides

advertisement
The past, present, and future of
Green Computing
Kirk W. Cameron
SCAPE Laboratory
Virginia Tech
SCAPE Laboratory Confidential
1
Enough About Me
•
•
•
•
•
•
•
•
Associate Professor Virginia Tech
Co-founder Green500
Co-founder MiserWare
Founding Member SpecPower
Consultant for EPA Energy Star for Servers
IEEE Computer “Green IT” Columnist
Over $4M Federally funded “Green” research
SystemG Supercomputer
2
What is SCAPE?
• Scalable Performance Laboratory
– Founded 2001 by Cameron
• Vision
– Improve efficiency of high-end systems
• Approach
– Exploit/create technologies for high-end systems
– Conduct quality research to solve important
problems
– When appropriate, commercialize technologies
– Educate and train next generation HPC CS
SCAPE Laboratory Confidential
3
The Big Picture (Today)
• Past: Challenges
– Need to measure and correlate power data
– Save energy while maintaining performance
• Present
– Software/hardware infrastructure for power measurement
– Intelligent Power Management (CPU Miser, Memory Miser)
– Integration with other toolkits (PAPI, Prophesy)
• Future: Research + Commercialization
– Management Infra-Structure for Energy Reduction
– MiserWare, Inc.
– Holistic Power Management
4
1882 - 2001
5
Prehistory
1882 - 2001
• Embedded systems
• General Purpose Microarchitecture
–
–
–
–
Circa 1999 power becomes disruptive technology
Moore’s Law + Clock Frequency Arms Race
Simulators emerge (e.g. Princeton’s Wattch)
Related work continues today (CMPs, SMT, etc)
6
2002
7
Server Power
2002
• IBM Austin
– Energy-aware commercial servers [Keller et al]
• LANL
– Green Destiny [Feng et al]
• Observations
– IBM targets commercial apps
– Feng et al achieve power savings in exchange for
performance loss
8
HPC Power
2002
• My observations
– Power will become disruptive to HPC
– Laptops outselling PC’s
– Commercial power-aware not appropriate for HPC
$800,000 per year
per megawatt!
$4,000/yr
TM CM-5
.005 Megawatts
$12,000/yr
Residential A/C
.015 Megawatts
$680,000/yr
$8 million/yr
$9.6 million/yr
Intel ASCI Red High-speed train
Conventional
Power Plant
Earth Simulator
10 Megawatts
.850 Megawatts
300 Megawatts 12 Megawatts
9
HPPAC Emerges
2002
• SCAPE Project
– High-performance,
power-aware computing
– Two initial goals
• Measurement tools
• Power/energy savings
– Big Goals…no funding
(risk all startup funds)
10
2003 - 2004
11
Cluster Power
2003 - 2004
• IBM Austin
– On evaluating request-distribution schemes for saving energy
in server clusters, ISPASS ‘03 [Lefurgy et al]
– Improving Server Performance on Trans Processing Workloads
by Enhanced Data Placement. SBAC-PAD ’04 [Rubio et al]
• Rutgers
– Energy conservation techniques for disk array-based servers.
ICS ’04 [Bianchini et al]
• SCAPE
– High-performance, power-aware computing, SC04
– Power measurement + power/energy savings
12
2003 - 2004
PowerPack Measurement
Scalable, synchronized, and accurate.
Hardware power/energy profiling
Power/Energy Profiling Data
Baytech
Management
unit
Baytech Powerstrip
Multi-meter
AC Power from outlet
Data Log
Multi-meter
Multi-meter
DC
Baytech
Power Strip
Single
node
Multi-meter
control
MM Thread
MM Thread
MM Thread
Multi-meter Control Thread
Microbenchmarks
Data collection
Multi-meter
DC Power from power supply
Data Analysis
Data Repository
AC
Applications
DVS
control
PowerPack libraries (profile/control)
DVS Thread DVS Thread
DVS Thread
DVS Control Thread
Software power/energy control
High-performance
Power-aware Cluster
13
After frying
multiple
components…
14
PowerPack Framework
(DC Power Profiling)
If node .eq. root then
call pmeter_init (xmhost,xmport)
call pmeter_log (pmlog,NEW_LOG)
endif
<CODE SEGMENT>
If node .eq. root then
call pmeter_start_session(pm_label)
endif
<CODE SEGMENT>
If node .eq. root then
call pmeter_pause()
call pmeter_log(pmlog,CLOSE_LOG)
call pmeter_finalize()
endif
Multi-meters + 32-node Beowulf
15
Power Profiles – Single Node
Power consum ption distribution for system idle
System Power: 39 Watt
CPU
14%
Pow er Supply
33%
Power consum ption distribution for
m emory performance bound (171.swim )
System Power: 59 Watt
Pow er Supply
21%
CPU
35%
Memory
10%
Disk
11%
NIC
1%
Fans
23%
Other Chipset
8%
Fans
15%
Other Chipset
5%
NIC
1%
Disk
7%
Memory
16%
• CPU is largest consumer of power typically (under load)
16
Power Profiles – Single Node
40 .0
Power
Consumption
Distribution
for Different
Workloads
Power
Consumption
for Various
Workloads
N ote : on ly p owe r c ons ume d
b y CPU , memo ry, di sk an d
N IC are con sid ered here
CPU-bound
memory-bound
CPU-bound
memory-bound
35 .0
30 .0
network-bound
network-bound
25 .0
disk-bound
disk-bound
20 .0
15 .0
10 .0
5 .0
0 .0
i dl e
171 .swi m
16 4.gz ip
C PU
M emo ry
cp
Di sk
scp
NIC
17
NAS PB FT – Performance Profiling
compute
reduce
(comm)
compute
all-to-all
(comm)
About 50% time spent in communications.
18
Power Profile of FT Benchmark (class B, NP=4)
30
initialize
startup
iteration 1 iteration 2
iteration 3
25
20
15
10
P owe r (wa tt)
5
CPU power
0
10
5
memory power
0
5
disk power
0
5
NIC power
0
0
20
40
60
80
10 0
1 20
14 0
1 60
1 80
20 0
Tim e (s ec ond)
Power profiles reflect performance profiles.
19
One FFT Iteration
one iteration
30
evolve
fft
cffts1
25
transpose_x_yz
transpose_local
cffts1
cffts2
CPU Power
transpose_finish
Power (Watts)
mpi_all-to-all
20
send-recv
send-recv
wait
send-recv
15
Memory Power
10
5
0
110
115
120
125
130
135
140
145
150
Time (Seconds)
SCAPE Laboratory Confidential
20
2005 - present
21
Intuition confirmed
2005 - Present
22
HPPAC Tool Progress
2005 - Present
• PowerPack
– Modularized PowerPack and SysteMISER
– Extended analytics for applicability
– Extended to support thermals
• SysteMISER
– Improved analytics to weigh tradeoffs at runtime
– Automated cluster-wide, DVS scheduling
– Support for automated power-aware memory
23
Predicting CPU Power
2005 - Present
30
Estimated CPU Power
Measured CPU Power
25
Power (Watts)
20
15
10
5
0
0
10
20
30
40
50
60
70
80
90
100
Time (Seconds)
24
Predicting Memory Power 2005 - Present
12
Estimated Memory Power
Measured Memory Power
10
Power (Watts)
8
6
4
2
0
0
10
20
30
40
50
60
70
80
90
100
Time (Seconds)
25
Correlating Thermals BT
2005 - Present
26
Correlating Thermals MG 2005 - Present
SCAPE Laboratory Confidential
27
Tempest Results FT
2005 - Present
28
SysteMISER
2005 - Present
• Our software approach to reduce energy
– Management Infrastructure for Energy Reduction
• Power/performance
– measurement
– prediction
– control
The Heat Miser.
29
Power-aware DVS scheduling
strategies
2005 - Present
CPUSPEED Daemon
[example]$ start_cpuspeed
[example]$ mpirun –np 16 ft.B.16
Internal scheduling
MPI_Init();
<CODE SEGMENT>
setspeed(600);
<CODE SEGMENT>
setspeed(1400);
<CODE SEGMENT>
MPI_Finalize();
External Scheduling
[example]$ psetcpuspeed 600
[example]$ mpirun –np 16 ft.B.16
NEMO & PowerPack Framework for saving energy
30
CPU MISER Scheduling (FT)
2005 - Present
Normalized Energy and Delay with CPU MISER for FT.C.8
normalized delay
1.20
normalized energy
1.00
0.80
0.60
0.40
0.20
0.00
auto
600
800
1000
1200
1400
CPU MISER
36% energy savings, less than 1% performance loss
See SC2004, SC2005 publications.
31
Where else can we save energy?
• Processor – DVS
2005 - Present
– Where everyone starts.
• NIC
– Very small portion of systems power
• Disk
– A good choice (our future work)
• Power-supply
– A very good choice (for a EE or ME)
• Memory
– Only 20-30% of system power, but…
32
The Power of Memory
2005 - Present
Effects of increased memory on system power
(90 Watt CPU, 9 Watt 4GB DIMM)
100%
% of system power
90%
80%
70%
60%
Percentage of system power for memory
50%
Percentage of system power for CPUs
40%
30%
20%
10%
0%
0
32
64
96
128
160
192
224
256
Amount of memory per processor (GB)
33
Memory Management Policies
2005 - Present
Default
Static
Dynamic
Memory MISER =
Page Allocation Shaping + Allocation Prediction + Dynamic Control
34
Memory MISER Evaluation
of Prediction and Control
2005 - Present
Memory Online
Memory Demand
8
7
Gigabytes
6
5
4
3
2
1
0
0
5000
10000
15000
20000
25000
30000
35000
Time (seconds)
Prediction/control looks good, but are we guaranteeing performance?
35
Memory MISER Evaluation
of Prediction and Control
2005 - Present
Memory Online
Memory Demand
8
7
Gigabytes
6
5
4
3
2
1
0
22850
22860
22870
22880
22890
22900
22910
22920
22930
22940
22950
Time (seconds)
Stable, accurate prediction using PID controller.
But, what about big (capacity) spikes?
36
Memory MISER Evaluation
of Prediction and Control
2005 - Present
Memory Online
Memory Used
8
7
Gigabytes
6
5
4
3
2
1
0
16940
16960
16980
17000
17020
17040
17060
Time (seconds)
Memory MISER guarantees performance in “worst” conditions.
37
Memory MISER Evaluation
Energy Reduction
2005 - Present
FLASH Memory Demand
6
Gigabytes
5
4
Pinned pages (OS)
decrease efficiency
Memory Demand
High freq cyclic alloc/dealloc
Tiered increases in
memory allocations
…
3
Stable PID control
48
40
…
Stable PID control
32
24
2
16
1
8
0
0
t0
t1
t2
Time
t3
Devices
Devices Online
t4
30% total system energy savings,
less than 1% performance loss
38
Present - 2012
39
SystemG Supercomputer @ VT
SystemG Stats
• 325 Mac Pro Computer nodes, each with two 4-core 2.8
gigahertz (GHZ) Intel Xeon Processors.
• Each node has eight gigabytes (GB) random access memory
(RAM). Each core has 6 MB cache.
• Mellanox 40Gb/s end-to-end InfiniBand adapters and switches.
• LINPACK result: 22.8 TFLOPS (trillion operations per sec)
• Over 10,000 power and thermal sensors
• Variable power modes: DVFS control (2.4 and 2.8 GHZ), FanSpeed control, Concurrency throttling,etc.
(Check: /sys/devices/system/cpu/cpuX/Scaling_avaliable_frequencies.)
• Intelligent Power Distribution Unit: Dominion PX
(remotely control the servers and network devices. Also monitor current, voltage, power,
and temperature through Raritan’s KVM switches and secure Console Servers.)
Deployment Details
1U
24 U
* 13 racks total, 24 nodes on each rack and
8 nodes on each layer.
8U
* 5 PDUs per rack.
1U
8U
Raritan PDU Model DPCS12-20. Each single PUD in SystemG has
an unique IP address and Users can use IPMI to access and retrieve
1U
8U
information from the PDUS and also control them such as remotely
shuting down and restarting machines, recording system AC power, etc.
* There are two types of switch:
1) Ethernet Switch: 1 Gb/sec Ethernet switch. 36 nodes share one
Ethernet switch.
2) InfiniBand switch: 40 Gb/sec InfiniBand switch. 24 nodes (which is
one rack) share one IB switch.
Data collection system and Labview
Sample diagram and corresponding front panel from Labview:
A Power Profile for HPCC benchmark suite
Published Papers And Useful Links
Papers:
1.
2.
Rong Ge, Xizhou Feng, Shuaiwen Song, Hung-Ching Chang, Dong Li, Kirk W. Cameron,
PowerPack: Energy profiling and analysis of High-Performance Systems and Applications,
IEEE Transactions on Parallel and Distributed Systems, Apr. 2009.
Shuaiwen Song, Rong Ge, Xizhou Feng, Kirk W. Cameron, Energy Profiling and Analysis of
the HPC Challenge Benchmarks, The International Journal of High Performance Computing
Applications, Vol. 23, No. 3, 265-276 (2009)
NI system set details:
http://sine.ni.com/nips/cds/view/p/lang/en/nid/202545
http://sine.ni.com/nips/cds/view/p/lang/en/nid/202571
The future…
Present - 2012
• PowerPack
– Streaming sensor data from any source
• PAPI Integration
– Correlated to various systems and applications
• Prophesy Integration
– Analytics to provide unified interface
• SysteMISER
– Study effects of power-aware disks and NICs
– Study effects of emergent architectures (CMT, SMT, etc)
– Coschedule power modes for energy savings
46
Outreach
•
•
•
•
See http://green500.org
See http://thegreengrid.org
See http://www.spec.org/specpower/
See http://hppac.cs.vt.edu
SCAPE Laboratory Confidential
48
Acknowledgements
• My SCAPE Team
–
–
–
–
–
–
Dr. Xizhou Feng (PhD 2006)
Dr. Rong Ge (PhD 2008)
Dr. Matt Tolentino (PhD 2009)
Mr. Dong Li (PhD Student, exp 2010)
Mr. Song Shuaiwen (PhD Student, exp 2010)
Mr. Chun-Yi Su, Mr. Hung-Ching Chang
• Funding Sources
– National Science Foundation (CISE: CCF, CNS)
– Department of Energy (SC)
– Intel
49
Thank you very much.
http://scape.cs.vt.edu
cameron@cs.vt.edu
Thanks to our sponsors: NSF (Career, CCF, CNS), DOE (SC), Intel
50
Download