The Yin and Yang of Hardware Heterogeneity: Can Software Survive? Kathryn S McKinley

advertisement
The Yin and Yang of
Hardware Heterogeneity:
Can Software Survive?
Kathryn S McKinley
1
2
Computation
Turing 1936
3
The Transistor
Shockley, Bardeen, Brattain 1947
4
Virtuous Cycle
Doubling of
Transistors
faster, smaller,
cheaper, …
Software
Software
Innovation
Innovation
Device
Innovation
Software Complexity
Hardware Complexity
Sequential Interface
Sequential Interface
5
Hardware
6
Dennard Scaling is over
Power = Clock Speed × Voltage2
Performance
Performance
Power
7
Dark silicon
Powered fraction of a chip
@ 40mm2 30w
60%
50%
50%
Electricity
Electricity costs in U.S.
Data Centers
40%
30%
20%
18%
10%
$$$$$$
9%
0%
90nm
45nm
32nm
[Goulding et al. Hot Chips 2010]
2011 $7.4 billion
2006 $4.5 billion
[U.S. EPA 2007]
Battery life
Multicore Hardware
2003
2006
2008
Pentium4
(130)
C2D(65)
C2Q(65)
i7(45)
130nm
65nm
45nm
55M tran.
131mm2
1 core
2-way SMT
Northwood
291M tran.
143mm2
2 cores
no SMT
Conroe
Kentsfield
731M tran.
263mm2
4 cores
2-way SMT
Bloomfield
2008
2009
2009
Atom(45) C2D(45) AtomD(45)
45nm
45nm
45nm
2010
i5 (32)
32nm
47M tran. 228M tran.
176M tran.
382M tran.
2
2
2
36mm
82mm
87mm
81mm2
1 core
2 cores
2 cores + GPU
2 cores
2-way SMT
no SMT
2-way SMT
2-way SMT
Diamondville Wolfdale
Pineview
Clarkdale
Each die shown at correct scale
9
Virtuous Cycle
Doubling of
Transistors
faster, smaller,
cheaper, …
End of
Dennard
Scaling
Software
Software
Innovation
Innovation
Device
Innovation
Software Complexity
Hardware Complexity
Sequential Interface
Sequential Interface
Parallel
Interface
10
Software
11
PC Software era ending
12
Software =
Mobile + Web + Cloud
13
Software People
App
Writers
Computer Scientists
14
Languages People Use
PHP
15
Software
Fast
Performance
enough
Productivity
Managing
complexity
Abstractions
16
Hardware & Software
pulling in opposite directions
17
Hardware & Software
Marriage
18
How’s this marriage
working out?
19
What should we
measure?
20
energy =
ò
power dt
21
Multicore Hardware
2003
2006
2008
Pentium4
(130)
C2D(65)
C2Q(65)
i7(45)
130nm
65nm
45nm
55M tran.
131mm2
1 core
2-way SMT
Northwood
291M tran.
143mm2
2 cores
no SMT
Conroe
Kentsfield
731M tran.
263mm2
4 cores
2-way SMT
Bloomfield
2008
2009
2009
Atom(45) C2D(45) AtomD(45)
45nm
45nm
45nm
2010
i5 (32)
32nm
47M tran. 228M tran.
176M tran.
382M tran.
2
2
2
36mm
82mm
87mm
81mm2
1 core
2 cores
2 cores + GPU
2 cores
2-way SMT
no SMT
2-way SMT
2-way SMT
Diamondville Wolfdale
Pineview
Clarkdale
Each die shown at correct scale
22
scalable non-scalable
Workloads
native
61 benchmarks
managed
Native Non-scalable
Java Non-scalable
27 C, C++, Fortran
SPEC CPU2006
18 Java
SPEC jvm98, DaCapo,
pjbb2005
0.25
Native Scalable
0.25
11 C, C++ PARSEC
0.25
Java Scalable
0.25
5 Java DaCapo
23
Power vs Performance
Power is benchmark dependent
100
80
60
2003
2008
Power (W)
Pentium 4 (130)
i7 (45)
2006
40
Core 2 Duo (65)
i5 (32)
?
20
2010
Core 2 Duo (45)
?
2008
10
0.5
1
2
3
4
5
Performance / Reference Performance
24
Parallelism did not solve
the power problem
25
Web Page Energy on
Windows Phone
1.4
1400000
Energy (mWh)
1.2
Load data
1
1200000
1000000
0.8
800000
0.6
600000
0.4
400000
0.2
200000
0
0
amazon
apple
baidu
bing
ebay
google
msn
paypal
wikipedia youtube
Data downloaded (bytes)
Load energy
What’s next in
hardware?
27
Pareto Analysis (45nm)
Workload determines energy efficient architecture
Average
Native Non-scale
Java Non-scale
Java Scale
Native Scale
0.60
Normalized Group Energy
0.55
0.50
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.00
1.00
2.00
3.00
4.00
5.00
6.00
Group Performance / Group Reference Performance
7.00
28
Native
scalable
✔
Java
non-scalable
✔
Java
scalable
✔
Native
non-scalable
✔
✔
✔
✔
✔
✔
✔
✔
✔
✔
✔
✔
✔
I7(45)4C2T@2.7Ghz
✔
I7(45)4C2T@2.7GHz no
TB
I7(45)4C2T@2.1GHz
I7(45)4C2T@1.6GHz
I7(45)4C1T@2.7GHz
I7(45)4C1T@2.7GHz no
TB
I7(45)2c2T@1.6GHz
I7(45)2C1T@1.6GHz
I7(45)1C2T@2.4GHz
I7(45)1C2T@1.6GHz
I7(45)1C1T@2.7GHz
I7(45)1C1T@2.7GHz no
TB
Core2D(45)2C1T@3.1G
Hz
Core2D(45)2C1T@1.6GHz
Atom(45)1C2T@1.7Ghz
Workload determines energy
efficient architecture
✔
✔
✔
✔
✔
✔
29
Parallelism & Heterogeneity
big
00
00
0
0
0
0
00
0
0
00
00
00 0 0
0
00
00
00 0 0
0
0
small
00
0
0
0
0
00
0
0
0
0
0
0
0
00
00
0
0
0
00
0
0
0
0
00
0
0
0
0
custom
30
Motivation
Single ISA Heterogeneity
NVIDIA Tegra3
5 Cortex A9 (4 x 1.4 GHz, 1 x 500 MHz)
Texas Instruments OMAP5432
2 Cortex A15 + 2 Cortex M4
31
Heterogeneous Hardware
Energy
Efficiency
Complexity
32
Heterogeneous parallel
hardware + software
July 31, 1922. Train wreck at Laurel, Maryland [Washington Post, August 1, 1922]
33
Exploiting Heterogeneity
Parallelism
Ubiquity
Differentiation
34
Case Studies
Mobile UI
Managed runtime
Always awake
Interactive cloud
[UIST’13]
[ISCA’12]
[NSDI’12]
[ICAC’13]
35
Case Studies
Mobile UI
Managed runtime
Always awake
Interactive cloud
[UIST’13]
[ISCA’12]
[NSDI’12]
[ICAC’13]
36
User Interface I/O on
OMAP big/little
Kihm, Guimbretière UIST’13
0
0
0
0
0
0
0
0
37
User Interface I/O
& Heterogeneity
Characteristics
Parallelism
Ubiquity
Differentiated
UI I/O



38
A9+M3 Heterogeneity
Battery life Increase
A9
A9 + M3 Dispatch
A9 + display control
A9 + M3 execute
3
2.5
2
1.5
1
0.5
0
Scrolling
Pen Inking
Virtual
Keyboard
Keyboard
39
Case Studies
Mobile UI
Managed runtime
Always awake
Interactive cloud
[UIST’13]
[ISCA’12]
[NSDI’12]
[ICAC’13]
40
VM Services on little cores
Cao et al., ISCA’12
0
0
0
0
0
0
0
41
VM Services & Heterogeneity
Characteristics
Parallelism
Ubiquity
Differentiated
VM Services


?
42
Measured (fill)
GC
JIT
Model (empty)
GC & JIT
Normalized to 1 core at 2.8 GHz
1.20
1.16
1.12
1.08
1.04
1.00
Better
Better
0.96
Better
Better
0.92
Performance
Power
Energy
PPE
2.8 GHz AMD + 2.8 GHz AMD | 2.8 GHz AMD + 1.66 GHz Atom
43
Case Studies
Mobile UI
Managed runtime
Always awake
Interactive cloud
[UIST’13]
[ISCA’12]
[NSDI’12]
[ICAC’13]
44
Somniloquy
Application stubs on little cores
wake up big core as needed
Agarwal et al., NSDI’12
Applications filtering, notifications, downloads, keep alive
Host Big core
RAM, peripherals, …
Apps
Somniloquy
daemon
OS
Network stack
Laptop
Wakeup
filters
App
stubs
Embedded OS
Network stack
CPU + DRAM + Flash
Little core
Network interface
Big: Lenovo X60 Laptop +
Little: gumstix
Power Consumption in Watts
16
14
12
10
8
6
4
2
0
Little sleep
Somniloquy
Big @ low
power
Big Core
46
Case Studies
✔
Mobile UI
Managed runtime
Always awake
Interactive cloud
[UIST’13]
[ISCA’12]
[NSDI’12]
[ICAC’13]
47
Case Studies
Mobile UI
Managed runtime
Always awake
Interactive cloud
[UIST’13]
[ISCA’12]
[NSDI’12]
[ICAC’13]
48
Interactive Cloud Services
Bing, Finance, Recommendations, Games
Ren et al. ICAC’13
Characteristics
Parallelism
Ubiquity
Differentiated
Interactive Services


?
49
Interactive Services
Workload Characterization of Bing Search
1. Responsiveness deadline ~100ms = interactive
2. Partial execution trades quality for responsiveness
Completion vs Quality
Quality
1.00
0.80
0.60
0.40
0.20
0.00
0.0 0.2 0.4 0.6 0.8 1.0
Completion Ratio
Data Centers are
Power Limited
~88 Watts/core
Homogeneous
0
00
00
0
00
00
0
00
00
0
00
00
0
00
00
5 SMT Nehalum cores
i5 670 32nm
00 00 00 00 00
00 00 00 00 00
00 00 00 00 00
00 00 00 00 00
22 SMT AtomD cores
Bonnell 32nm
51
Homogeneous Cores
Throughput or quality, but not both!
Quality Goal
5 Nehalums
1.00
0
0
00
0
Quality
0.99
0
0
00
0
22 Atoms
0
0
00
0
0
0
00
0
0
0
00
0
0.98
0
0
0
0
0
0
0
0
0.97
0.96
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0.95
30
60
90
120 150 180 210
Queries per second
240
270
300
52
Interactive Services
Workload Characterization of Bing Search
1. Responsiveness deadline ~100ms = interactive
2. Partial execution trades quality for responsiveness
3. Unknown, highly variable service demand
00
00
00
0.00
0.0 0.2
0.4 0.6 +
0.8 1.0
3 SMT
Nehalum
9 SMT
Atom cores
Completion
Ratio
Demand (ms)
100
00
0.20
00
90
00
90
00
70
00
55
00
Differentiated
45
0.40
0
00
00
0
00
00
35
0
00
0.60 00
25
0.80
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
5
Quality
1.00
Heterogeneous
Demand Distribution (ms)
15
Completion vs Quality
Slow to Fast scheduling
Long jobs: no difference between slow-to-fast & fast-to-slow
Short jobs: slow to fast consumes less energy
Unknown service demand: long jobs migrate to fast cores
Short
job
Long job
Time
Short
job
Long job
Time
54
FOF: Fast Old & First Scheduler
1. Schedule fastest core first
2. Promote older jobs to faster cores
Small
Medium
Big
55
Data Centers are
Power Limited
~88 Watts/core
Homogeneous
0
00
00
Heterogeneous
0
00
00
0
00
00
0
00
00
00
00
00
00
00
00
00
00
00
3 SMT Nehalum +
9 SMT Atom cores
0
00
00
0
00
00
0
00
00
0
00
00
5 SMT Nehalum cores
i5 670 32nm
00 00 00 00 00
00 00 00 00 00
00 00 00 00 00
00 00 00 00 00
22 SMT AtomD cores
Bonnell 32nm
56
FOF Heterogeneous Cores
throughput and quality!
Quality Goal
22 Atoms
5 Nehalums
Heterogeneous
3 N + 9 Atoms
1.00
Quality
0.99
0.98
50% throughput increase
or buy 33% fewer servers
0.97
0.96
0.95
30
60
90
120
150
180
210
Queries per second
240
270
300
57
Simultaneous Multithreading =
Dynamic Heterogeneity
SMT off
1 SMT core
2 SMT cores
All SMT
4 fast
3 fast + 2 slow
2 fast + 4 slow
8 slow
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
00
0
0
0
00
0
0
0
0
0
0
00
0
0
0
00
0
0
0
0
0
0
00
0
0
0
00
0
0
0
00
0
0
0
00
0
0
0
00
0
0
0
00
0
0
0
00
0
0
0
00
0
0
0
00
0
0
0
00
0
0 SMT
58
FOF Scheduler for SMT
1. Schedule fastest core first
Fastest = unshared, share with youngest job
2. Promote older jobs to faster cores
free core? find sharing of (oldest, other),
move other to free core
59
Slow to Fast on SMT
Monte Carlo Finance Server
6 Core 2-way SMT 3.33 GHz Intel Xeon
SMT off
SMT + Share Old
SMT + FOF
1.000
Quality
0.995
0.990
0.985
0.980
16% Improvement
0.975
0.970
4
12
20
31
39
Queries per second
47
55
60
Slow to Fast for Energy
start on slow
Monte Carlo Finance Server
Energy (Joules)
6 Core 2-way SMT 3.33 GHz Intel Xeon
Homogeneous
Deadline Only
Slow to Fast
2.40
2.30
2.20
2.10
2.00
1.90
1.80
1.70
1.60
1.50
1.40
0.20
0.22
0.24
0.26
0.28
0.30
Average Latency (seconds)
61
Key Analytical Results
Slow to Fast Theorem
An optimal solution migrates jobs from
slower to faster cores to minimize energy
Heterogeneity Theorem
More heterogeneity (ratio of fastest to
slowest core) is desirable for higher p in Lp
norm (the tail!)
62
63
64
Software Challenges in a
Power Constrained World
Optimizing performance, power, & energy
Software/Hardware co-design
Software portability
Programming models & abstractions
00
00
0
0
0
0
00
0
0
00
00
00 0 0
0
00
00
00 0 0
0
0
00
0
0
0
0
00
0
0
0
0
0
0
0
00
00
OOPSLA’13
0
0
0
00
0
0
0
0
00
0
0
0
0
65
New Virtuous Cycle?
Doubling of
Transistors
faster, smaller,
cheaper, …
Specialization
Software
Innovation
Device
Innovation
Software Complexity
Hardware Complexity
Sequential Interface
Sequential Interface
New Interfaces &
Abstractions
66
Collaborators on this work
Steve Blackburn, Ting Cao, Hadi
Esmaeilzadeh, Ivan Jibaja
Yuxiong He, Shaolei Ren, Sameh
Elnikety, Xi Yang, Yong Hun Eom
Todd Mytkowicz, James Bornholt,
Aman Kansal
67
The Future?
Thank you
68
Bibliography
•
•
•
•
•
•
•
•
Exploiting Processor Heterogeneity in Interactive Systems, S. Ren, Y. He, S. Elnikety, and K.
S. McKinley, The USENIX International Conference on Autonomic Computing (ICAC), San
Jose, CA, June 2013.
Asymmetric Cores for Low Power User Interface Systems, Jaeyeon Kihm and Francois
Guimbretiere, ACM Symposium on User Interface Software and Technology (UIST), October,
2013. Poster.
The Yin and Yang of Hardware Heterogeneity: Can Software Survive? K. S. McKinley, ACM
SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and
Applications (OOPSLA), Indianapolis, IN, October 2013.
The New Global Ecosystem in Advanced Computing: Implications for U.S. Competitiveness
and National Security, Committee on Global Approaches to Advanced Computing (member),
Board on Global Science and Technology Policy and Global Affairs Division, National
Research Council of the National Academies, The National Academies Press, Washington,
D.C., 2012.
The Model Is Not Enough: Understanding Energy Consumption in Mobile Devices, J.
Bornholt, T. Mytkowicz, and K. S. McKinley, Hot Chips, San Jose, CA, August 2012.
Looking Back and Looking Forward: Power, Performance, and Upheaval, H. Esmaeilzadeh, T.
Cao, X. Yang, S. M. Blackburn, and K. S. McKinley, Communications of the ACM (CACM),
Research Highlights, 55(7), July, 2012.
The Yin and Yang of Power and Performance for Asymmetric Hardware and Managed
Software, T. Cao, T. Gao, S. M. Blackburn, and K.S. McKinley, ACM/IEEE International
Symposium on Computer Architecture, Portland, OR, June, 2012.
Yuvraj Agarwal, Steve Hodges, Ranveer Chandra, James Scott, Paramvir Bahl, and Rajesh
Gupta, Somniloquy: Augmenting Network Interfaces to Reduce PC Energy Usage, in
Networked Systems Design & Implementation (NSDI), USENIX, 22 April 2009
69
A Byte of My Story
Mentors
Family
ACM Fellow
Congressional Testimony70
I fail a lot
Rejected job applications
1984 (all), 1993 (8 of 11), 2011 (4 of 8)
Failed PhD qualifying exam
Rejected first three grant applications
Rejected 3 times my most cited paper
Rejected papers, grants, papers, …
I learn & persist
71
Download