ppt - ECE Users Pages

advertisement
Energy and Power
Lecture notes S. Yalamanchili and S. Mukhopadhyay
Some Useful Reading
• http://en.wikipedia.org/wiki/CPU_power_dissip
ation
• http://en.wikipedia.org/wiki/CMOS#Power:_sw
itching_and_leakage
• http://www.xbitlabs.com/articles/cpu/display/c
ore-i5-2500t-2390t-i3-2100t-pentiumg620t.html
• http://www.cpu-world.com/info/charts.html
(2)
Historical Scaling
(3)
Technology Scaling
GATE
GATE
DRAIN
SOURCE
tox
DRAIN
SOURCE
BODY
L
• 30% scaling down in dimensions  doubles
transistor density
P   CV dd f  V dd I st  V dd I leak
2
• Power per transistor
 Vdd scaling  lower power
• Transistor delay = Cgate Vdd/ISAT
 Cgate, Vdd scaling  lower delay
(4)
Fundamental Trends
2004
2006
2008
2010
2012
2014
2016
2018
90
65
45
32
22
16
11
8
Integration Capacity
(BT)
2
4
8
16
32
64
128
256
Delay = CV/I scaling
0.7
~0.7
>0.7
Delay scaling will slow down
>0.35
>0.5
>0.5
Energy scaling will slow down
High Volume
Manufacturing
Technology Node
(nm)
Energy/Logic Op
scaling
Bulk Planar CMOS
High Probability
Low Probability
Alternate, 3G etc
Low Probability
High Probability
Medium
Variability
ILD (K)
RC Delay
Metal Layers
High
~3
<3
1
1
1
6-7
7-8
8-9
Very High
Reduce slowly towards 2-2.5
1
1
1
1
1
0.5 to 1 layer per generation
Source: Shekhar Borkar, Intel Corp.
(5)
ITRS Roadmap for Logic Devices
From: “ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems,” P. Kogge, et.al, 2008
(6)
Where Does the Power Go in CMOS?
• Dynamic Power Consumption
 Charging and discharging capacitance
• Short Circuit Power
 Short circuit path between supply rails during
switching
 Nominally 10%-20% of dynamic power and can be
ignored for a first order analysis
• Leakage
 Leaky transistors
(7)
Dynamic Power
• Dynamic power is used in charging and
discharging the capacitances in the CMOS circuit.
VDD
VDD
Voltage
iDD
VDD
CL
0
T
Input to
CMOS
inverter
iDD
CL
Time
Output
Capacitor
Charging
Output
Capacitor
Discharging
PDYNAMIC = CL x VDD x VDD x Frequency
(8)
Static Power
• Technology scaling has caused transistors to
become smaller and smaller. As a result, static
power has become a substantial portion of the
total power.
Input = 0
Output = VDD
Gate Leakage
Junction Leakage
Sub-threshold
Leakage
PSTATIC = VDD x ISTATIC
(9)
Delay
Energy
EDP
Energy or delay
Energy-Delay Interaction
VDD
VDD
• Delay decreases with supply voltage but
energy/power increases
(10)
leakage or delay
Static Energy-Delay Interaction
leakage
delay
GATE
DRAIN
SOURCE
tox
L
Vth
• Static energy increases exponentially with
decrease in threshold voltage
• Delay increases with threshold voltage
(11)
Power(watts)
Power(watts)
Power Vs. Energy
P2
P1
Same Energy = area under the curve
P0
Time
P0
Time
• Energy is a rate of expenditure of energy
 One joule/sec = one watt
• Both profiles use the same amount of energy
at different rates or power
(12)
Optimizing Power vs. Energy
Maximize battery life  minimize energy
Thermal envelopes 
minimize peak power
(13)
The Problem
• Historically performance scaling was
accompanied by power scaling
• This is no longer true  power densities are
increasing
(14)
The End of Dennard Scaling
GATE
DRAIN
SOURCE
tox
L
•
Voltage is no longer
scaling at the same rate
•
Slower scaling in power
per transistor 
increasing power densities
From R. Dennard, et al., “Design of ion-implanted MOSFETs with very small physical dimensions,” IEEE Journal of Solid
State Circuits, vol. SC-9, no. 5, pp. 256-268, Oct. 1974.
(15)
Chip Power Densities
From: “ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems,” P. Kogge, et.al, 2008
(16)
What is the Problem?
Mukhopadhyay and Yalamanchili (2009)
Based on scaling using Pentium-class cores
 While Moore’s Law continues, scaling phenomena have
changed
 Power densities are increasing with each generation

(17)
17
The Power Wall
P   CV dd f  V dd I st  V dd I leak
2
• Power per transistor scales with frequency
but also scales with Vdd
 Lower Vdd can be compensated for with increased
pipelining to keep throughput constant
 Power per transistor is not same as power per
area  power density is the problem!
 Multiple units can be run at lower frequencies to
keep throughput constant, while saving power
(18)
The Advent of Dark Silicon?
In-order core
Out of-order core
• Cannot afford to turn
on all devices at once
• How do we manage the
power and thermals?
64-core asymmetric chip multiprocessor layout
and failure probability distribution
(19)
What are my Options?
1. Better technology
 Manufacturing
 New Devices  non-CMOS?
2. Be more efficient – activity management
 Clock gating
 Power gating
 Power management
3. Improved architecture
 Simpler pipelines
4. Parallelism
(20)
Activity Management
Clock Gating
Power Gating
Vdd
input
Combinational
Logic
clk
Power gate
transistor
cond
clk
•
Turn off clock to a block of
logic
•
Eliminate unnecessary
transitions/activity
•
Core 0
clk
Clock distribution power
Core 1
•
Turn off power to a
block of logic, e.g.,
core
•
No leakage
(21)
Power Management
• Software controlled power management
 Optimize power and/or energy
 Orchestrated by the operating system or application
libraries
 Industry standard interfaces for power management
o
Advanced Configuration and Power Interface (ACPI)
 https://www.acpica.org/
 http://www.acpi.info/
• Hardware power management
 Optimized power/energy
 Failsafe operation, e.g., protect against thermal
emergencies
(22)
Processor Power States
• Performance States – P-states
 Operate at different voltage/frequencies
o
Recall delay-voltage relationship
 Lower voltage  lower leakage
 Lower frequency  lower power (not the same as energy!)
 Lower frequency  longer execution time
• Idle States - C-states
 Sleep states
 Differ is how much state is saved
• SW or HW managed transitions between states!
(23)
Multiple Voltage Frequency Domains
Intel Sandy Bridge
Processor
•
•
•
Cores and ring in one DVFS domain
Graphics unit in another DVFS domain
Cores and portion of cache can be gated
off
From E. Rotem et. Al. HotChips 2011
(24)
Power States
From: http://www.intel.com/content/www/us/en/processors/core/2nd-gen-core-family-mobile-vol-1-datasheet.html
(25)
Power Gating
• Turn off components
that are not being used
 Lose all state information
• Costs of powering down
• Costs of powering up
• Smart shutdown
 Models to guide decisions
Intel Sandy Bridge
Processor
(26)
Simplify Core Design
AMD Bulldozer Core
• Support for out of
order execution,
schedulers, branch
prediction, etc.
consumes more
energy per instruction
ARM A7 Core (arm.com)
• Can fit many more
simpler cores on a
dies
(27)
Parallelism and Power
IBM Power5
Source: IBM
AMD Trinity
Source: forwardthinking.pcmag.com
•
How much of the chip area is devoted to compute?
•
Run many cores slower. Why does this reduce power?
(28)
Parallelism
• Concurrency + lower frequency  greater
energy efficiency
Example
Core
Cache
Core
Core
Cache
Cache
Core
Core
Cache
Cache
•
•
•
•
•
4X #cores
0.75x voltage
0.5x Frequency
1X power
2X in performance
P   CV dd f  V dd I st  V dd I leak
2
(29)
Microarchitectural Level Models
• How can we study power consumption without
building circuits?
 Models
• Models can are available at multiple levels of
abstraction.
We are interested in microarchitectural models
(30)
Processor Microarchitecture
Fetch
Decode
Execute/Writeback
Register
Files
ALU
MUL
Instruction
Cache
Fetch
Queue
Instruction
Decoder
Instruction
Queue
FPU
LD
Branch
Prediction
Instruction
TLB
Data
TLB
ST
L1 Data
Cache
Network
Memory
L2 Data Cache
NoC
Router
On-Chip
Network
(31)
Energy/Power Calculation
• How do we calculate energy or power dissipation
for a given microarchitecture?
• Energy/Power varies between:
 Different ISA; ARM vs Intel x86
 Different microarchitecture; in-order vs out-of-order
 Different applications; memory vs compute-bound
 Different technologies; 90nm vs 22nm technology
 Different operation conditions; frequency, temperature
(32)
Architecture Activity (1)
icache.read++; fbuffer.write++;
Register
Files
Activity 1: Instruction Fetch
ALU
MUL
Instruction
Cache
Fetch
Queue
Instruction
Decoder
Instruction
TLB
Instruction
Queue
FPU
LD
Branch
Prediction
• Collect activity counts of
each architecture
component (through
simulation or
measurement).
• List of components differs
between microarchitectures.
• Activity counts at each
component differs between
applications.
Data
TLB
ST
L1 Data
Cache
L2 Data Cache
NoC
Router
On-Chip
Network
(33)
Architecture Activity (2)
fbuffer.read++; idecoder.logic++;
Activity 2: Instruction Decode
Register
Files
ALU
MUL
Instruction
Cache
Fetch
Queue
Instruction
Decoder
Instruction
TLB
Instruction
Queue
FPU
LD
Branch
Prediction
• Read/write accesses to
caches, buffers, etc.
• Logical accesses to logic
blocks such as decoder, ALUs,
etc.
Data
TLB
ST
L1 Data
Cache
L2 Data Cache
NoC
Router
On-Chip
Network
• Tradeoff of differentiating
more access types (accuracy)
vs simulation speed
(complexity).
(34)
Power and Architecture Activity
• For example, At nth clock cycle, collected
counters are:
 Data cache:
o
read = 20, write = 12;
o
per-read energy = 0.5nJ; per-write energy = 0.6nJ;
o
Read energy = read*per-read energy = 10nJ
o
Write energy = write*per-write energy = 7.2nJ
o
Total activity energy = read+write energies = 17.2nJ
o
If n = 50th clock cycle and clock frequency = 2GHz,
Total activity power = energy*clock_freq/n = 688mW
*Note: n/clock_freq = n clock periods in sec
power = time average of energy
(35)
Things to consider (1)
1. How do we calculate per-read/write energies?
• Per-access energies can be estimated from circuit-level
designs and analyses.
• There are various open-source tools for this.
Architecture
Specification
Technology
Parameters
Circuit-level
Estimation
Tool
Estimation
Results:
Area, Energy,
Timing, etc.
(36)
Things to consider (2)
2. Is per-access energy always the same?
• Per-access energy in fact depends on:
• how many bits are switching
• how they are switching (0→1 or 1→0)
• It is reasonable to assume constant per-access
energy in long-term observation (e.g., n = 1M
clock cycles); the number of switching bits are
averaged (e.g., 50% of bits are switching).
• Most architecture simulators do not capture bitlevel details due to simulation complexity.
(37)
Things to consider (3)
3. If a register file didn’t have read/write accesses
but held data, what is the energy dissipation?
• Energy (or power) is largely comprised of dynamic and
static dissipations.
• Dynamic (or switching) energy refers to energy dissipation
due to switching activities.
• Static (or leakage) energy is dissipation to keep the
electronic system turned on.
• In this case, the register file has no dynamic energy
dissipation but consumes static energy.
(38)
Thermal Issues
• Heat can cause damage to the chip
 Need failsafe operation
• Thermal fields change the physical
characteristics
 Leakage current and therefore power increases
 Delay increases
 Device degradation becomes worse
• Cooling solution determines the permitted
power dissipation
(39)
Thermal Design Power (TDP)
• This is the maximum
power at which the part is
designed to operate
 Dictates the design of the
cooling system
o
AMD Trinity APU
Max temperature  Tjmax
 Typically fixed by worst case
workload
• Parts are typically
operating below the TDP
• Opportunities for turbo
mode?
http://ecs.vancouver.wsu.edu/thermofluids-research
(40)
Trinity TDP
Source: http://www.anandtech.com/show/6347/amd-a10-5800k-a8-5600k-review-trinity-on-the-desktop-part-2
(41)
Exploiting the Physics
• Most of time the part is operating well below its
thermal limit
 Leaving performance on the table
• Can temporarily boost frequency (and
therefore power dissipation) for short periods
of time, e.g., seconds
• Temperature changes slowly
(42)
Boosting
Intel Sandy Bridge
• Exploit package physics
 Temperature changes on the
order of milliseconds
• Use the thermal headroom
Turbo boost region
Max Power
TDP Power
10s of seconds
Low power – build up
thermal credits
(43)
Conclusions
• Power/energy is the leading driver of modern
architecture design
• Power and energy management is key to
scalability
• Need integrated power/energy, performance,
thermal management in fielded systems
• What about energy/power efficient algorithms?
(44)
Study Guide
• Explain the difference between energy
dissipation and power dissipation
• Distinguish between static power dissipation
and dynamic power dissipation
• Be able to apply the simplified McPAT power
model to a simple datapath and instruction
sequence
• Explain dynamic voltage frequency scaling
 What are power states?
 Why is this an advantage?
 What is the impact of DVFS on i) energy, ii)
execution time, and iii) power
(45)
Study Guide (cont.)
• How is thermal design power (TDP) calculated?
• When using boost algorithms, what determines
the duration of the high frequency operation?
• How does a power virus work?
• Describe how throttling works
• Know the power dissipation in some modern
processor-memory systems drawn from the
embedded, server, and high performance
computing segments
(46)
Download