Design for low power
The Crusoe processors
The ARM processors
Final remarks
2015-12-14
Microelectronics technology has exponential performance increase at a low cost.
However, for some applications, low-power consumption is more important than performance.
Mobile communications and computing
Wireless Internet
Medical implants
Deep space applications
Low-power designs will lead to:
Longer battery life time
Lower cost
• Cooling and package
• Electricity bill
Higher reliability and longer life time (due to lower temperature and smaller temperature gradients).
1
Performance degradation at high temperature:
Reduced carrier mobility and driving current.
Increased interconnect delay .
Temperature has strong impacts on frequency
7
6
5
4
9
8
Vdd=1.4V
40 50 60 70 80
Temperature (C)
90 100 110
Source: Temperature-Aware Performance and Power Modelling, W. P. Liao, In Technical Report 04-250, UCLA Engr.
Device lifetime decreases exponentially with increasing junction temperature.
Source: NXP Semiconductor, PNX8526 Data Sheet. ( Online )
2015-12-14
2
2015-12-14
Circuit Techniques:
Power efficient circuits, and asynchronous logic.
Micro-Architecture and Logic:
Logic transformations to reduce switching activities.
Number encoding (sign-magnitude number representation is better than the two’s complement representation for integers).
2’s complement vs. sign-magnitude
0 0 0 0 0 0 1 0 2 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 -1 1 0 0 0 0 0 0 1
1 1 1 1 1 1 1 0 -2 1 0 0 0 0 0 1 0
Software:
Use power efficient algorithms.
Compiler to optimize for more power efficient code.
Run-time power management by the OS.
3
Static power
Leakage currents (I
DDQ
).
Sub-threshold currents.
Substrate currents.
Ideally, no static (DC) power since in the steady state there is no direct path from V dd to ground.
Dynamic Power
Transient switching behavior.
Capacitive switching: each time a capacitive node switches from ground to V dd
, an energy of CV dd2 is consumed.
The result of charging and discharging parasitic capacitances.
This is about 60% of the total power in the current technology.
Average dynamic power consumption for CMOS:
P dyn
V dd
: Supply voltage;
1
2
CV
2 dd
f
C : the total capacitance;
: the expected number of transitions per clock cycle; and f : the clock frequency in a synchronous system.
We have therefore 3 degrees of freedom inherent in the lowpower design space:
Supply voltage ( V dd
).
Physical capacitance ( C ).
Switching activity ( f ).
These parameters are not completely orthogonal and cannot be optimized independently.
2015-12-14
4
Quadratic relationship to power
The most direct and dramatic way of minimizing energy consumption.
However, we need to consider also other factors that influence selection of a system supply voltage:
Performance requirements; and
Compatibility.
2015-12-14
Determined by two primary sources:
Devices and interconnects
Should be kept at a minimum by using small devices and short wires.
Multi-core is better than a single core, since the interconnect lengths will be shorter in general.
GALS: Globally asynchronous, locally synchronous design:
Uses different clock domains for different parts of the chips.
5
SA determines how often switching occurs.
f determines the average periodicity of data arrivals.
determines how many transitions each arrival will spark.
Clock gating ── to disable portions of the circuitry so that the flip-flops in them do not have to switch states.
cond clk clk
Dynamic frequency scaling ── the clock frequency is automatically adjusted "on the fly," to conserve power.
Build adaptive systems.
Learn from the nature.
Slow things down, and turn them off whenever appropriate.
Parallel processing and pipelining
So that low voltage can be used to deliver the required performance.
Vdd/2
Vdd
Logic
Block
Vdd = 1
Freq = 1
Throughput = 1
Power = 1
Area = 1
Pwr Density = 1
Logic
Block
Logic
Block
Vdd = 0.5
Freq = 0.5
Throughput = 1
Power = 0.25
Area = 2
Pwr Density = 0.125
2015-12-14
6
Clock
FP
I/O
Issue
Int
Others
Caches
Mem
Simplify the control
SIMD amortizes energy cost of one instruction over many operations.
GPUs have exploited this idea to get great power efficiency.
Pipeline gating: reduce miss-speculated instruction execution.
Miss-speculated instructions increase energy consumption, typically 16%-
105% overhead.
Pipeline gating: stall fetching when confidence is low.
Prevent “bad” instructions from entering the pipeline: may reduce 38% of wrongly executed instructions.
low confidence
BP counter incr decr
> threshold?
stall fetch decode issue exe/wb commit
2015-12-14
7
A flat large memory consumes much power.
A memory hierarchy is better.
The lower level memories will usually not be activated.
Banked memory consumes less power.
Hierarchical register files are better (e.g., register windows).
Multiple layers of cache.
Very often only the top levels are visited.
Dynamically adjusting cache size can also save power.
Low power DRAM with deep sleeping modes should be used whenever possible.
Doze (entry time immediate, wakeup time immediate)
Cores idle at reduced frequency; continue snooping on the bus
Wake up immediately — no state reloading needed
Nap (entry time 2–16 μ s, wakeup time < 0.5 ms)
Core clock stopped and voltage lowered to reduce leakage
All architecture state is retained
D-cache modified data is flushed by HW to maintain coherence
SRAM (cache) remains power on, value retained
Sleep (entry time 2–16 μ s, wakeup time < 1 ms)
Core powered off (either or both cores)
Some architecture state must be saved by software
On wakeup, the core goes through power-on-reset sequence
2015-12-14
8
Design for low power
The Crusoe processors
The ARM processors
Final remarks
2015-12-14
Developed for mobile and Internet computing.
VLIW CPU: executing up to 4 operations in each cycle
Molecule: long instruction word (128 bits molecule).
All atoms within a molecule are executed in parallel.
1 ALU, 1FP, 1 load/store, 1 branch unit.
7-stage integer/10-stage FP pipeline.
Executing x86 code, but simpler than superscalar x86 implementation.
9
2015-12-14
• The blue stuff is silicon, and the yellow is software.
• Crusoe's blue part is smaller.
• All of the other hardware was moved off the die and into software.
10
2015-12-14
Code Morphing means dynamic translation of x86 code to native
Crusoe code.
Provides the Crusoe processor with x86 compatibility.
It uses a translation cache to exploit the fact that once a loop has been translated, it will be executed many times.
Benefits:
Improvements for power consumption and performance.
Upgrades to the software portion of a microprocessor can be done independently from the hardware.
Decoupling the hardware design from the system and application software.
11
Traditional x86 Processors
Translates each x86 instruction every time it is encountered
Crusoe Processor with Code Morphing software
Translates instructions once, saving the resulted translation in a cache for re-use
Full of complex, power-hungry transistors
Much of the processor functionality is implemented in software
less logic transistors, less power
use effective optimization/schedule algorithm
use a larger window of instructions
2015-12-14
Adjusting power to meet user demands:
Scale voltage and frequency dynamically to give just enough performance for current workload.
Switching off processor or changing clock rate.
How do you know if you are running code fast enough?
Real time software (e.g., DVD player) is easy just run it fast enough to keep up with streaming data.
User interface software a bit more heuristic, store profiles to allow user to help specify if performance ‘ is good enough ’ .
12
Crusoe uses both clock rate and voltage adjustment to achieve cubic power reduction:
Frequency changes in steps of 33 MHz.
Voltage changes in steps of 25mV.
Supports up to 200 frequency/voltage changes per second.
The adjustment is done dynamically:
If no idle time detected during a workload, the frequency/voltage point is incremented.
If idle time is detected, decrement the frequency/voltage level.
Result: up to 30% power reduction
2015-12-14
13
2015-12-14
Crusoe TM5400 Intel Pentium III
14
0,2
0
0,8
0,6
0,4
1,2
1
Office 2000 Web Browser
Applictions
Mp3 DVD
Mobile Pentum III 500Mhz
TM5400
TM3120
Design for low power
The Crusoe processors
The ARM processors
Final remarks
2015-12-14
15
A family of RISC processors designed by ARM (Advanced
RISC Machine) company.
The most widely used instruction set architecture in terms of quantity produced.
They were originally targeted at the PC market, however the designs are particularly suited to low power applications.
e.g., mobile computing; and
embedded systems, in general.
Being RISC, they requires a simpler HW, resulting in lower power and being very attractive for smaller devices.
As of 2009, ARM processors accounted for approximately
90% of all embedded 32-bit RISC processors.
ARM is focusing only on designing the barebones of the processor, the core.
They license their processor cores to semiconductor companies like Apple and NVIDIA, which combine the core with other parts to produce their products.
ARM doesn’t manufacture the chips.
These cores are specifically popular in mobile computing devices because of their low power consumption.
The majority of the recent smart phone processors are based on an ARM core.
In 2010, 95% of smart phones were based on ARM.
2015-12-14
16
Harvard architecture with 32-bit words.
Load/store architecture.
Uniform 16 × 32-bit register file, with additional register banks for different processor modes.
Fixed instruction width of 32 bits to ease decoding and pipelining.
Mostly single-cycle execution.
Conditional execution of most instructions (predicated execution) to get better performance.
2015-12-14
1-4 way multi-core.
Split L1 cache.
Unified L2 off-chip cache.
MESI snoopy protocol.
Cache-to-cache transfers to avoid L2.
Multi-issue, speculation, and renaming.
Out of order selection of instructions.
Optional NEON SIMD engine.
Power-efficient and high performance.
17
The NEON™ general-purpose SIMD engine processes efficiently current and future multimedia formats.
It accelerates multimedia and signal processing algorithms such as video encode/decode, 2D/3D graphics, gaming, audio/speech processing, and image processing.
It is a 128-bit SIMD architecture extension for the basic ARM instruction set.
It has 32 registers, 64-bits wide (dual view as 16 registers, 128bits wide).
NEON instructions perform "Packed SIMD" processing with 8
16-bit operations per instruction.
It gives 1.6x-2.5x performance for complex video codes.
Design for low power
The Crusoe processors
The ARM processors
Final remarks
2015-12-14
18
Thursday, Jan 14, 8:00-12:00. Don’t forget to register.
Closed book.
An English/Swedish dictionary is permitted.
Answers can be written in English or/and Swedish.
Cover the topics discussed in the lectures:
Follow the lecture notes when preparing for the exam.
Reading instruction can be found at the course website.
Please provide feedback of the course
by email.
fill in the course evaluations!
2015-12-14
19