Motivation Lecture 12: Low-Power Architecture 2015-12-14 Design for low power

advertisement

Lecture 12: Low-Power Architecture

 Design for low power

 The Crusoe processors

 The ARM processors

 Final remarks

2015-12-14

Motivation

Microelectronics technology has exponential performance increase at a low cost.

However, for some applications, low-power consumption is more important than performance.

 Mobile communications and computing

 Wireless Internet

 Medical implants

 Deep space applications

Low-power designs will lead to:

 Longer battery life time

 Lower cost

• Cooling and package

• Electricity bill

 Higher reliability and longer life time (due to lower temperature and smaller temperature gradients).

1

Thermal Impact on Performance

 Performance degradation at high temperature:

 Reduced carrier mobility and driving current.

 Increased interconnect delay .

Temperature has strong impacts on frequency

7

6

5

4

9

8

Vdd=1.4V

40 50 60 70 80

Temperature (C)

90 100 110

Source: Temperature-Aware Performance and Power Modelling, W. P. Liao, In Technical Report 04-250, UCLA Engr.

Thermal Impact on Reliability

High temperature leads to decreased reliability and lifetime.

 Device lifetime decreases exponentially with increasing junction temperature.

Source: NXP Semiconductor, PNX8526 Data Sheet. ( Online )

2015-12-14

2

Thermal Map of a MP-SoC

2015-12-14

Low-Power Technique Examples

Circuit Techniques:

 Power efficient circuits, and asynchronous logic.

Micro-Architecture and Logic:

 Logic transformations to reduce switching activities.

 Number encoding (sign-magnitude number representation is better than the two’s complement representation for integers).

2’s complement vs. sign-magnitude

0 0 0 0 0 0 1 0 2 0 0 0 0 0 0 1 0

0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 -1 1 0 0 0 0 0 0 1

1 1 1 1 1 1 1 0 -2 1 0 0 0 0 0 1 0

Software:

 Use power efficient algorithms.

 Compiler to optimize for more power efficient code.

 Run-time power management by the OS.

3

CMOS Power Dissipation

Static power

 Leakage currents (I

DDQ

).

 Sub-threshold currents.

 Substrate currents.

 Ideally, no static (DC) power since in the steady state there is no direct path from V dd to ground.

Dynamic Power

 Transient switching behavior.

 Capacitive switching: each time a capacitive node switches from ground to V dd

, an energy of CV dd2 is consumed.

 The result of charging and discharging parasitic capacitances.

 This is about 60% of the total power in the current technology.

Design for Low Dynamic Power

Average dynamic power consumption for CMOS:

P dyn

V dd

: Supply voltage;

1

2

CV

2 dd

 f

C : the total capacitance;

 : the expected number of transitions per clock cycle; and f : the clock frequency in a synchronous system.

We have therefore 3 degrees of freedom inherent in the lowpower design space:

 Supply voltage ( V dd

).

 Physical capacitance ( C ).

 Switching activity (  f ).

These parameters are not completely orthogonal and cannot be optimized independently.

2015-12-14

4

Voltage Reduction

Quadratic relationship to power

 The most direct and dramatic way of minimizing energy consumption.

However, we need to consider also other factors that influence selection of a system supply voltage:

 Performance requirements; and

 Compatibility.

2015-12-14

Physical Capacitance

Determined by two primary sources:

 Devices and interconnects

Should be kept at a minimum by using small devices and short wires.

 Multi-core is better than a single core, since the interconnect lengths will be shorter in general.

GALS: Globally asynchronous, locally synchronous design:

 Uses different clock domains for different parts of the chips.

5

Switching Activities

SA determines how often switching occurs.

f determines the average periodicity of data arrivals.

 determines how many transitions each arrival will spark.

Clock gating ── to disable portions of the circuitry so that the flip-flops in them do not have to switch states.

cond clk clk

Dynamic frequency scaling ── the clock frequency is automatically adjusted "on the fly," to conserve power.

 Build adaptive systems.

 Learn from the nature.

Power-Aware Architecture Design (1)

Slow things down, and turn them off whenever appropriate.

Parallel processing and pipelining

 So that low voltage can be used to deliver the required performance.

Vdd/2

Vdd

Logic

Block

Vdd = 1

Freq = 1

Throughput = 1

Power = 1

Area = 1

Pwr Density = 1

Logic

Block

Logic

Block

Vdd = 0.5

Freq = 0.5

Throughput = 1

Power = 0.25

Area = 2

Pwr Density = 0.125

2015-12-14

6

Processor Power Distribution (Alpha 21264)

Power Consumption

Clock

FP

I/O

Issue

Int

Others

Caches

Mem

Power-Aware Architecture Design (2)

Simplify the control

 SIMD amortizes energy cost of one instruction over many operations.

 GPUs have exploited this idea to get great power efficiency.

Pipeline gating: reduce miss-speculated instruction execution.

 Miss-speculated instructions increase energy consumption, typically 16%-

105% overhead.

 Pipeline gating: stall fetching when confidence is low.

 Prevent “bad” instructions from entering the pipeline: may reduce 38% of wrongly executed instructions.

low confidence

BP counter incr decr

> threshold?

stall fetch decode issue exe/wb commit

2015-12-14

7

Power Reduction on Memory System

A flat large memory consumes much power.

A memory hierarchy is better.

 The lower level memories will usually not be activated.

Banked memory consumes less power.

Hierarchical register files are better (e.g., register windows).

Multiple layers of cache.

 Very often only the top levels are visited.

Dynamically adjusting cache size can also save power.

Low power DRAM with deep sleeping modes should be used whenever possible.

Power Saving Modes (PA6T Core)

Doze (entry time immediate, wakeup time immediate)

 Cores idle at reduced frequency; continue snooping on the bus

 Wake up immediately — no state reloading needed

Nap (entry time 2–16 μ s, wakeup time < 0.5 ms)

 Core clock stopped and voltage lowered to reduce leakage

 All architecture state is retained

 D-cache modified data is flushed by HW to maintain coherence

 SRAM (cache) remains power on, value retained

Sleep (entry time 2–16 μ s, wakeup time < 1 ms)

 Core powered off (either or both cores)

 Some architecture state must be saved by software

 On wakeup, the core goes through power-on-reset sequence

2015-12-14

8

Lecture 12: Low-Power Architecture

 Design for low power

 The Crusoe processors

 The ARM processors

 Final remarks

2015-12-14

Crusoe Family of Processors

Developed for mobile and Internet computing.

VLIW CPU: executing up to 4 operations in each cycle

 Molecule: long instruction word (128 bits molecule).

 All atoms within a molecule are executed in parallel.

1 ALU, 1FP, 1 load/store, 1 branch unit.

7-stage integer/10-stage FP pipeline.

Executing x86 code, but simpler than superscalar x86 implementation.

9

Crusoe Architecture (TM5800)

2015-12-14

Crusoe vs. x86

• The blue stuff is silicon, and the yellow is software.

• Crusoe's blue part is smaller.

• All of the other hardware was moved off the die and into software.

10

Code Morphing

2015-12-14

Code Morphing Software

 Code Morphing means dynamic translation of x86 code to native

Crusoe code.

 Provides the Crusoe processor with x86 compatibility.

It uses a translation cache to exploit the fact that once a loop has been translated, it will be executed many times.

Benefits:

 Improvements for power consumption and performance.

 Upgrades to the software portion of a microprocessor can be done independently from the hardware.

 Decoupling the hardware design from the system and application software.

11

Advantages of Code Morphing

Traditional x86 Processors

Translates each x86 instruction every time it is encountered

Crusoe Processor with Code Morphing software

Translates instructions once, saving the resulted translation in a cache for re-use

Full of complex, power-hungry transistors

Much of the processor functionality is implemented in software

less logic transistors, less power

use effective optimization/schedule algorithm

use a larger window of instructions

2015-12-14

Dynamic Power Management

Adjusting power to meet user demands:

 Scale voltage and frequency dynamically to give just enough performance for current workload.

 Switching off processor or changing clock rate.

How do you know if you are running code fast enough?

 Real time software (e.g., DVD player) is easy  just run it fast enough to keep up with streaming data.

 User interface software  a bit more heuristic, store profiles to allow user to help specify if performance ‘ is good enough ’ .

12

LongRun Power Management

Crusoe uses both clock rate and voltage adjustment to achieve cubic power reduction:

 Frequency changes in steps of 33 MHz.

 Voltage changes in steps of 25mV.

 Supports up to 200 frequency/voltage changes per second.

The adjustment is done dynamically:

 If no idle time detected during a workload, the frequency/voltage point is incremented.

 If idle time is detected, decrement the frequency/voltage level.

Result: up to 30% power reduction

2015-12-14

13

Example: TM5400

2015-12-14

Processor Thermal Comparison

Crusoe TM5400 Intel Pentium III

14

Power Consumption Comparison

0,2

0

0,8

0,6

0,4

1,2

1

Office 2000 Web Browser

Applictions

Mp3 DVD

Mobile Pentum III 500Mhz

TM5400

TM3120

Lecture 12: Low-Power Architecture

 Design for low power

 The Crusoe processors

 The ARM processors

 Final remarks

2015-12-14

15

ARM Processors

A family of RISC processors designed by ARM (Advanced

RISC Machine) company.

The most widely used instruction set architecture in terms of quantity produced.

They were originally targeted at the PC market, however the designs are particularly suited to low power applications.

 e.g., mobile computing; and

 embedded systems, in general.

Being RISC, they requires a simpler HW, resulting in lower power and being very attractive for smaller devices.

As of 2009, ARM processors accounted for approximately

90% of all embedded 32-bit RISC processors.

ARM Cores

ARM is focusing only on designing the barebones of the processor, the core.

They license their processor cores to semiconductor companies like Apple and NVIDIA, which combine the core with other parts to produce their products.

 ARM doesn’t manufacture the chips.

These cores are specifically popular in mobile computing devices because of their low power consumption.

The majority of the recent smart phone processors are based on an ARM core.

 In 2010, 95% of smart phones were based on ARM.

2015-12-14

16

ARM Architecture Features

Harvard architecture with 32-bit words.

Load/store architecture.

Uniform 16 × 32-bit register file, with additional register banks for different processor modes.

Fixed instruction width of 32 bits to ease decoding and pipelining.

Mostly single-cycle execution.

Conditional execution of most instructions (predicated execution) to get better performance.

2015-12-14

ARM Cortex-A9 MP

1-4 way multi-core.

Split L1 cache.

Unified L2 off-chip cache.

MESI snoopy protocol.

Cache-to-cache transfers to avoid L2.

Multi-issue, speculation, and renaming.

Out of order selection of instructions.

Optional NEON SIMD engine.

Power-efficient and high performance.

17

NEON Technology

The NEON™ general-purpose SIMD engine processes efficiently current and future multimedia formats.

It accelerates multimedia and signal processing algorithms such as video encode/decode, 2D/3D graphics, gaming, audio/speech processing, and image processing.

It is a 128-bit SIMD architecture extension for the basic ARM instruction set.

It has 32 registers, 64-bits wide (dual view as 16 registers, 128bits wide).

NEON instructions perform "Packed SIMD" processing with 8

16-bit operations per instruction.

It gives 1.6x-2.5x performance for complex video codes.

Lecture 12: Low-Power Architecture

 Design for low power

 The Crusoe processors

 The ARM processors

 Final remarks

2015-12-14

18

Examination and Feedback

Thursday, Jan 14, 8:00-12:00. Don’t forget to register.

Closed book.

An English/Swedish dictionary is permitted.

Answers can be written in English or/and Swedish.

Cover the topics discussed in the lectures:

 Follow the lecture notes when preparing for the exam.

Reading instruction can be found at the course website.

Please provide feedback of the course

 by email.

 fill in the course evaluations!

2015-12-14

19

Download