I 0,0 [0]

advertisement
System-Level Memory Bus Power
And Performance Optimization for
Embedded Systems
Ke Ning
kning@ece.neu.edu
David Kaeli
kaeli@ece.neu.edu
Why Power is More Important?
 “Power:
A First Class Design Constraint for Future
Architecture” – Trevor Mudge 2001
 Increasing complexity for higher performance (MIPS)
 Parallelism,
pipeline, memory/cache size
 Higher clock frequency, larger die size
 Rising dynamic power consumption
 CMOS
process continues to shrink:
 Smaller
size logic gates reduce Vthreshold
 Lower Vthreshold will have higher leakage
 Leakage power will exceed dynamic power
 Things
 Low
getting worse in Embedded System
power and low cost systems
 Fixed or Limited applications/functionalities
 Real-time systems with timing constraints
2
Power Breakdown of An Embedded System
Research
Target
SDRAM
Internal Dynamic
UART
SPORT1
SPORT0
PPI
RTC
Source: Analog Devices Inc.
3
Internal Leakage
25°C
1.2V Internal
400MHz CCLK
Blackfin Processor
3.3V External
133MHz SDRAM
27MHz PPI
Introduction
 Related
work on microprocessor power
 Low
power design trend
 Power metrics
 Power performance tradeoffs
 Power optimization techniques
 Power
estimation framework
 Experimental
framework built from Blackfin cycle accurate
simulator
 Validated through a Blackfin EZKit board
 Power
aware bus arbitration
 Memory page remapping
4
Outline
 Research
 Related
 Power
Motivation and Introduction
Work
Estimation Framework
 Optimization
I – Power-Aware Bus Arbitration
 Optimization
II – Memory Page Remapping
 Summary
5
Power Modeling
 Dynamic
power estimation
 Instruction
level model: [Tiwari94], JouleTrack[Sinha01]
 Function level model: [Qu00]
 Architecture model: Cai-Lim Model, TEMPEST[CaiLim99],
Wattch[Brooks00], Simplepower[Ye00]
 Static
power estimation
 Butts-Sohi
 Previous
 Activity
model [Butts00]
memory system power estimation
model: CACTI[Wilton96]
 Trace driven model: Dinero IV[Elder98]
6
Power Equation
P = ACVDD2 f
+V
N k design
DD
dynamic
A
C
VDD
f
N
k design I leakage
7
leakage
Activity Factor
Total Capacitance
Voltage
Frequency
Transistor Number
Technology factor
I leakage
Common Power Optimization Techniques
 Gating
(turn off unused components)
 Clock
gating
 Voltage gating: Cache decay [Hu01]
 Scaling:
(scale operating point of an component)
 Voltage
scaling: Drowsy cache [Flautner02]
 Frequency scaling: [Pering98]
 Resource scaling: DRAM power mode [Delaluz01]
 Banking:
(break single component into smaller sub-units)
 Vertical
sub-banking: Filter cache[Kin97]
 Horizontal sub-banking: Scratchpad [Kandemir01]
 Clustering:
(partition components into clusters)
 Switching reduction: (redesigning with lower activity)
 Bus
encoding: Permutation Code [Mehta96], redundant
code[Stan95, Benini98], WZE[Musoll97]
8
Power Aware Figure of Merit
 Delay,
D
 Performance,
 Power,
MIPS
P
 Battery
 Obvious
life (mobile), packaging (high performance)
choice for power performance tradeoff, PD
 Joules/instruction,
inversely MIPS/W
 Energy
figure
 Mobile / low power applications
 Energy
Delay PD2
 MIPS2/W
 Energy
[Gonzalez96]
Delay Square PD3
 MIPS3/W
 Voltage
 More
9
and frequency independent
generically, MIPSm/W
Power Optimization Effect on Power Figure
 Most
of optimization schemes sacrifice performance for lower
power consumption, except switching reduction.
 All of optimization schemes generate higher power efficiency.
 All of optimization schemes increase hardware complexity.
10
Outline
 Research
Motivation and Introduction
 Related
 Power
Estimation Framework
 Optimization
I – Power-Aware Bus Arbitration
 Optimization
II – Memory Page Remapping
 Summary
11
External Bus
 External
Bus Components
 Typically
is off-chip bus
 Includes: Control Bus, Address Bus, Data Bus
 External
Bus Power Consumption
 Dynamic
power factors: activity, capacitance, frequency, voltage
 Leakage power: supply voltage, threshold voltage, CMOS
technology
 Different
 Longer
from internal memory bus power:
physical distance, higher bus capacitance, lower speed
 Cross line interference, Higher leakage current
 Different communication protocols (memory/peripheral dependent)
 Multiplexed row/column address bus, narrower data bus
12
Data Cache
Instruction Cache
Internal Bus
System DMA Controller
Memory DMA 0
13
Memory DMA 1
PPI
DMA
SPORT
DMA
NTSC/PAL
Encoder
Streaming
Interface
S-Video/CVBS
NIC
Power
Modeling
Area
SDRAM
External Bus
Media
Processor
Core
External Bus Interface Unit (EBIU)
Embedded SOC System Architecture
FLASH
Memory
Asynchronous
Devices
ADSP-BF533 EZ-Kit Lite Board
Audio In
Audio Out
Video In & Out
Video Codec/
ADV Converter
Audio Codec/
AD Converter
BF533
Blackfin
Processor
SPORT
Data I/O
SDRAM
Memory
FLASH
Memory
14
External Bus Power Estimator
 Previous
Approaches
Used Hamming distance [Benini98]
 Control signal was not considered
 Shared row and column address bus
 Memory state transitions were not considered

 In
Our Estimator
Integrate memory control signal power into the model
 Consider the case where row and column address are shared
 Memory state transitions and stalls also cost power
 Consider page miss penalty and traffic reverse penalty
P(bus) =
P(page miss)
+ P(bus turnaround)
+ P(control signal)
+ P(address generation)
+ P(data transmission)
+ P(leakage)

15
Two External Bus SDRAM Timing Models
(a) SDRAM Access in Sequential Command Mode
tCAS
t RCD
t RP
Bank 0 Request
P
N
N
A
N
N
R
R
R
R
Bank 1 Request
P
N
R
R
N
A
N
N
R
R
(b) SDRAM Access in Pipelined Command Mode
Bank 0 Request
Bank 1 Request
P
A
N
P
N
R
R
R
R
A
R
R
System Clock Cycles (SCLK)
P
16
- PRECHARGE
A
- ACTIVATE
N
- NOP
R
- READ
Bus Power Simulation Framework
Program
Target Binary
Compiler
Instruction Level
Simulator
Memory Trace
Generator
Memory Hierarchy
Model
Memory Power
Model
Memory Technology
Timing Model
External Bus
Power Estimator
Developed software modules
17
Bus Power
Multimedia Benchmark Configurations
18
Name
Description
I-Cache
Size
D-Cache
Size
MPEG2-ENC
MPEG-2 Video encoder with 720x480 4:2:0 input
frames.
16k
16k
MPEG2-DEC
MPEG-2 Video decoder of 720x480 sequence with
4:2:2 CCIR frame output.
16k
16k
H264-ENC
H.264/MPEG-4 Part 10 (AVC) digital video encoder for
achieving very high data compression.
16k
16k
H264-DEC
H.264/MPEG-4 Part 10 (AVC) video decompression
algorithm.
16k
16k
JPEG-ENC
JPEG image encoder for 512x512 image.
8k
8k
JPEG-DEC
JPEG image decoder for 512x512 image.
8k
8k
PGP-ENC
Pretty Good Privacy encryption and digital signature
of text message.
8k
4k
PGP-DEC
Pretty Good Privacy decryption of encrypted
message.
8k
4k
G721-ENC
G.721 Voice Encoder of 16bit input audio samples.
4k
2k
G721-DEC
G.721 Voice Decoder of encoded bits.
4k
2k
Outline
 Research
 Related
 Power
Motivation and Introduction
Work
Estimation Framework
 Optimization
 Optimization
 Summary
19
I – Power-Aware Bus Arbitration
II – Memory Page Remapping
Optimization I – Bus Arbitration
 Multiple
bus access masters in an SOC system
 Processor
cores
 Data/Instruction caches
 DMA
 ASIC modules
 Multimedia
applications
 High
bus bandwidth throughput
 Large memory footprint
 Efficient
arbitration algorithm can:
 Increase
power awareness
 Increase bus throughput
 Reduce bus power
20
Data Cache
Instruction Cache
Internal Bus
System DMA Controller
Memory DMA 0
21
Memory DMA 1
PPI
DMA
SPORT
DMA
NTSC/PAL
Encoder
Streaming
Interface
S-Video/CVBS
NIC
SDRAM
External Bus
Media
Processor
Core
EBIU with Arbitration Enabled
Bus Arbitration Target Region
FLASH
Memory
Asynchronous
Devices
Bus Arbitration Schemes
 EBIU
with arbitration enabled
Handle core-to-memory and core-to-peripheral communication
 Resolve bus access contention
 Schedule bus access requests

 Traditional
Algorithms
First Come First Serve (FCFS)
 Fixed Priority

 Power
Aware Algorithms
(Categorized by power metric / cost function)
Minimum Power (P1D0) or (1, 0)
 Minimum Delay (P0D1) or (0, 1)
 Minimum Power Delay Product (P1D1) or (1, 1)
 Minimum Power Delay Square Product (P1D2) or (1, 2)
 More generically (PnDm) or (n, m)

22
Bus Arbitration Schemes (Continued)
 Power

Aware Arbitration
From the current pending requests in the waiting queue, find a
permutation of the external bus requests to achieve the minimum total
power and/or performance cost.
 Reducible
to minimum Hamiltonian path problem in a graph G(V,E).
Vertex = Request R(t,s,b,l)
 t – request arrival time
 s – starting address
 b – block size
 l – read / write
 Edge = Transition of Request i and j.
 i,j - Request i and j
 edge weight w(i, j) is cost of transition

23
Minimum Hamiltonian Path Problem
R0
R0 – Last Request on the Bus. Must be
the starting point of a path.
R1, R2, R3 – Requests in the queue
R3
R1
R2
w(i,j) = P(i,j)nD(i,j)m
P(i,j) – Power of Rj after Ri
D(i,j) – Delay of Rj after Ri
Hamiltonian Path: R0->R3->R1->R2
w(1,3)
w(3,1)
24
Minimum Path weight
= w(0,3)+w(3,1)+w(1,2)
NP-Complete Problem
Greedy Solution
R0
Greedy Algorithm (local min)
Only the next request
in the path is needed
R3
R1
R2
min{w(0,j) |
w(i,j) is the edge weight
of graph G(V,E)}
In each iteration of arbitration:
w(1,3)
w(3,1)
25
1. A new graph G(V,E) need to
be constructed.
2. A greedy solution request is
arbitrated to use the bus.
Experimental Setup
 Utilized
embedded power modeling framework
 Implemented eleven different arbitration schemes inside EBIU
 FCFS,
FixedPriority.
 minimum power (P1D0) or (1,0), minimum delay (P0D1) or (0, 1), and
(1,1), (1,2), (2,1), (1,3), (3, 1), (3, 2), (2, 3)
 10
multimedia application benchmarks are ported to Blackfin
architecture and simulated, including MPEG-2, H.264, JPEG,
PGP and G.721.
26
Power Improvement
MPEG2 Decoder External Bus Power
MPEG2 Encoder External Bus Power
Pipelined Command
50.0
40.0
30.0
20.0
10.0
70.0
Average Bus Power (mW)
Average Bus Power (mW)
Sequential Command
Sequential Command
60.0
Pipelined Command
60.0
50.0
40.0
30.0
20.0
10.0
0.0
FP
FCFS (0, 1) (1, 0) (1, 1) (1, 2) (2, 1) (1, 3) (2, 3) (3, 2) (3, 1)
Arbitration Algorithm
0.0
FP
FCFS (0, 1) (1, 0) (1, 1) (1, 2) (2, 1) (1, 3) (2, 3) (3, 2) (3, 1)
Arbitration Algorithm
Power-aware arbitration schemes have lower power consumptions than
Fixed Priority and FCFS.
 Difference across different power-aware arbitration strategies is small.
 Parallel Command model has 6-7% saving than Sequential Command model
for MPEG2 ENC & DEC.
 The results are consistent to all other benchmarks.

27
Speed Improvement
MPEG2 Decoder External Bus Delay
MPEG2 Encoder External Bus Delay
Sequential Command
Sequential Command
200.0
Pipelined Command
Pipelined Command
180.0
140.0
Average Delay (SCLK)
Average Delay (SCLK)
160.0
120.0
100.0
80.0
60.0
40.0
20.0
160.0
140.0
120.0
100.0
80.0
60.0
40.0
20.0
0.0
0.0
FP
FCFS (0, 1) (1, 0) (1, 1) (1, 2) (2, 1) (1, 3) (2, 3) (3, 2) (3, 1)
Arbitration Algorithm
FP
FCFS (0, 1) (1, 0) (1, 1) (1, 2) (2, 1) (1, 3) (2, 3) (3, 2) (3, 1)
Arbitration Algorithm
Power-aware schemes have smaller bus delay than traditional Fixed Priority
and FCFS.
 Difference across different power-aware arbitration strategies is small.
 Parallel Command model has 3-9% speedup than Sequential Command
model for MPEG2 ENC & DEC.
 The results are consistent to all other benchmarks.

28
Comparison with Exhaustive Algorithm
R0
Algorithm can fail in
certain case.
 Complexity of O(n) vs O(n!).
 Performance difference is
negligible:
Exhaustive
Search
Greedy
Search
R3
R1
R2
18
17
new
29
 Greedy
Comments on Experimental Results
 Power
aware arbitrators significantly reduce the external bus
power for all 8 benchmarks. In average, there are 14% power
saving.
 Power aware arbitrators reduce the bus access delay. The
delay are reduced by 21% in average among 8 benchmarks.
 Pipelined SDRAM model has big performance advantage over
sequential SDRAM model. It achieve 6% power saving and
12% speedup.
 Power and delay in external bus are highly correlated.
Minimum power also achieves minimum delay.
 Minimum power schemes will lead to simpler design options.
Scheme (1,0) is preferred due to its simplicity.
30
Design of A Power Estimation Unit (PEU)
Bank(0) Open Row Addr
Last Bank Address
Bank(1) Open Row Addr
Bank(2) Open Row Addr
Bank(3) Open Row Addr
Last Column Address
Next Request
Address
Bank Addr Row Addr
If not equal, output
bank miss power
If not equal, output
page miss penalty power,
update last column
address register
Column Addr
Use hamming distance
to calculate column
address data power
Power Estimation Unit (PEU)
31
Updated Column Addr
Estimated
Power
Two Arbitrator Implementation Structures
Shared PEU
Structure
Request Queue Buffer
t
t
s
s
s
s
b
b
b
b
l
l
l
l
Dedicated PEU
Structure
Request Queue Buffer
t
t
32
s
s
s
s
b
b
b
b
l
l
l
l
Power Estimator
Power Estimator
Unit
Power Estimator
Unit
Power Estimator
Unit
Unit (PEU)
Minimum
Power
Request
Access
Command
Generator
External
Bus
State Update
Memory/Bus
States Info
Comparator
t
t
Power Estimator
Unit (PEU)
Comparator
t
t
State Update
Memory/Bus
States Info
Minimum
Power
Request
Access
Command
Generator
External
Bus
Performance of two structures
MPEG-2 Decoder (1,0) Arbitrator
Estimator Unit Implemation Performance Comparison
135.0
165.0
160.0
155.0
150.0
145.0
140.0
135.0
130.0
125.0
120.0
Average Delay (Cycles)
Average Delay (Cycles)
MPEG-2 Encoder (1,0) Arbitrator
Estimator Unit Implemation Performance Comparison
Estimator Unit Shared
Estimator Unit Dedicated
130.0
Estimator Unit Shared
125.0
Estimator Unit Dedicated
120.0
115.0
110.0
105.0
100.0
0
2
4
6
Estimator Logic Delay (Cycles)
8
10
0
2
4
6
8
10
Estimator Logic Delay (Cycles)
Higher PEU delay will lower the external bus performance for both MPEG-2
encoder and decoder.
 When PEU delay is 5 or higher, dedicated structure is preferred than shared
structure. Otherwise, shared structure is enough.

33
Summary of Bus Arbitration Schemes
 Efficient
bus arbitrations can provide benefits to both power and
performance over traditional arbitration schemes.
 Minimum power and minimum delay are highly correlated on
external bus performance.
 Pipelined SDRAM model has significant advantage over sequential
SDRAM model.
 Arbitration scheme (1, 0) is recommended.
 Minimum power approach provides more design options and leads
to simpler design implementations. The trade-off between design
complexity and performance was presented.
34
Outline
 Research
 Related
 Power
Motivation and Introduction
Work
Estimation Framework
 Optimization
I – Power-Aware Bus Arbitration
 Optimization
 Summary
35
II – Memory Page Remapping
Data Access Pattern in Multimedia Apps
time
time
Fix Stride
2-Way Stream
address
3 common data access patterns in
multimedia applications
 Majority of cycles in loop bodies
and array accesses
 High data access bandwidth
 Poor locality, cross page
references
time
address

2-D Stride
address
36
Previous work on Access Pattern
 Previous
work was performance driven and OS/compiler
related approach
 Data
Pre-fetching [Chen94] [Zhang00]
 Memory Customization [Adve00] [Grun01]
 Data Layout Optimization [Catthoor98] [DeLaLuz04]
 Shortcoming
 Multimedia
of OS/compiler-based strategies:
benchmark’s dominant activities are within large
monolithic data buffers.
 Buffers generally contain many memory pages and can not be
further optimized.
 Constraint by the OS and compiler capability. Poor flexibility.
37
Optimization II - Page Remapping
 Technique
currently used in large memory space peripheral
memory access.
 External memories in embedded multimedia systems
 High
bus access overhead
 Page miss penalty
 Efficient
 Reduce
page remapping can
page misses
 Improve external bus throughput
 Reduce power / energy consumption.
38
Data Cache
Instruction Cache
Internal Bus
System DMA Controller
Memory DMA 0 Memory DMA 1
39
PPI
DMA
SPORT
DMA
NTSC/PAL
Encoder
Streaming
Interface
S-Video/CVBS
NIC
SDRAM
External Bus
Media
Processor
Core
External Bus Interface Unit (EBIU)
Page Remapping Target Region
FLASH
Memory
Asynchronous
Devices
SDRAM Memory Pages
 High
memory access latency. Minimum latency of an sclk cycle
 Page miss penalty
 Additional latency due to refresh cycle
 No guaranteed access due to arbitration logic
 Non-sequential read/write would suffer
Bank 0
Page 0
Page 1
Page 2
Page 3
Page 4
Page N-1
40
X
X*
X
Bank 1
Bank 2
Bank M-1
X
X*
X
X
X
X
X*
X
X
X*
SDRAM Page Miss Penalty
t RP
COMMAND
P
t RCD
A
DATA
COMMAND
tCAS
t RCD
t RP
R R R R
P
A
tCAS
R R R R
D D D D
P
A
DATA
R R R R
D D D D
D D D D
R R R R
D D D D
System Clock Cycles (SCLK)
P - PRECHARGE
41
A - ACTIVATE
R - READ
N - NOP
D - DATA
SDRAM Timing Parameters
SDRAM
parameter
Sclk cycles
Access
type
Number of
cycles
trcd
1-15
Read cycle
trp +n*(tcas)
trp
1-7
Write cycle
twp
trcd = tras + trp
1-15
tcas
2-3
Page miss
Refresh
cycle
trp + trcd
2*(trcd) * nrows
twp
trp
tras
tcas
=
=
=
=
write to precharge
read to precharge
activate to precharge
read latency
~8-10 sclk penalty associated with a page miss
42
SDRAM Page Access Sequence (I)
12 Reads across 4 banks
Bank 0
Page 0
Page 1
Page 2
Page 3
R
R
R
R
R
R
R
R
Bank 1
Bank 2
Bank 3
R
R
R
R
PAR PAR PAR PAR PAR PAR PAR PAR PAR PAR PAR PAR
System Clock
Typically access pattern of 2-D stride / 2-way stream.
Poor data layout causes significant access overhead.
P – Precharge
A – Activation
R - Read
43
SDRAM Page Access Sequence (II)
12 Reads across 4 banks
Bank 0
Page 0
Page 1
Page 2
Page 3
R R R
Bank 1
RR R
Bank 2
RR R
Bank 3
RR R
PAR PAR PAR PARR R R R R R R R
System Clock
Less access overhead with distributed data layout.
P – Precharge
A – Activation
R - Read
44
Why we use Page Remapping
Bank 0
Bank 1
Bank 2
Bank 3
X
X
X
X
X
X
X
X
Page 2
Page Remapping Entry
of Page 2:
{2,0,1,3}
Page 2
45
Module in an SOC System
46
SDRAM
External Bus
External Bus Interface Unit (EBIU)
Internal
Bus
Page Remapping
 Address
FLASH
Memory
Asynchronous
Devices
translation
unit, only translates
bank address
 Non-MMU system
inserts a page
remapping module
before EBIU
 MMU system can take
advantage the existing
address translation unit.
No extra hardware
needed
Sequence (I) after Remapping
12 Reads across 4 banks
Bank 0
Page 0
Page 1
Page 2
Page 3
Bank 1
Bank 2
Bank 3
R R R
RR R
RR R
R R R
PAR PAR PAR PAR R R R R R R R R
System Clock
Same performance as sequence II.
Applicable for monolithic data buffers (eg. frame buffers).
P – Precharge
A – Activation
R - Read
47
Page Remapping Algorithm
 NP
complete problem.
 Reducible to graph coloring problem in a page transition
graph G(V,E).
Vertex = Page Im,n
 m – page bank number
 n – page row number
 Edge = Transition of Page Im,n to Ip,q.
 weighted edges captures page traversal during the program execution
 edge weight is number of transition from Page Im,n to Page Ip,q

 Color
= Bank
Each bank have one distinct color.
 Every page will be assigned one color.

48
Page Remapping Algorithm (continued)
 Page

Remapping Algorithm
From the page transition graph, find the color (bank) assignment for each
page, such that the transition cost between same color pages is
minimized.
 Algorithm
Steps:
Sort the edges based on their transition weight
 Edges are process in a decreasing weight order
 Color the pages associated with each edge
 Weight parameter array for each page represents the cost of mapping that
page into each bank
eg: {500, 200, 0, 0}
 5 different situations of processing each edge

 Page
remapping table (PMT) is generated as a result of
mapping.
49
Example Case
100
I0,0
Bank 0 Bank 1 Bank 2 Bank 3
Page 0
I0,0
I0,1
I1,1
I2,1
I3,1
Page 1
I1,2
Page 2
I1,3
Page 3
I3,1
60
500
30
I0,1
I2,1
50
40
80
Original page allocation
I1,1
I1,3
200
I1,2
Page transition graph
50
Initial Step
Bank 0 Bank 1 Bank 2 Bank 3
Page 0
Page 1
Page 2
Page 3
51
No page is mapped. All slots are available.
Step (1) – two unmapped pages
Selected Edge:
Bank 0 Bank 1 Bank 2 Bank 3
Page 0
I0,0
I0,1
Page 1
Page 2
Page 3
I0,0
500
Actions: Allocate unmapped pages I0,0 and I0,1
Weight Parameters Updates:
I0,0[0]: { 0, 500, 0, 0}
I0,1[1]: { 500, 0, 0, 0}
52
I0,1
Step (2) – two unmapped pages
Selected Edge:
Bank 0 Bank 1 Bank 2 Bank 3
Page 0
I0,0
I1,1
I0,1
Page 1
I1,2
Page 2
Page 3
I1,1
200
Actions: Allocate unmapped pages I1,1 and I1,2
Weight Parameters Updates:
I1,1[0]: { 0, 200, 0, 0}
I1,2[1]: { 200, 0, 0, 0}
53
I1,2
Step (3) – one unmapped page
Selected Edge:
Bank 0 Bank 1 Bank 2 Bank 3
Page 0
I0,0
I1,1
I0,1
I3,1
Page 1
I1,2
Page 2
Page 3
I0,0
100
Actions: Map pages I3,1 and no change for I0,0
Weight Parameters Updates:
I3,1[2]: { 100, 0, 0, 0}
I0,0[0]: { 0, 500, 100, 0}
54
I3,1
Step (4) – one unmapped page
Selected Edge:
Bank 0 Bank 1 Bank 2 Bank 3
Page 0
I0,0
I1,1
I0,1
I3,1
I2,1
Page 1
I1,2
Page 2
Page 3
I1,2
80
Actions: Map pages I2,1 and no change for I1,2
Weight Parameters Updates:
I2,1[3]: { 0, 80, 0, 0}
I1,2[1]: { 200, 0, 0, 80}
55
I2,1
Step (5) – one unmapped page
Selected Edge:
Bank 0 Bank 1 Bank 2 Bank 3
Page 0
I0,0
I1,1
I0,1
I3,1
I2,1
Page 1
I1,2
Page 2
I1,3
Page 3
56
I3,1
60
I1,3
Actions: Map pages I1,3 and no change for I3,1
Weight Parameters Updates:
I1,3[0]: { 0, 0, 60, 0}
I3,1[2]: { 160, 0, 0, 0}
Step (6) – same row pages
Selected Edge:
Bank 0 Bank 1 Bank 2 Bank 3
Page 0
I0,0
I1,1
I0,1
I3,1
I2,1
Page 1
I1,2
Page 2
I1,3
Page 3
57
I1,1
50
I3,1
Actions: Both I1,1 and I3,1 are on the same row, no actions.
Step (7) – two mapped pages
Selected Edge:
Bank 0 Bank 1 Bank 2 Bank 3
Page 0
I0,0
I1,1
I0,1
I3,1
I2,1
Page 1
I1,2
Page 2
I1,3
Page 3
58
I2,1
40
I1,3
Actions: Both I2,1 and I1,3 are mapped, no conflicts.
Weight Parameters Updates:
I1,3[0]: { 0, 0, 60, 40}
I2,1[3]: { 40, 80, 0, 0}
Step (8) – conflict resolving
Selected Edge:
Bank 0 Bank 1 Bank 2 Bank 3
Page 0
I0,0
I1,1
I0,1
I3,1
I2,1
Page 1
I1,2
Page 2
I1,3
Page 3
30
I0,0
Actions: Both I0,0 and I1,1 are mapped and in same bank.
Current Weight Parameters:
I0,1[1]: { 500, 0, 0, 0}
I1,1[0]: { 30, 200, 0, 0}
I2,1[3]: { 40, 80, 0, 0}
I3,1[2]: { 160, 0, 0, 0}
Bank 0 Bank 1 Bank 2 Bank 3
Page 0
Page 1
Page 2
Page 3
I0,0
I2,1
I3,1
I1,2
I0,1
I1,3
No Conflict
59
I1,1
I1,1
Updated Weight Parameters:
I0,0[0]: {0, 500, 100, 30}
Generated PMT table
I1,1
I1,2
I1,3
I2,1
I3,1
I-Cache
00
10
xx
xx
xx
11
01
00
xx
00
xx
xx
xx
01
xx
xx
Page Remapping Table (4kB)
I0,0
I0,1
D-Cache
External Memory Address
Memory Page Address (14bits)
Row/Column Address (22bits)
Bank Address (2bits)
EBIU
16MB External SDRAM
60
Experimental Setup
 Utilized
embedded power modeling framework
 Extended address translation unit for page remapping
 Page coloring program to generate PMT
 Same 10 Multimedia application benchmarks
 MPEG-2
encoder and decoder
 H.264 encoder and decoder
 JPEG encoder and decoder
 PGP encoder and decoder
 G.721 encoder and decoder
61
Page Miss Reduction
70.0
2 Bank Original
Page Miss per 100 Requests
60.0
50.0
4 Bank Original
8 Bank Original
2 Bank Rem apped
4 Bank Rem apped
8 Bank Rem apped
40.0
30.0
20.0
10.0
0.0
MPEG2- MPEG2ENC
DEC
62
H264ENC
H264DEC
JPEGENC
JPEGDEC
PGP-ENC PGP-DEC
G721ENC
G721DEC
External Bus Power
40
2 Bank Original
35
4 Bank Original
8 Bank Original
External Power (mW)
30
2 Bank Rem apped
4 Bank Rem apped
25
8 Bank Rem apped
20
15
10
5
0
MPEG2- MPEG2ENC
DEC
63
H264ENC
H264DEC
JPEGENC
JPEGDEC
PGPENC
PGPDEC
G721ENC
G721DEC
Average Access Delay
120
2 Bank Original
4 Bank Original
Average Request Delay (cycle)
100
8 Bank Original
2 Bank Rem apped
4 Bank Rem apped
80
8 Bank Rem apped
60
40
20
0
MPEG2- MPEG2ENC
DEC
64
H264ENC
H264DEC
JPEGENC
JPEGDEC
PGPENC
PGPDEC
G721ENC
G721DEC
Comments of Page Remapping
 Page
remapping algorithm is presented by example.
 Our algorithm can significantly reduce the memory page miss
rate by 70-80% on average.
 For a 4-bank SDRAM memory system, we reduced external
 memory access time by 12.6%.
 The proposed algorithm can reduce power consumption in
majority of the benchmarks, averaged by 13.2% of power
reduction.
 Combining the effects of both power and delay, our algorithm
can benefit significantly to the total energy cost.
 Stability study was done in dissertation. PMT table generated
from one test vector input perform well on different inputs.
65
Outline
 Research
 Related
 Power
Motivation and Introduction
Work
Estimation Framework
 Optimization
I – Power-Aware Bus Arbitration
 Optimization
II – Memory Page Remapping
 Summary
66
Summary
 Reviewed
the issues of external bus power in a system-on-achip (SOC) embedded system.
 Built external bus power estimation framework and
experimental methodology.
 PACS’04
 Proposed
a series of power aware bus arbitration schemes
and their performance improvement over traditional schemes.
 HiPEAC’05
also appeared in LNCS
 Transaction of High performance of Embedded Architectures and
Compilers
 Proposed
page remapping algorithm to reduce page misses
and its power and delay improvements.
 LCTES’07
67
Future Work
 Integration
of power estimation framework in complete tool
chain
 Extend arbitration schemes to multiple memory interfaces
and other peripheral interfaces.
 Compare performance of page remapping with corresponding
OS/Compiler schemes
68
Thank You !
69
Download