Low Power Processor Architectures and Software Optimization

advertisement
ECE 260C – VLSI Advanced Topics
Term paper presentation
Low Power Processor Architectures and
Software Optimization Techniques
May 27, 2014
Keyuan Huang
Ngoc Luong
Motivation
Global Mobile Devices and Connections Growth
 ~10 billion mobile devices in 2018
 Moore’s law is slowing down
 Power dissipation per gate remains
unchanged
 How to reduce power?
 Circuit level optimizations (DVFS, power gating,
clock gating)
 Microarchitecture optimization techniques
 Compiler optimization techniques
Trend: More innovations on architectural and software techniques to optimize power consumption
Low Power Architectures Overview
 Asynchronous Processors
 Eliminate clock and use handshake protocol
 Save clock power but higher area
 Ex: SNAP, ARM996HS, SUN Sproull.
 Application Specific Instruction Set Processors
 Applications: cryptography, signal processing, vector processing, physical simulation, computer graphic
 Combine basic instructions with custom instruction based on application
 Ex: Tensilica’s Extensa, Altera’s NIOS, Xilinx Microblaze, Sony’s Cell, IRAM, Intel’s EXOCHI
 Reconfigurable Instruction Set Processors
 Combine fixed core with reconfigurable logic (FPGA)
 Low NRE cost vs ASIP
 Ex: Chimaera, GARP, PRISC, Warp, Tensilica’s Stenos, OptimoDE, PICO
 No Instruction Set Computer
 Build custom datapath based on application code
 Compiler has low-level control of hardware resource
 Ex: WISHBONE system.
Qadri, Muhammad Yasir, Hemal S. Gujarathi, and Klaus D. McDonald-Maier. "Low Power Processor Architectures and Contemporary Techniques for Power Optimization--A Review." Journal of computers 4.10 (2009).
 Combine GP processor with ASIP to focus on
reducing energy and energy delay for a range
of applications
 Broader range of applications compared to
accelerator
 Reconfigurable via patching algorithm
 Automatically synthesizable by toolchain from C
source code
 Energy consumption is reduced up to 16x for
functions and 2.1x for whole application
Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.
C-core organization
 Data path (FU, mux, register)
 Control unit (state machine)
 Cache interface (ld, st)
 Scan chain (CPU interface)
Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.
C-core execution
 Compiler insert stubs into code
compatible with c-core
 Choose between c-core and CPU and
use c-core if available
 If no c-core available, use GP processor,
else use c-core to execute
 C-core raises exception when finish
executing and return the value to CPU
Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.
Patching support
 Basic block mapping
 Control flow mapping
 Register mapping
 Patch generation
Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.
Patching Example
 Configurable constants
 Generalized single-cycle datapath operators
 Control flow changes
Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.
Results
 18 fully placed-and routed c-cores vs MIPS
 3.3x – 16x energy efficiency improvement
 Reduce system energy consumption by upto
47%
 Reduce energy-delay by up to 55% at the full
application level
 Even higher energy saving without patching
support
Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature
computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.
Software Optimization Technique
 Memory system uses power (1/10 to ¼) in portable computers
 System bus switching activity controlled by software
 ALU and FPU data paths needs good scheduling to avoid pipeline stalls
 Control logic and clock reduce by using shortest possible program to do
the computation
K. Roy and M. C. Johnson, Software design for low power, 1996 :NATO Advanced Study Institute on Low Power Design in Deep Submicron Electronics
General categories of software optimization
 Minimizing memory accesses

Minimize accesses needed by algorithm

Minimize total memory size needed by algorithm

Use multiple-word parallel loads, not single word loads
 Optimal selection and sequencing of machine instruction
 Instruction packing
 Minimizing circuit state effect
 Operand swapping
K. Roy and M. C. Johnson, Software design for low power, 1996 :NATO Advanced Study Institute on Low Power Design in Deep Submicron Electronics
Compiler Managed Partitioned Data Caches for Low Power
Rajiv Ravindran ,Michael Chu,Scott Mahlke
Basic Idea: Compiler Managed, Hardware Assisted
Hardware
Software
Banking, dynamic voltage/frequency,
scaling, dynamic resizing
Software controlled scratch-pad,
data/code reorganization
+ Transparent to the user
+ Handle arbitrary instr/data accesses
ー Limited program information
+ Whole program information
+ Proactive
ー Conservative



Global program knowledge
Proactive optimizations
Efficient execution
Ravindran, R., Chu, M., Mahlke, S.: Compiler Managed Partitioned Data Caches for Low Power. In: LCTES 2007 (2007)
Traditional Cache Architecture
tag
data
tag
set
offset
lru
tag
data
lru
tag
data
lru
tag
data
Disadvantages
lru
Replace
=?
=?
=?
=?
4:1 mux
• Lookup
 Activate all ways on every access
• Replacement  Choose among all the ways
–
–
–
–
Fixed replacement policy
Set index no program locality
Set-associativity has high overhead
Activate multiple data/tag-array
per access
Partitioned Cache Architecture
Ld/St Reg [Addr]
tag
data
tag
set
offset
lru
tag
data
P0
[k-bitvector]
lru
tag
P1
data
Advantages
[R/U]
lru
tag
P2
data
lru
P3
Replace
=?
=?
=?
=?
+ Improve performance by controlling
replacement
+ Reduce cache access power by
restricting number of accesses
4:1 mux
• Lookup
 Restricted to partitions specified in bit-vector if ‘R’, else default to all partitions
• Replacement  Restricted to partitions specified in bit-vector
Partitioned Caches: Example
ld1/st1
for (i = 0; i < N1; i++) {
…
for (j = 0; j < N2; j++)
y[i + j] += *w1++ + x[i + j]
for (k = 0; k < N3; k++)
y[i + k] += *w2++ + x[i + k]
ld2/st2
}
part-0
tag
ld3
ld5
ld4
ld6
(b) Fused load/store instructions
(a) Annotated code segment
data
part-1
tag
data
part-3
tag
data
ld1, st1, ld2, st2
ld5, ld6
ld3, ld4
y
w1/w2
x
ld1 [100], R
ld5 [010], R
ld3 [001], R
(d) Actual cache partition assignment for
each instuction
(c) Trace consisting of array references, cache blocks, and load/stores
from the example
Compiler Controlled Data Partitioning
 Goal: Place loads/stores into cache partitions
 Analyze application’s memory characteristics
 Cache requirements  Number of partitions per ld/st
 Predict conflicts
 Place loads/stores to different partitions
 Satisfies its caching needs
 Avoid conflicts, overlap if possible
Cache Analysis:
Estimating Number of Partitions
• Minimal partitions to avoid conflict/capacity misses
• Probabilistic hit-rate estimate
• Use the reuse distance to compute number of partitions
j-loop
k-loop
X W1 Y Y X W1 Y Y
X W2 Y Y X W2 Y Y
B1
M
B1
M
B2
M
• M has reuse distance = 1
B2
M
Cache Analysis:
Estimating Number Of Partitions
 Avoid conflict/capacity misses for an instruction
 Estimates hit-rate based on
• Reuse-distance (D), total number of cache blocks (B), associativity (A)
(Brehob et. al., ’99)
1
8
16
24
32
2
3
D=2
D=1
D=0
1
4
2
3
1
4
1
16
24
1
1
32
3
4
8 .76
8 .87
1
2
16
1
24
1
1
32
 Compute energy matrices in reality
 Pick most energy efficient configuration per instruction
.98
1
1
Cache Analysis:
Computing Interferences
 Avoid conflicts among temporally co-located references
 Model conflicts using interference graph
X W1 Y Y X W1 Y Y
X W2 Y Y X W2 Y Y
M4 M2 M1 M1 M4 M2 M1 M1
M4 M3 M1 M1 M4 M3 M1 M1
M4
D=1
M1
D=1
M2
D=1
M3
D=1
Partition Assignment
 Placement phase can overlap references
 Compute combined working-set
 Use graph-theoretic notion of a clique
part-0
tag
ld1, st1, ld2, st2
y
 For each clique, new D  Σ D of each node
M1
D=1
Clique 2
Clique 1
M2
D=1
M3
D=1
tag
data
part-2
tag
data
ld5, ld6
ld3, ld4
w1/w2
x
ld1 [100], R
ld5 [010], R
ld3 [001], R
 Combined D for all overlaps  Max (All cliques)
M4
D=1
data
part-1
Actual cache partition assignment for
each instruction
Clique 1 : M1, M2, M4  New reuse distance (D) = 3
Clique 2 : M1, M3, M4  New reuse distance (D) = 3
Combined reuse distance  Max(3, 3) = 3
Experimental Setup
 Trimaran compiler and simulator infrastructure
 ARM9 processor model
 Cache configurations:
 1-Kb to 32-Kb
 32-byte block size
 2, 4, 8 partitions vs. 2, 4, 8-way set-associative cache
 Mediabench suite
 CACTI for cache energy modeling
Reduction in Tag & Data-Array Checks
8
8-part
4-part
2-part
Average way accesses
7
6
5
4
3
2
1
0
1-K
2-K
4-K
8-K
Cache size
16-K
32-K
Average
• 25%,30%,36% access reduction on a 2-,4-,8-partition cache
Ravindran, R., Chu, M., Mahlke, S.: Compiler Managed Partitioned Data Caches for Low Power. In: LCTES 2007 (2007)
Improvement in Fetch Energy
16-Kb cache
2-part vs 2-way
4-part vs 4-way
8-part vs 8-way
50
40
30
20
Average
djpeg
cjpeg
unepic
epic
gsmdecode
gsmencode
pgpdecode
pgpencode
pegwitdec
pegwitenc
mpeg2enc
mpeg2dec
g721decode
g721encode
0
rawdaudio
10
rawcaudio
Percentage energy improvement
60
• 8%,16%,25% energy reduction on a 2-,4-,8-partition cache
Ravindran, R., Chu, M., Mahlke, S.: Compiler Managed Partitioned Data Caches for Low Power. In: LCTES 2007 (2007)
Summary
Maintain the advantages of a hardware-cache
Expose placement and lookup decisions to the
compiler
Avoid conflicts, eliminate redundancies
Achieve a higher performance and a lower
power consumption
Future Works
 Hybrid scratch-pad and caches
 Develop advance toolchain for newer technology node such as 28nm
 Incorporate the ability of partitioning data cache into the compiler of the
toolchain for the ASIP
Reference
1. Qadri, Muhammad Yasir, Hemal S. Gujarathi, and Klaus D. McDonald-Maier. "Low Power Processor Architectures and
Contemporary Techniques for Power Optimization--A Review." Journal of computers 4.10 (2009).
2. Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer
Architecture News. Vol. 38. No. 1. ACM, 2010.
3. Ravindran, R., Chu, M., Mahlke, S.: Compiler Managed Partitioned Data Caches for Low Power. In: LCTES 2007 (2007)
4. K. Roy and M. C. Johnson, Software design for low power, 1996 :NATO Advanced Study Institute on Low Power Design in
Deep Submicron Electronics
Download