rravindr-lctes07 - University of Michigan

advertisement
Compiler Managed Partitioned Data
Caches for Low Power
Rajiv Ravindran*, Michael Chu, and Scott Mahlke
Advanced Computer Architecture Lab
Department of Electrical Engineering and Computer Science
University of Michigan, Ann Arbor
* Currently with the Java, Compilers, and Tools Lab, Hewlett Packard, Cupertino, California
1
University of Michigan
Electrical Engineering and Computer Science
Introduction: Memory Power
• On-chip memories are a major contributor to system energy
• Data caches  ~16% in StrongARM [Unsal et. al, ‘01]
Hardware
Software
Banking, dynamic voltage/frequency,
scaling, dynamic resizing
Software controlled scratch-pad,
data/code reorganization
+ Transparent to the user
+ Handle arbitrary instr/data accesses
– Limited program information
– Reactive
+ Whole program information
+ Proactive
– No dynamic adaptability
– Conservative
2
University of Michigan
Electrical Engineering and Computer Science
Reducing Data Memory Power:
Compiler Managed, Hardware Assisted
Hardware
Software
Banking, dynamic voltage/frequency,
scaling, dynamic resizing
Software controlled scratch-pad,
data/code reorganization
+ Transparent to the user
+ Handle arbitrary instr/data accesses
ー Limited program information
ー Reactive
+ Whole program information
+ Proactive
ー No dynamic adaptability
ー Conservative





Global program knowledge
Proactive optimizations
Dynamic adaptability
Efficient execution
Aggressive software optimizations
3
University of Michigan
Electrical Engineering and Computer Science
Data Caches: Tradeoffs
Advantages
+
+
+
+
Disadvantages
–
–
–
–
Capture spatial/temporal locality
Transparent to the programmer
General than software scratch-pads
Efficient lookups
4
Fixed replacement policy
Set index no program locality
Set-associativity has high overhead
Activate multiple data/tag-array
per access
University of Michigan
Electrical Engineering and Computer Science
Traditional Cache Architecture
tag
data
tag
set
offset
lru
tag
data
lru
tag
data
lru
tag
data
lru
Replace
=?
=?
=?
=?
4:1 mux
• Lookup
 Activate all ways on every access
• Replacement  Choose among all the ways
5
University of Michigan
Electrical Engineering and Computer Science
Partitioned Cache Architecture
Ld/St Reg [Addr]
tag
data
tag
set
offset
lru
tag
data
P0
[k-bitvector]
lru
tag
P1
data
[R/U]
lru
tag
P2
data
lru
P3
Replace
=?
=?
=?
=?
4:1 mux
• Advantages
• Lookup
 Restricted to partitions specified in bit-vector if ‘R’, else default to all partitions
 Improve performance by controlling replacement
• Replacement  Restricted to partitions specified in bit-vector
 Reduce cache access power by restricting number of accesses
6
University of Michigan
Electrical Engineering and Computer Science
Partitioned Caches: Example
for (i = 0; i < N1; i++) {
…
for (j = 0; j < N2; j++)
y[i + j] += *w1++ + x[i + j]
ld1/st1
ld3
ld5
for (k = 0; k < N3; k++)
y[i + k] += *w2++ + x[i + k]
ld2/st2
}
way-0
tag
data
ld1, st1, ld2, st2
y
way-1
tag
data
ld4
ld6
way-2
tag
data
ld5, ld6
ld3, ld4
w1/w2
x
ld1 [100], R
ld5 [010], R
ld3 [001], R
• Reduce number of tag checks per iteration from 12 to 4 !
7
University of Michigan
Electrical Engineering and Computer Science
Compiler Controlled Data Partitioning
• Goal: Place loads/stores into cache partitions
• Analyze application’s memory characteristics
– Cache requirements  Number of partitions per ld/st
– Predict conflicts
• Place loads/stores to different partitions
– Satisfies its caching needs
– Avoid conflicts, overlap if possible
8
University of Michigan
Electrical Engineering and Computer Science
Cache Analysis:
Estimating Number of Partitions
• Minimal partitions to avoid conflict/capacity misses
• Probabilistic hit-rate estimate
• Use the working-set to compute number of partitions
j-loop
k-loop
X W1 Y Y X W1 Y Y
X W2 Y Y X W2 Y Y
B1
M
B1
M
B1
M
B1
M
• M has working-set size = 1
9
University of Michigan
Electrical Engineering and Computer Science
Cache Analysis:
Estimating Number Of Partitions
 Avoid conflict/capacity misses for an instruction
 Estimates hit-rate based on
• Reuse-distance (D), total number of cache blocks (B), associativity (A)
(Brehob et. al., ’99)
1
8
16
24
32
2
3
D=2
D=1
D=0
1
4
2
3
1
4
1
16
24
1
1
3
4
8 .76
8 .87
1
2
.98
16
1
32
1
24
1
1
32
1
 Compute energy matrices in reality
 Pick most energy efficient configuration per instruction
10
University of Michigan
Electrical Engineering and Computer Science
Cache Analysis:
Computing Interferences
• Avoid conflicts among temporally co-located references
• Model conflicts using interference graph
X W1 Y Y X W1 Y Y
X W2 Y Y X W2 Y Y
M4 M2 M1 M1 M4 M2 M1 M1
M4 M3 M1 M1 M4 M3 M1 M1
M4
D=1
M1
D=1
M2
D=1
M3
D=1
11
University of Michigan
Electrical Engineering and Computer Science
Partition Assignment
 Placement phase can overlap references
 Compute combined working-set
 Use graph-theoretic notion of a clique
 For each clique, new D  Σ D of each node
 Combined D for all overlaps  Max (All cliques)
M4
D=1
M1
D=1
Clique 2
Clique 1
M2
D=1
M3
D=1
Clique 1 : M1, M2, M4  New reuse distance (D) = 3
Clique 2 : M1, M3, M4  New reuse distance (D) = 3
Combined reuse distance  Max(3, 3) = 3
12
University of Michigan
Electrical Engineering and Computer Science
Experimental Setup
• Trimaran compiler and simulator infrastructure
• ARM9 processor model
• Cache configurations:
– 1-Kb to 32-Kb
– 32-byte block size
– 2, 4, 8 partitions vs. 2, 4, 8-way set-associative cache
• Mediabench suite
• CACTI for cache energy modeling
13
University of Michigan
Electrical Engineering and Computer Science
Reduction in Tag & Data-Array Checks
8
8-part
4-part
2-part
Average way accesses
7
6
5
4
3
2
1
0
1-K
2-K
4-K
8-K
Cache size
16-K
32-K
Average
• 36% reduction on a 8-partition cache
14
University of Michigan
Electrical Engineering and Computer Science
0
15
Average
djpeg
cjpeg
unepic
epic
gsmdecode
4-part vs 4-way
gsmencode
pgpdecode
pgpencode
2-part vs 2-way
pegwitdec
pegwitenc
mpeg2enc
mpeg2dec
g721decode
g721encode
rawdaudio
rawcaudio
Percentage energy improvement
Improvement in Fetch Energy
16-Kb cache
60
8-part vs 8-way
50
40
30
20
10
University of Michigan
Electrical Engineering and Computer Science
Summary
• Maintain the advantages of a hardware-cache
• Expose placement and lookup decisions to the compiler
– Avoid conflicts, eliminate redundancies
• 24% energy savings for 4-Kb with 4-partitions
• Extensions
– Hybrid scratch-pad and caches
– Disable selected tags  convert them to scratch-pads
– 35% additional savings in 4-Kb cache with 1 partition as SP
16
University of Michigan
Electrical Engineering and Computer Science
Thank You
&
Questions
17
University of Michigan
Electrical Engineering and Computer Science
Cache Analysis
Step 1: Instruction Fusioning
• Combine ld/st that accesses the same set of objects
• Avoids coherence and duplication
• Points-to analysis
ld1/st1
for (i = 0; i < N1; i++) {
…
for (j = 0; j < readInput1(); j++)
ld3
y[i + j] += *w1++ + x[i + j]
ld5
M1
for (k = 0; k < readInput2(); k++)
y[i + k] += *w2++ + x[i + k] ld4
ld2/st2
M2
ld6
}
18
University of Michigan
Electrical Engineering and Computer Science
Partition Assignment
• Greedily place instructions based on its cache estimates
• Overlap instructions if required
• Compute number of partitions for overlapped instructions
– Enumerate cliques within interference graph
– Compute combined working-set of all cliques
• Assign the R/U bit to control lookup
M4
D=1
M1
D=1
Clique 2
Clique 1
M2
D=1
M3
D=1
19
University of Michigan
Electrical Engineering and Computer Science
Related Work
• Direct addressed, cool caches [Unsal ’01, Asanovic ’01]
– Tags maintained in registers that are addressed within loads/stores
• Split temporal/spatial cache [Rivers ’96]
– Hardware managed, two partitions
• Column partitioning [Devdas ’00]
– Individual ways can be configured as a scratch-pad
– No load/store based partitioning
• Region based caching [Tyson ’02]
– Heap, stack, globals
– More finer grained control and management
• Pseudo set-associative caches [Calder ’96,Inou ’99,Albonesi ‘99]
– Reduce tag check power
– Compromises on cycle time
– Orthogonal to our technique
20
University of Michigan
Electrical Engineering and Computer Science
0
21
Average
djpeg
cjpeg
unepic
epic
gsmdecode
gsmencode
pgpdecode
Annotated LD/STs
pgpencode
pegwitdec
pegwitenc
mpeg2enc
mpeg2dec
g721decode
12
g721encode
rawdaudio
rawcaudio
Percentage instructions
Code Size Overhead
Extra MOV instructions
15% 16%
10
8
6
4
2
University of Michigan
Electrical Engineering and Computer Science
Download