Lec6-faircaching

advertisement
ECE8833 Polymorphous and Many-Core Computer Architecture
Lecture 6 Fair Caching Mechanisms for CMP
Prof. Hsien-Hsin S. Lee
School of Electrical and Computer Engineering
Cache Sharing in CMP [Kim, Chandra, Solihin, PACT’04]
Processor Core 1
Processor Core 2
L1 $
L1 $
L2 $
……
[Kim, Chandra, Solihin PACT2004]
Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
ECE8833 H.-H. S. Lee 2009
2
Cache Sharing in CMP
Processor Core 1 ←t1
Processor Core 2
L1 $
L1 $
L2 $
……
Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
ECE8833 H.-H. S. Lee 2009
3
Cache Sharing in CMP
t2→ Processor Core 2
Processor Core 1
L1 $
L1 $
L2 $
……
Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
ECE8833 H.-H. S. Lee 2009
4
Cache Sharing in CMP
Processor Core 1 ←t1
t2→ Processor Core 2
L1 $
L1 $
L2 $
……
t2’s throughput is significantly reduced due to unfair cache sharing.
Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
ECE8833 H.-H. S. Lee 2009
5
Shared L2 Cache Space Contention
10
8
gzip's
Normalized
Cache Misses
Per
Instruction
6
4
2
0
gzip(alone) gzip+applu
gzip+apsi
gzip+art
gzip+swim
gzip(alone) gzip+applu
gzip+apsi
gzip+art
gzip+swim
1.2
1
gzip's 0.8
Normalized 0.6
IPC
0.4
0.2
0
Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
ECE8833 H.-H. S. Lee 2009
6
Impact of Unfair Cache Sharing
• Uniprocessor scheduling
t1
t2
time slice
t3
t1
• 2-core CMP scheduling
P1:
P2:
t1
t2
t1
t3
t4
time slice
t1
t2
t1
t3
t1
t4
• gzip will get more time slices than others if gzip is set to run at
higher priority (and it could run slower than others  priority
inversion)
• It could further slows down the other processes (starvation)
• Thus the overall throughput is reduced (uniform slowdown)
7
ECE8833 H.-H. S. Lee 2009
7
Stack Distance Profiling Algorithm
HIT
Counters
CTR
Pos
0
Cache Tag
MRU
HIT Counters
Value
CTR Pos 0
CTR Pos 1
CTR Pos 2
CTR Pos 3
30
20
15
10
CTR
Pos
1
CTR
Pos
2
CTR
Pos
3
LRU
Misses = 25
[Qureshi+, MICRO-39]
ECE8833 H.-H. S. Lee 2009
8
Stack Distance Profiling
• A counter for each cache way, C>A is the counter for misses
• Show the reuse frequency for each way in a cache
• Can be used to predict the misses for associativity smaller than “A”
– Misses for 2-way cache for gzip = C>A + Σ Ci where i = 3 to 8
• art does not need all the space for likely poor temporal locality
• If the given space is halved for art and given to gzip, what happens?
ECE8833 H.-H. S. Lee 2009
9
Fairness Metrics [Kim et al. PACT’04]
• Uniform slowdown
T _ shared i T _ shared j

T _ alonei
T _ alone j
Execution time of
ti when it runs
alone.
Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
ECE8833 H.-H. S. Lee 2009
10
Fairness Metrics [Kim et al. PACT’04]
• Uniform slowdown
T _ shared i T _ shared j

T _ alonei
T _ alone j
Execution time of
ti when it shares
cache with others.
Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
ECE8833 H.-H. S. Lee 2009
11
Fairness Metrics [Kim et al. PACT’04]
• Uniform slowdown
T _ shared i T _ shared j

T _ alonei
T _ alone j
• We want to minimize:
– Ideally:
M 0ij  X i  X j , where X i 
T _ shared i
T _ alonei
Try to equalize the ratio of miss increase
of each thread
Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
ECE8833 H.-H. S. Lee 2009
12
Fairness Metrics [Kim et al. PACT’04]
• Uniform slowdown
T _ shared i T _ shared j

T _ alonei
T _ alone j
• We want to minimize:
– Ideally:
M 0ij  X i  X j , where X i 
T _ shared i
T _ alonei
Miss _ shared i
M  X i  X j , where X i 
Miss _ alonei
ij
1
M 3ij  X i  X j , where X i 
MissRate _ shared i
MissRate _ alonei
M 5ij  X i  X j , where X i  MissRate _ shared i  MissRate _ alonei
Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
ECE8833 H.-H. S. Lee 2009
13
Partitionable Cache Hardware
• Modified LRU cache replacement policy
– G. E. Suh, et. al., HPCA 2002
P2 Miss
Current Partition
Target Partition
P1: 448B
P2: 576B
P1: 384B
P2: 640B
LRU
Per-thread
Counter
LRU
LRU
LRU
Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
ECE8833 H.-H. S. Lee 2009
14
Partitionable Cache Hardware
• Modified LRU cache replacement policy
– G. Suh, et. al., HPCA 2002
P2 Miss
Current Partition
Target Partition
P1: 448B
P2: 576B
P1: 384B
P2: 640B
LRU
LRU
LRU
*
LRU
Current Partition
Target Partition
P1: 384B
P2: 640B
P1: 384B
P2: 640B
LRU
*
Partition granularity
could be as coarse as
one entire cache way
LRU
LRU
LRU
Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
ECE8833 H.-H. S. Lee 2009
15
Dynamic Fair Caching Algorithm
MissRate alone
Ex) Optimizing
M3 metric
MissRate shared
P1:
P1:
Counters to keep miss rates
running the process alone
(from stack distance profiling)
P2:
Counters to keep dynamic
miss rates (running with a
shared cache)
P2:
Target Partition
P1:
P2:
10K accesses found
to be the best
Repartitioning
interval
Counters to keep
target partition size
Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
ECE8833 H.-H. S. Lee 2009
16
Dynamic Fair Caching Algorithm
MissRate alone
1st Interval
P1:20%
P2: 5%
MissRate shared
P1:
P1:20%
P2:
P2:15%
Target Partition
Repartitioning
interval
P1:256KB
P2:256KB
Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
ECE8833 H.-H. S. Lee 2009
17
Dynamic Fair Caching Algorithm
MissRate alone
Repartition!
P1:20%
P2: 5%
MissRate shared
P1:20%
Evaluate M3
P1: 20% / 20%
P2: 15% / 5%
P2:15%
Target Partition
P1:192KB
P1:256KB
P2:320KB
P2:256KB
Repartitioning
interval
Partition
granularity:
64KB
Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
ECE8833 H.-H. S. Lee 2009
18
Dynamic Fair Caching Algorithm
MissRate alone
2nd Interval
P1:20%
P2: 5%
MissRate shared
MissRate shared
P1:20%
P1:20%
P2:15%
P2:15%
Target Partition
Repartitioning
interval
P1:192KB
P2:320KB
Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
ECE8833 H.-H. S. Lee 2009
19
Dynamic Fair Caching Algorithm
MissRate alone
Repartition!
P1:20%
P2: 5%
MissRate shared
MissRate shared
P1:20%
P1:20%
P2:15%
P2:10%
Target Partition
Evaluate M3
P1: 20% / 20%
P2: 10% / 5%
Repartitioning
interval
P1:128KB
P1:192KB
P2:384KB
P2:320KB
Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
ECE8833 H.-H. S. Lee 2009
20
Dynamic Fair Caching Algorithm
MissRate alone
3rd Interval
P1:20%
P2: 5%
MissRate shared
MissRate shared
P1:20%
P1:20%
P1:25%
P2:10%
P2:10%
P2:
9%
Target Partition
Repartitioning
interval
P1:128KB
P2:384KB
Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
ECE8833 H.-H. S. Lee 2009
21
Dynamic Fair Caching Algorithm
MissRate alone
Repartition!
P1:20%
P2: 5%
MissRate shared
MissRate shared
P1:20%
P1:25%
P2:10%
P2: 9%
Target Partition
The best Trollback
threshold found to be
20%
Do Rollback if:
P2: Δ<Trollback
Δ=MRold-MRnew
Repartitioning
interval
P1:128KB
P1:192KB
P2:384KB
P2:320KB
Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU
ECE8833 H.-H. S. Lee 2009
22
Generic Repartitioning Algorithm
Pick the largest and smallest as
a pair for repartitioning
Repeat for all candidate processes
ECE8833 H.-H. S. Lee 2009
23
Utility-Based Cache Partitioning (UCP)
Running Processes on Dual-Core [Qureshi & Patt, MICRO-39]
# of ways given (1 to 16)
# of ways given (1 to 16)
• LRU: in real runs on avg., 7 ways were allocated to equake and 9 to vpr
• UTIL
– How much you use (in a set) is how much you will get
– Ideally, 3 ways to equake and 13 to vpr
ECE8833 H.-H. S. Lee 2009
25
Defining Utility
Misses per 1000 instructions
Utility Uab = Misses with a ways – Misses with b ways
Low Utility
High Utility
Saturating Utility
Num ways from 16-way 1MB L2
Slide courtesy: Moin Qureshi, MICRO-39
ECE8833 H.-H. S. Lee 2009
26
Framework for UCP
PA
UMON1
I$
Core1
D$
UMON2
Shared
L2 cache
I$
D$
Core2
Main Memory
Three components:
 Utility Monitors (UMON) per core
 Partitioning Algorithm (PA)
 Replacement support to enforce partitions
Slide courtesy: Moin Qureshi, MICRO-39
ECE8833 H.-H. S. Lee 2009
27
Utility Monitors (UMON)
 For each core, simulate LRU policy using Auxiliary Tag Dir (ATD)
UMON-global (one way-counter for all sets)
 Hit counters in ATD to count hits per recency position
 LRU is a stack algorithm: hit counts  utility
E.g., hits(2 ways) = H0+H1
(MRU) H0 H1 H2 H3
+ + + +
H15(LRU)
... +
Set A
Set B
Set C
Set D
Set E
Set F
Set G
Set H
ATD
ECE8833 H.-H. S. Lee 2009
28
Utility Monitors (UMON)
 Extra tags incur hardware and power overhead
 DSS reduces overhead [Qureshi et al. ISCA’06]
(MRU) H0 H1 H2 H3
+ + + +
Set A
Set B
Set C
Set D
Set E
Set F
Set G
Set H
H15(LRU)
... +
Set A
Set B
Set C
Set D
Set E
Set F
Set G
Set H
ATD
ECE8833 H.-H. S. Lee 2009
29
Utility Monitors (UMON)
 Extra tags incur hardware and power overhead
 DSS reduces overhead [Qureshi et al. ISCA’06]
 32 sets sufficient based on Chebyshev’s inequality
 Sample every 32 sets (simple static) used in the paper
 Storage < 2KB/UMON (or 0.17% L2)
(MRU) H0 H1 H2 H3
+ + + +
Set A
Set B
Set C
Set D
Set E
Set F
Set G
Set H
H15(LRU)
... +
Set B
Set E
Set F
UMON (DSS)
ATD
ECE8833 H.-H. S. Lee 2009
30
Partitioning Algorithm (PA)
 Evaluate all possible partitions and select the best
 With a ways to core1 and (16-a) ways to core2:
Hitscore1 = (H0 + H1 + … + Ha-1) ---- from UMON1
Hitscore2 = (H0 + H1 + … + H16-a-1) ---- from UMON2
 Select a that maximizes (Hitscore1 + Hitscore2)
 Partitioning done once every 5 million cycles
 After each partitioning interval
 Hit counters in all UMONs are halved
 To retain some past information
ECE8833 H.-H. S. Lee 2009
31
Replacement Policy to Reach Desired Partition
Use way partitioning [Suh+ HPCA’02, Iyer ICS’04]
• Each Line contains core-id bits
• On a miss, count ways_occupied in the set by miss-causing
app
• Binary decision for dual-core (in this paper)
ways_occupied < ways_given
Yes
Victim is the LRU line
from other app
ECE8833 H.-H. S. Lee 2009
No
Victim is the LRU line
from miss-causing app
32
UCP Performance (Weighted Speedup)
UCP improves average weighted speedup by 11% (Dual Core)
ECE8833 H.-H. S. Lee 2009
33
UPC Performance (Throughput)
UCP improves average throughput by 17%
ECE8833 H.-H. S. Lee 2009
34
Dynamic Insertion Policy
Conventional LRU
MRU
LRU
Incoming
Block
Slide Source: Yuejian Xie
ECE8833 H.-H. S. Lee 2009
Slide Source: Yuejian Xie
36
Conventional LRU
MRU
LRU
Occupies one cache block
for a long time with no benefit!
Slide Source: Yuejian Xie
ECE8833 H.-H. S. Lee 2009
37
LIP: LRU Insertion Policy [Qureshi et al. ISCA’07]
MRU
LRU
Incoming
Block
Slide Source: Yuejian Xie
ECE8833 H.-H. S. Lee 2009
38
LIP: LRU Insertion Policy [Qureshi et al. ISCA’07]
MRU
LRU
Useless Block
Evicted at next eviction
Useful Block
Moved to MRU position
Adapted Slide from Yuejian Xie
ECE8833 H.-H. S. Lee 2009
39
LIP: LRU Insertion Policy [Qureshi et al. ISCA’07]
MRU
Slide Source: Yuejian Xie
ECE8833 H.-H. S. Lee 2009
LRU
Useless Block
Evicted at next eviction
Useful Block
Moved to MRU position
LIP is not entirely new, Intel has tried this in 1998 when
designing “Timna” (integrating CPU and Gfx accelerator
that share L2)
40
BIP: Bimodal Insertion Policy [Qureshi et al. ISCA’07]
LIP may not age older lines
Infrequently insert lines in MRU position
Let e = Bimodal throttle parameter
if ( rand() < e )
Insert at MRU position; // LRU replacement policy
else
Insert at LRU position;
Promote to MRU if reused
ECE8833 H.-H. S. Lee 2009
41
DIP: Dynamic Insertion Policy [Qureshi et al. ISCA’07]
Two types of workloads: LRU-friendly or BIP-friendly
DIP
DIP can be implemented by:
1. Monitor both policies (LRU and BIP)
2. Choose the best-performing policy
3. Apply the best policy to the cache
BIP
LRU
1-ε
ε
LIP
LRU
Need a cost-effective implementation  “Set Dueling”
ECE8833 H.-H. S. Lee 2009
42
Set Dueling for DIP [Qureshi et al. ISCA’07]
Divide the cache in three:
•
•
•
Dedicated LRU sets
Dedicated BIP sets
Follower sets (winner of LRU,BIP)
n-bit saturating counter
misses to LRU sets: counter++
misses to BIP sets : counter--
LRU-sets
BIP-sets
miss
miss
+
–
n-bit cntr
MSB = 0?
YES
Follower Sets
Use LRU
No
Use BIP
Counter decides policy for follower sets:
•
•
MSB = 0, Use LRU
MSB = 1, Use BIP
monitor  choose  apply
(using a single counter)
Slide Source: Moin Qureshi
ECE8833 H.-H. S. Lee 2009
43
Promotion/Insertion Pseudo Partitioning
PIPP [Xie & Loh ISCA’09]
• What’s PIPP?
– Promotion/Insertion Pseudo Partitioning
– Achieving both capacity (UCP) and dead-time management (DIP).
• Eviction
– LRU block as the victim
• Insertion
– The core’s quota worth of blocks away from LRU
• Promotion
– To MRU by only one.
Promote
MRU
Hit
New
Insert Position = 3
(Target Allocation)
To Evict
LRU
Slide Source: Yuejian Xie
ECE8833 H.-H. S. Lee 2009
45
PIPP Example
Core0 quota: 5 blocks
Core1 quota: 3 blocks
Core0’s
Block
Core1’s
Block
Request
D
Core1’s quota=3
1
A
MRU
2
3
4
B
5
C
LRU
Slide Source: Yuejian Xie
ECE8833 H.-H. S. Lee 2009
46
PIPP Example
Core0 quota: 5 blocks
Core1 quota: 3 blocks
Core0’s
Block
Core1’s
Block
Request
6
Core0’s quota=5
1
A
MRU
2
3
4
D
B
5
LRU
Slide Source: Yuejian Xie
ECE8833 H.-H. S. Lee 2009
47
PIPP Example
Core0 quota: 5 blocks
Core1 quota: 3 blocks
Core0’s
Block
Core1’s
Block
Request
7
Core0’s quota=5
1
A
MRU
2
6
3
4
D
B
LRU
Slide Source: Yuejian Xie
ECE8833 H.-H. S. Lee 2009
48
PIPP Example
Core0 quota: 5 blocks
Core1 quota: 3 blocks
Core0’s
Block
Core1’s
Block
Request
D
1
A
MRU
2
7
6
3
4
D
LRU
Slide Source: Yuejian Xie
ECE8833 H.-H. S. Lee 2009
49
How PIPP Does Both Management
Quota
Core0
Core1
Core2
Core3
6
4
4
2
MRU
LRU
Insert closer to
LRU position
Slide Source: Yuejian Xie
ECE8833 H.-H. S. Lee 2009
50
Pseudo Partitioning Benefits
Core0 quota: 5 blocks
Core1 quota: 3 blocks
Core0’s
Block
Core1’s
Block
MRU1
LRU1
Request
New
Strict
Partition
MRU0
LRU0
Slide Source: Yuejian Xie
ECE8833 H.-H. S. Lee 2009
51
Pseudo Partitioning Benefits
Core0 quota: 5 blocks
Core1 quota: 3 blocks
Core0’s
Block
Core1’s
Block
Request
New
Pseudo
Partition
MRU
Core1 “stole” a line from Core0
LRU
Slide Source: Yuejian Xie
ECE8833 H.-H. S. Lee 2009
52
Pseudo Partitioning Benefits
ECE8833 H.-H. S. Lee 2009
53
Promote By One
(PIPP)
Directly to MRU
(TADIP)
Single Reuse Block
New
MRU
LRU
New
MRU
LRU
Slide Source: Yuejian Xie
ECE8833 H.-H. S. Lee 2009
54
Algorithm Comparison
Algorithm
Capacity
Management
Dead-time
Management
Note
LRU
Baseline, no explicit
management
UCP
Strict partitioning
DIP / TADIP
Insert at LRU and promote to
MRU on hit
PIPP
Pseudo-partitioning and
incremental promotion
Slide Source: Yuejian Xie
ECE8833 H.-H. S. Lee 2009
55
Download