Abhisek*Pan*&*Vijay*S.*Pai

advertisement
Imbalanced*Cache*Par//oning*for*
Balanced*Data7Parallel*Programs*
Abhisek*Pan*&*Vijay*S.*Pai*
Contributions
Motivation
• Shared last-level cache partitioning for balanced
data-parallel applications
• Balanced allocation is suboptimal for balanced
programs
• Increasing allocation for one thread at a time
improves utilization
• High imbalance helps both preferred and unpreferred threads
!  Preferred thread benefits because working set
now fits into partition
!  Un-preferred threads benefit by using the data
left behind in the preferred partition
• Prioritizing each thread in turn ensures balanced
progress
• 17% drop in miss rate, 8% drop in execution time
on average for 4-core 8MB cache
• Negligible overheads
Last level cache partitioning heavily studied for
multiprogramming workloads
Multithreading different than multiprogramming
!  All threads have to progress equally
!  Pure throughput maximization is not enough
!  Data-parallel threads are similar to each other in
their data access patterns
Only way to improve utilization is through
imbalance in allocation
1.0
Miss*Rate*
Miss Rate
0.8
Preferred thread
benefits since
working*set*2*
increased allocation
covers working set
working*set*2*
0.6
0.4
0.2
18.0
16.0
14.0
Good!*
Overheads
!  Per-segment way partitioning
and counters
!  Program phase detection
!  Evaluation stage overhead for
small cache (1 % ave., 5 % max.)
IMBERR"
STATICEEQ"
CPI"
IMBERR"
"
W
up
w
ise
"
im
Sw
St
Ar
t"
re
am
clu
ste
r"
Ca
nn
ea
Flu
l"
id
an
im
at
e"
Eq
ua
ke
"
ck
sc
ho
le
s"
"
ns
-o
Bl
a
Sw
ap
"
W
up
w
ise
"
im
Sw
St
Ar
t"
re
am
clu
ste
r"
Ca
nn
ea
Flu
l"
id
an
im
at
e"
Eq
ua
ke
"
le
s"
ck
sc
ho
Bl
a
-o
ns
"
1.2"
1"
0.8"
0.6"
0.4"
0.2"
0"
32 MB
STATICEEQ"
CPI"
IMBERR"
STATICEEQ"
CPI"
IMBERR"
Conclusion
Effective cache utilization and
balanced progress for dataparallel applications through:
A.  High Imbalance in partitions
and
B.  Prioritizing each thread in turn
W
up
w
ise
"
"
im
Sw
Ar
t"
re
am
clu
ste
r"
Ca
nn
ea
Flu
l"
id
an
im
at
e"
Eq
ua
ke
"
St
Bl
a
ck
sc
ho
le
s"
"
ns
-o
Sw
ap
W
up
w
ise
"
"
im
Sw
Ar
t"
re
am
clu
ste
r"
Ca
nn
ea
Flu
l"
id
an
im
at
e"
Eq
ua
ke
"
St
ck
sc
ho
le
s"
1.2"
1"
0.8"
0.6"
0.4"
0.2"
0"
"
1.2"
1"
0.8"
0.6"
0.4"
0.2"
0"
ns
Stable Stage
!  Maintain the chosen configuration till the next
program phase change
!  Choose preferred thread in round-robin manner
CPI"
1.2"
1"
0.8"
0.6"
0.4"
0.2"
0"
-o
Evaluation Stage
!  Triggers at the start of a new program phase
!  Divide the cache sets into equal-sized segment
!  Each segment with a different level of imbalance
!  A segment for un-partitioned cache
!  Each core is prioritized in turn
!  Select configuration with least number of misses
STATICEEQ"
Bl
a
No partitions
Sw
ap
Thread 4
4-core CMP with 32 way shared L2, 9 data-parallel
workloads from PARSEC AND SPEC OMP suites
Baselines
!  Compared to a statically equi-partitioned cache
and a CPI-based adaptive partitioning scheme
!  Misses and execution time normalized to an unpartitioned cache
Misses
Execution time
Sw
ap
2-Stage Partitioning
Thread 3
12.0
Reuse Distance
Inefficient*
Ways*
Alloca/on!*
Un-preferred thread
benefits from data
remaining in preferred
partition
Evaluation
Method
Thread 2
10.0
8.0
6.0
4.0
2.0
0.0
0.0
No*effect!*
Thread 1
Thread 0
Thread 1
Thread 2
Thread 3
128 MB
Acknowledgements
Funded by: NSF (CNS-0751153,
CNS-1117726, & OCI-1216809)
& the DOE (DEFC02-12ER26104)
Thanks to: T.N. Vijaykumar,
Malek Musleh
Download