Slides - Department of Computer Science

advertisement
PREDATOR: Predictive False Sharing Detection
Tongping Liu*, Chen Tian, Ziang Hu, Emery Berger*
*University of Massachusetts Amherst
Huawei US Research Center
UNIVERSITY OF MASSACHUSETTS, AMHERST • School of Computer Science
1
Parallelism: Expectation is Awesome
Parallel Program
int main(int THREADS) {
W=8/THREADS;
for(i=0; i<8; i+=W)
spawn(increment,i);
}

Expectation
90
80
70
60
Runtime (s)
int count[8];
int W;
void increment(int S)
{
for(in=S; in<S+W; in++)
for(j=0; j<1M; j++)
count[in]++;
}
50
40
30
20
10
0
1
2
4
Number of threads
8
2
Parallelism: Reality is Awful
Parallel Program
 Reality
Runtime (s)
int count[8];
140
int W;
120
void increment(int S)
{
100
False
for(in=S; in<S+W; in++)
80
for(j=0; j<1M; j++) sharing
count[in]++;
60
}
int main(int THREADS) {
W=8/THREADS;
for(i=0; i<8; i+=W)
spawn(increment,i);
}

Expectation
40
20
0
1
2
4
Number of threads
8
False sharing slows the program by 13X
3
False Sharing in Real Applications
False sharing slows MySQL by 50%
4
False Sharing vs. True Sharing
Cache Line
5
False Sharing vs. True Sharing
Task 1
Task 3
False
Sharing
Task 2
Task 4
Task 1
True
Sharing
Task 2
6
Resource Contention at Cache Line Level
7
False Sharing Causes Performance Problems
Core 1
Core 2
Thread 1
Thread 2
Invalidate
Cache
Cache
Main Memory
Cache line: basic unit of data transfer
8
False Sharing Causes Performance Problems
Core 1
Core 2
Thread 1
Thread 2
Invalidate
Cache
Cache
Main Memory
Interleaved accesses cause cache invalidations
9
False Sharing is Everywhere
me = 1;
you = 1; // globals
me = new Foo;
you = new Bar; // heap
class X {
int me;
int you;
}; // fields
array[me] = 12;
array[you] = 13; // array indices
10
False Sharing is Hard to Diagnose
Multiple experts worked together to
diagnose MySQL scalability issue (1.5M LOC)
11
Problems of Existing Tools
• No precise information/false positives
– WIBA’09, VEE’11, EuroSys’13, SC’13
• Accurate & Precise
– OOPSLA’11 ( Cannot detect read-write FS)
Shared problem: only detect observed false sharing
12
False Sharing Causes Performance Problems
Core 1
Core 2
Task 1
Task 2
Invalidat
e
Cache
Main
Memory
Cache
Interleaved accesses
Cache invalidations
Performance problems
Detect false sharing causing performance problems
Find cache lines with many cache invalidations
13
Find Lines with Many Invalidations
Memory: Global, Heap
.......
……
Track cache invalidations on each cache line
14
Track Cache Invalidations
• Hardware-based
approach
– Needs hardware
support
– No portability
• Simulation-based
approach
– Needs hardware info
such as cache hierarchy,
cache capacity
– Very slow
• Conservative Assumptions
– Each thread runs on a
different core with its private
cache.
– Infinite cache capacity.
PREDATOR: based on
memory access history
of each cache line
15
Track Cache Invalidations
Each Entry: { Thread ID, Access Type}
T1
T2
0
w
0r
T1
0
0r
31
0
2
Time
r
w r
T1
w w r
w r
# of invalidations
T2
16
PREDATOR Components
Compiler
Instrumentation
Instruments every memory
read/write access
Runtime System
Collects memory accesses
and reports false sharing
17
Detect Problems Correctly & Precisely
Task 1
Task 3
False
Sharing
• Correctly:
– No false alarms
Task 2
Task 4
Task 1
Track memory accesses
on each word
True
Sharing
Task 2
• Precisely
– Global variables
– Heap objects: pinpoint the line of memory allocation
18
PREDATOR’s Report
19
Why do we
need prediction?
20
Necessity of False Sharing Prediction
Thread 1
Thread 2
Cache line 1
Cache line 2
False
Sharing
Cache line 1
Cache line 2
False
Sharing
Cache line 1
21
Properties Affecting False Sharing Occurrence
• Change of memory layout
32-bit platform   64-bit platform
Different memory allocator
Different compiler or optimization
Different allocation order by changing the code, e.g.,
printf
• Run on hardware with different cache line size
22
Example of False Sharing Sensitivity
Cache line size = 64 bytes
Memory
Offset = 0
Offset = 8
……
Offset = 56
Colors represent threads
23
Example of False Sharing Sensitivity
6
Runtime (Seconds)
5
4
3
2
1
0
PREDATOR predicts false sharing
problems without occurrence
24
Prediction Based on Virtual Cache Lines
Thread 1
Thread 2
Real case
Cache line 1
Cache line 2
False
Sharing
Virtual cache line 1
Prediction 1
Virtual cache line 2
False
Sharing
Prediction 2
Virtual cache line 1
25
Track Invalidations on Virtual Cache Lines
X
d
Y
Non-tracked virtual lines
Tracked virtual line
(sz-d)/2
(sz-d)/2
 d < the cache line size - sz
(X, Y) from different threads && one of them is write
26
Benchmark Results
Benchmarks
Unknown Without
Problem Prediction
With Prediction
Improvement
Histogram
Linear_regression
✔
✔
46%
✔
1207%
✔
✔
0.09%
✔
✔
0.14%
✔
✔
4.77%
✔
✔
7.52%
Reverse_index
Word_count
Streamcluster-1
Streamcluster-2
✔
✔
27
Real Applications Results
• MySQL
– Problem: False sharing occurs when different threads
update the shared bitmap simultaneously.
– Performance improves 180% after fixes.
• Boost library:
– Problem: “there will be 16 spinlocks per cache line”
– Performance improves about 100%.
28
Performance Overhead of PREDATOR
Execution Time Overhead
15
Normalized Runtime
23
26
12
Original
9
6
PREDATOR-NP
5.6X
PREDATOR
3
0
29
Core 1
Core 2
Thread 1
Thread 2
Invalidat
e
Cache
Compiler
Instrumentation
Cache
Runtime System
Main
Memory
Precise report
Thread 1
Thread 2
Cache line 1
Cache line 2
False
Sharing
Virtual cache line 1
Real case
Prediction 1
Virtual cache line 2
False
Sharing
Virtual cache line 1
Prediction 2
30
31
False Sharing is Hard to Diagnose
Multiple experts worked together to
diagnose MySQL scalability issue (1.5M LOC)
32
Detailed Prediction Algorithm
1. Find suspected cache lines
33
Detailed Prediction Algorithm
1. Find suspected cache lines
2. Track detailed memory accesses
34
Detailed Prediction Algorithm
1. Find suspected cache lines
2. Track detailed memory accesses
X
Y
3. Predict based on hot accesses
d
d < sz && (X, Y) from different threads,
potential false sharing
35
4: Tracking Cache
Invalidations on the Virtual Line
X
d
Y
Non-tracked virtual lines
Tracked virtual line
(sz-d)/2
(sz-d)/2
36
Download