PREDATOR: Predictive False Sharing Detection Tongping Liu*, Chen Tian, Ziang Hu, Emery Berger* *University of Massachusetts Amherst Huawei US Research Center UNIVERSITY OF MASSACHUSETTS, AMHERST • School of Computer Science 1 Parallelism: Expectation is Awesome Parallel Program int main(int THREADS) { W=8/THREADS; for(i=0; i<8; i+=W) spawn(increment,i); } Expectation 90 80 70 60 Runtime (s) int count[8]; int W; void increment(int S) { for(in=S; in<S+W; in++) for(j=0; j<1M; j++) count[in]++; } 50 40 30 20 10 0 1 2 4 Number of threads 8 2 Parallelism: Reality is Awful Parallel Program Reality Runtime (s) int count[8]; 140 int W; 120 void increment(int S) { 100 False for(in=S; in<S+W; in++) 80 for(j=0; j<1M; j++) sharing count[in]++; 60 } int main(int THREADS) { W=8/THREADS; for(i=0; i<8; i+=W) spawn(increment,i); } Expectation 40 20 0 1 2 4 Number of threads 8 False sharing slows the program by 13X 3 False Sharing in Real Applications False sharing slows MySQL by 50% 4 False Sharing vs. True Sharing Cache Line 5 False Sharing vs. True Sharing Task 1 Task 3 False Sharing Task 2 Task 4 Task 1 True Sharing Task 2 6 Resource Contention at Cache Line Level 7 False Sharing Causes Performance Problems Core 1 Core 2 Thread 1 Thread 2 Invalidate Cache Cache Main Memory Cache line: basic unit of data transfer 8 False Sharing Causes Performance Problems Core 1 Core 2 Thread 1 Thread 2 Invalidate Cache Cache Main Memory Interleaved accesses cause cache invalidations 9 False Sharing is Everywhere me = 1; you = 1; // globals me = new Foo; you = new Bar; // heap class X { int me; int you; }; // fields array[me] = 12; array[you] = 13; // array indices 10 False Sharing is Hard to Diagnose Multiple experts worked together to diagnose MySQL scalability issue (1.5M LOC) 11 Problems of Existing Tools • No precise information/false positives – WIBA’09, VEE’11, EuroSys’13, SC’13 • Accurate & Precise – OOPSLA’11 ( Cannot detect read-write FS) Shared problem: only detect observed false sharing 12 False Sharing Causes Performance Problems Core 1 Core 2 Task 1 Task 2 Invalidat e Cache Main Memory Cache Interleaved accesses Cache invalidations Performance problems Detect false sharing causing performance problems Find cache lines with many cache invalidations 13 Find Lines with Many Invalidations Memory: Global, Heap ....... …… Track cache invalidations on each cache line 14 Track Cache Invalidations • Hardware-based approach – Needs hardware support – No portability • Simulation-based approach – Needs hardware info such as cache hierarchy, cache capacity – Very slow • Conservative Assumptions – Each thread runs on a different core with its private cache. – Infinite cache capacity. PREDATOR: based on memory access history of each cache line 15 Track Cache Invalidations Each Entry: { Thread ID, Access Type} T1 T2 0 w 0r T1 0 0r 31 0 2 Time r w r T1 w w r w r # of invalidations T2 16 PREDATOR Components Compiler Instrumentation Instruments every memory read/write access Runtime System Collects memory accesses and reports false sharing 17 Detect Problems Correctly & Precisely Task 1 Task 3 False Sharing • Correctly: – No false alarms Task 2 Task 4 Task 1 Track memory accesses on each word True Sharing Task 2 • Precisely – Global variables – Heap objects: pinpoint the line of memory allocation 18 PREDATOR’s Report 19 Why do we need prediction? 20 Necessity of False Sharing Prediction Thread 1 Thread 2 Cache line 1 Cache line 2 False Sharing Cache line 1 Cache line 2 False Sharing Cache line 1 21 Properties Affecting False Sharing Occurrence • Change of memory layout 32-bit platform 64-bit platform Different memory allocator Different compiler or optimization Different allocation order by changing the code, e.g., printf • Run on hardware with different cache line size 22 Example of False Sharing Sensitivity Cache line size = 64 bytes Memory Offset = 0 Offset = 8 …… Offset = 56 Colors represent threads 23 Example of False Sharing Sensitivity 6 Runtime (Seconds) 5 4 3 2 1 0 PREDATOR predicts false sharing problems without occurrence 24 Prediction Based on Virtual Cache Lines Thread 1 Thread 2 Real case Cache line 1 Cache line 2 False Sharing Virtual cache line 1 Prediction 1 Virtual cache line 2 False Sharing Prediction 2 Virtual cache line 1 25 Track Invalidations on Virtual Cache Lines X d Y Non-tracked virtual lines Tracked virtual line (sz-d)/2 (sz-d)/2 d < the cache line size - sz (X, Y) from different threads && one of them is write 26 Benchmark Results Benchmarks Unknown Without Problem Prediction With Prediction Improvement Histogram Linear_regression ✔ ✔ 46% ✔ 1207% ✔ ✔ 0.09% ✔ ✔ 0.14% ✔ ✔ 4.77% ✔ ✔ 7.52% Reverse_index Word_count Streamcluster-1 Streamcluster-2 ✔ ✔ 27 Real Applications Results • MySQL – Problem: False sharing occurs when different threads update the shared bitmap simultaneously. – Performance improves 180% after fixes. • Boost library: – Problem: “there will be 16 spinlocks per cache line” – Performance improves about 100%. 28 Performance Overhead of PREDATOR Execution Time Overhead 15 Normalized Runtime 23 26 12 Original 9 6 PREDATOR-NP 5.6X PREDATOR 3 0 29 Core 1 Core 2 Thread 1 Thread 2 Invalidat e Cache Compiler Instrumentation Cache Runtime System Main Memory Precise report Thread 1 Thread 2 Cache line 1 Cache line 2 False Sharing Virtual cache line 1 Real case Prediction 1 Virtual cache line 2 False Sharing Virtual cache line 1 Prediction 2 30 31 False Sharing is Hard to Diagnose Multiple experts worked together to diagnose MySQL scalability issue (1.5M LOC) 32 Detailed Prediction Algorithm 1. Find suspected cache lines 33 Detailed Prediction Algorithm 1. Find suspected cache lines 2. Track detailed memory accesses 34 Detailed Prediction Algorithm 1. Find suspected cache lines 2. Track detailed memory accesses X Y 3. Predict based on hot accesses d d < sz && (X, Y) from different threads, potential false sharing 35 4: Tracking Cache Invalidations on the Virtual Line X d Y Non-tracked virtual lines Tracked virtual line (sz-d)/2 (sz-d)/2 36