Asymmetry-aware execution placement on manycore chips Alexey Tumanov Joshua Wise, Onur Mutlu, Greg Ganger CARNEGIE MELLON UNIVERSITY Introduction: Core Scaling ? • Moore’s Law continues: can still fit more transistors on a chip http://www.pdl.cmu.edu/ 2 Alexey Tumanov © April 2013! Introduction: Memory Controllers core MC • Multi-core multi-MC chips gain prevalence • But, now latency is NOT: • uniform memory access latency (UMA) • hierarchically non-uniform memory latency (NUMA) http://www.pdl.cmu.edu/ 3 Alexey Tumanov © April 2013! NUMA • NUMA: symmetrically non-uniform crossbar 2x 2x 2x 2x 2x 2x 2x 2x x x 2x 2x x x 2x 2x bottom left MC access from all cores http://www.pdl.cmu.edu/ 4 Alexey Tumanov © April 2013! ANUMA • ANUMA: asymmetrically non-uniform core MC http://www.pdl.cmu.edu/ bottom left MC access from all cores 5 Alexey Tumanov © April 2013! NUMA vs. ANUMA • Latency & bandwidth asymmetries in multiMC manycore NoC systems are different • [numa] sharp division into local/remote that dwarfs other intra-node asymmetries • [anuma] gradual continuum of access latencies http://www.pdl.cmu.edu/ 6 Alexey Tumanov © April 2013! Problem Statement • MC access asymmetries can result in suboptimal placement w.r.t. MC access patterns http://www.pdl.cmu.edu/ 7 Alexey Tumanov © April 2013! Motivating Experiments • “Bare metal” latency differential ~ 14% • does it matter? • GCC: 1 thread, 1 MC, 1 core at a time, tile linux 1 1.22e+07 2 1.2e+07 3 1.18e+07 4 1.16e+07 5 1.14e+07 6 1.12e+07 1.24e+07 1.22e+07 1.2e+07 1.18e+07 1.16e+07 14% 0 7 1.1e+07 0 1 2 http://www.pdl.cmu.edu/ 3 4 5 6 7 8 1.14e+07 GCC runtime (µs) 0 1.24e+07 1.12e+07 2 4 6 8 10 12 Manhattan distance to MC 1.1e+07 not 0 Alexey Tumanov © April 2013! Motivation • Latency differential gets much worse • 14% figure is NOT an upper bound • Major factor: contention on the NoC • NoC node and link contention slows down packets • amplifying effect on “bare metal” asymmetries • Examples: • up to 100% latency differential in a simulator http://www.pdl.cmu.edu/ 9 Alexey Tumanov © April 2013! Exploring Solution Space: NUMA • Designate grid quadrants into NUMA zones (+) a well-explored approach (+) kernel code readily available (-) lacks flexibility (-) contention avoidance http://www.pdl.cmu.edu/ 10 Alexey Tumanov © April 2013! Solution Space: data placement • OS allocates physical frames in a way that: • balances the NoC contention • optimizes some objective function • maximizes throughput (+) proactive (-) no crystal ball: which pages will be hot? (-) once allocated, can’t be easily undone • inter-MC page migration worsens NoC contention http://www.pdl.cmu.edu/ 11 Alexey Tumanov © April 2013! Solution Space: thread placement • Move execution closer to MCs used (-) waits for the problem to occur before acting (-) cache thrashing (+) better-informed decisions are possible • MC usage statistics collection http://www.pdl.cmu.edu/ 12 Alexey Tumanov © April 2013! Solution Space: hybrid • Combine execution and data placement • allocate frames, then adjust execution placement • place execution, then adjust data placement (+) Corrective action possible (-) Complexity http://www.pdl.cmu.edu/ 13 Alexey Tumanov © April 2013! MC Usage Stats Collection • Interface: • procfs control handle to start/stop collection • procfs data file for stats data extraction – pages accessed per MC per sampling cycle • Implementation • • • • instrumented VM to page fault each page fault is attributed to corresponding MCi very inefficient, but does work adjustable sampling frequency (20Hz) – trades off overhead vs. accuracy http://www.pdl.cmu.edu/ 14 Alexey Tumanov © April 2013! Placement Algorithm • Place threads nearest to MCs they use • greedily based on memory intensity • Weighted geometric placement algorithm • calculate thread’s desired coordinates as a function of – MC coordinates – thread’s access count vector (L1 normal) • Placement collisions resolved • find second choices in direction of preferred MCs http://www.pdl.cmu.edu/ 15 Alexey Tumanov © April 2013! Bandwidth Performance Results <1+%'> <=1+%'>' 550 ,-./0!1!(2'"31456789 500 450 400 350 300 250 http://www.pdl.cmu.edu/ !$&$;*;' !"#$%&' !%:$;*;' Base (%'" BC)A# ?@A+# ()*+" (%'" BC)A# ?@A+# ()*+" 16 Alexey Tumanov © April 2013! Bandwidth Performance Results <1+%'> <=1+%'>' 550 ,-./0!1!(2'"31456789 500 450 400 350 300 250 http://www.pdl.cmu.edu/ !$&$;*;' !"#$%&' !%:$;*;' Base verhead (%'"O BC)A# ?@A+# ()*+" (%'" BC)A# ?@A+# ()*+" 17 Alexey Tumanov © April 2013! Bandwidth Performance Results <1+%'> <=1+%'>' 550 ,-./0!1!(2'"31456789 500 450 400 350 300 250 http://www.pdl.cmu.edu/ !$&$;*;' !"#$%&' !%:$;*;' Base vrhd?@A+# Weighted (%'"O BC)A# ()*+" 18 (%'" BC)A# ?@A+# ()*+" Alexey Tumanov © April 2013! Bandwidth Performance Results <1+%'> <=1+%'>' 550 ,-./0!1!(2'"31456789 500 450 400 350 300 250 http://www.pdl.cmu.edu/ !$&$;*;' !"#$%&' !%:$;*;' Base vrhd?@A+# Wghtd Brute (%'"O BC)A# ()*+" 19 (%'" BC)A# ?@A+# ()*+" Alexey Tumanov © April 2013! Bandwidth Performance Results <1+%'> <=1+%'>' 550 ,-./0!1!(2'"31456789 500 450 400 350 300 250 http://www.pdl.cmu.edu/ !$&$;*;' !"#$%&' !%:$;*;' Base vrhd?@A+# Wghtd Brute Base (%'"O BC)A# ()*+" (%'" BC)A# ?@A+# ()*+" 20 Alexey Tumanov © April 2013! Bandwidth Performance Results <1+%'> <=1+%'>' 550 ,-./0!1!(2'"31456789 500 450 400 350 300 250 http://www.pdl.cmu.edu/ !$&$;*;' !"#$%&' !%:$;*;' Base vrhd?@A+# Wghtd Brute Base Ovrhd ghtd()*+" Brute (%'"O BC)A# ()*+" (%'" BC)A#W ?@A+# 21 Alexey Tumanov © April 2013! Runtime Performance Results 1 threads 2 threads 4 threads 8 threads 16 threads 1.3e+07 1.1e+07 bench2 runtime (us) Microseconds to complete bench2 1.2e+07 Minimums Medians Maximums 1e+07 9e+06 8e+06 7e+06 6e+06 5e+06 4e+06 1 2Br BN O0 WG http://www.pdl.cmu.edu/ N 0 1 2 N 0 22 1 2 N 0 1 2 N 0 1 Alexey Tumanov © April 2013! 2 Runtime Performance Results 1 threads 2 threads 4 threads 8 threads 16 threads 1.3e+07 1.1e+07 bench2 runtime (us) Microseconds to complete bench2 1.2e+07 Minimums Medians Maximums 1e+07 9e+06 8e+06 7e+06 6e+06 5e+06 4e+06 1 2Br BNO 0WG 1 Br 2 BN O0 WG http://www.pdl.cmu.edu/ N 0 23 1 2 N 0 1 2 N 0 1 Alexey Tumanov © April 2013! 2 Runtime Performance Results 1 threads 2 threads 4 threads 8 threads 16 threads 1.3e+07 1.1e+07 bench2 runtime (us) Microseconds to complete bench2 1.2e+07 Minimums Medians Maximums 1e+07 9e+06 8e+06 7e+06 6e+06 5e+06 4e+06 1 2Br BNO 0WG 1 Br 2 1 2 1 Br 2 1 Br 2 BN O0 WG BN O0WG BNO 0WG Br BNO0WG http://www.pdl.cmu.edu/ 24 Alexey Tumanov © April 2013! Takeaways • Promising, but stat collection overhead is high • Call for per-core MC stats counters • in hardware • Works despite high cost of migration http://www.pdl.cmu.edu/ 25 Alexey Tumanov © April 2013! Future Work • More experimentation to understand benefits • as a function of application properties • heterogeneous workloads • ANUMCA • exploring how these same ideas apply to cache • cache access is asymmetrically non-uniform too • Optimizing the application-level goal • SLAs, multi-threaded app. performance goals http://www.pdl.cmu.edu/ 26 Alexey Tumanov © April 2013! Conclusion • Promising, but stat collection overhead is high • Call for per-core MC stats counters • Works despite high cost of migration • Questions? http://www.pdl.cmu.edu/ 27 Alexey Tumanov © April 2013! References [1] Alexey Tumanov, Joshua Wise, Onur Mutlu, Greg Ganger. “Asymmetry-Aware Execution Placement on Manycore Chips. In Proc. of SFMA’13, EuroSys 2013, Czech Republic. [2] Reetuparna Das, Rachata Ausavarungnirun, Onur Mutlu, Akhilesh Kumar, and Mani Azimi. “Application-to-Core Mapping Policies to Reduce Memory System Interference in Multi-Core Systems”. In Proc. of HPCA’2013. [3] Tilera Corporation, “The Processor Architecture Overview for the TILEPro Series”, Feb 2013. [pdf] http://www.pdl.cmu.edu/ 28 Alexey Tumanov © April 2013!