Asymmetry-aware execution placement on manycore chips Alexey Tumanov

Asymmetry-aware execution placement on manycore chips Alexey Tumanov Joshua Wise, Onur Mutlu, Greg Ganger CARNEGIE MELLON UNIVERSITY Introduction: Core Scaling ? •  Moore’s Law continues: can still fit more transistors on a chip http://www.pdl.cmu.edu/ 2 Alexey Tumanov © April 2013! Introduction: Memory Controllers core MC •  Multi-core multi-MC chips gain prevalence •  But, now latency is NOT: •  uniform memory access latency (UMA) •  hierarchically non-uniform memory latency (NUMA) http://www.pdl.cmu.edu/ 3 Alexey Tumanov © April 2013! NUMA •  NUMA: symmetrically non-uniform crossbar 2x 2x 2x 2x 2x 2x 2x 2x x x 2x 2x x x 2x 2x bottom left MC access from all cores http://www.pdl.cmu.edu/ 4 Alexey Tumanov © April 2013! ANUMA •  ANUMA: asymmetrically non-uniform core MC http://www.pdl.cmu.edu/ bottom left MC access from all cores 5 Alexey Tumanov © April 2013! NUMA vs. ANUMA •  Latency & bandwidth asymmetries in multiMC manycore NoC systems are different •  [numa] sharp division into local/remote that dwarfs other intra-node asymmetries •  [anuma] gradual continuum of access latencies http://www.pdl.cmu.edu/ 6 Alexey Tumanov © April 2013! Problem Statement •  MC access asymmetries can result in suboptimal placement w.r.t. MC access patterns http://www.pdl.cmu.edu/ 7 Alexey Tumanov © April 2013! Motivating Experiments •  “Bare metal” latency differential ~ 14% •  does it matter? •  GCC: 1 thread, 1 MC, 1 core at a time, tile linux 1 1.22e+07 2 1.2e+07 3 1.18e+07 4 1.16e+07 5 1.14e+07 6 1.12e+07 1.24e+07 1.22e+07 1.2e+07 1.18e+07 1.16e+07 14% 0 7 1.1e+07 0 1 2 http://www.pdl.cmu.edu/ 3 4 5 6 7 8 1.14e+07 GCC runtime (µs) 0 1.24e+07 1.12e+07 2 4 6 8 10 12 Manhattan distance to MC 1.1e+07 not 0 Alexey Tumanov © April 2013! Motivation •  Latency differential gets much worse •  14% figure is NOT an upper bound •  Major factor: contention on the NoC •  NoC node and link contention slows down packets •  amplifying effect on “bare metal” asymmetries •  Examples: •  up to 100% latency differential in a simulator http://www.pdl.cmu.edu/ 9 Alexey Tumanov © April 2013! Exploring Solution Space: NUMA •  Designate grid quadrants into NUMA zones (+) a well-explored approach (+) kernel code readily available (-) lacks flexibility (-) contention avoidance http://www.pdl.cmu.edu/ 10 Alexey Tumanov © April 2013! Solution Space: data placement •  OS allocates physical frames in a way that: •  balances the NoC contention •  optimizes some objective function •  maximizes throughput (+) proactive (-) no crystal ball: which pages will be hot? (-) once allocated, can’t be easily undone •  inter-MC page migration worsens NoC contention http://www.pdl.cmu.edu/ 11 Alexey Tumanov © April 2013! Solution Space: thread placement •  Move execution closer to MCs used (-) waits for the problem to occur before acting (-) cache thrashing (+) better-informed decisions are possible •  MC usage statistics collection http://www.pdl.cmu.edu/ 12 Alexey Tumanov © April 2013! Solution Space: hybrid •  Combine execution and data placement •  allocate frames, then adjust execution placement •  place execution, then adjust data placement (+) Corrective action possible (-) Complexity http://www.pdl.cmu.edu/ 13 Alexey Tumanov © April 2013! MC Usage Stats Collection •  Interface: •  procfs control handle to start/stop collection •  procfs data file for stats data extraction – pages accessed per MC per sampling cycle •  Implementation •  •  •  •  instrumented VM to page fault each page fault is attributed to corresponding MCi very inefficient, but does work adjustable sampling frequency (20Hz) – trades off overhead vs. accuracy http://www.pdl.cmu.edu/ 14 Alexey Tumanov © April 2013! Placement Algorithm •  Place threads nearest to MCs they use •  greedily based on memory intensity •  Weighted geometric placement algorithm •  calculate thread’s desired coordinates as a function of – MC coordinates – thread’s access count vector (L1 normal) •  Placement collisions resolved •  find second choices in direction of preferred MCs http://www.pdl.cmu.edu/ 15 Alexey Tumanov © April 2013! Bandwidth Performance Results <1+%'> <=1+%'>' 550 ,-./0!1!(2'"31456789 500 450 400 350 300 250 http://www.pdl.cmu.edu/ !$&$;*;' !"#$%&' !%:$;*;' Base (%'" BC)A# ?@A+# ()*+" (%'" BC)A# ?@A+# ()*+" 16 Alexey Tumanov © April 2013! Bandwidth Performance Results <1+%'> <=1+%'>' 550 ,-./0!1!(2'"31456789 500 450 400 350 300 250 http://www.pdl.cmu.edu/ !$&$;*;' !"#$%&' !%:$;*;' Base verhead (%'"O BC)A# ?@A+# ()*+" (%'" BC)A# ?@A+# ()*+" 17 Alexey Tumanov © April 2013! Bandwidth Performance Results <1+%'> <=1+%'>' 550 ,-./0!1!(2'"31456789 500 450 400 350 300 250 http://www.pdl.cmu.edu/ !$&$;*;' !"#$%&' !%:$;*;' Base vrhd?@A+# Weighted (%'"O BC)A# ()*+" 18 (%'" BC)A# ?@A+# ()*+" Alexey Tumanov © April 2013! Bandwidth Performance Results <1+%'> <=1+%'>' 550 ,-./0!1!(2'"31456789 500 450 400 350 300 250 http://www.pdl.cmu.edu/ !$&$;*;' !"#$%&' !%:$;*;' Base vrhd?@A+# Wghtd Brute (%'"O BC)A# ()*+" 19 (%'" BC)A# ?@A+# ()*+" Alexey Tumanov © April 2013! Bandwidth Performance Results <1+%'> <=1+%'>' 550 ,-./0!1!(2'"31456789 500 450 400 350 300 250 http://www.pdl.cmu.edu/ !$&$;*;' !"#$%&' !%:$;*;' Base vrhd?@A+# Wghtd Brute Base (%'"O BC)A# ()*+" (%'" BC)A# ?@A+# ()*+" 20 Alexey Tumanov © April 2013! Bandwidth Performance Results <1+%'> <=1+%'>' 550 ,-./0!1!(2'"31456789 500 450 400 350 300 250 http://www.pdl.cmu.edu/ !$&$;*;' !"#$%&' !%:$;*;' Base vrhd?@A+# Wghtd Brute Base Ovrhd ghtd()*+" Brute (%'"O BC)A# ()*+" (%'" BC)A#W ?@A+# 21 Alexey Tumanov © April 2013! Runtime Performance Results 1 threads 2 threads 4 threads 8 threads 16 threads 1.3e+07 1.1e+07 bench2 runtime (us) Microseconds to complete bench2 1.2e+07 Minimums Medians Maximums 1e+07 9e+06 8e+06 7e+06 6e+06 5e+06 4e+06 1 2Br BN O0 WG http://www.pdl.cmu.edu/ N 0 1 2 N 0 22 1 2 N 0 1 2 N 0 1 Alexey Tumanov © April 2013! 2 Runtime Performance Results 1 threads 2 threads 4 threads 8 threads 16 threads 1.3e+07 1.1e+07 bench2 runtime (us) Microseconds to complete bench2 1.2e+07 Minimums Medians Maximums 1e+07 9e+06 8e+06 7e+06 6e+06 5e+06 4e+06 1 2Br BNO 0WG 1 Br 2 BN O0 WG http://www.pdl.cmu.edu/ N 0 23 1 2 N 0 1 2 N 0 1 Alexey Tumanov © April 2013! 2 Runtime Performance Results 1 threads 2 threads 4 threads 8 threads 16 threads 1.3e+07 1.1e+07 bench2 runtime (us) Microseconds to complete bench2 1.2e+07 Minimums Medians Maximums 1e+07 9e+06 8e+06 7e+06 6e+06 5e+06 4e+06 1 2Br BNO 0WG 1 Br 2 1 2 1 Br 2 1 Br 2 BN O0 WG BN O0WG BNO 0WG Br BNO0WG http://www.pdl.cmu.edu/ 24 Alexey Tumanov © April 2013! Takeaways •  Promising, but stat collection overhead is high •  Call for per-core MC stats counters •  in hardware •  Works despite high cost of migration http://www.pdl.cmu.edu/ 25 Alexey Tumanov © April 2013! Future Work •  More experimentation to understand benefits •  as a function of application properties •  heterogeneous workloads •  ANUMCA •  exploring how these same ideas apply to cache •  cache access is asymmetrically non-uniform too •  Optimizing the application-level goal •  SLAs, multi-threaded app. performance goals http://www.pdl.cmu.edu/ 26 Alexey Tumanov © April 2013! Conclusion •  Promising, but stat collection overhead is high •  Call for per-core MC stats counters •  Works despite high cost of migration •  Questions? http://www.pdl.cmu.edu/ 27 Alexey Tumanov © April 2013! References [1] Alexey Tumanov, Joshua Wise, Onur Mutlu, Greg Ganger. “Asymmetry-Aware Execution Placement on Manycore Chips. In Proc. of SFMA’13, EuroSys 2013, Czech Republic. [2] Reetuparna Das, Rachata Ausavarungnirun, Onur Mutlu, Akhilesh Kumar, and Mani Azimi. “Application-to-Core Mapping Policies to Reduce Memory System Interference in Multi-Core Systems”. In Proc. of HPCA’2013. [3] Tilera Corporation, “The Processor Architecture Overview for the TILEPro Series”, Feb 2013. [pdf] http://www.pdl.cmu.edu/ 28 Alexey Tumanov © April 2013!

Asymmetry-aware execution placement on manycore chips Alexey Tumanov

Related documents

Products

Support

Asymmetry-aware execution placement on manycore chips Alexey Tumanov

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib