Asymmetry-aware execution placement on manycore chips Alexey Tumanov

advertisement
Asymmetry-aware
execution placement on
manycore chips
Alexey Tumanov
Joshua Wise, Onur Mutlu, Greg Ganger
CARNEGIE MELLON UNIVERSITY
Introduction: Core Scaling
?
•  Moore’s Law continues:
can still fit more transistors
on a chip
http://www.pdl.cmu.edu/
2
Alexey Tumanov © April 2013!
Introduction: Memory Controllers
core
MC
•  Multi-core multi-MC chips gain prevalence
•  But, now latency is NOT:
•  uniform memory access latency (UMA)
•  hierarchically non-uniform memory latency (NUMA)
http://www.pdl.cmu.edu/
3
Alexey Tumanov © April 2013!
NUMA
•  NUMA: symmetrically non-uniform
crossbar
2x
2x
2x
2x
2x
2x
2x
2x
x
x
2x
2x
x
x
2x
2x
bottom left MC access
from all cores
http://www.pdl.cmu.edu/
4
Alexey Tumanov © April 2013!
ANUMA
•  ANUMA: asymmetrically non-uniform
core
MC
http://www.pdl.cmu.edu/
bottom left MC access
from all cores
5
Alexey Tumanov © April 2013!
NUMA vs. ANUMA
•  Latency & bandwidth asymmetries in multiMC manycore NoC systems are different
•  [numa] sharp division into local/remote that dwarfs
other intra-node asymmetries
•  [anuma] gradual continuum of access latencies
http://www.pdl.cmu.edu/
6
Alexey Tumanov © April 2013!
Problem Statement
•  MC access asymmetries can result in
suboptimal placement w.r.t. MC access patterns
http://www.pdl.cmu.edu/
7
Alexey Tumanov © April 2013!
Motivating Experiments
•  “Bare metal” latency differential ~ 14%
•  does it matter?
•  GCC: 1 thread, 1 MC, 1 core at a time, tile linux
1
1.22e+07
2
1.2e+07
3
1.18e+07
4
1.16e+07
5
1.14e+07
6
1.12e+07
1.24e+07
1.22e+07
1.2e+07
1.18e+07
1.16e+07
14%
0
7
1.1e+07
0
1
2
http://www.pdl.cmu.edu/
3
4
5
6
7
8
1.14e+07
GCC runtime (µs)
0
1.24e+07
1.12e+07
2
4
6
8
10 12
Manhattan distance to MC
1.1e+07
not 0
Alexey Tumanov © April 2013!
Motivation
•  Latency differential gets much worse
•  14% figure is NOT an upper bound
•  Major factor: contention on the NoC
•  NoC node and link contention slows down packets
•  amplifying effect on “bare metal” asymmetries
•  Examples:
•  up to 100% latency differential in a simulator
http://www.pdl.cmu.edu/
9
Alexey Tumanov © April 2013!
Exploring Solution Space: NUMA
•  Designate grid quadrants into NUMA zones
(+) a well-explored approach
(+) kernel code readily available
(-) lacks flexibility
(-) contention avoidance
http://www.pdl.cmu.edu/
10
Alexey Tumanov © April 2013!
Solution Space: data placement
•  OS allocates physical frames in a way that:
•  balances the NoC contention
•  optimizes some objective function
•  maximizes throughput
(+) proactive
(-) no crystal ball: which pages will be hot?
(-) once allocated, can’t be easily undone
•  inter-MC page migration worsens NoC contention
http://www.pdl.cmu.edu/
11
Alexey Tumanov © April 2013!
Solution Space: thread placement
•  Move execution closer to MCs used
(-) waits for the problem to occur before acting
(-) cache thrashing
(+) better-informed decisions are possible
•  MC usage statistics collection
http://www.pdl.cmu.edu/
12
Alexey Tumanov © April 2013!
Solution Space: hybrid
•  Combine execution and data placement
•  allocate frames, then adjust execution placement
•  place execution, then adjust data placement
(+) Corrective action possible
(-) Complexity
http://www.pdl.cmu.edu/
13
Alexey Tumanov © April 2013!
MC Usage Stats Collection
•  Interface:
•  procfs control handle to start/stop collection
•  procfs data file for stats data extraction
– pages accessed per MC per sampling cycle
•  Implementation
• 
• 
• 
• 
instrumented VM to page fault
each page fault is attributed to corresponding MCi
very inefficient, but does work
adjustable sampling frequency (20Hz)
– trades off overhead vs. accuracy
http://www.pdl.cmu.edu/
14
Alexey Tumanov © April 2013!
Placement Algorithm
•  Place threads nearest to MCs they use
•  greedily based on memory intensity
•  Weighted geometric placement algorithm
•  calculate thread’s desired coordinates as a function of
– MC coordinates
– thread’s access count vector (L1 normal)
•  Placement collisions resolved
•  find second choices in direction of preferred MCs
http://www.pdl.cmu.edu/
15
Alexey Tumanov © April 2013!
Bandwidth Performance Results
<1+%'>
<=1+%'>'
550
,-./0!1!(2'"31456789
500
450
400
350
300
250
http://www.pdl.cmu.edu/
!$&$;*;'
!"#$%&'
!%:$;*;'
Base
(%'" BC)A# ?@A+# ()*+"
(%'" BC)A# ?@A+# ()*+"
16
Alexey Tumanov © April 2013!
Bandwidth Performance Results
<1+%'>
<=1+%'>'
550
,-./0!1!(2'"31456789
500
450
400
350
300
250
http://www.pdl.cmu.edu/
!$&$;*;'
!"#$%&'
!%:$;*;'
Base
verhead
(%'"O
BC)A#
?@A+# ()*+"
(%'" BC)A# ?@A+# ()*+"
17
Alexey Tumanov © April 2013!
Bandwidth Performance Results
<1+%'>
<=1+%'>'
550
,-./0!1!(2'"31456789
500
450
400
350
300
250
http://www.pdl.cmu.edu/
!$&$;*;'
!"#$%&'
!%:$;*;'
Base
vrhd?@A+#
Weighted
(%'"O
BC)A#
()*+"
18
(%'" BC)A# ?@A+# ()*+"
Alexey Tumanov © April 2013!
Bandwidth Performance Results
<1+%'>
<=1+%'>'
550
,-./0!1!(2'"31456789
500
450
400
350
300
250
http://www.pdl.cmu.edu/
!$&$;*;'
!"#$%&'
!%:$;*;'
Base
vrhd?@A+#
Wghtd
Brute
(%'"O
BC)A#
()*+"
19
(%'" BC)A# ?@A+# ()*+"
Alexey Tumanov © April 2013!
Bandwidth Performance Results
<1+%'>
<=1+%'>'
550
,-./0!1!(2'"31456789
500
450
400
350
300
250
http://www.pdl.cmu.edu/
!$&$;*;'
!"#$%&'
!%:$;*;'
Base
vrhd?@A+#
Wghtd
Brute Base
(%'"O
BC)A#
()*+"
(%'" BC)A# ?@A+# ()*+"
20
Alexey Tumanov © April 2013!
Bandwidth Performance Results
<1+%'>
<=1+%'>'
550
,-./0!1!(2'"31456789
500
450
400
350
300
250
http://www.pdl.cmu.edu/
!$&$;*;'
!"#$%&'
!%:$;*;'
Base
vrhd?@A+#
Wghtd
Brute Base
Ovrhd
ghtd()*+"
Brute
(%'"O
BC)A#
()*+"
(%'"
BC)A#W
?@A+#
21
Alexey Tumanov © April 2013!
Runtime Performance Results
1 threads
2 threads
4 threads
8 threads
16 threads
1.3e+07
1.1e+07
bench2 runtime (us)
Microseconds to complete bench2
1.2e+07
Minimums
Medians
Maximums
1e+07
9e+06
8e+06
7e+06
6e+06
5e+06
4e+06
1
2Br
BN O0 WG
http://www.pdl.cmu.edu/
N
0
1
2
N
0
22
1
2
N
0
1
2
N
0
1
Alexey Tumanov © April 2013!
2
Runtime Performance Results
1 threads
2 threads
4 threads
8 threads
16 threads
1.3e+07
1.1e+07
bench2 runtime (us)
Microseconds to complete bench2
1.2e+07
Minimums
Medians
Maximums
1e+07
9e+06
8e+06
7e+06
6e+06
5e+06
4e+06
1
2Br BNO 0WG
1 Br
2
BN O0 WG
http://www.pdl.cmu.edu/
N
0
23
1
2
N
0
1
2
N
0
1
Alexey Tumanov © April 2013!
2
Runtime Performance Results
1 threads
2 threads
4 threads
8 threads
16 threads
1.3e+07
1.1e+07
bench2 runtime (us)
Microseconds to complete bench2
1.2e+07
Minimums
Medians
Maximums
1e+07
9e+06
8e+06
7e+06
6e+06
5e+06
4e+06
1
2Br BNO 0WG
1 Br
2
1
2
1 Br
2
1 Br
2
BN O0 WG
BN O0WG
BNO 0WG
Br
BNO0WG
http://www.pdl.cmu.edu/
24
Alexey Tumanov © April 2013!
Takeaways
•  Promising, but stat collection overhead is high
•  Call for per-core MC stats counters
•  in hardware
•  Works despite high cost of migration
http://www.pdl.cmu.edu/
25
Alexey Tumanov © April 2013!
Future Work
•  More experimentation to understand benefits
•  as a function of application properties
•  heterogeneous workloads
•  ANUMCA
•  exploring how these same ideas apply to cache
•  cache access is asymmetrically non-uniform too
•  Optimizing the application-level goal
•  SLAs, multi-threaded app. performance goals
http://www.pdl.cmu.edu/
26
Alexey Tumanov © April 2013!
Conclusion
•  Promising, but stat collection overhead is high
•  Call for per-core MC stats counters
•  Works despite high cost of migration
•  Questions?
http://www.pdl.cmu.edu/
27
Alexey Tumanov © April 2013!
References
[1] Alexey Tumanov, Joshua Wise, Onur Mutlu, Greg Ganger.
“Asymmetry-Aware Execution Placement on Manycore Chips. In
Proc. of SFMA’13, EuroSys 2013, Czech Republic.
[2] Reetuparna Das, Rachata Ausavarungnirun, Onur Mutlu,
Akhilesh Kumar, and Mani Azimi. “Application-to-Core Mapping
Policies to Reduce Memory System Interference in Multi-Core
Systems”. In Proc. of HPCA’2013.
[3] Tilera Corporation, “The Processor Architecture Overview for
the TILEPro Series”, Feb 2013. [pdf]
http://www.pdl.cmu.edu/
28
Alexey Tumanov © April 2013!
Download