Document 10410110

advertisement
HAT: Heterogeneous Adap1ve Thro4ling for On-­‐Chip Networks Kevin Kai-­‐Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu Execu1ve Summary Problem: Packets contend in on-­‐chip networks (NoCs), causing conges@on, thus reducing system performance Approach: Source throFling (temporarily delaying packet injec@ons) to reduce conges@on 1) Which applica1ons to thro4le? Observa@on: ThroFling network-­‐intensive applica@ons leads to higher system performance èKey idea 1: Applica@on-­‐aware source throFling 2) How much to thro4le? Observa@on: There is no single thro4ling rate that works well for every applica@on workload èKey idea 2: Dynamic throFling rate adjustment Result: Improves both system performance and energy efficiency over state-­‐of-­‐the-­‐art source throFling policies 2 Outline •  Background and Mo1va1on •  Mechanism •  Comparison Points •  Results •  Conclusions 3 On-­‐Chip Networks PE PE PE … PE •  Connect cores, caches, memory controllers, etc. –  Buses and crossbars are not scalable IInterconnect PE Processing Element (Cores, L2 Banks, Memory Controllers, etc) 4 On-­‐Chip Networks PE PE R
PE PE PE PE R
PE R
PE R
R
PE R
R
•  Connect cores, caches, memory controllers, etc R
PE R
R
Router –  Buses and crossbars are not scalable •  Packet switched •  2D mesh: Most commonly used topology •  Primarily serve cache misses and memory requests Processing Element (Cores, L2 Banks, Memory Controllers, etc) 5 Network Conges1on Reduces Performance PE PE P R
R
PE R
Rs1on e
g
n
Co
R
Network conges1on: PE R
Limited shared resources (buffers and links) •  due to design constraints: Power, chip area, and @ming PE P PE P R
PE R
P PE PE P R
PE êsystem performance R
Router Packet P Processing Element (Cores, L2 Banks, Memory Controllers, etc) 6 Goal •  Improve system performance in a highly congested network •  Observa1on: Reducing network load (number of packets in the network) decreases network conges@on, hence improves system performance •  Approach: Source throFling (temporarily delaying new traffic injec@on) to reduce network load 7 Source Thro4ling PE PE P P … P P P P P P P P I PE P P P P P Long network latency when the network is congested I PE Network Packet P Processing Element (Cores, L2 Banks, Memory Controllers, etc) 8 Source Thro4ling … P PE P P PE P o4le ThrP o4le ThrP P P P P Thro4le P P P P P P PE P I P P P P P •  ThroFling makes some packets wait longer to inject •  Average network throughput increases, hence higher system performance I PE Network Packet P Processing Element (Cores, L2 Banks, Memory Controllers, etc) 9 Key Ques1ons of Source Thro4ling •  Every cycle when a node has a packet to inject, source throFling blocks the packet with probability P –  We call P “thro4ling rate” (ranges from 0 to 1) •  Thro4ling rate can be set independently on every node Two key ques1ons: 1.  Which applica@ons to throFle? 2.  How much to throFle? Naïve mechanism: ThroFle every single node with a constant throFling rate 10 Key Observa1on #1 ThroFling network-­‐intensive applica@ons leads to higher system performance Configura1on: 16-­‐node system, 4x4 mesh network, 8 gromacs (network-­‐non-­‐intensive), and 8 mcf (network-­‐intensive) … gromacs mcf Thro4le P P P I P P P P P P P ThroFling gromacs decreases system performance by 2% due to minimal network load reduc@on 11 Key Observa1on #1 ThroFling network-­‐intensive applica@ons leads to higher system performance Configura1on: 16-­‐node system, 4x4 mesh network, 8 gromacs (network-­‐non-­‐intensive), and 8 mcf (network-­‐intensive) … gromacs mcf Thro4le P P P I P P P P P P P ThroFling mcf increases system performance by 9% (gromacs: +14% mcf: +5%) due to reduced conges@on 12 Key Observa1on #1 ThroFling network-­‐intensive applica@ons leads to higher system performance Configura1on: 16-­‐node system, 4x4 mesh network, 8 gromacs (network-­‐non-­‐intensive), and 8 mcf (network-­‐intensive) … gromacs mcf Thro4le P P I P P P reduces •  ThroFling network-­‐intensive applica@ons conges@on enefits both network-­‐non-­‐intensive • •  TBhroFling mcf reduces conges@on and etwork-­‐intensive applica@ons •  g nromacs benefits m
ore from less network latency 13 Key Observa1on #2 There is no single thro4ling rate that works well for every applica@on workload … gromacs Thro4le 0.8
P P I … gromacs mcf Thro4le 0.8
Thro4le 0.8
P P P I P P mcf Thro4le 0.8
P P P P Network runs best at or below a certain network load Dynamically adjust throFling rate to avoid overload and under-­‐u@liza@on 14 Outline •  Background and Mo@va@on •  Mechanism •  Comparison Points •  Results •  Conclusions 15 Heterogeneous Adap1ve Thro4ling (HAT) 1.  Applica1on-­‐aware thro4ling: ThroFle network-­‐intensive applica@ons that interfere with network-­‐non-­‐intensive applica@ons 2.  Network-­‐load-­‐aware thro4ling rate adjustment: Dynamically adjust throFling rate to adapt to different workloads and program phases 16 Heterogeneous Adap1ve Thro4ling (HAT) 1.  Applica1on-­‐aware thro4ling: ThroFle network-­‐intensive applica@ons that interfere with network-­‐non-­‐intensive applica@ons 2.  Network-­‐load-­‐aware thro4ling rate adjustment: Dynamically adjust throFling rate to adapt to different workloads and program phases 17 Applica1on-­‐Aware Thro4ling 1.  Measure applica1ons’ network intensity Use L1 MPKI (misses per thousand instruc@ons) to es@mate network intensity 2.  Thro4le network-­‐intensive applica1ons How to select unthroFled applica@ons? •  Leaving too many applica@ons unthroFled overloads the network èSelect unthroFled applica@ons so that their total network intensity is less than the total network capacity Network-­‐non-­‐intensive Network-­‐intensive (Unthro4led) (Thro4led) App Σ MPKI < Threshold
Higher L1 MPKI 18 Heterogeneous Adap1ve Thro4ling (HAT) 1.  Applica1on-­‐aware thro4ling: ThroFle network-­‐intensive applica@ons that interfere with network-­‐non-­‐intensive applica@ons 2.  Network-­‐load-­‐aware thro4ling rate adjustment: Dynamically adjust throFling rate to adapt to different workloads and program phases 19 Dynamic Thro4ling Rate Adjustment •  Different workloads require different throFling rates to avoid overloading the network •  But, network load (frac@on of occupied buffers/
links) is an accurate indicator of conges@on •  Key idea: Measure current network load and dynamically adjust throFling rate based on load if network load > target:
Increase throttling rate
else:
Decrease throttling rate
If network is congested, throFle more If network is not congested, avoid unnecessary throFling 20 Heterogeneous Adap1ve Thro4ling (HAT) 1.  Applica1on-­‐aware thro4ling: ThroFle network-­‐intensive applica@ons that interfere with network-­‐non-­‐intensive applica@ons 2.  Network-­‐load-­‐aware thro4ling rate adjustment: Dynamically adjust throFling rate to adapt to different workloads and program phases 21 Epoch-­‐Based Opera1on •  Applica1on classifica1on and thro4ling rate adjustment are expensive if done every cycle •  Solu1on: recompute at epoch granularity During epoch: Every node: 1)  Measure L1 MPKI 2)  Measure network load Current Epoch (100K cycles) Beginning of epoch: All nodes send measured info to a central controller, which: 1)  Classifies applica@ons 2)  Adjusts throFling rate 3)  Sends new classifica@on and throFling rate to each node Next Epoch (100K cycles) Time 22 Pugng It Together: Key Contribu1ons 1. Applica1on-­‐aware thro4ling –  ThroFle network-­‐intensive applica@ons based on applica@ons’ network intensi@es 2. Network-­‐load-­‐aware thro4ling rate adjustment –  Dynamically adjust throFling rate based on network load to avoid overloading the network HAT is the first work to combine applica@on-­‐aware throFling and network-­‐load-­‐aware rate adjustment 23 Outline •  Background and Mo@va@on •  Mechanism •  Comparison Points •  Results •  Conclusions 24 Comparison Points •  Source thro4ling for bufferless NoCs [Nychis+ Hotnets’10, SIGCOMM’12] –  ThroFle network-­‐intensive applica@ons when other applica@ons cannot inject –  Does not take network load into account –  We call this “Heterogeneous ThroFling” •  Source thro4ling for buffered networks [ThoFethodi+ HPCA’01] –  ThroFle every applica@on when the network load exceeds a dynamically tuned threshold –  Not applica@on-­‐aware –  Fully blocks packet injec@ons while throFling –  We call this “Self-­‐Tuned ThroFling” 25 Outline •  Background and Mo@va@on •  Mechanism •  Comparison Points •  Results •  Conclusions 26 Methodology •  Chip Mul1processor Simulator –  64-­‐node mul@-­‐core systems with a 2D-­‐mesh topology –  Closed-­‐loop core/cache/NoC cycle-­‐level model –  64KB L1, perfect L2 (always hits to stress NoC) •  Router Designs –  Virtual-­‐channel buffered router: 4 VCs, 4 flits/VC [Dally+ IEEE TPDS’92] •  Input buffers to hold contending packets –  Bufferless deflec1on router: BLESS [Moscibroda+ ISCA’09] •  Misroute (deflect) contending packets 27 Methodology •  Workloads –  60 mul@-­‐core workloads of SPEC CPU2006 benchmarks –  4 network-­‐intensive workload categories based on the network intensity of applica@ons (Low/Medium/High) •  Metrics System performance: Weighted Speedup = ∑ IPC
IPC
IPC
Fairness: Maximum Slowdown = max
IPC
€
WeightedSpeedup
Energy efficiency: PerfPerWatt =
shared
i
alone
i
i
i
€
alone
i
shared
i
Power
28 Weighted Speedup Performance: Bufferless NoC (BLESS) 50 45 40 35 30 25 20 15 10 5 0 + 9.4% + 11.5% + 11.3% No ThroFling Heterogeneous HAT + 7.4% + 8.6% HL HML HM H Workload Categories AVG 1. HAT provides beFer performance improvement than state-­‐of-­‐the-­‐art throFling approaches 2. Highest improvement on heterogeneous workloads -­‐ L and M are more sensi1ve to network latency 29 Weighted Speedup Performance: Buffered NoC 50 45 40 35 30 25 20 15 10 5 0 No ThroFling Self-­‐Tuned Heterogeneous HAT HL HML HM H Workload Categories + 3.5% AVG HAT provides beFer performance improvement than prior approaches 30 Normalized Maximum Slowdwon (Lower is be4er) Applica1on Fairness 1.2 No ThroFling HAT Heterogeneous 1.0 1.2 -­‐ 15% Self-­‐Tuned Heterogeneous HAT -­‐ 5% 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 BLESS No ThroFling Buffered HAT provides beFer fairness than prior works 31 Network Energy Efficiency No ThroFling Normalized Perf. per Wa4 1.2 + 8.5% HAT + 5% 1.0 0.8 0.6 0.4 0.2 0.0 BLESS Buffered HAT increases energy efficiency by reducing network load by 4.7% (BLESS) or 0.5% (buffered) 32 Other Results in Paper •  Performance on CHIPPER [Fallin+ HPCA’11] –  HAT improves system performance •  Performance on mul1threaded workloads –  HAT is not designed for mul@threaded workloads, but it slightly improves system performance •  Parameter sensi@vity sweep of HAT –  HAT provides consistent system performance improvement on different network sizes 33 Conclusions Problem: Packets contend in on-­‐chip networks (NoCs), causing conges@on, thus reducing system performance Approach: Source throFling (temporarily delaying packet injec@ons) to reduce conges@on 1) Which applica1ons to thro4le? Observa@on: ThroFling network-­‐intensive applica@ons leads to higher system performance èKey idea 1: Applica@on-­‐aware source throFling 2) How much to thro4le? Observa@on: There is no single thro4ling rate that works well for every applica@on workload èKey idea 2: Dynamic throFling rate adjustment Result: Improves both system performance and energy efficiency over state-­‐of-­‐the-­‐art source throFling policies 34 HAT: Heterogeneous Adap1ve Thro4ling for On-­‐Chip Networks Kevin Kai-­‐Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu Thro4ling Rate Steps 36 Network Sizes 50
1.9%
40
8
VC-BUFFERED
HAT
3.9%
3.7%
6
30
2
1.3%
0
50
40
3.3%
30
CHIPPER
HAT
5.9%
20.0%
20
10
6.0%
5.8%
0
50
6.2%
40
BLESS
HAT
0
14
10.4%
12
10
8
6
4
2
0
8
9.7%
6
10.4%
30
17.2%
10.8%
20
10
Unfairness
(Lower is Better)
Weighted Speedup
(Higher is Better)
10
4
4.9%
11.1%
20
4
2
7.2%
0
0
16
64
144
Network Size
16
64
144
Network Size
37 48
44
40
36
32
28
24
20
16
12
8
4
0
9.4%
5.4%
20
11.5%
5.8%
10.4%
11.3%
CHIPPER
CHIPPER-HETERO.
CHIPPER-HAT
BLESS
BLESS-HETERO.
BLESS-HAT
5.9%
6.7%
8.6%
25.6%
5.9%
HL
HML
HM
H
20.7%14.2% 19.2%16.9%
AVG
HL
HML
12
12.3%
15.4%
20.0%
17.2%
24.4%
HM
16
H
AVG
8
4
Unfairness
(Lower is Better)
Weighted Speedup
(Higher is Better)
Performance on CHIPPER/BLESS 0
Injec@on rate (number of injected packets / cycle): +8.5% (BLESS) or +4.8% (CHIPPER) 38 Mul1threaded Workloads 39 Weighted Speedup Mo1va1on 16 15 14 13 12 11 10 9 8 7 6 90% Workload 1 Workload 2 80 82 84 86 94% 88 90 92 94 Thro4ling Rate (%) 96 98 100 40 Sensi1vity to Other Parameters •  Network load target: WS peaks at b/w 50% and 65% network u@liza@on –  Drops more than 10% beyond that •  Epoch length: WS varies by less than 1% b/w 20K and 1M cycles •  Low-­‐intensity workloads: HAT does not impact system performance •  Unthro4led network intensity threshold: 41 Implementa1on •  L1 MPKI: One L1 miss hardware counter and one instruc@on hardware counter •  Network load: One hardware counter to monitor the number of flits routed to neighboring nodes •  Computa1on: Done at one central CPU node –  At most several thousand cycles (a few percent overhead for 100K-­‐cycle epoch) 42 
Download