HAT: Heterogeneous Adap1ve Thro4ling for On-­‐Chip Networks Kevin Kai-­‐Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu Execu1ve Summary Problem: Packets contend in on-­‐chip networks (NoCs), causing conges@on, thus reducing system performance Approach: Source throFling (temporarily delaying packet injec@ons) to reduce conges@on 1) Which applica1ons to thro4le? Observa@on: ThroFling network-­‐intensive applica@ons leads to higher system performance èKey idea 1: Applica@on-­‐aware source throFling 2) How much to thro4le? Observa@on: There is no single thro4ling rate that works well for every applica@on workload èKey idea 2: Dynamic throFling rate adjustment Result: Improves both system performance and energy efficiency over state-­‐of-­‐the-­‐art source throFling policies 2 Outline • Background and Mo1va1on • Mechanism • Comparison Points • Results • Conclusions 3 On-­‐Chip Networks PE PE PE … PE • Connect cores, caches, memory controllers, etc. – Buses and crossbars are not scalable IInterconnect PE Processing Element (Cores, L2 Banks, Memory Controllers, etc) 4 On-­‐Chip Networks PE PE R PE PE PE PE R PE R PE R R PE R R • Connect cores, caches, memory controllers, etc R PE R R Router – Buses and crossbars are not scalable • Packet switched • 2D mesh: Most commonly used topology • Primarily serve cache misses and memory requests Processing Element (Cores, L2 Banks, Memory Controllers, etc) 5 Network Conges1on Reduces Performance PE PE P R R PE R Rs1on e g n Co R Network conges1on: PE R Limited shared resources (buffers and links) • due to design constraints: Power, chip area, and @ming PE P PE P R PE R P PE PE P R PE êsystem performance R Router Packet P Processing Element (Cores, L2 Banks, Memory Controllers, etc) 6 Goal • Improve system performance in a highly congested network • Observa1on: Reducing network load (number of packets in the network) decreases network conges@on, hence improves system performance • Approach: Source throFling (temporarily delaying new traffic injec@on) to reduce network load 7 Source Thro4ling PE PE P P … P P P P P P P P I PE P P P P P Long network latency when the network is congested I PE Network Packet P Processing Element (Cores, L2 Banks, Memory Controllers, etc) 8 Source Thro4ling … P PE P P PE P o4le ThrP o4le ThrP P P P P Thro4le P P P P P P PE P I P P P P P • ThroFling makes some packets wait longer to inject • Average network throughput increases, hence higher system performance I PE Network Packet P Processing Element (Cores, L2 Banks, Memory Controllers, etc) 9 Key Ques1ons of Source Thro4ling • Every cycle when a node has a packet to inject, source throFling blocks the packet with probability P – We call P “thro4ling rate” (ranges from 0 to 1) • Thro4ling rate can be set independently on every node Two key ques1ons: 1. Which applica@ons to throFle? 2. How much to throFle? Naïve mechanism: ThroFle every single node with a constant throFling rate 10 Key Observa1on #1 ThroFling network-­‐intensive applica@ons leads to higher system performance Configura1on: 16-­‐node system, 4x4 mesh network, 8 gromacs (network-­‐non-­‐intensive), and 8 mcf (network-­‐intensive) … gromacs mcf Thro4le P P P I P P P P P P P ThroFling gromacs decreases system performance by 2% due to minimal network load reduc@on 11 Key Observa1on #1 ThroFling network-­‐intensive applica@ons leads to higher system performance Configura1on: 16-­‐node system, 4x4 mesh network, 8 gromacs (network-­‐non-­‐intensive), and 8 mcf (network-­‐intensive) … gromacs mcf Thro4le P P P I P P P P P P P ThroFling mcf increases system performance by 9% (gromacs: +14% mcf: +5%) due to reduced conges@on 12 Key Observa1on #1 ThroFling network-­‐intensive applica@ons leads to higher system performance Configura1on: 16-­‐node system, 4x4 mesh network, 8 gromacs (network-­‐non-­‐intensive), and 8 mcf (network-­‐intensive) … gromacs mcf Thro4le P P I P P P reduces • ThroFling network-­‐intensive applica@ons conges@on enefits both network-­‐non-­‐intensive • • TBhroFling mcf reduces conges@on and etwork-­‐intensive applica@ons • g nromacs benefits m ore from less network latency 13 Key Observa1on #2 There is no single thro4ling rate that works well for every applica@on workload … gromacs Thro4le 0.8 P P I … gromacs mcf Thro4le 0.8 Thro4le 0.8 P P P I P P mcf Thro4le 0.8 P P P P Network runs best at or below a certain network load Dynamically adjust throFling rate to avoid overload and under-­‐u@liza@on 14 Outline • Background and Mo@va@on • Mechanism • Comparison Points • Results • Conclusions 15 Heterogeneous Adap1ve Thro4ling (HAT) 1. Applica1on-­‐aware thro4ling: ThroFle network-­‐intensive applica@ons that interfere with network-­‐non-­‐intensive applica@ons 2. Network-­‐load-­‐aware thro4ling rate adjustment: Dynamically adjust throFling rate to adapt to different workloads and program phases 16 Heterogeneous Adap1ve Thro4ling (HAT) 1. Applica1on-­‐aware thro4ling: ThroFle network-­‐intensive applica@ons that interfere with network-­‐non-­‐intensive applica@ons 2. Network-­‐load-­‐aware thro4ling rate adjustment: Dynamically adjust throFling rate to adapt to different workloads and program phases 17 Applica1on-­‐Aware Thro4ling 1. Measure applica1ons’ network intensity Use L1 MPKI (misses per thousand instruc@ons) to es@mate network intensity 2. Thro4le network-­‐intensive applica1ons How to select unthroFled applica@ons? • Leaving too many applica@ons unthroFled overloads the network èSelect unthroFled applica@ons so that their total network intensity is less than the total network capacity Network-­‐non-­‐intensive Network-­‐intensive (Unthro4led) (Thro4led) App Σ MPKI < Threshold Higher L1 MPKI 18 Heterogeneous Adap1ve Thro4ling (HAT) 1. Applica1on-­‐aware thro4ling: ThroFle network-­‐intensive applica@ons that interfere with network-­‐non-­‐intensive applica@ons 2. Network-­‐load-­‐aware thro4ling rate adjustment: Dynamically adjust throFling rate to adapt to different workloads and program phases 19 Dynamic Thro4ling Rate Adjustment • Different workloads require different throFling rates to avoid overloading the network • But, network load (frac@on of occupied buffers/ links) is an accurate indicator of conges@on • Key idea: Measure current network load and dynamically adjust throFling rate based on load if network load > target: Increase throttling rate else: Decrease throttling rate If network is congested, throFle more If network is not congested, avoid unnecessary throFling 20 Heterogeneous Adap1ve Thro4ling (HAT) 1. Applica1on-­‐aware thro4ling: ThroFle network-­‐intensive applica@ons that interfere with network-­‐non-­‐intensive applica@ons 2. Network-­‐load-­‐aware thro4ling rate adjustment: Dynamically adjust throFling rate to adapt to different workloads and program phases 21 Epoch-­‐Based Opera1on • Applica1on classifica1on and thro4ling rate adjustment are expensive if done every cycle • Solu1on: recompute at epoch granularity During epoch: Every node: 1) Measure L1 MPKI 2) Measure network load Current Epoch (100K cycles) Beginning of epoch: All nodes send measured info to a central controller, which: 1) Classifies applica@ons 2) Adjusts throFling rate 3) Sends new classifica@on and throFling rate to each node Next Epoch (100K cycles) Time 22 Pugng It Together: Key Contribu1ons 1. Applica1on-­‐aware thro4ling – ThroFle network-­‐intensive applica@ons based on applica@ons’ network intensi@es 2. Network-­‐load-­‐aware thro4ling rate adjustment – Dynamically adjust throFling rate based on network load to avoid overloading the network HAT is the first work to combine applica@on-­‐aware throFling and network-­‐load-­‐aware rate adjustment 23 Outline • Background and Mo@va@on • Mechanism • Comparison Points • Results • Conclusions 24 Comparison Points • Source thro4ling for bufferless NoCs [Nychis+ Hotnets’10, SIGCOMM’12] – ThroFle network-­‐intensive applica@ons when other applica@ons cannot inject – Does not take network load into account – We call this “Heterogeneous ThroFling” • Source thro4ling for buffered networks [ThoFethodi+ HPCA’01] – ThroFle every applica@on when the network load exceeds a dynamically tuned threshold – Not applica@on-­‐aware – Fully blocks packet injec@ons while throFling – We call this “Self-­‐Tuned ThroFling” 25 Outline • Background and Mo@va@on • Mechanism • Comparison Points • Results • Conclusions 26 Methodology • Chip Mul1processor Simulator – 64-­‐node mul@-­‐core systems with a 2D-­‐mesh topology – Closed-­‐loop core/cache/NoC cycle-­‐level model – 64KB L1, perfect L2 (always hits to stress NoC) • Router Designs – Virtual-­‐channel buffered router: 4 VCs, 4 flits/VC [Dally+ IEEE TPDS’92] • Input buffers to hold contending packets – Bufferless deflec1on router: BLESS [Moscibroda+ ISCA’09] • Misroute (deflect) contending packets 27 Methodology • Workloads – 60 mul@-­‐core workloads of SPEC CPU2006 benchmarks – 4 network-­‐intensive workload categories based on the network intensity of applica@ons (Low/Medium/High) • Metrics System performance: Weighted Speedup = ∑ IPC IPC IPC Fairness: Maximum Slowdown = max IPC € WeightedSpeedup Energy efficiency: PerfPerWatt = shared i alone i i i € alone i shared i Power 28 Weighted Speedup Performance: Bufferless NoC (BLESS) 50 45 40 35 30 25 20 15 10 5 0 + 9.4% + 11.5% + 11.3% No ThroFling Heterogeneous HAT + 7.4% + 8.6% HL HML HM H Workload Categories AVG 1. HAT provides beFer performance improvement than state-­‐of-­‐the-­‐art throFling approaches 2. Highest improvement on heterogeneous workloads -­‐ L and M are more sensi1ve to network latency 29 Weighted Speedup Performance: Buffered NoC 50 45 40 35 30 25 20 15 10 5 0 No ThroFling Self-­‐Tuned Heterogeneous HAT HL HML HM H Workload Categories + 3.5% AVG HAT provides beFer performance improvement than prior approaches 30 Normalized Maximum Slowdwon (Lower is be4er) Applica1on Fairness 1.2 No ThroFling HAT Heterogeneous 1.0 1.2 -­‐ 15% Self-­‐Tuned Heterogeneous HAT -­‐ 5% 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 BLESS No ThroFling Buffered HAT provides beFer fairness than prior works 31 Network Energy Efficiency No ThroFling Normalized Perf. per Wa4 1.2 + 8.5% HAT + 5% 1.0 0.8 0.6 0.4 0.2 0.0 BLESS Buffered HAT increases energy efficiency by reducing network load by 4.7% (BLESS) or 0.5% (buffered) 32 Other Results in Paper • Performance on CHIPPER [Fallin+ HPCA’11] – HAT improves system performance • Performance on mul1threaded workloads – HAT is not designed for mul@threaded workloads, but it slightly improves system performance • Parameter sensi@vity sweep of HAT – HAT provides consistent system performance improvement on different network sizes 33 Conclusions Problem: Packets contend in on-­‐chip networks (NoCs), causing conges@on, thus reducing system performance Approach: Source throFling (temporarily delaying packet injec@ons) to reduce conges@on 1) Which applica1ons to thro4le? Observa@on: ThroFling network-­‐intensive applica@ons leads to higher system performance èKey idea 1: Applica@on-­‐aware source throFling 2) How much to thro4le? Observa@on: There is no single thro4ling rate that works well for every applica@on workload èKey idea 2: Dynamic throFling rate adjustment Result: Improves both system performance and energy efficiency over state-­‐of-­‐the-­‐art source throFling policies 34 HAT: Heterogeneous Adap1ve Thro4ling for On-­‐Chip Networks Kevin Kai-­‐Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu Thro4ling Rate Steps 36 Network Sizes 50 1.9% 40 8 VC-BUFFERED HAT 3.9% 3.7% 6 30 2 1.3% 0 50 40 3.3% 30 CHIPPER HAT 5.9% 20.0% 20 10 6.0% 5.8% 0 50 6.2% 40 BLESS HAT 0 14 10.4% 12 10 8 6 4 2 0 8 9.7% 6 10.4% 30 17.2% 10.8% 20 10 Unfairness (Lower is Better) Weighted Speedup (Higher is Better) 10 4 4.9% 11.1% 20 4 2 7.2% 0 0 16 64 144 Network Size 16 64 144 Network Size 37 48 44 40 36 32 28 24 20 16 12 8 4 0 9.4% 5.4% 20 11.5% 5.8% 10.4% 11.3% CHIPPER CHIPPER-HETERO. CHIPPER-HAT BLESS BLESS-HETERO. BLESS-HAT 5.9% 6.7% 8.6% 25.6% 5.9% HL HML HM H 20.7%14.2% 19.2%16.9% AVG HL HML 12 12.3% 15.4% 20.0% 17.2% 24.4% HM 16 H AVG 8 4 Unfairness (Lower is Better) Weighted Speedup (Higher is Better) Performance on CHIPPER/BLESS 0 Injec@on rate (number of injected packets / cycle): +8.5% (BLESS) or +4.8% (CHIPPER) 38 Mul1threaded Workloads 39 Weighted Speedup Mo1va1on 16 15 14 13 12 11 10 9 8 7 6 90% Workload 1 Workload 2 80 82 84 86 94% 88 90 92 94 Thro4ling Rate (%) 96 98 100 40 Sensi1vity to Other Parameters • Network load target: WS peaks at b/w 50% and 65% network u@liza@on – Drops more than 10% beyond that • Epoch length: WS varies by less than 1% b/w 20K and 1M cycles • Low-­‐intensity workloads: HAT does not impact system performance • Unthro4led network intensity threshold: 41 Implementa1on • L1 MPKI: One L1 miss hardware counter and one instruc@on hardware counter • Network load: One hardware counter to monitor the number of flits routed to neighboring nodes • Computa1on: Done at one central CPU node – At most several thousand cycles (a few percent overhead for 100K-­‐cycle epoch) 42