Document 10410110

HAT: Heterogeneous Adap1ve Thro4ling for On-‐Chip Networks Kevin Kai-‐Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu Execu1ve Summary Problem: Packets contend in on-‐chip networks (NoCs), causing conges@on, thus reducing system performance Approach: Source throFling (temporarily delaying packet injec@ons) to reduce conges@on 1) Which applica1ons to thro4le? Observa@on: ThroFling network-‐intensive applica@ons leads to higher system performance èKey idea 1: Applica@on-‐aware source throFling 2) How much to thro4le? Observa@on: There is no single thro4ling rate that works well for every applica@on workload èKey idea 2: Dynamic throFling rate adjustment Result: Improves both system performance and energy efficiency over state-‐of-‐the-‐art source throFling policies 2 Outline •  Background and Mo1va1on •  Mechanism •  Comparison Points •  Results •  Conclusions 3 On-‐Chip Networks PE PE PE … PE •  Connect cores, caches, memory controllers, etc. –  Buses and crossbars are not scalable IInterconnect PE Processing Element (Cores, L2 Banks, Memory Controllers, etc) 4 On-‐Chip Networks PE PE R PE PE PE PE R PE R PE R R PE R R •  Connect cores, caches, memory controllers, etc R PE R R Router –  Buses and crossbars are not scalable •  Packet switched •  2D mesh: Most commonly used topology •  Primarily serve cache misses and memory requests Processing Element (Cores, L2 Banks, Memory Controllers, etc) 5 Network Conges1on Reduces Performance PE PE P R R PE R Rs1on e g n Co R Network conges1on: PE R Limited shared resources (buffers and links) •  due to design constraints: Power, chip area, and @ming PE P PE P R PE R P PE PE P R PE êsystem performance R Router Packet P Processing Element (Cores, L2 Banks, Memory Controllers, etc) 6 Goal •  Improve system performance in a highly congested network •  Observa1on: Reducing network load (number of packets in the network) decreases network conges@on, hence improves system performance •  Approach: Source throFling (temporarily delaying new traffic injec@on) to reduce network load 7 Source Thro4ling PE PE P P … P P P P P P P P I PE P P P P P Long network latency when the network is congested I PE Network Packet P Processing Element (Cores, L2 Banks, Memory Controllers, etc) 8 Source Thro4ling … P PE P P PE P o4le ThrP o4le ThrP P P P P Thro4le P P P P P P PE P I P P P P P •  ThroFling makes some packets wait longer to inject •  Average network throughput increases, hence higher system performance I PE Network Packet P Processing Element (Cores, L2 Banks, Memory Controllers, etc) 9 Key Ques1ons of Source Thro4ling •  Every cycle when a node has a packet to inject, source throFling blocks the packet with probability P –  We call P “thro4ling rate” (ranges from 0 to 1) •  Thro4ling rate can be set independently on every node Two key ques1ons: 1.  Which applica@ons to throFle? 2.  How much to throFle? Naïve mechanism: ThroFle every single node with a constant throFling rate 10 Key Observa1on #1 ThroFling network-‐intensive applica@ons leads to higher system performance Configura1on: 16-‐node system, 4x4 mesh network, 8 gromacs (network-‐non-‐intensive), and 8 mcf (network-‐intensive) … gromacs mcf Thro4le P P P I P P P P P P P ThroFling gromacs decreases system performance by 2% due to minimal network load reduc@on 11 Key Observa1on #1 ThroFling network-‐intensive applica@ons leads to higher system performance Configura1on: 16-‐node system, 4x4 mesh network, 8 gromacs (network-‐non-‐intensive), and 8 mcf (network-‐intensive) … gromacs mcf Thro4le P P P I P P P P P P P ThroFling mcf increases system performance by 9% (gromacs: +14% mcf: +5%) due to reduced conges@on 12 Key Observa1on #1 ThroFling network-‐intensive applica@ons leads to higher system performance Configura1on: 16-‐node system, 4x4 mesh network, 8 gromacs (network-‐non-‐intensive), and 8 mcf (network-‐intensive) … gromacs mcf Thro4le P P I P P P reduces •  ThroFling network-‐intensive applica@ons conges@on enefits both network-‐non-‐intensive • •  TBhroFling mcf reduces conges@on and etwork-‐intensive applica@ons •  g nromacs benefits m ore from less network latency 13 Key Observa1on #2 There is no single thro4ling rate that works well for every applica@on workload … gromacs Thro4le 0.8 P P I … gromacs mcf Thro4le 0.8 Thro4le 0.8 P P P I P P mcf Thro4le 0.8 P P P P Network runs best at or below a certain network load Dynamically adjust throFling rate to avoid overload and under-‐u@liza@on 14 Outline •  Background and Mo@va@on •  Mechanism •  Comparison Points •  Results •  Conclusions 15 Heterogeneous Adap1ve Thro4ling (HAT) 1.  Applica1on-‐aware thro4ling: ThroFle network-‐intensive applica@ons that interfere with network-‐non-‐intensive applica@ons 2.  Network-‐load-‐aware thro4ling rate adjustment: Dynamically adjust throFling rate to adapt to different workloads and program phases 16 Heterogeneous Adap1ve Thro4ling (HAT) 1.  Applica1on-‐aware thro4ling: ThroFle network-‐intensive applica@ons that interfere with network-‐non-‐intensive applica@ons 2.  Network-‐load-‐aware thro4ling rate adjustment: Dynamically adjust throFling rate to adapt to different workloads and program phases 17 Applica1on-‐Aware Thro4ling 1.  Measure applica1ons’ network intensity Use L1 MPKI (misses per thousand instruc@ons) to es@mate network intensity 2.  Thro4le network-‐intensive applica1ons How to select unthroFled applica@ons? •  Leaving too many applica@ons unthroFled overloads the network èSelect unthroFled applica@ons so that their total network intensity is less than the total network capacity Network-‐non-‐intensive Network-‐intensive (Unthro4led) (Thro4led) App Σ MPKI < Threshold Higher L1 MPKI 18 Heterogeneous Adap1ve Thro4ling (HAT) 1.  Applica1on-‐aware thro4ling: ThroFle network-‐intensive applica@ons that interfere with network-‐non-‐intensive applica@ons 2.  Network-‐load-‐aware thro4ling rate adjustment: Dynamically adjust throFling rate to adapt to different workloads and program phases 19 Dynamic Thro4ling Rate Adjustment •  Different workloads require different throFling rates to avoid overloading the network •  But, network load (frac@on of occupied buffers/ links) is an accurate indicator of conges@on •  Key idea: Measure current network load and dynamically adjust throFling rate based on load if network load > target: Increase throttling rate else: Decrease throttling rate If network is congested, throFle more If network is not congested, avoid unnecessary throFling 20 Heterogeneous Adap1ve Thro4ling (HAT) 1.  Applica1on-‐aware thro4ling: ThroFle network-‐intensive applica@ons that interfere with network-‐non-‐intensive applica@ons 2.  Network-‐load-‐aware thro4ling rate adjustment: Dynamically adjust throFling rate to adapt to different workloads and program phases 21 Epoch-‐Based Opera1on •  Applica1on classifica1on and thro4ling rate adjustment are expensive if done every cycle •  Solu1on: recompute at epoch granularity During epoch: Every node: 1)  Measure L1 MPKI 2)  Measure network load Current Epoch (100K cycles) Beginning of epoch: All nodes send measured info to a central controller, which: 1)  Classifies applica@ons 2)  Adjusts throFling rate 3)  Sends new classifica@on and throFling rate to each node Next Epoch (100K cycles) Time 22 Pugng It Together: Key Contribu1ons 1. Applica1on-‐aware thro4ling –  ThroFle network-‐intensive applica@ons based on applica@ons’ network intensi@es 2. Network-‐load-‐aware thro4ling rate adjustment –  Dynamically adjust throFling rate based on network load to avoid overloading the network HAT is the first work to combine applica@on-‐aware throFling and network-‐load-‐aware rate adjustment 23 Outline •  Background and Mo@va@on •  Mechanism •  Comparison Points •  Results •  Conclusions 24 Comparison Points •  Source thro4ling for bufferless NoCs [Nychis+ Hotnets’10, SIGCOMM’12] –  ThroFle network-‐intensive applica@ons when other applica@ons cannot inject –  Does not take network load into account –  We call this “Heterogeneous ThroFling” •  Source thro4ling for buffered networks [ThoFethodi+ HPCA’01] –  ThroFle every applica@on when the network load exceeds a dynamically tuned threshold –  Not applica@on-‐aware –  Fully blocks packet injec@ons while throFling –  We call this “Self-‐Tuned ThroFling” 25 Outline •  Background and Mo@va@on •  Mechanism •  Comparison Points •  Results •  Conclusions 26 Methodology •  Chip Mul1processor Simulator –  64-‐node mul@-‐core systems with a 2D-‐mesh topology –  Closed-‐loop core/cache/NoC cycle-‐level model –  64KB L1, perfect L2 (always hits to stress NoC) •  Router Designs –  Virtual-‐channel buffered router: 4 VCs, 4 flits/VC [Dally+ IEEE TPDS’92] •  Input buffers to hold contending packets –  Bufferless deflec1on router: BLESS [Moscibroda+ ISCA’09] •  Misroute (deflect) contending packets 27 Methodology •  Workloads –  60 mul@-‐core workloads of SPEC CPU2006 benchmarks –  4 network-‐intensive workload categories based on the network intensity of applica@ons (Low/Medium/High) •  Metrics System performance: Weighted Speedup = ∑ IPC IPC IPC Fairness: Maximum Slowdown = max IPC € WeightedSpeedup Energy efficiency: PerfPerWatt = shared i alone i i i € alone i shared i Power 28 Weighted Speedup Performance: Bufferless NoC (BLESS) 50 45 40 35 30 25 20 15 10 5 0 + 9.4% + 11.5% + 11.3% No ThroFling Heterogeneous HAT + 7.4% + 8.6% HL HML HM H Workload Categories AVG 1. HAT provides beFer performance improvement than state-‐of-‐the-‐art throFling approaches 2. Highest improvement on heterogeneous workloads -‐ L and M are more sensi1ve to network latency 29 Weighted Speedup Performance: Buffered NoC 50 45 40 35 30 25 20 15 10 5 0 No ThroFling Self-‐Tuned Heterogeneous HAT HL HML HM H Workload Categories + 3.5% AVG HAT provides beFer performance improvement than prior approaches 30 Normalized Maximum Slowdwon (Lower is be4er) Applica1on Fairness 1.2 No ThroFling HAT Heterogeneous 1.0 1.2 -‐ 15% Self-‐Tuned Heterogeneous HAT -‐ 5% 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 BLESS No ThroFling Buffered HAT provides beFer fairness than prior works 31 Network Energy Efficiency No ThroFling Normalized Perf. per Wa4 1.2 + 8.5% HAT + 5% 1.0 0.8 0.6 0.4 0.2 0.0 BLESS Buffered HAT increases energy efficiency by reducing network load by 4.7% (BLESS) or 0.5% (buffered) 32 Other Results in Paper •  Performance on CHIPPER [Fallin+ HPCA’11] –  HAT improves system performance •  Performance on mul1threaded workloads –  HAT is not designed for mul@threaded workloads, but it slightly improves system performance •  Parameter sensi@vity sweep of HAT –  HAT provides consistent system performance improvement on different network sizes 33 Conclusions Problem: Packets contend in on-‐chip networks (NoCs), causing conges@on, thus reducing system performance Approach: Source throFling (temporarily delaying packet injec@ons) to reduce conges@on 1) Which applica1ons to thro4le? Observa@on: ThroFling network-‐intensive applica@ons leads to higher system performance èKey idea 1: Applica@on-‐aware source throFling 2) How much to thro4le? Observa@on: There is no single thro4ling rate that works well for every applica@on workload èKey idea 2: Dynamic throFling rate adjustment Result: Improves both system performance and energy efficiency over state-‐of-‐the-‐art source throFling policies 34 HAT: Heterogeneous Adap1ve Thro4ling for On-‐Chip Networks Kevin Kai-‐Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu Thro4ling Rate Steps 36 Network Sizes 50 1.9% 40 8 VC-BUFFERED HAT 3.9% 3.7% 6 30 2 1.3% 0 50 40 3.3% 30 CHIPPER HAT 5.9% 20.0% 20 10 6.0% 5.8% 0 50 6.2% 40 BLESS HAT 0 14 10.4% 12 10 8 6 4 2 0 8 9.7% 6 10.4% 30 17.2% 10.8% 20 10 Unfairness (Lower is Better) Weighted Speedup (Higher is Better) 10 4 4.9% 11.1% 20 4 2 7.2% 0 0 16 64 144 Network Size 16 64 144 Network Size 37 48 44 40 36 32 28 24 20 16 12 8 4 0 9.4% 5.4% 20 11.5% 5.8% 10.4% 11.3% CHIPPER CHIPPER-HETERO. CHIPPER-HAT BLESS BLESS-HETERO. BLESS-HAT 5.9% 6.7% 8.6% 25.6% 5.9% HL HML HM H 20.7%14.2% 19.2%16.9% AVG HL HML 12 12.3% 15.4% 20.0% 17.2% 24.4% HM 16 H AVG 8 4 Unfairness (Lower is Better) Weighted Speedup (Higher is Better) Performance on CHIPPER/BLESS 0 Injec@on rate (number of injected packets / cycle): +8.5% (BLESS) or +4.8% (CHIPPER) 38 Mul1threaded Workloads 39 Weighted Speedup Mo1va1on 16 15 14 13 12 11 10 9 8 7 6 90% Workload 1 Workload 2 80 82 84 86 94% 88 90 92 94 Thro4ling Rate (%) 96 98 100 40 Sensi1vity to Other Parameters •  Network load target: WS peaks at b/w 50% and 65% network u@liza@on –  Drops more than 10% beyond that •  Epoch length: WS varies by less than 1% b/w 20K and 1M cycles •  Low-‐intensity workloads: HAT does not impact system performance •  Unthro4led network intensity threshold: 41 Implementa1on •  L1 MPKI: One L1 miss hardware counter and one instruc@on hardware counter •  Network load: One hardware counter to monitor the number of flits routed to neighboring nodes •  Computa1on: Done at one central CPU node –  At most several thousand cycles (a few percent overhead for 100K-‐cycle epoch) 42

Document 10410110

Related documents

Products

Support

Document 10410110

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib