18-742 Parallel Computer Architecture Lecture 18: Interconnection Networks II

18-742 Parallel Computer Architecture Lecture 18: Interconnection Networks II Chris Fallin Carnegie Mellon University Material based on Michael Papamichael’s 18-742 lecture slides from Spring 2011, in turn based on Onur Mutlu’s 18-742 lecture slides from Spring 2010. Readings: Interconnection Networks  Required        Dally, “Virtual-Channel Flow Control,” ISCA 1990. Mullins et al., “Low-Latency Virtual-Channel Routers for On-Chip Networks,” ISCA 2004. Wentzlaff et al., “On-Chip Interconnection Architecture of the Tile Processor,” IEEE Micro 2007. Fallin et al., “CHIPPER: A Low-complexity Bufferless Deflection Router,” HPCA 2011. Fallin et al., “MinBD: Minimally-Buffered Deflection Routing for OnChip Interconnect,” NOCS 2012. Patel et al., “Processor-Memory Interconnections for Multiprocessors,” ISCA 1979. Recommended   Moscibroda and Mutlu, “A Case for Bufferless Routing in On-Chip Networks,” ISCA 2009. Tobias Bjerregaard, Shankar Mahadevan, “A Survey of Research and Practices of Network-on-Chip”, ACM Computing Surveys (CSUR) 2006. 2 Last Lecture  Interconnection Networks    Introduction & Terminology Topology Buffering and Flow control 3 Today   Review (Topology & Flow Control) More on interconnection networks      Research on NoC Router Design     Routing Router design Network performance metrics On-chip vs. off-chip differences BLESS: bufferless deflection routing CHIPPER: cheaper bufferless deflection routing MinBD: adding small buffers to recover some performance Research on Congestion Control  HAT: Heterogeneous Adaptive Throttling 4 Today   Review (Topology & Flow Control) More on interconnection networks      Research on NoC Router Design     Routing Router design Network performance metrics On-chip vs. off-chip differences BLESS: bufferless deflection routing CHIPPER: cheaper bufferless deflection routing MinBD: adding small buffers to recover some performance Research on Congestion Control  HAT: Heterogeneous Adaptive Throttling 5 Review: Topologies 3 2 1 0 0 1 Topology 2 7 6 7 6 5 4 5 4 3 2 3 2 1 0 1 0 3 Crossbar Multistage Logarith. Mesh Direct/Indirect Indirect Indirect Direct Blocking/ Non-blocking Non-blocking Blocking Blocking Cost O(N2) O(NlogN) O(N) Latency O(1) O(logN) O(sqrt(N)) 6 Review: Flow Control S Store and Forward S Cut Through / Wormhole Shrink Buffers D D Reduce latency Any other issues? Head-of-Line Blocking Use Virtual Channels Red holds this channel: channel remains idle until read proceeds Blocked by other packets Channel idle but red packet blocked behind blue Buffer full: blue cannot proceed 7 Review: Flow Control S Store and Forward S Cut Through / Wormhole Shrink Buffers D D Reduce latency Any other issues? Head-of-Line Blocking Use Virtual Channels Buffer full: blue cannot proceed Blocked by other packets 8 Today   Review (Topology & Flow Control) More on interconnection networks      Research on NoC Router Design     Routing Router design Network performance metrics On-chip vs. off-chip differences BLESS: bufferless deflection routing CHIPPER: cheaper bufferless deflection routing MinBD: adding small buffers to recover some performance Research on Congestion Control  HAT: Heterogeneous Adaptive Throttling 12 Routing Mechanism  Arithmetic    Simple arithmetic to determine route in regular topologies Dimension order routing in meshes/tori Source Based Source specifies output port for each switch in route + Simple switches   no control state: strip output port off header - Large header  Table Lookup Based Index into table for output port + Small header - More complex switches  13 Routing Algorithm  Types     Deterministic: always choose the same path Oblivious: do not consider network state (e.g., random) Adaptive: adapt to state of the network How to adapt   Local/global feedback Minimal or non-minimal paths 14 Deterministic Routing   All packets between the same (source, dest) pair take the same path Dimension-order routing   E.g., XY routing (used in Cray T3D, and many on-chip networks) First traverse dimension X, then traverse dimension Y + Simple + Deadlock freedom (no cycles in resource allocation) - Could lead to high contention - Does not exploit path diversity 15 Deadlock    No forward progress Caused by circular dependencies on resources Each packet waits for a buffer occupied by another packet downstream 16 Handling Deadlock  Avoid cycles in routing  Dimension order routing    Restrict the “turns” each packet can take Avoid deadlock by adding virtual channels   Cannot build a circular dependency Separate VC pool per distance Detect and break deadlock  Preemption of buffers 17 Turn Model to Avoid Deadlock  Idea     Analyze directions in which packets can turn in the network Determine the cycles that such turns can form Prohibit just enough turns to break possible cycles Glass and Ni, “The Turn Model for Adaptive Routing,” ISCA 1992. 18 Valiant’s Algorithm    An example of oblivious algorithm Goal: Balance network load Idea: Randomly choose an intermediate destination, route to it first, then route from there to destination  Between source-intermediate and intermediate-dest, can use dimension order routing + Randomizes/balances network load - Non minimal (packet latency can increase)  Optimizations:   Do this on high load Restrict the intermediate node to be close (in the same quadrant) 19 Adaptive Routing  Minimal adaptive Router uses network state (e.g., downstream buffer occupancy) to pick which “productive” output port to send a packet to  Productive output port: port that gets the packet closer to its destination + Aware of local congestion - Minimality restricts achievable link utilization (load balance)   Non-minimal (fully) adaptive “Misroute” packets to non-productive output ports based on network state + Can achieve better network utilization and load balance - Need to guarantee livelock freedom  20 More on Adaptive Routing  Can avoid faulty links/routers  Idea: Route around faults + Deterministic routing cannot handle faulty components - Need to change the routing table to disable faulty routes - Assuming the faulty link/router is detected 21 Today   Review (Topology & Flow Control) More on interconnection networks      Research on NoC Router Design     Routing Router design Network performance metrics On-chip vs. off-chip differences BLESS: bufferless deflection routing CHIPPER: cheaper bufferless deflection routing MinBD: adding small buffers to recover some performance Research on Congestion Control  HAT: Heterogeneous Adaptive Throttling 22 On-chip Networks PE PE PE R R PE PE PE PE PE R PE PE PE R R R R PE R R R PE VC Identifier R From East PE R PE R Input Port with Buffers R From West VC 0 VC 1 VC 2 Control Logic Routing Unit (RC) VC Allocator (VA) Switch Allocator (SA) PE To East From North R To West To North R To South To PE From South R Router PE Processing Element Crossbar (5 x 5) From PE Crossbar (Cores, L2 Banks, Memory Controllers etc) 23 Router Design: Functions of a Router  Buffering (of flits)  Route computation  Arbitration of flits (i.e. prioritization) when contention   Switching   Called packet scheduling From input port to output port Power management  Scale link/router frequency 24 Router Pipeline BW  RC VA SA ST LT Five logical stages  BW: Buffer Write RC: Route computation VA: Virtual Channel Allocation SA: Switch Allocation ST: Switch Traversal  LT: Link Traversal     25 Wormhole Router Timeline BW Head Body 1 Body 2 RC VA SA BW BW BW Tail ST LT SA ST LT SA ST LT SA ST LT Route computation performed once per packet  Virtual channel allocated once per packet Body and tail flits inherit this information from head flit   26 Dependencies in a Router Decode + Routing Switch Arbitration Crossbar Traversal Wormhole Router Decode + Routing VC Switch Allocation Arbitration Virtual Channel Router Decode + Routing VC Allocation Speculative Switch Arbitration Crossbar Traversal Crossbar Traversal Speculative Virtual Channel Router  Dependence between output of one module and input of another   Determine critical path through router Cannot bid for switch port until routing performed 27 Pipeline Optimizations: Lookahead Routing  At current router perform routing computation for next router  Overlap with BW BW RC    SA ST LT Precomputing route allows flits to compete for VCs immediately after BW RC decodes route header Routing computation needed at next hop   VA Can be computed in parallel with VA Galles, “Spider: A High-Speed Network Interconnect,” IEEE Micro 1997. Pipeline Optimizations: Speculation  Assume that Virtual Channel Allocation stage will be successful   Valid under low to moderate loads Entire VA and SA in parallel BW RC  ST LT If VA unsuccessful (no virtual channel returned)   VA SA Must repeat VA/SA in next cycle Prioritize non-speculative requests Pipeline Optimizations: Bypassing  When no flits in input buffer   Speculatively enter ST On port conflict, speculation aborted VA RC Setup  ST LT In the first stage, a free VC is allocated, next routing is performed and the crossbar is setup Today   Review (Topology & Flow Control) More on interconnection networks      Research on NoC Router Design     Routing Router design Network performance metrics On-chip vs. off-chip differences BLESS: bufferless deflection routing CHIPPER: cheaper bufferless deflection routing MinBD: adding small buffers to recover some performance Research on Congestion Control  HAT: Heterogeneous Adaptive Throttling 39 Interconnection Network Performance Throughput given by flow control Latency Zero load latency (topology+routing+f low control) Throughput given by routing Throughput given by topology Min latency given by routing algorithm Min latency given by topology Offered Traffic (bits/sec) 40 Ideal Latency  Ideal latency  Solely due to wire delay between source and destination D L Tideal   v b     D = Manhattan distance L = packet size b = channel bandwidth  v = propagation velocity 41 Actual Latency  Dedicated wiring impractical  Long wires segmented with insertion of routers D L Tactual    H  Trouter  Tc v b        D = Manhattan distance L = packet size b = channel bandwidth v = propagation velocity H = hops Trouter = router latency Tc = latency due to contention 42 Network Performance Metrics  Packet latency  Round trip latency  Saturation throughput  Application-level performance: system performance  Affected by interference among threads/applications 44 Today   Review (Topology & Flow Control) More on interconnection networks      Research on NoC Router Design     Routing Router design Network performance metrics On-chip vs. off-chip differences BLESS: bufferless deflection routing CHIPPER: cheaper bufferless deflection routing MinBD: adding small buffers to recover some performance Research on Congestion Control  HAT: Heterogeneous Adaptive Throttling 45 On-Chip vs. Off-Chip Differences Advantages of on-chip   Wires are “free”  Can build highly connected networks with wide buses Low latency   Can cross entire network in few clock cycles High Reliability  Packets are not dropped and links rarely fail Disadvantages of on-chip    Sharing resources with rest of components on chip  Area  Power Limited buffering available Not all topologies map well to 2D plane 46 Today   Review (Topology & Flow Control) More on interconnection networks      Research on NoC Router Design     Routing Router design Network performance metrics On-chip vs. off-chip differences BLESS: bufferless deflection routing CHIPPER: cheaper bufferless deflection routing MinBD: adding small buffers to recover some performance Research on Congestion Control  HAT: Heterogeneous Adaptive Throttling 47 A Case for Bufferless Routing in On-Chip Networks $ Thomas Moscibroda Microsoft Research Onur Mutlu CMU On-Chip Networks (NoC) Multi-core Chip CPU+L1 Cache -Bank CPU+L1 CPU+L1 CPU+L1 Cache -Bank Cache -Bank Cache -Bank $ CPU+L1 Cache -Bank CPU+L1 CPU+L1 CPU+L1 Cache -Bank Cache -Bank Cache -Bank Memory Controller Thomas Moscibroda, Microsoft Research Accelerator, etc… On-Chip Networks (NoC) • Connect cores, caches, memory controllers, etc… • Examples: • Intel 80-core Terascale chip • MIT RAW chip $ • Design goals in NoC design: • High throughput, low latency • Fairness between cores, QoS, … • Low complexity, low cost • Power, low energy consumption Thomas Moscibroda, Microsoft Research On-Chip Networks (NoC) • Connect cores, caches, memory controllers, etc… • Examples: • Intel 80-core Terascale chip • MIT RAW chip Energy/Power in On-Chip Networks • Power is a$key constraint in the design • Design goals in NoC design: of high-performance processors • High throughput, low latency • NoCs consume portion of system • Fairness between cores, QoS, substantial … power • Low complexity, low cost • ~30% in Intel 80-core Terascale [IEEE Micro’07] • Power, low energy consumption • ~40% in MIT RAW Chip [ISCA’04] • NoCs estimated to consume 100s of Watts [Borkar, DAC’07] Thomas Moscibroda, Microsoft Research Current NoC Approaches • Existing approaches differ in numerous ways: • Network topology [Kim et al, ISCA’07, Kim et al, ISCA’08 etc] • Flow control [Michelogiannakis et al, HPCA’09, Kumar et al, MICRO’08, etc] • Virtual Channels [Nicopoulos et al, MICRO’06, etc] • QoS & fairness mechanisms [Lee et al, ISCA’08, etc] • Routing algorithms [Singh et al, CAL’04] • Router architecture [Park et al, ISCA’08] • Broadcast, Multicast [Jerger et al, ISCA’08, Rodrigo et al, MICRO’08] $ Existing work assumes existence of buffers in routers! Thomas Moscibroda, Microsoft Research A Typical Router Input Channel 1 Credit Flow to upstream router VC1 Scheduler Routing Computation VC Arbiter VC2 Switch Arbiter VCv Input Port 1 Output Channel 1 $ Input Channel N VC1 VC2 Output Channel N Credit Flow to upstream router VCv Input Port N N x N Crossbar Buffers are integral part of existing NoC Routers Thomas Moscibroda, Microsoft Research Buffers in NoC Routers Buffers are necessary for high network throughput  buffers increase total available bandwidth in network Avg. packet latency • small buffers $ medium buffers large buffers Injection Rate Thomas Moscibroda, Microsoft Research Buffers in NoC Routers • Buffers are necessary for high network throughput  buffers increase total available bandwidth in network • Buffers consume significant energy/power • • • • Dynamic energy when read/write $ Static energy even when not occupied Buffers add complexity and latency • Logic for buffer management • Virtual channel allocation • Credit-based flow control Buffers require significant chip area • E.g., in TRIPS prototype chip, input buffers occupy 75% of total on-chip network area [Gratz et al, ICCD’06] Thomas Moscibroda, Microsoft Research Going Bufferless…? How much throughput do we lose?  How is latency affected? buffers latency • no buffers Injection Rate • Up to what injection rates can we use bufferless routing? • Can we achieve energy reduction? $  Are there realistic scenarios in which NoC is operated at injection rates below the threshold?  If so, how much…? • Can we reduce area, complexity, etc…? Thomas Moscibroda, Microsoft Research Answers in our paper! Overview • Introduction and Background • Bufferless Routing (BLESS) • FLIT-BLESS • WORM-BLESS $ • BLESS with buffers • Advantages and Disadvantages • Evaluations • Conclusions Thomas Moscibroda, Microsoft Research BLESS: Bufferless Routing • • Always forward all incoming flits to some output port If no productive direction is available, send to another direction •  packet is deflected  Hot-potato routing [Baran’64, etc] $ Deflected! Buffered Thomas Moscibroda, Microsoft Research BLESS BLESS: Bufferless Routing Routing Flit-Ranking VC Arbiter PortSwitch Arbiter Prioritization arbitration policy $ Flit-Ranking PortPrioritization 1. Create a ranking over all incoming flits 2. For a given flit in this ranking, find the best free output-port Apply to each flit in order of ranking Thomas Moscibroda, Microsoft Research FLIT-BLESS: Flit-Level Routing • • Each flit is routed independently. Oldest-first arbitration (other policies evaluated in paper) Flit-Ranking PortPrioritization • 1. Oldest-first ranking 2. Assign flit to productive port, if possible. Otherwise, assign to non-productive port. Network Topology: $  Can be applied to most topologies (Mesh, Torus, Hypercube, Trees, …) 1) #output ports ¸ #input ports at every router 2) every router is reachable from every other router • Flow Control & Injection Policy:  Completely local, inject whenever input port is free • • Absence of Deadlocks: every flit is always moving Absence of Livelocks: with oldest-first ranking Thomas Moscibroda, Microsoft Research WORM-BLESS: Wormhole Routing • Potential downsides of FLIT-BLESS • Not-energy optimal (each flits needs header information) • Increase in latency (different flits take different path) • Increase in receive buffer size new worm! • • $ BLESS with wormhole routing…? [Dally, Seitz’86] Problems: • Injection Problem (not known when it is safe to inject) • Livelock Problem (packets can be deflected forever) Thomas Moscibroda, Microsoft Research WORM-BLESS: Wormhole Routing Flit-Ranking Port-Prioritization Deflect worms if necessary! Truncate worms if necessary! At low congestion, packets travel routed as worms 1. 2. Oldest-first ranking If flit is head-flit a) assign flit to unallocated, productive port b) assign flit to allocated, productive port c) assign flit to unallocated, non-productive port d) assign flit to allocated, non-productive port else, a) assign flit to port that is allocated to worm $ Body-flit turns into head-flit allocated to West This worm is truncated! & deflected! allocated to North Head-flit: West See paper for details… Thomas Moscibroda, Microsoft Research BLESS with Buffers • • • • BLESS without buffers is extreme end of a continuum BLESS can be integrated with buffers • FLIT-BLESS with Buffers • WORM-BLESS with Buffers Whenever a buffer is full, it’s first flit becomes must-schedule $ must-schedule flits must be deflected if necessary See paper for details… Thomas Moscibroda, Microsoft Research Overview • Introduction and Background • Bufferless Routing (BLESS) • FLIT-BLESS • WORM-BLESS $ • BLESS with buffers • Advantages and Disadvantages • Evaluations • Conclusions Thomas Moscibroda, Microsoft Research BLESS: Advantages & Disadvantages Advantages • No buffers • Purely local flow control • Simplicity - no credit-flows - no virtual channels - simplified router design • • No deadlocks, livelocks Adaptivity $ Disadvantages • Increased latency • Reduced bandwidth • Increased buffering at receiver • Header information at each flit - packets are deflected around congested areas! • • Router latency reduction Area savings Impact on energy…? Thomas Moscibroda, Microsoft Research Reduction of Router Latency • Baseline Router (speculative) BLESS Router (standard) BLESS Router (optimized) BW: Buffer Write RC: Route Computation VA: Virtual Channel Allocation SA: Switch Allocation ST: Switch Traversal LT: Link Traversal LA LT: Link Traversal of Lookahead BLESS gets rid of input buffers and virtual channels head BW flit RC body flit Router 1 VA SA BW RC ST SA ST LT ST RC ST LT $ RC Can be improved to 2. ST LT Router Latency = 2 LT LA LT Router 2 Router Latency = 3 LT Router 2 Router 1 [Dally, Towles’04] RC ST LT Router Latency = 1 Thomas Moscibroda, Microsoft Research BLESS: Advantages & Disadvantages Advantages • No buffers • Purely local flow control • Simplicity - no credit-flows - no virtual channels - simplified router design • • No deadlocks, livelocks Adaptivity - packets are deflected around congested areas! • • Router latency reduction Area savings $ Disadvantages • Increased latency • Reduced bandwidth • Increased buffering at receiver • Header information at each flit Extensive evaluations in the paper! Impact on energy…? Thomas Moscibroda, Microsoft Research Evaluation Methodology • 2D mesh network, router latency is 2 cycles o o o o o • • 4x4, 8 core, 8 L2 cache banks (each node is a core or an L2 bank) is cycle-accurate 4x4, 16 core, 16 L2 cache banks (each nodeSimulation is a core and an L2 bank)  Models stalls in network 8x8, 16 core, 64 L2 cache banks (each node is L2 bank and may be a core) and processors 128-bit wide links, 4-flit data packets, 1-flit address packets behavior Self-throttling  input Aggressive model For baseline configuration: 4 VCs per physical port, 1processor packet deep Benchmarks $ o Multiprogrammed SPEC CPU2006 and Windows Desktop applications o Heterogeneous and homogenous application mixes o o Synthetic traffic patterns: UR, Transpose, Tornado, Bit Complement Most of M our evaluations processor model based on Intel Pentium with perfect L2 caches 2 GHz processor, 128-entry instruction window  Puts maximal stress 64Kbyte private L1 caches on NoC o Total 16Mbyte shared L2 caches; 16 MSHRs per bank o DRAM model based on Micron DDR2-800 x86 o Thomas Moscibroda, Microsoft Research Evaluation Methodology • Energy model provided by Orion simulator [MICRO’02] o • 70nm technology, 2 GHz routers at 1.0 Vdd For BLESS, we model o Additional energy to transmit header information o Additional buffers needed on the receiver side o Additional logic to reorder flits of individual packets at receiver $ • We partition network energy into buffer energy, router energy, and link energy, each having static and dynamic components. • Comparisons against non-adaptive and aggressive adaptive buffered routing algorithms (DO, MIN-AD, ROMM) Thomas Moscibroda, Microsoft Research Evaluation – Synthethic Traces BLESS Injection Rate (flits per cycle per node) Thomas Moscibroda, Microsoft Research 0.49 0.46 0.43 0.4 0.37 0.34 0.31 0.28 0.25 0.22 0.19 0.16 0.13 $ 0.1 FLIT-2 WORM-2 FLIT-1 WORM-1 MIN-AD 0.07 • BLESS has significantly lower saturation throughput compared to buffered baseline. 100 90 80 70 60 50 40 30 20 10 0 0 • Uniform random injection Average Latency • First, the bad news  Best Baseline • Perfect caches! • Very little performance degradation with BLESS (less than 4% in dense network) • With router latency 1, BLESS can even outperform baseline (by ~10%) • Significant energy improvements (almost 40%) Baseline 18 16 14 12 10 8 6 4 2 0 $ milc 4x4, 8x Energy (normalized) • milc benchmarks (moderately intensive) W-Speedup Evaluation – Homogenous Case Study 1.2 BLESS 4x4, 16x milc BufferEnergy LinkEnergy RL=1 8x8, 16x milc RouterEnergy 1 0.8 0.6 0.4 0.2 0 4x4, 8x milc 4x4, 16x milc Thomas Moscibroda, Microsoft Research 8x8, 16x milc • milc benchmarks (moderately intensive) Observations: • Perfect caches! W-Speedup Evaluation – Homogenous Case Study Baseline 18 16 14 12 10 8 6 4 2 0 BLESS RL=1 1)Injection rates not extremely high Energy (normalized) • Very little performance degradation with BLESS on average $ milc 4x4, 8x (less than 4% in dense  self-throttling! network) 1.2 1 • With router latency 1, 2)For and 0.8temporary BLESS can bursts even 0.6 outperform baseline network links as 0.4buffers! (by ~10%) • Significant energy improvements (almost 40%) 4x4, 16x milc BufferEnergy LinkEnergy 8x8, 16x milc RouterEnergy hotspots, use 0.2 0 4x4, 8 8x milc 4x4, 16x milc Thomas Moscibroda, Microsoft Research 8x8, 16x milc Evaluation – Further Results • BLESS increases buffer requirement at receiver by at most 2x See paper for details…  overall, energy is still reduced Impact of memory latency DO MIN-AD ROMM FLIT-2 WORM-2 FLIT-1 WORM-1 4x4, 8x matlab DO MIN-AD ROMM FLIT-2 WORM-2 FLIT-1 WORM-1 18 16 14 12 10 8 6 4 2 0 DO MIN-AD ROMM FLIT-2 WORM-2 FLIT-1 WORM-1  with real caches, very little slowdown! (at most 1.5%) $ W-Speedup • 4x4, 16x matlab 8x8, 16x matlab Thomas Moscibroda, Microsoft Research Evaluation – Further Results • BLESS increases buffer requirement at receiver by at most 2x See paper for details…  overall, energy is still reduced • Impact of memory latency  with real caches, very little slowdown! (at most 1.5%) $ • Heterogeneous application mixes (we evaluate several mixes of intensive and non-intensive applications)  little performance degradation  significant energy savings in all cases  no significant increase in unfairness across different applications • Area savings: ~60% of network area can be saved! Thomas Moscibroda, Microsoft Research Evaluation – Aggregate Results • Aggregate results over all 29 applications Sparse Network -46.4% -41.0% ∆ System Performance -0.5% $ -3.2% -0.15% -0.55% LinkEnergy RouterEnergy 0.6 0.4 0.2 BASE FLIT WORM Mean BASE FLIT WORM Worst-Case W-Speedup 0.8 8 7 6 5 4 3 2 1 0 Mean Thomas Moscibroda, Microsoft Research WORM -28.1% FLIT -39.4% BASE ∆ Network Energy WORM Worst-Case FLIT Average BASE Worst-Case 1 0 Realistic L2 Average BufferEnergy Energy (normalized) Perfect L2 Worst-Case Evaluation – Aggregate Results • Aggregate results over all 29 applications Sparse Network Perfect L2 Realistic L2 Average Worst-Case Average Worst-Case ∆ Network Energy -39.4% -28.1% -46.4% -41.0% ∆ System Performance -0.5% $ -3.2% -0.15% -0.55% Dense Network Perfect L2 Realistic L2 Average Worst-Case Average Worst-Case ∆ Network Energy -32.8% -14.0% -42.5% -33.7% ∆ System Performance -3.6% -17.1% -0.7% -1.5% Thomas Moscibroda, Microsoft Research Conclusion • For a very wide range of applications and network settings, buffers are not needed in NoC • • • • Significant energy savings (32% even in dense networks and perfect caches) Area-savings of 60% Simplified router and network design (flow control, etc…) Performance slowdown is$minimal (can even increase!)  A strong case for a rethinking of NoC design! • We are currently working on future research. • Support for quality of service, different traffic classes, energymanagement, etc… Thomas Moscibroda, Microsoft Research CHIPPER: A Low-complexity Bufferless Deflection Router Chris Fallin Chris Craik Onur Mutlu Motivation  Recent work has proposed bufferless deflection routing (BLESS [Moscibroda, ISCA 2009])      Energy savings: ~40% in total NoC energy Area reduction: ~40% in total NoC area Minimal performance loss: ~4% on average Unfortunately: unaddressed complexities in router  long critical path, large reassembly buffers Goal: obtain these benefits while simplifying the router in order to make bufferless NoCs practical. 79 Problems that Bufferless Routers Must Solve 1. Must provide livelock freedom  A packet should not be deflected forever 2. Must reassemble packets upon arrival Flit: atomic routing unit Packet: one or multiple flits 0 1 2 3 80 A Bufferless Router: A High-Level View Crossbar Router Deflection Routing Logic Local Node Problem 2: Packet Reassembly Inject Problem 1: Livelock Freedom Eject Reassembly Buffers 81 Complexity in Bufferless Deflection Routers 1. Must provide livelock freedom Flits are sorted by age, then assigned in age order to output ports  43% longer critical path than buffered router 2. Must reassemble packets upon arrival Reassembly buffers must be sized for worst case  4KB per node (8x8, 64-byte cache block) 82 Problem 1: Livelock Freedom Crossbar Deflection Routing Logic Inject Problem 1: Livelock Freedom Eject Reassembly Buffers 83 Livelock Freedom in Previous Work     What stops a flit from deflecting forever? All flits are timestamped Oldest flits are assigned their desired ports Total order among flits Guaranteed progress! New traffic is lowest priority < < < < < Flit age forms total order  But what is the cost of this? 84 Age-Based Priorities are Expensive: Sorting  Router must sort flits by age: long-latency sort network  Three comparator stages for 4 flits 4 1 2 3 85 Age-Based Priorities Are Expensive: Allocation   After sorting, flits assigned to output ports in priority order Port assignment of younger flits depends on that of older flits  1 sequential dependence in the port allocator East? 2 GRANT: Flit 1  East {N,S,W} East? DEFLECT: Flit 2  North {S,W} 3 GRANT: Flit 3  South South? {W} 4 South? DEFLECT: Flit 4  West Age-Ordered Flits 86 Age-Based Priorities Are Expensive  Overall, deflection routing logic based on Oldest-First has a 43% longer critical path than a buffered router Priority Sort  Port Allocator Question: is there a cheaper way to route while guaranteeing livelock-freedom? 87 Solution: Golden Packet for Livelock Freedom  What is really necessary for livelock freedom? Key Insight: No total order. it is enough to: 1. Pick one flit to prioritize until arrival 2. Ensure any flit is eventually picked Guaranteed progress! New traffic is lowest-priority < < Flit age forms total order partial ordering is sufficient! < “Golden Flit” 88 What Does Golden Flit Routing Require?    Only need to properly route the Golden Flit First Insight: no need for full sort Second Insight: no need for sequential allocation Priority Sort Port Allocator 89 Golden Flit Routing With Two Inputs    Let’s route the Golden Flit in a two-input router first Step 1: pick a “winning” flit: Golden Flit, else random Step 2: steer the winning flit to its desired output and deflect other flit  Golden Flit always routes toward destination 90 Golden Flit Routing with Four Inputs  Each block makes decisions independently!  Deflection is a distributed decision N N E S S E W W 91 Permutation Network Operation wins  swap! Golden: N E N wins  swap! N E Priority Sort Port Allocator S x S S E W W W wins  no swap! deflected wins  no swap! 92 CHIPPER: Cheap Interconnect Partially-Permuting Router Inject/Eject Inject Eject Miss Buffers (MSHRs) 93 EVALUATION 94 Methodology  Multiprogrammed workloads: CPU2006, server, desktop   Multithreaded workloads: SPLASH-2, 16 threads   8x8 (64 cores), 39 homogeneous and 10 mixed sets 4x4 (16 cores), 5 applications System configuration     Buffered baseline: 2-cycle router, 4 VCs/channel, 8 flits/VC Bufferless baseline: 2-cycle latency, FLIT-BLESS Instruction-trace driven, closed-loop, 128-entry OoO window 64KB L1, perfect L2 (stresses interconnect), XOR mapping 95 Methodology  Hardware modeling  Verilog models for CHIPPER, BLESS, buffered logic    Synthesized with commercial 65nm library ORION for crossbar, buffers and links Power   Static and dynamic power from hardware models Based on event counts in cycle-accurate simulations 96 Results: Performance Degradation Multiprogrammed (subset of 49 total) Multithreaded 13.6% BLESS 64 CHIPPER 56 1.8% 1 48 Speedup (Normalized) Weighted Speedup Buffered 40 32 24 0.8 0.6 0.4 16 0.2 8 Minimal loss for low-to-medium-intensity workloads 3.6% 49.8% 97 AVG lun fft radix cholesky AVG (full set) mcf stream GemsFDTD MIX.6 MIX.0 MIX.8 MIX.2 MIX.5 search.1 vpr h264ref gcc tonto perlbench C luc 0 0 Results: Power Reduction Multiprogrammed (subset of 49 total) Multithreaded 2.5 18 Buffered 14 BLESS 12 CHIPPER 73.4% 54.9% 2 1.5 10 8 1 6 4 Removing buffers  majority of power savings Slight savings from BLESS to CHIPPER 98 AVG lun fft AVG (full set) mcf stream GemsFDTD MIX.6 MIX.0 MIX.8 MIX.2 MIX.5 search.1 vpr h264ref gcc tonto C radix 0 0 cholesky 2 luc C 0.5 perlbench Network Power (W) 16 Results: Area and Critical Path Reduction Normalized Router Area 1.5 Normalized Critical Path 1.5 -29.1% 1.25 1.25 +1.1% 1 1 -36.2% 0.75 0.75 -1.6% 0.5 C CHIPPER maintains area savings of BLESS 0.25 C 0 0.5 0.25 0 Critical path becomes competitive to buffered Buffered BLESS CHIPPER Buffered BLESS CHIPPER 99 Conclusions  Two key issues in bufferless deflection routing   Bufferless deflection routers were high-complexity and impractical    Oldest-first prioritization  long critical path in router No end-to-end flow control for reassembly  prone to deadlock with reasonably-sized reassembly buffers CHIPPER is a new, practical bufferless deflection router     livelock freedom and packet reassembly Golden packet prioritization  short critical path in router Retransmit-once protocol  deadlock-free packet reassembly Cache miss buffers as reassembly buffers  truly bufferless network CHIPPER frequency comparable to buffered routers at much lower area and power cost, and minimal performance loss 100 MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect Chris Fallin, Greg Nazario, Xiangyao Yu*, Kevin Chang, Rachata Ausavarungnirun, Onur Mutlu Carnegie Mellon University *CMU and Tsinghua University Bufferless Deflection Routing   Key idea: Packets are never buffered in the network. When two packets contend for the same link, one is deflected. Removing buffers yields significant benefits    But, at high network utilization (load), bufferless deflection routing causes unnecessary link & router traversals    Reduces power (CHIPPER: reduces NoC power by 55%) Reduces die area (CHIPPER: reduces NoC area by 36%) Reduces network throughput and application performance Increases dynamic power Goal: Improve high-load performance of low-cost deflection networks by reducing the deflection rate. 102 Outline: This Talk  Motivation  Background: Bufferless Deflection Routing  MinBD: Reducing Deflections      Addressing Link Contention Addressing the Ejection Bottleneck Improving Deflection Arbitration Results Conclusions 103 Outline: This Talk  Motivation  Background: Bufferless Deflection Routing  MinBD: Reducing Deflections      Addressing Link Contention Addressing the Ejection Bottleneck Improving Deflection Arbitration Results Conclusions 104 Issues in Bufferless Deflection Routing  Correctness: Deliver all packets without livelock    Correctness: Reassemble packets without deadlock   CHIPPER1: Golden Packet Globally prioritize one packet until delivered CHIPPER1: Retransmit-Once Performance: Avoid performance degradation at high load  1 Fallin MinBD et al., “CHIPPER: A Low-complexity Bufferless Deflection Router”, HPCA 105 Key Performance Issues 1. Link contention: no buffers to hold traffic  any link contention causes a deflection  use side buffers 2. Ejection bottleneck: only one flit can eject per router per cycle  simultaneous arrival causes deflection  eject up to 2 flits/cycle 3. Deflection arbitration: practical (fast) deflection arbiters deflect unnecessarily  new priority scheme (silver flit) 106 Outline: This Talk  Motivation  Background: Bufferless Deflection Routing  MinBD: Reducing Deflections      Addressing Link Contention Addressing the Ejection Bottleneck Improving Deflection Arbitration Results Conclusions 107 Outline: This Talk  Motivation  Background: Bufferless Deflection Routing  MinBD: Reducing Deflections      Addressing Link Contention Addressing the Ejection Bottleneck Improving Deflection Arbitration Results Conclusions 108 Addressing Link Contention    Problem 1: Any link contention causes a deflection Buffering a flit can avoid deflection on contention But, input buffers are expensive:    All flits are buffered on every hop  high dynamic energy Large buffers necessary  high static energy and large area Key Idea 1: add a small buffer to a bufferless deflection router to buffer only flits that would have been deflected 109 How to Buffer Deflected Flits Destination Destination DEFLECTED Eject Inject 1 Fallin 2011. Baseline Router et al., “CHIPPER: A Low-complexity Bufferless Deflection Router”, HPCA 110 How to Buffer Deflected Flits Side Buffer Step 2. Buffer this flit in a small FIFO “side buffer.” Destination Destination Step 3. Re-inject this flit into pipeline when a slot is available. Step 1. Remove up to one deflected flit per cycle from the outputs. Eject Inject DEFLECTED Side-Buffered Router 111 Why Could A Side Buffer Work Well?  Buffer some flits and deflect other flits at per-flit level   Relative to bufferless routers, deflection rate reduces (need not deflect all contending flits)  4-flit buffer reduces deflection rate by 39% Relative to buffered routers, buffer is more efficiently used (need not buffer all flits)  similar performance with 25% of buffer space 112 Outline: This Talk  Motivation  Background: Bufferless Deflection Routing  MinBD: Reducing Deflections      Addressing Link Contention Addressing the Ejection Bottleneck Improving Deflection Arbitration Results Conclusions 113 Addressing the Ejection Bottleneck     Problem 2: Flits deflect unnecessarily because only one flit can eject per router per cycle In 20% of all ejections, ≥ 2 flits could have ejected  all but one flit must deflect and try again  these deflected flits cause additional contention Ejection width of 2 flits/cycle reduces deflection rate 21% Key idea 2: Reduce deflections due to a single-flit ejection port by allowing two flits to eject per cycle 114 Addressing the Ejection Bottleneck DEFLECTED Eject Inject Single-Width Ejection 115 Addressing the Ejection Bottleneck For fair comparison, baseline routers have dual-width ejection for perf. (not power/area) Eject Inject Dual-Width Ejection 116 Outline: This Talk  Motivation  Background: Bufferless Deflection Routing  MinBD: Reducing Deflections      Addressing Link Contention Addressing the Ejection Bottleneck Improving Deflection Arbitration Results Conclusions 117 Improving Deflection Arbitration    Problem 3: Deflections occur unnecessarily because fast arbiters must use simple priority schemes Age-based priorities (several past works): full priority order gives fewer deflections, but requires slow arbiters State-of-the-art deflection arbitration (Golden Packet & two-stage permutation network)    Prioritize one packet globally (ensure forward progress) Arbitrate other flits randomly (fast critical path) Random common case leads to uncoordinated arbitration 118 Fast Deflection Routing Implementation    Let’s route in a two-input router first: Step 1: pick a “winning” flit (Golden Packet, else random) Step 2: steer the winning flit to its desired output and deflect other flit  Highest-priority flit always routes to destination 119 Fast Deflection Routing with Four Inputs  Each block makes decisions independently  Deflection is a distributed decision N N E S S E W W 120 Unnecessary Deflections in Fast Arbiters  How does lack of coordination cause unnecessary deflections? 1. No flit is golden (pseudorandom arbitration) 2. Red flit wins at first stage 3. Green flit loses at first stage (must be deflected now) 4. Red flit loses at second stage; Red and Green are deflected Destination all flits have equal priority unnecessary deflection! Destination 121 Improving Deflection Arbitration   Key idea 3: Add a priority level and prioritize one flit to ensure at least one flit is not deflected in each cycle Highest priority: one Golden Packet in network    Chosen in static round-robin schedule Ensures correctness Next-highest priority: one silver flit per router per cycle   Chosen pseudo-randomly & local to one router Enhances performance 122 Adding A Silver Flit  Randomly picking a silver flit ensures one flit is not deflected 1. No flit is golden but Red flit is silver 2. Red flit wins at first stage (silver) 3. Green flit is deflected at first stage 4. Red flit wins at second stage (silver); not deflected Destination red all flits flit has have higher equal priority priority At least one flit is not deflected Destination 123 Minimally-Buffered Deflection Router Problem 1: Link Contention Solution 1: Side Buffer Problem 2: Ejection Bottleneck Solution 2: Dual-Width Ejection Eject Problem 3: Unnecessary Deflections Inject Solution 3: Two-level priority scheme 124 Outline: This Talk  Motivation  Background: Bufferless Deflection Routing  MinBD: Reducing Deflections    Addressing Link Contention Addressing the Ejection Bottleneck Improving Deflection Arbitration 125 Outline: This Talk  Motivation  Background: Bufferless Deflection Routing  MinBD: Reducing Deflections      Addressing Link Contention Addressing the Ejection Bottleneck Improving Deflection Arbitration Results Conclusions 126 Methodology: Simulated System  Chip Multiprocessor Simulation       64-core and 16-core models Closed-loop core/cache/NoC cycle-level model Directory cache coherence protocol (SGI Origin-based) 64KB L1, perfect L2 (stresses interconnect), XOR-mapping Performance metric: Weighted Speedup (similar conclusions from network-level latency) Workloads: multiprogrammed SPEC CPU2006   75 randomly-chosen workloads Binned into network-load categories by average injection rate 127 Methodology: Routers and Network Input-buffered virtual-channel router      Bufferless deflection router: CHIPPER1 Bufferless-buffered hybrid router: AFC2     Has input buffers and deflection routing logic Performs coarse-grained (multi-cycle) mode switching Common parameters     1Fallin 2Jafri 8 VCs, 8 flits/VC [Buffered(8,8)]: large buffered router 4 VCs, 4 flits/VC [Buffered(4,4)]: typical buffered router 4 VCs, 1 flit/VC [Buffered(4,1)]: smallest deadlock-free router All power-of-2 buffer sizes up to (8, 8) for perf/power sweep 2-cycle router latency, 1-cycle link latency 2D-mesh topology (16-node: 4x4; 64-node: 8x8) Dual ejection assumed for baseline routers (for perf. only) et al., “CHIPPER: A Low-complexity Bufferless Deflection Router”, HPCA 2011. et al., “Adaptive Flow Control for Robust Performance and Energy”, MICRO 2010. 128 Methodology: Power, Die Area, Crit. Path  Hardware modeling  Verilog models for CHIPPER, MinBD, buffered control logic    Synthesized with commercial 65nm library ORION 2.0 for datapath: crossbar, muxes, buffers and links Power    Static and dynamic power from hardware models Based on event counts in cycle-accurate simulations Broken down into buffer, link, other 129 Weighted Speedup Reduced Deflections & Improved Perf. 1. All mechanisms individually reduce 3. Overall, 5.8% over baseline, 2.7% overdeflections dual-eject 15 by reducing deflections 64% / 54% 2. Side buffer alone is not sufficient for performance 14.5 (ejection bottleneck remains) 14 2.7% 5.8% Baseline 13.5 B (Side (Side-Buf) Buffer) D (Dual-Eject) 13 S (Silver Flits) B+D 12.5 B+S+D (MinBD) 12 Deflection 28% Rate 17% 22% 27% 11% 10% 130 Overall Performance Results 16 Weighted Speedup 2.7% 2.7% 14 8.3% 12 Buffered (8,8) Buffered (4,4) Buffered (4,1) CHIPPER 10 8.1% AFC (4,4) MinBD-4 8 Injection Rate @(8.1% Improves 2.7% over CHIPPER at buffering high load)space • Similar perf. to Buffered (4,1) 25% of • Within 2.7% of Buffered (4,4) (8.3% at high load) 131 Overall Power Results Network Power (W) 3.0 dynamic other static other 2.5 dynamic link dynamic buffer static buffer static non-buffer link dynamic static buffer 2.0 1.5 1.0 0.5 MinBD-4 AFC(4,4) CHIPPER Buffered (4,1) Buffered (4,4) Buffered (8,8) 0.0 • Dynamic power increases with deflection routing • Buffers are significant fraction of power in baseline routers 132 Buffer power is much smaller in MinBD (4-flit buffer) • Dynamic power reduces in MinBD relative to CHIPPER Weighted Speedup Performance-Power Spectrum 15.0 14.8 14.6 14.4 14.2 14.0 13.8 13.6 13.4 13.2 13.0 More Perf/Power Less Perf/Power Buf (8,8) MinBD Buf (4,4) AFC Buf (4,1) CHIPPER Buf (1,1) • Most (perf/watt) of 0.5 energy-efficient 1.0 1.5 2.0 2.5any evaluated network routerPower design Network (W) 3.0 133 Die Area and Critical Path Normalized Die Area Normalized Critical Path 2.5 1.2 2 -36% +8% +7% 1.0 0.8 1.5 0.6 +3% 1 MinBD CHIPPER Buffered (4,1) Buffered (4,4) MinBD CHIPPER 0.0 Buffered (4,1) 0 Buffered (4,4) 0.2 Buffered (8,8) 0.5 Buffered (8,8) 0.4 • Only 3% area increase over CHIPPER (4-flit buffer) by 36% from Buffered • Reduces Increasesarea by 7% over CHIPPER, 8% (4,4) over Buffered (4,4) 134 Conclusions  Bufferless deflection routing offers reduced power & area But, high deflection rate hurts performance at high load  MinBD (Minimally-Buffered Deflection Router) introduces:        Side buffer to hold only flits that would have been deflected Dual-width ejection to address ejection bottleneck Two-level prioritization to avoid unnecessary deflections MinBD yields reduced power (31%) & reduced area (36%) relative to buffered routers MinBD yields improved performance (8.1% at high load) relative to bufferless routers  closes half of perf. gap MinBD has the best energy efficiency of all evaluated designs with competitive performance 135

18-742 Parallel Computer Architecture Lecture 18: Interconnection Networks II

Related documents

Products

Support

18-742 Parallel Computer Architecture Lecture 18: Interconnection Networks II

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib