Bufferless Routing - Microsoft Research

A Case for Bufferless Routing in On-Chip Networks $ Thomas Moscibroda Onur Mutlu Microsoft Research CMU On-Chip Networks (NoC) Multi-core Chip CPU+L1 Cache -Bank CPU+L1 CPU+L1 CPU+L1 Cache -Bank Cache -Bank Cache -Bank $ CPU+L1 Cache -Bank CPU+L1 CPU+L1 CPU+L1 Cache -Bank Cache -Bank Cache -Bank Memory Controller Thomas Moscibroda, Microsoft Research Accelerator, etc… On-Chip Networks (NoC) • • Connect cores, caches, memory controllers, etc… Examples: • Intel 80-core Terascale chip • MIT RAW chip • Design goals in NoC design: • High throughput, low latency • Fairness between cores, QoS, … • Low complexity, low cost • Power, low energy consumption $ Thomas Moscibroda, Microsoft Research On-Chip Networks (NoC) • • • Connect cores, caches, memory controllers, etc… Examples: • Intel 80-core Terascale chip • MIT RAW chip Energy/Power in On-Chip Networks • Power is a$key constraint in the design Design goals in NoC design: of high-performance processors • High throughput, low latency • NoCs consume portion of system • Fairness between cores, QoS, substantial … power • Low complexity, low cost • ~30% in Intel 80-core Terascale [IEEE Micro’07] • Power, low energy consumption • ~40% in MIT RAW Chip [ISCA’04] • NoCs estimated to consume 100s of Watts [Borkar, DAC’07] Thomas Moscibroda, Microsoft Research Current NoC Approaches • Existing approaches differ in numerous ways: • Network topology [Kim et al, ISCA’07, Kim et al, ISCA’08 etc] • Flow control [Michelogiannakis et al, HPCA’09, Kumar et al, MICRO’08, etc] • Virtual Channels [Nicopoulos et al, MICRO’06, etc] • QoS & fairness mechanisms [Lee et al, ISCA’08, etc] • Routing algorithms [Singh et al, CAL’04] • Router architecture [Park et al, ISCA’08] • Broadcast, Multicast [Jerger et al, ISCA’08, Rodrigo et al, MICRO’08] $ Existing work assumes existence of buffers in routers! Thomas Moscibroda, Microsoft Research A Typical Router Input Channel 1 Credit Flow to upstream router VC1 Scheduler Routing Computation VC Arbiter VC2 Switch Arbiter VCv Input Port 1 Output Channel 1 $ Input Channel N VC1 VC2 Output Channel N Credit Flow to upstream router VCv Input Port N N x N Crossbar Buffers are integral part of existing NoC Routers Thomas Moscibroda, Microsoft Research Buffers in NoC Routers Buffers are necessary for high network throughput  buffers increase total available bandwidth in network Avg. packet latency • small buffers $ medium buffers large buffers Injection Rate Thomas Moscibroda, Microsoft Research Buffers in NoC Routers • Buffers are necessary for high network throughput  buffers increase total available bandwidth in network • Buffers consume significant energy/power • • • • Dynamic energy when read/write $ Static energy even when not occupied Buffers add complexity and latency • Logic for buffer management • Virtual channel allocation • Credit-based flow control Buffers require significant chip area • E.g., in TRIPS prototype chip, input buffers occupy 75% of total on-chip network area [Gratz et al, ICCD’06] Thomas Moscibroda, Microsoft Research Going Bufferless…? How much throughput do we lose?  How is latency affected? buffers latency • no buffers Injection Rate • Up to what injection rates can we use bufferless routing? • Can we achieve energy reduction? $  Are there realistic scenarios in which NoC is operated at injection rates below the threshold?  If so, how much…? • Can we reduce area, complexity, etc…? Thomas Moscibroda, Microsoft Research Answers in our paper! Overview • Introduction and Background • Bufferless Routing (BLESS) • FLIT-BLESS • WORM-BLESS $ • BLESS with buffers • Advantages and Disadvantages • Evaluations • Conclusions Thomas Moscibroda, Microsoft Research BLESS: Bufferless Routing • • Always forward all incoming flits to some output port If no productive direction is available, send to another direction •  packet is deflected  Hot-potato routing [Baran’64, etc] $ Deflected! Buffered Thomas Moscibroda, Microsoft Research BLESS BLESS: Bufferless Routing Routing Flit-Ranking VC Arbiter PortSwitch Arbiter Prioritization arbitration policy $ Flit-Ranking PortPrioritization 1. Create a ranking over all incoming flits 2. For a given flit in this ranking, find the best free output-port Apply to each flit in order of ranking Thomas Moscibroda, Microsoft Research FLIT-BLESS: Flit-Level Routing • • Each flit is routed independently. Oldest-first arbitration (other policies evaluated in paper) Flit-Ranking PortPrioritization • 1. Oldest-first ranking 2. Assign flit to productive port, if possible. Otherwise, assign to non-productive port. Network Topology: $  Can be applied to most topologies (Mesh, Torus, Hypercube, Trees, …) 1) #output ports ¸ #input ports at every router 2) every router is reachable from every other router • Flow Control & Injection Policy:  Completely local, inject whenever input port is free • • Absence of Deadlocks: every flit is always moving Absence of Livelocks: with oldest-first ranking Thomas Moscibroda, Microsoft Research WORM-BLESS: Wormhole Routing • Potential downsides of FLIT-BLESS • Not-energy optimal (each flits needs header information) • Increase in latency (different flits take different path) • Increase in receive buffer size new worm! • • $ BLESS with wormhole routing…? [Dally, Seitz’86] Problems: • Injection Problem (not known when it is safe to inject) • Livelock Problem (packets can be deflected forever) Thomas Moscibroda, Microsoft Research WORM-BLESS: Wormhole Routing Flit-Ranking Port-Prioritization Deflect worms if necessary! Truncate worms if necessary! At low congestion, packets travel routed as worms 1. 2. Oldest-first ranking If flit is head-flit a) assign flit to unallocated, productive port b) assign flit to allocated, productive port c) assign flit to unallocated, non-productive port d) assign flit to allocated, non-productive port else, a) assign flit to port that is allocated to worm $ Body-flit turns into head-flit allocated to West This worm is truncated! & deflected! allocated to North Head-flit: West See paper for details… Thomas Moscibroda, Microsoft Research BLESS with Buffers • • • • BLESS without buffers is extreme end of a continuum BLESS can be integrated with buffers • FLIT-BLESS with Buffers • WORM-BLESS with Buffers Whenever a buffer is full, it’s first flit becomes must-schedule $ must-schedule flits must be deflected if necessary See paper for details… Thomas Moscibroda, Microsoft Research Overview • Introduction and Background • Bufferless Routing (BLESS) • FLIT-BLESS • WORM-BLESS $ • BLESS with buffers • Advantages and Disadvantages • Evaluations • Conclusions Thomas Moscibroda, Microsoft Research BLESS: Advantages & Disadvantages Advantages • No buffers • Purely local flow control • Simplicity - no credit-flows - no virtual channels - simplified router design • • No deadlocks, livelocks Adaptivity $ Disadvantages • Increased latency • Reduced bandwidth • Increased buffering at receiver • Header information at each flit - packets are deflected around congested areas! • • Router latency reduction Area savings Impact on energy…? Thomas Moscibroda, Microsoft Research Reduction of Router Latency • Baseline Router (speculative) BLESS Router (standard) BLESS Router (optimized) BW: Buffer Write RC: Route Computation VA: Virtual Channel Allocation SA: Switch Allocation ST: Switch Traversal LT: Link Traversal LA LT: Link Traversal of Lookahead BLESS gets rid of input buffers and virtual channels head BW flit RC body flit Router 1 VA SA BW RC ST SA ST LT ST RC ST LT $ RC Can be improved to 2. ST LT Router Latency = 2 LT LA LT Router 2 Router Latency = 3 LT Router 2 Router 1 [Dally, Towles’04] RC ST LT Router Latency = 1 Thomas Moscibroda, Microsoft Research BLESS: Advantages & Disadvantages Advantages • No buffers • Purely local flow control • Simplicity - no credit-flows - no virtual channels - simplified router design • • No deadlocks, livelocks Adaptivity - packets are deflected around congested areas! • • Router latency reduction Area savings $ Disadvantages • Increased latency • Reduced bandwidth • Increased buffering at receiver • Header information at each flit Extensive evaluations in the paper! Impact on energy…? Thomas Moscibroda, Microsoft Research Evaluation Methodology • 2D mesh network, router latency is 2 cycles o o o o o • • 4x4, 8 core, 8 L2 cache banks (each node is a core or an L2 bank) is cycle-accurate 4x4, 16 core, 16 L2 cache banks (each nodeSimulation is a core and an L2 bank)  Models stalls in network 8x8, 16 core, 64 L2 cache banks (each node is L2 bank and may be a core) and processors 128-bit wide links, 4-flit data packets, 1-flit address packets behavior Self-throttling  input Aggressive model For baseline configuration: 4 VCs per physical port, 1processor packet deep Benchmarks $ o Multiprogrammed SPEC CPU2006 and Windows Desktop applications o Heterogeneous and homogenous application mixes o o Synthetic traffic patterns: UR, Transpose, Tornado, Bit Complement Most of M our evaluations processor model based on Intel Pentium with perfect L2 caches 2 GHz processor, 128-entry instruction window  Puts maximal stress 64Kbyte private L1 caches on NoC o Total 16Mbyte shared L2 caches; 16 MSHRs per bank o DRAM model based on Micron DDR2-800 x86 o Thomas Moscibroda, Microsoft Research Evaluation Methodology • Energy model provided by Orion simulator [MICRO’02] o • 70nm technology, 2 GHz routers at 1.0 Vdd For BLESS, we model o Additional energy to transmit header information o Additional buffers needed on the receiver side o Additional logic to reorder flits of individual packets at receiver $ • We partition network energy into buffer energy, router energy, and link energy, each having static and dynamic components. • Comparisons against non-adaptive and aggressive adaptive buffered routing algorithms (DO, MIN-AD, ROMM) Thomas Moscibroda, Microsoft Research Evaluation – Synthethic Traces BLESS Injection Rate (flits per cycle per node) Thomas Moscibroda, Microsoft Research 0,49 0,46 0,43 0,4 0,37 0,34 0,31 0,28 0,25 0,22 0,19 0,16 0,13 $ 0,1 FLIT-2 WORM-2 FLIT-1 WORM-1 MIN-AD 0,07 • BLESS has significantly lower saturation throughput compared to buffered baseline. 100 90 80 70 60 50 40 30 20 10 0 0 • Uniform random injection Average Latency • First, the bad news  Best Baseline • Perfect caches! • Very little performance degradation with BLESS (less than 4% in dense network) • With router latency 1, BLESS can even outperform baseline (by ~10%) • Significant energy improvements (almost 40%) Baseline 18 16 14 12 10 8 6 4 2 0 $ milc 4x4, 8x Energy (normalized) • milc benchmarks (moderately intensive) W-Speedup Evaluation – Homogenous Case Study 1,2 BLESS 4x4, 16x milc BufferEnergy LinkEnergy RL=1 8x8, 16x milc RouterEnergy 1 0,8 0,6 0,4 0,2 0 4x4, 8x milc 4x4, 16x milc Thomas Moscibroda, Microsoft Research 8x8, 16x milc • milc benchmarks (moderately intensive) Observations: • Perfect caches! W-Speedup Evaluation – Homogenous Case Study Baseline 18 16 14 12 10 8 6 4 2 0 BLESS RL=1 1) Injection rates not extremely high Energy (normalized) • Very little performance degradation with BLESS on average $ milc 4x4, 8x (less than 4% in dense  self-throttling! network) 1,2 4x4, 16x milc BufferEnergy LinkEnergy 8x8, 16x milc RouterEnergy 1 • With router latency 1, 2) bursts and0,8 temporary hotspots, BLESSFor can even 0,6 outperform baseline use network links as buffers! 0,4 (by ~10%) • Significant energy improvements (almost 40%) 0,2 0 4x4, 8 8x milc 4x4, 16x milc Thomas Moscibroda, Microsoft Research 8x8, 16x milc Evaluation – Further Results • BLESS increases buffer requirement at receiver by at most 2x See paper for details…  overall, energy is still reduced Impact of memory latency DO MIN-AD ROMM FLIT-2 WORM-2 FLIT-1 WORM-1 4x4, 8x matlab DO MIN-AD ROMM FLIT-2 WORM-2 FLIT-1 WORM-1 18 16 14 12 10 8 6 4 2 0 DO MIN-AD ROMM FLIT-2 WORM-2 FLIT-1 WORM-1  with real caches, very little slowdown! (at most 1.5%) $ W-Speedup • 4x4, 16x matlab 8x8, 16x matlab Thomas Moscibroda, Microsoft Research Evaluation – Further Results • BLESS increases buffer requirement at receiver by at most 2x See paper for details…  overall, energy is still reduced • Impact of memory latency  with real caches, very little slowdown! (at most 1.5%) $ • Heterogeneous application mixes (we evaluate several mixes of intensive and non-intensive applications)  little performance degradation  significant energy savings in all cases  no significant increase in unfairness across different applications • Area savings: ~60% of network area can be saved! Thomas Moscibroda, Microsoft Research Evaluation – Aggregate Results • Aggregate results over all 29 applications Sparse Network -46.4% -41.0% ∆ System Performance -0.5% $ -3.2% -0.15% -0.55% LinkEnergy RouterEnergy 0,6 0,4 0,2 BASE FLIT WORM Mean BASE FLIT WORM Worst-Case W-Speedup 0,8 8 7 6 5 4 3 2 1 0 Mean Thomas Moscibroda, Microsoft Research WORM -28.1% FLIT -39.4% BASE ∆ Network Energy WORM Worst-Case FLIT Average BASE Worst-Case 1 0 Realistic L2 Average BufferEnergy Energy (normalized) Perfect L2 Worst-Case Evaluation – Aggregate Results • Aggregate results over all 29 applications Sparse Network Perfect L2 Realistic L2 Average Worst-Case Average Worst-Case ∆ Network Energy -39.4% -28.1% -46.4% -41.0% ∆ System Performance -0.5% $ -3.2% -0.15% -0.55% Dense Network Perfect L2 Realistic L2 Average Worst-Case Average Worst-Case ∆ Network Energy -32.8% -14.0% -42.5% -33.7% ∆ System Performance -3.6% -17.1% -0.7% -1.5% Thomas Moscibroda, Microsoft Research Conclusion • For a very wide range of applications and network settings, buffers are not needed in NoC • • • • Significant energy savings (32% even in dense networks and perfect caches) Area-savings of 60% Simplified router and network design (flow control, etc…) Performance slowdown is$minimal (can even increase!)  A strong case for a rethinking of NoC design! • We are currently working on future research. • Support for quality of service, different traffic classes, energymanagement, etc… Thomas Moscibroda, Microsoft Research

Bufferless Routing - Microsoft Research

Related documents

Products

Support

Bufferless Routing - Microsoft Research

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib