A Case for Bufferless Routing in On-Chip Networks $ Thomas Moscibroda Onur Mutlu Microsoft Research CMU On-Chip Networks (NoC) Multi-core Chip CPU+L1 Cache -Bank CPU+L1 CPU+L1 CPU+L1 Cache -Bank Cache -Bank Cache -Bank $ CPU+L1 Cache -Bank CPU+L1 CPU+L1 CPU+L1 Cache -Bank Cache -Bank Cache -Bank Memory Controller Thomas Moscibroda, Microsoft Research Accelerator, etc… On-Chip Networks (NoC) • • Connect cores, caches, memory controllers, etc… Examples: • Intel 80-core Terascale chip • MIT RAW chip • Design goals in NoC design: • High throughput, low latency • Fairness between cores, QoS, … • Low complexity, low cost • Power, low energy consumption $ Thomas Moscibroda, Microsoft Research On-Chip Networks (NoC) • • • Connect cores, caches, memory controllers, etc… Examples: • Intel 80-core Terascale chip • MIT RAW chip Energy/Power in On-Chip Networks • Power is a$key constraint in the design Design goals in NoC design: of high-performance processors • High throughput, low latency • NoCs consume portion of system • Fairness between cores, QoS, substantial … power • Low complexity, low cost • ~30% in Intel 80-core Terascale [IEEE Micro’07] • Power, low energy consumption • ~40% in MIT RAW Chip [ISCA’04] • NoCs estimated to consume 100s of Watts [Borkar, DAC’07] Thomas Moscibroda, Microsoft Research Current NoC Approaches • Existing approaches differ in numerous ways: • Network topology [Kim et al, ISCA’07, Kim et al, ISCA’08 etc] • Flow control [Michelogiannakis et al, HPCA’09, Kumar et al, MICRO’08, etc] • Virtual Channels [Nicopoulos et al, MICRO’06, etc] • QoS & fairness mechanisms [Lee et al, ISCA’08, etc] • Routing algorithms [Singh et al, CAL’04] • Router architecture [Park et al, ISCA’08] • Broadcast, Multicast [Jerger et al, ISCA’08, Rodrigo et al, MICRO’08] $ Existing work assumes existence of buffers in routers! Thomas Moscibroda, Microsoft Research A Typical Router Input Channel 1 Credit Flow to upstream router VC1 Scheduler Routing Computation VC Arbiter VC2 Switch Arbiter VCv Input Port 1 Output Channel 1 $ Input Channel N VC1 VC2 Output Channel N Credit Flow to upstream router VCv Input Port N N x N Crossbar Buffers are integral part of existing NoC Routers Thomas Moscibroda, Microsoft Research Buffers in NoC Routers Buffers are necessary for high network throughput buffers increase total available bandwidth in network Avg. packet latency • small buffers $ medium buffers large buffers Injection Rate Thomas Moscibroda, Microsoft Research Buffers in NoC Routers • Buffers are necessary for high network throughput buffers increase total available bandwidth in network • Buffers consume significant energy/power • • • • Dynamic energy when read/write $ Static energy even when not occupied Buffers add complexity and latency • Logic for buffer management • Virtual channel allocation • Credit-based flow control Buffers require significant chip area • E.g., in TRIPS prototype chip, input buffers occupy 75% of total on-chip network area [Gratz et al, ICCD’06] Thomas Moscibroda, Microsoft Research Going Bufferless…? How much throughput do we lose? How is latency affected? buffers latency • no buffers Injection Rate • Up to what injection rates can we use bufferless routing? • Can we achieve energy reduction? $ Are there realistic scenarios in which NoC is operated at injection rates below the threshold? If so, how much…? • Can we reduce area, complexity, etc…? Thomas Moscibroda, Microsoft Research Answers in our paper! Overview • Introduction and Background • Bufferless Routing (BLESS) • FLIT-BLESS • WORM-BLESS $ • BLESS with buffers • Advantages and Disadvantages • Evaluations • Conclusions Thomas Moscibroda, Microsoft Research BLESS: Bufferless Routing • • Always forward all incoming flits to some output port If no productive direction is available, send to another direction • packet is deflected Hot-potato routing [Baran’64, etc] $ Deflected! Buffered Thomas Moscibroda, Microsoft Research BLESS BLESS: Bufferless Routing Routing Flit-Ranking VC Arbiter PortSwitch Arbiter Prioritization arbitration policy $ Flit-Ranking PortPrioritization 1. Create a ranking over all incoming flits 2. For a given flit in this ranking, find the best free output-port Apply to each flit in order of ranking Thomas Moscibroda, Microsoft Research FLIT-BLESS: Flit-Level Routing • • Each flit is routed independently. Oldest-first arbitration (other policies evaluated in paper) Flit-Ranking PortPrioritization • 1. Oldest-first ranking 2. Assign flit to productive port, if possible. Otherwise, assign to non-productive port. Network Topology: $ Can be applied to most topologies (Mesh, Torus, Hypercube, Trees, …) 1) #output ports ¸ #input ports at every router 2) every router is reachable from every other router • Flow Control & Injection Policy: Completely local, inject whenever input port is free • • Absence of Deadlocks: every flit is always moving Absence of Livelocks: with oldest-first ranking Thomas Moscibroda, Microsoft Research WORM-BLESS: Wormhole Routing • Potential downsides of FLIT-BLESS • Not-energy optimal (each flits needs header information) • Increase in latency (different flits take different path) • Increase in receive buffer size new worm! • • $ BLESS with wormhole routing…? [Dally, Seitz’86] Problems: • Injection Problem (not known when it is safe to inject) • Livelock Problem (packets can be deflected forever) Thomas Moscibroda, Microsoft Research WORM-BLESS: Wormhole Routing Flit-Ranking Port-Prioritization Deflect worms if necessary! Truncate worms if necessary! At low congestion, packets travel routed as worms 1. 2. Oldest-first ranking If flit is head-flit a) assign flit to unallocated, productive port b) assign flit to allocated, productive port c) assign flit to unallocated, non-productive port d) assign flit to allocated, non-productive port else, a) assign flit to port that is allocated to worm $ Body-flit turns into head-flit allocated to West This worm is truncated! & deflected! allocated to North Head-flit: West See paper for details… Thomas Moscibroda, Microsoft Research BLESS with Buffers • • • • BLESS without buffers is extreme end of a continuum BLESS can be integrated with buffers • FLIT-BLESS with Buffers • WORM-BLESS with Buffers Whenever a buffer is full, it’s first flit becomes must-schedule $ must-schedule flits must be deflected if necessary See paper for details… Thomas Moscibroda, Microsoft Research Overview • Introduction and Background • Bufferless Routing (BLESS) • FLIT-BLESS • WORM-BLESS $ • BLESS with buffers • Advantages and Disadvantages • Evaluations • Conclusions Thomas Moscibroda, Microsoft Research BLESS: Advantages & Disadvantages Advantages • No buffers • Purely local flow control • Simplicity - no credit-flows - no virtual channels - simplified router design • • No deadlocks, livelocks Adaptivity $ Disadvantages • Increased latency • Reduced bandwidth • Increased buffering at receiver • Header information at each flit - packets are deflected around congested areas! • • Router latency reduction Area savings Impact on energy…? Thomas Moscibroda, Microsoft Research Reduction of Router Latency • Baseline Router (speculative) BLESS Router (standard) BLESS Router (optimized) BW: Buffer Write RC: Route Computation VA: Virtual Channel Allocation SA: Switch Allocation ST: Switch Traversal LT: Link Traversal LA LT: Link Traversal of Lookahead BLESS gets rid of input buffers and virtual channels head BW flit RC body flit Router 1 VA SA BW RC ST SA ST LT ST RC ST LT $ RC Can be improved to 2. ST LT Router Latency = 2 LT LA LT Router 2 Router Latency = 3 LT Router 2 Router 1 [Dally, Towles’04] RC ST LT Router Latency = 1 Thomas Moscibroda, Microsoft Research BLESS: Advantages & Disadvantages Advantages • No buffers • Purely local flow control • Simplicity - no credit-flows - no virtual channels - simplified router design • • No deadlocks, livelocks Adaptivity - packets are deflected around congested areas! • • Router latency reduction Area savings $ Disadvantages • Increased latency • Reduced bandwidth • Increased buffering at receiver • Header information at each flit Extensive evaluations in the paper! Impact on energy…? Thomas Moscibroda, Microsoft Research Evaluation Methodology • 2D mesh network, router latency is 2 cycles o o o o o • • 4x4, 8 core, 8 L2 cache banks (each node is a core or an L2 bank) is cycle-accurate 4x4, 16 core, 16 L2 cache banks (each nodeSimulation is a core and an L2 bank) Models stalls in network 8x8, 16 core, 64 L2 cache banks (each node is L2 bank and may be a core) and processors 128-bit wide links, 4-flit data packets, 1-flit address packets behavior Self-throttling input Aggressive model For baseline configuration: 4 VCs per physical port, 1processor packet deep Benchmarks $ o Multiprogrammed SPEC CPU2006 and Windows Desktop applications o Heterogeneous and homogenous application mixes o o Synthetic traffic patterns: UR, Transpose, Tornado, Bit Complement Most of M our evaluations processor model based on Intel Pentium with perfect L2 caches 2 GHz processor, 128-entry instruction window Puts maximal stress 64Kbyte private L1 caches on NoC o Total 16Mbyte shared L2 caches; 16 MSHRs per bank o DRAM model based on Micron DDR2-800 x86 o Thomas Moscibroda, Microsoft Research Evaluation Methodology • Energy model provided by Orion simulator [MICRO’02] o • 70nm technology, 2 GHz routers at 1.0 Vdd For BLESS, we model o Additional energy to transmit header information o Additional buffers needed on the receiver side o Additional logic to reorder flits of individual packets at receiver $ • We partition network energy into buffer energy, router energy, and link energy, each having static and dynamic components. • Comparisons against non-adaptive and aggressive adaptive buffered routing algorithms (DO, MIN-AD, ROMM) Thomas Moscibroda, Microsoft Research Evaluation – Synthethic Traces BLESS Injection Rate (flits per cycle per node) Thomas Moscibroda, Microsoft Research 0,49 0,46 0,43 0,4 0,37 0,34 0,31 0,28 0,25 0,22 0,19 0,16 0,13 $ 0,1 FLIT-2 WORM-2 FLIT-1 WORM-1 MIN-AD 0,07 • BLESS has significantly lower saturation throughput compared to buffered baseline. 100 90 80 70 60 50 40 30 20 10 0 0 • Uniform random injection Average Latency • First, the bad news Best Baseline • Perfect caches! • Very little performance degradation with BLESS (less than 4% in dense network) • With router latency 1, BLESS can even outperform baseline (by ~10%) • Significant energy improvements (almost 40%) Baseline 18 16 14 12 10 8 6 4 2 0 $ milc 4x4, 8x Energy (normalized) • milc benchmarks (moderately intensive) W-Speedup Evaluation – Homogenous Case Study 1,2 BLESS 4x4, 16x milc BufferEnergy LinkEnergy RL=1 8x8, 16x milc RouterEnergy 1 0,8 0,6 0,4 0,2 0 4x4, 8x milc 4x4, 16x milc Thomas Moscibroda, Microsoft Research 8x8, 16x milc • milc benchmarks (moderately intensive) Observations: • Perfect caches! W-Speedup Evaluation – Homogenous Case Study Baseline 18 16 14 12 10 8 6 4 2 0 BLESS RL=1 1) Injection rates not extremely high Energy (normalized) • Very little performance degradation with BLESS on average $ milc 4x4, 8x (less than 4% in dense self-throttling! network) 1,2 4x4, 16x milc BufferEnergy LinkEnergy 8x8, 16x milc RouterEnergy 1 • With router latency 1, 2) bursts and0,8 temporary hotspots, BLESSFor can even 0,6 outperform baseline use network links as buffers! 0,4 (by ~10%) • Significant energy improvements (almost 40%) 0,2 0 4x4, 8 8x milc 4x4, 16x milc Thomas Moscibroda, Microsoft Research 8x8, 16x milc Evaluation – Further Results • BLESS increases buffer requirement at receiver by at most 2x See paper for details… overall, energy is still reduced Impact of memory latency DO MIN-AD ROMM FLIT-2 WORM-2 FLIT-1 WORM-1 4x4, 8x matlab DO MIN-AD ROMM FLIT-2 WORM-2 FLIT-1 WORM-1 18 16 14 12 10 8 6 4 2 0 DO MIN-AD ROMM FLIT-2 WORM-2 FLIT-1 WORM-1 with real caches, very little slowdown! (at most 1.5%) $ W-Speedup • 4x4, 16x matlab 8x8, 16x matlab Thomas Moscibroda, Microsoft Research Evaluation – Further Results • BLESS increases buffer requirement at receiver by at most 2x See paper for details… overall, energy is still reduced • Impact of memory latency with real caches, very little slowdown! (at most 1.5%) $ • Heterogeneous application mixes (we evaluate several mixes of intensive and non-intensive applications) little performance degradation significant energy savings in all cases no significant increase in unfairness across different applications • Area savings: ~60% of network area can be saved! Thomas Moscibroda, Microsoft Research Evaluation – Aggregate Results • Aggregate results over all 29 applications Sparse Network -46.4% -41.0% ∆ System Performance -0.5% $ -3.2% -0.15% -0.55% LinkEnergy RouterEnergy 0,6 0,4 0,2 BASE FLIT WORM Mean BASE FLIT WORM Worst-Case W-Speedup 0,8 8 7 6 5 4 3 2 1 0 Mean Thomas Moscibroda, Microsoft Research WORM -28.1% FLIT -39.4% BASE ∆ Network Energy WORM Worst-Case FLIT Average BASE Worst-Case 1 0 Realistic L2 Average BufferEnergy Energy (normalized) Perfect L2 Worst-Case Evaluation – Aggregate Results • Aggregate results over all 29 applications Sparse Network Perfect L2 Realistic L2 Average Worst-Case Average Worst-Case ∆ Network Energy -39.4% -28.1% -46.4% -41.0% ∆ System Performance -0.5% $ -3.2% -0.15% -0.55% Dense Network Perfect L2 Realistic L2 Average Worst-Case Average Worst-Case ∆ Network Energy -32.8% -14.0% -42.5% -33.7% ∆ System Performance -3.6% -17.1% -0.7% -1.5% Thomas Moscibroda, Microsoft Research Conclusion • For a very wide range of applications and network settings, buffers are not needed in NoC • • • • Significant energy savings (32% even in dense networks and perfect caches) Area-savings of 60% Simplified router and network design (flow control, etc…) Performance slowdown is$minimal (can even increase!) A strong case for a rethinking of NoC design! • We are currently working on future research. • Support for quality of service, different traffic classes, energymanagement, etc… Thomas Moscibroda, Microsoft Research