Networks on Chip: Router Microarchitecture & Network Topologies

Networks on Chip: Router Microarchitecture & Network Topologies Daniel U. Becker, James Chen, Nan Jiang, Prof. William J. Dally Concurrent VLSI Architecture Group Stanford University 2 Outline  Introduction  Router Microarchitecture  Network Topologies  Open Source Router RTL  Q&A 10/13/09 NoC: Router Microarchitecture & Network Topologies 3 Why Networks-on-Chip? (1)  Clock frequency scaling and ILP exploitation have hit power wall  Single-threaded performance is leveling off  Transistor budgets still growing  Further performance increases rely on exploiting parallelism  Key problems:  Scalability  Energy efficiency  Design complexity (source: Wikipedia) 10/13/09 NoC: Router Microarchitecture & Network Topologies 4 Why Networks-on-Chip? (2)  Bus-based solutions don’t scale  Contention, electrical characteristics, timing, …  Really want point-to-point links  Full connectivity is too expensive  Area, power, delay, …  Ad-hoc wiring is too expensive  Design, verification, …  Need efficient, scalable communication fabric on chip: Network!  Building blocks (circuits, microarchitecture)  Topologies  Routing & flow control schemes 10/13/09 NoC: Router Microarchitecture & Network Topologies 5 NoCs vs. Long-Haul Networks  Networks are a mature field  Extensive body of prior research  Internet, wireless, interconnection networks, …  NoCs have similar building blocks & design choices  So why not just leverage those results?  Well, we do, but…  NoCs are subject to very different constraints  These constraints lead to very different design tradeoffs 10/13/09 NoC: Router Microarchitecture & Network Topologies 6 The On-Chip Environment (1)  Wires are cheap  Favors wider interfaces and more channels  Buffers are expensive  Provide “just enough” buffering  (e.g. credit round trip)  Minimize occupancy & turnaround time  Efficient flow control required  Power budget is limiting factor for current chip designs  Minimize power required for moving things around  Maximize power available for doing actual work 10/13/09 NoC: Router Microarchitecture & Network Topologies 7 The On-Chip Environment (2)  Semiconductor technology requires planar layout  Favors regular, low-dimensional topologies  Strict cycle time constraints  Favors simple routing algorithms, flow control mechanisms  No complex arithmetic or large lookup tables  Minimize amount of state required for adaptive routing  No need to support online reconfigurability  Topology is essentially static  Possibly some dynamic reconfiguration for power management  But must be able to isolate faulty cores, e.g. at boot time 10/13/09 NoC: Router Microarchitecture & Network Topologies 8 Traffic Characteristics (1)  Message-oriented  Few message types with short, uniform message sizes  Highly latency-sensitive  E.g. memory, cache coherence traffic in CMPs  Requires shallow router pipelines, low-diameter topologies  Bursty  Can’t just optimize for average network load!  Message loss usually not acceptable  Message may contain only instance of its payload data  End-to-end reliability is expensive 10/13/09 NoC: Router Microarchitecture & Network Topologies 9 Traffic Characteristics (2)  Assumptions for this talk:  Memory load/store traffic  Split transaction protocol  Single-packet messages  Single-phit flits 10/13/09 Transact. Type Message Type Packet Size Read Request 1 flit Read Reply 1+n flits Write Request 1+n flits Write Reply 1 flit NoC: Router Microarchitecture & Network Topologies 10 Router Microarchitecture  Topology, routing and flow control set the high-level framework for network cost and performance  Channel width, connectivity, path diversity, hop count, …  Router microarchitecture determines the cost of each hop  Directly impacts latency, throughput and power consumption  Comprises variety of different aspects:  Pipeline organization  Implementation of routing logic, flow control & allocators  Buffer management  Power management, reliability & fault tolerance 10/13/09 NoC: Router Microarchitecture & Network Topologies 11 Virtual Channels  Share channel capacity between multiple data streams  Interleave flits from different packets  Provide dedicated buffer space for each virtual channel  Decouple channels from buffers  “The Swiss Army Knife for Interconnection Networks”  Prevent deadlocks  Reduce head-of-line blocking  Also useful for providing QoS 10/13/09 NoC: Router Microarchitecture & Network Topologies [Dally’87] Using VCs for Deadlock Prevention  Protocol deadlock  Circular dependencies between messages at network edge  Solution:  Partition range of VCs into different message classes  Routing deadlock  Circular dependencies between resources within network  Solution:  Partition range of VCs into different resource classes  Restrict transitions between resource classes to impose partial order on resource acquisition  {packet classes} = {message classes} × {resource classes} 10/13/09 NoC: Router Microarchitecture & Network Topologies 12 13 [Dally’90] Using VCs for Flow Control  Coupling between channels and buffers causes head-of-line blocking     Adds false dependencies between packets Limits channel utilization Increases latency Even with VCs for deadlock prevention, still applies to packets in same class  Solution:  Assign multiple VCs to each packet class 10/13/09 NoC: Router Microarchitecture & Network Topologies 14 VC Router Pipeline  Route Computation (RC)  Determine candidate output port(s) and VC(s)  Can be precomputed at upstream router (lookahead routing) Per packet  Virtual Channel Allocation (VA)  Assign available output VCs to waiting packets at input VCs  Switch Allocation (SA)  Assign switch time slots to buffered flits  Switch Traversal (ST)  Send flits through crossbar switch to appropriate output 10/13/09 Per flit NoC: Router Microarchitecture & Network Topologies 15 Allocator Comparison  Allocators represent key pieces of router control logic  Affect both cost/complexity and performance  Evaluate & compare several representative VC and switch allocator implementations  Investigate scaling behavior with router radix, number of VCs, and other key parameters  To appear in SC‘09 10/13/09 NoC: Router Microarchitecture & Network Topologies 16 Allocation Basics  Arbitration:  Multiple requestors  Single resource  Request + grant vectors  Allocation:  Multiple requestors  Multiple equivalent resources  Request + grant matrices  Matching:  Each grant must satisfy a request  Each requester gets at most one grant  Each resource is granted at most once 10/13/09 NoC: Router Microarchitecture & Network Topologies 17 Separable Allocators  Matchings have at most one grant per row and per column Input-first:  Implement via to two phases of arbitration  Column-wise and row-wise  Perform in either order  Arbiters in each stage are fully independent Output-first:  Fast and cheap  But bad choices in first phase can prevent second stage from generating a good matching! 10/13/09 NoC: Router Microarchitecture & Network Topologies 18 [Tamir’93] Wavefront Allocators  Avoid separate phases  … and bad decisions in first  Generate better matchings  But delay scales linearly  Also difficult to pipeline  Principle of operation:  Pick initial diagonal  Grant all requests on diagonal  Never conflict!  For each grant, delete requests in same row, column  Repeat for next diagonal 10/13/09 NoC: Router Microarchitecture & Network Topologies 19 Wavefront Allocator Timing  Originally conceived as fullcustom design  Tiled design  True delay scales linearly  Signal wraparound creates combinational loops  Effectively broken at priority diagonal  But static timing analysis cannot infer that  Synthesized designs must be modified to avoid loops! 10/13/09 NoC: Router Microarchitecture & Network Topologies 20 Loop-Free Wavefront Allocators I/O Transformation: 10/13/09 Replication: NoC: Router Microarchitecture & Network Topologies 21 [Hurt’99] Diagonal Propagation Allocator  Unrolled matrix avoids combinational loops  Sliding priority window activates sub-matrix cells  But static timing analysis again sees false paths!  Actual delay is ~n  Reported delay is ~(2n-1)  Hurts synthesized designs 10/13/09 NoC: Router Microarchitecture & Network Topologies 22 [J. Chen] Domino Logic Wavefront Allocator  Wavefront allocator tends to be on the critical path for highradix routers  Full-custom dual-rail domino implementation achieves 2-3x reduction in latency over synthesized implementation  Exploits the one-hot nature of priority signal  Optimize token propagation path  Single domino gate per cell on critical path  Maximum pull-down depth of two 10/13/09 NoC: Router Microarchitecture & Network Topologies 23 VC Allocation  Before packets can proceed through router, need to acquire ownership of VC at downstream router  VC allocator matches unassigned input VCs with output VCs that are not currently in use  P×V requestors (input VCs), P×V resources (output VCs)  VC is acquired by head flit, inherited by body & tail flits 10/13/09 NoC: Router Microarchitecture & Network Topologies 24 VC Allocator Implementations  Not shown:  Masking logic for busy VCs 10/13/09 NoC: Router Microarchitecture & Network Topologies 25 Sparse VC Allocation  Any-to-any flexibility in VC allocator is unnecessary  Different use cases for VCs restrict possible transitions:  Message class never changes  Resource classes are traversed in order  VCs within a packet class are functionally equivalent  Requests apply to all VCs assigned to target class  Can take advantage of these properties to reduce VC allocator complexity! 10/13/09 Property Max. Savings Delay 41% Area 90% Power 83% NoC: Router Microarchitecture & Network Topologies 26 VC Allocator Performance  Type of VC allocator has little impact on performance  Each packet only gerenates a single request  Each input VC only has small set of possible destination VCs Mesh 4 VCs 10/13/09 FBFly 8 VCs NoC: Router Microarchitecture & Network Topologies 27 VC Allocator Cost Delay Area 4 3.5 3 2.5 2 1.5 1 0.5 0 600000 500000 400000 sep_if 300000 sep_if sep_of 200000 sep_of wf wf 100000 0 2x1x1 2x1x2 2x1x4 2x2x1 mesh 2x2x2 2x2x4 2x1x1 fbfly mesh 2x1x4 2x2x1 2x2x2 2x2x4 fbfly  Wavefront cost scales badly Power 0.025 0.02 0.015 sep_if 0.01  Synthesis fails for larger FBFly configurations sep_of wf 0.005 0 2x1x1 2x1x2 mesh 10/13/09 2x1x2 2x1x4 2x2x1 2x2x2 2x2x4  But cost is reasonable for small mesh configurations fbfly NoC: Router Microarchitecture & Network Topologies 28 Switch Allocation  Traversal of the router requires access to the crossbar  All VCs at a given input port share one crossbar input  Switch allocator matches ready-to-go flits with crossbar time slots  Allocation performed on a cycle-by-cycle basis  P×V requestors (input VCs), P resources (output ports)  At most one flit at each input port can be granted 10/13/09 NoC: Router Microarchitecture & Network Topologies 29 Switch Allocator Implementations  Not shown:  Masking logic for credit and VC state check 10/13/09 NoC: Router Microarchitecture & Network Topologies Normalized to cycle time Switch Allocators – 1 VC / Class Mesh FBFly 10/13/09 NoC: Router Microarchitecture & Network Topologies 30 31 Switch Allocators – 2 VCs / Class 10/13/09 NoC: Router Microarchitecture & Network Topologies 32 Switch Allocators – 4 VCs / Class 10/13/09 NoC: Router Microarchitecture & Network Topologies 33 Switch Allocator Cost Delay Area 4 3.5 3 2.5 2 1.5 1 0.5 0 sep_if m sep_if rr sep_of m sep_of rr wf 2x1x1 2x1x2 2x1x4 2x2x1 mesh 2x2x2 2x2x4 sep_if m sep_if rr sep_of m sep_of rr wf 2x1x1 fbfly 2x1x2 mesh Power 0.018 0.016 0.014 0.012 0.01 0.008 0.006 0.004 0.002 0 2x1x4 2x2x1 2x2x2 2x2x4 fbfly  Separable input-first is fastest and has least cost  Separable output-first has higher delay & cost but similar matching performance  Matrix arbiters cost much more than round-robin, but delay advantage is small  Wavefront provides best matching, but at the cost of increased delay, area and power sep_if m sep_if rr sep_of m sep_of rr 2x1x1 2x1x2 mesh 10/13/09 200000 180000 160000 140000 120000 100000 80000 60000 40000 20000 0 2x1x4 2x2x1 2x2x2 fbfly 2x2x4 wf NoC: Router Microarchitecture & Network Topologies 34 [Peh’01] Speculative Switch Allocation  Perform switch allocation in parallel with VC allocation  Speculate that the latter will be successful  If so, saves a pipeline stage, otherwise try again  Reduces zero-load latency, but adds complexity  Prioritize non-speculative requests  Avoid performance degradation due to misspeculation  Usually implemented through secondary switch allocator  Simpler/faster/cheaper than true multi-priority allocator  But need to prioritize non-speculative grants 10/13/09 NoC: Router Microarchitecture & Network Topologies 35 Speculative Grant Masking  Conventional approach kills speculative grants upon conflict with non-speculative ones  Speculation matters primarily at low load, so can pessimistically kill using nonspeculative requests instead  Sacrifice some speculation opportunities for lower delay  But zero-load latency remains virtually unaffected! 10/13/09 NoC: Router Microarchitecture & Network Topologies Speculation – 1 VC / Class Normalized to cycle time Mesh FBFly 10/13/09 NoC: Router Microarchitecture & Network Topologies 36 37 Speculation – 2 VCs / Class 10/13/09 NoC: Router Microarchitecture & Network Topologies 38 Speculation – 4 VCs / Class 10/13/09 NoC: Router Microarchitecture & Network Topologies 39 Speculation Cost Delay Area 3.5 3 2.5 2 nonspec 1.5 spec_gnt 1 spec_req 0.5 0 2x1x1 2x1x2 2x1x4 2x2x1 mesh 2x2x2 2x2x4 fbfly 0.002 0.0015 nonspec 0.001 spec_gnt spec_req 0.0005 0 mesh 10/13/09 2x2x1 spec_req 2x1x2 2x1x4 2x2x1 2x2x2 2x2x4 fbfly  Pessimistic masking reduces delay hit due to speculation 0.0025 2x1x4 spec_gnt mesh 0.003 2x1x2 nonspec 2x1x1 Power 2x1x1 45000 40000 35000 30000 25000 20000 15000 10000 5000 0 2x2x2 fbfly 2x2x4  Slight increase in area  Slight power increase in small configurations only NoC: Router Microarchitecture & Network Topologies 40 Allocator Study Conclusions  Network performance largely insensitive to VC allocator  For reasonable packet sizes, VC allocator is lightly loaded  With light load, all variants produce near-ideal matchings  Favors use of simplest / fastest variant (sep/if)  Sparse VC allocation can greatly reduce delay & cost  For switch allocation, wavefront allocator produces better matchings at the cost of slightly higher delay  Best choice depends on application goals & constraints  Pessimistic speculation narrows delay gap to non-speculative implementation, but can slightly increase area & power 10/13/09 NoC: Router Microarchitecture & Network Topologies 41 Dual-Path Switch Allocation (1)  Significant fraction of cycle is used for peripheral logic  VC state & credit checking  Input VC selection / combination of requests  For buffered flits, much of this can be precomputed  Leave most of cycle for actual allocation logic  In some cases need to be pessimistic!  For newly arriving flits, precomputation not possible  But at most one such flit per input port per cycle!  So can simplify allocation for these flits 10/13/09 NoC: Router Microarchitecture & Network Topologies 42 Dual-Path Switch Allocation (2)  Idea: Provide separate, optimized logic paths for newly arriving and buffered flits  New flits always try to bypass buffer (fast path)  If no other VC at input port is active, send requests to fast-path output arbiter  First part of cycle available for credit checking, etc.  Also write to buffer in case bypassing fails  For buffered flits, precompute control signals (slow path)  Preselect next VC to go (round-robin between active VCs)  Check for credit and eligibility one cycle ahead  Almost entire cycle available for slow-path allocation  Merge grants from both paths  Prioritize slow-path grants to avoid starvation 10/13/09 NoC: Router Microarchitecture & Network Topologies 43 Network Topologies (1)  Physical organization of the network  Number of nodes per router  Connectivity between routers  Directly affects all key network parameters  Channel length, router complexity, latency, throughput, …  These parameters translate into performance and power 10/13/09 NoC: Router Microarchitecture & Network Topologies 44 Network Topologies (2)  2D mesh Node Router  One router per node  Connected with 4 neighbors  Concentrated mesh  Several nodes share a router  Reduces the network diameter  Fat tree / folded clos  Several nodes share a router  Number of channels between levels is constant Source: N. Jiang 10/13/09 NoC: Router Microarchitecture & Network Topologies 45 [Kim’07] Flattened Butterfly 4-ary 2-flatfly 10/13/09 2-ary 4-flatfly NoC: Router Microarchitecture & Network Topologies 46 Flattened Butterfly Benefits  High-radix routers reduce network diameter  Routers are fully connected within each dimension  Minimal paths require one inter-router hop per dimension  Folding adds path diversity  Can now traverse dimensions in any order  Highly scalable with network size 10/13/09 NoC: Router Microarchitecture & Network Topologies 47 [N. Jiang] Topology Comparison (1)  Evaluate implementation cost for various topologies  Focus on cost of routers  Area and power from P&R  Unless indicated otherwise, use parameters on the right Parameter Value Channel width 64 bits Packet injection rate 2% Request size 32 bits Reply size 128 bits Network size 64 nodes Operating frequency 200 MHz  Missing points for kn router because it failed to meet cycle time 10/13/09 NoC: Router Microarchitecture & Network Topologies 48 Topology Comparison (2) Millions Aggregate Router Area vs. Bisection Bandwidth 14 12 Area (um^2) 10 mesh 8 fbfly cmesh 6 ftree kn 4 2 0 0 10/13/09 1000 2000 3000 4000 5000 6000 7000 Network Bisection Bandwidth (bits per cycle) 8000 9000 NoC: Router Microarchitecture & Network Topologies 49 Topology Comparison (3) Aggregate Network Power vs. Bisection Bandwidth 0.3 0.25 Power (W) 0.2 mesh fbfly 0.15 cmesh ftree 0.1 kn 0.05 0 0 10/13/09 1000 2000 3000 4000 5000 6000 Network Bisection Bandwidth (bits per cycle) 7000 8000 9000 NoC: Router Microarchitecture & Network Topologies 50 Topology Study Conclusions  Meshes are most commonly used topology in NoCs, but they are not very efficient  High average hop count  Flattened Butterfly provides more bandwidth per watt at the cost of increased complexity  Non-tiled layout, more complex routers  Best choice for particular application depends on design goals and parameters 10/13/09 NoC: Router Microarchitecture & Network Topologies 51 Router RTL (1)  Analytical models are becoming increasingly inaccurate  Do not properly model wire delay  Crucial in submicron processes!  Derived from idealized, full-custom circuit designs  But much of the timing-critical control logic is synthesized  Useful for developing intuition & high-level models  But detailed evaluations require a more accurate model  Goal: Provide flexible template to generate customized RTL-level router implementation 10/13/09 NoC: Router Microarchitecture & Network Topologies 52 Router RTL (2)  Developed complete VC router implementation in RTL  Highly parameterized  Topologies, VC & flit buffer configurations, allocators, routing logic, and various other implementation details  Fully synthesizable Verilog-2001  BSD license allows for virtually unrestricted use  Live source code repository & bug tracker  http://nocs.stanford.edu/ 10/13/09 NoC: Router Microarchitecture & Network Topologies 53 Thank You! Questions? 10/13/09 NoC: Router Microarchitecture & Network Topologies

Networks on Chip: Router Microarchitecture & Network Topologies

Related documents

Products

Support

Networks on Chip: Router Microarchitecture & Network Topologies

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib