Networks on Chip: Router Microarchitecture & Network Topologies

advertisement
Networks on Chip:
Router Microarchitecture
& Network Topologies
Daniel U. Becker,
James Chen, Nan Jiang,
Prof. William J. Dally
Concurrent VLSI Architecture Group
Stanford University
2
Outline
 Introduction
 Router Microarchitecture
 Network Topologies
 Open Source Router RTL
 Q&A
10/13/09
NoC: Router Microarchitecture & Network Topologies
3
Why Networks-on-Chip? (1)
 Clock frequency scaling and ILP
exploitation have hit power wall
 Single-threaded performance is
leveling off
 Transistor budgets still growing
 Further performance increases
rely on exploiting parallelism
 Key problems:
 Scalability
 Energy efficiency
 Design complexity
(source: Wikipedia)
10/13/09
NoC: Router Microarchitecture & Network Topologies
4
Why Networks-on-Chip? (2)
 Bus-based solutions don’t scale
 Contention, electrical characteristics, timing, …
 Really want point-to-point links
 Full connectivity is too expensive
 Area, power, delay, …
 Ad-hoc wiring is too expensive
 Design, verification, …
 Need efficient, scalable communication fabric on chip: Network!
 Building blocks (circuits, microarchitecture)
 Topologies
 Routing & flow control schemes
10/13/09
NoC: Router Microarchitecture & Network Topologies
5
NoCs vs. Long-Haul Networks
 Networks are a mature field
 Extensive body of prior research
 Internet, wireless, interconnection networks, …
 NoCs have similar building blocks & design choices
 So why not just leverage those results?
 Well, we do, but…
 NoCs are subject to very different constraints
 These constraints lead to very different design tradeoffs
10/13/09
NoC: Router Microarchitecture & Network Topologies
6
The On-Chip Environment (1)
 Wires are cheap
 Favors wider interfaces and more channels
 Buffers are expensive
 Provide “just enough” buffering
 (e.g. credit round trip)
 Minimize occupancy & turnaround time
 Efficient flow control required
 Power budget is limiting factor for current chip designs
 Minimize power required for moving things around
 Maximize power available for doing actual work
10/13/09
NoC: Router Microarchitecture & Network Topologies
7
The On-Chip Environment (2)
 Semiconductor technology requires planar layout
 Favors regular, low-dimensional topologies
 Strict cycle time constraints
 Favors simple routing algorithms, flow control mechanisms
 No complex arithmetic or large lookup tables
 Minimize amount of state required for adaptive routing
 No need to support online reconfigurability
 Topology is essentially static
 Possibly some dynamic reconfiguration for power management
 But must be able to isolate faulty cores, e.g. at boot time
10/13/09
NoC: Router Microarchitecture & Network Topologies
8
Traffic Characteristics (1)
 Message-oriented
 Few message types with short, uniform message sizes
 Highly latency-sensitive
 E.g. memory, cache coherence traffic in CMPs
 Requires shallow router pipelines, low-diameter topologies
 Bursty
 Can’t just optimize for average network load!
 Message loss usually not acceptable
 Message may contain only instance of its payload data
 End-to-end reliability is expensive
10/13/09
NoC: Router Microarchitecture & Network Topologies
9
Traffic Characteristics (2)
 Assumptions for this talk:
 Memory load/store traffic
 Split transaction protocol
 Single-packet messages
 Single-phit flits
10/13/09
Transact.
Type
Message
Type
Packet
Size
Read
Request
1 flit
Read
Reply
1+n flits
Write
Request
1+n flits
Write
Reply
1 flit
NoC: Router Microarchitecture & Network Topologies
10
Router Microarchitecture
 Topology, routing and flow control set the high-level
framework for network cost and performance
 Channel width, connectivity, path diversity, hop count, …
 Router microarchitecture determines the cost of each hop
 Directly impacts latency, throughput and power consumption
 Comprises variety of different aspects:
 Pipeline organization
 Implementation of routing logic, flow control & allocators
 Buffer management
 Power management, reliability & fault tolerance
10/13/09
NoC: Router Microarchitecture & Network Topologies
11
Virtual Channels
 Share channel capacity between multiple data streams
 Interleave flits from different packets
 Provide dedicated buffer space for each virtual channel
 Decouple channels from buffers
 “The Swiss Army Knife for Interconnection Networks”
 Prevent deadlocks
 Reduce head-of-line blocking
 Also useful for providing QoS
10/13/09
NoC: Router Microarchitecture & Network Topologies
[Dally’87]
Using VCs for Deadlock
Prevention
 Protocol deadlock
 Circular dependencies between messages at network edge
 Solution:
 Partition range of VCs into different message classes
 Routing deadlock
 Circular dependencies between resources within network
 Solution:
 Partition range of VCs into different resource classes
 Restrict transitions between resource classes to impose partial
order on resource acquisition
 {packet classes} = {message classes} × {resource classes}
10/13/09
NoC: Router Microarchitecture & Network Topologies
12
13
[Dally’90]
Using VCs for Flow Control
 Coupling between channels and buffers causes head-of-line blocking




Adds false dependencies between packets
Limits channel utilization
Increases latency
Even with VCs for deadlock prevention, still applies to packets in same class
 Solution:
 Assign multiple VCs to each packet class
10/13/09
NoC: Router Microarchitecture & Network Topologies
14
VC Router Pipeline
 Route Computation (RC)
 Determine candidate output
port(s) and VC(s)
 Can be precomputed at upstream
router (lookahead routing)
Per packet
 Virtual Channel Allocation (VA)
 Assign available output VCs to
waiting packets at input VCs
 Switch Allocation (SA)
 Assign switch time slots to
buffered flits
 Switch Traversal (ST)
 Send flits through crossbar switch
to appropriate output
10/13/09
Per flit
NoC: Router Microarchitecture & Network Topologies
15
Allocator Comparison
 Allocators represent key pieces of router control logic
 Affect both cost/complexity and performance
 Evaluate & compare several representative VC and
switch allocator implementations
 Investigate scaling behavior with router radix, number
of VCs, and other key parameters
 To appear in SC‘09
10/13/09
NoC: Router Microarchitecture & Network Topologies
16
Allocation Basics
 Arbitration:
 Multiple requestors
 Single resource
 Request + grant vectors
 Allocation:
 Multiple requestors
 Multiple equivalent resources
 Request + grant matrices
 Matching:
 Each grant must satisfy a request
 Each requester gets at most one grant
 Each resource is granted at most once
10/13/09
NoC: Router Microarchitecture & Network Topologies
17
Separable Allocators
 Matchings have at most one grant per row and
per column
Input-first:
 Implement via to two phases of arbitration
 Column-wise and row-wise
 Perform in either order
 Arbiters in each stage are fully independent
Output-first:
 Fast and cheap
 But bad choices in first phase can prevent
second stage from generating a good matching!
10/13/09
NoC: Router Microarchitecture & Network Topologies
18
[Tamir’93]
Wavefront Allocators
 Avoid separate phases
 … and bad decisions in first
 Generate better matchings
 But delay scales linearly
 Also difficult to pipeline
 Principle of operation:
 Pick initial diagonal
 Grant all requests on diagonal
 Never conflict!
 For each grant, delete requests
in same row, column
 Repeat for next diagonal
10/13/09
NoC: Router Microarchitecture & Network Topologies
19
Wavefront Allocator Timing
 Originally conceived as fullcustom design
 Tiled design
 True delay scales linearly
 Signal wraparound creates
combinational loops
 Effectively broken at
priority diagonal
 But static timing analysis
cannot infer that
 Synthesized designs must
be modified to avoid loops!
10/13/09
NoC: Router Microarchitecture & Network Topologies
20
Loop-Free Wavefront Allocators
I/O Transformation:
10/13/09
Replication:
NoC: Router Microarchitecture & Network Topologies
21
[Hurt’99]
Diagonal Propagation Allocator
 Unrolled matrix avoids
combinational loops
 Sliding priority window
activates sub-matrix cells
 But static timing analysis
again sees false paths!
 Actual delay is ~n
 Reported delay is ~(2n-1)
 Hurts synthesized designs
10/13/09
NoC: Router Microarchitecture & Network Topologies
22
[J. Chen]
Domino Logic Wavefront Allocator
 Wavefront allocator tends to be
on the critical path for highradix routers
 Full-custom dual-rail domino
implementation achieves 2-3x
reduction in latency over
synthesized implementation
 Exploits the one-hot nature of
priority signal
 Optimize token propagation path
 Single domino gate per cell on
critical path
 Maximum pull-down depth of two
10/13/09
NoC: Router Microarchitecture & Network Topologies
23
VC Allocation
 Before packets can proceed through router, need to
acquire ownership of VC at downstream router
 VC allocator matches unassigned input VCs with output
VCs that are not currently in use
 P×V requestors (input VCs), P×V resources (output VCs)
 VC is acquired by head flit, inherited by body & tail flits
10/13/09
NoC: Router Microarchitecture & Network Topologies
24
VC Allocator Implementations
 Not shown:
 Masking logic for busy VCs
10/13/09
NoC: Router Microarchitecture & Network Topologies
25
Sparse VC Allocation
 Any-to-any flexibility in VC
allocator is unnecessary
 Different use cases for VCs
restrict possible transitions:
 Message class never changes
 Resource classes are
traversed in order
 VCs within a packet class are
functionally equivalent
 Requests apply to all VCs
assigned to target class
 Can take advantage of these
properties to reduce VC
allocator complexity!
10/13/09
Property
Max. Savings
Delay
41%
Area
90%
Power
83%
NoC: Router Microarchitecture & Network Topologies
26
VC Allocator Performance
 Type of VC allocator has little impact on performance
 Each packet only gerenates a single request
 Each input VC only has small set of possible destination VCs
Mesh
4 VCs
10/13/09
FBFly
8 VCs
NoC: Router Microarchitecture & Network Topologies
27
VC Allocator Cost
Delay
Area
4
3.5
3
2.5
2
1.5
1
0.5
0
600000
500000
400000
sep_if
300000
sep_if
sep_of
200000
sep_of
wf
wf
100000
0
2x1x1
2x1x2
2x1x4
2x2x1
mesh
2x2x2
2x2x4
2x1x1
fbfly
mesh
2x1x4
2x2x1
2x2x2
2x2x4
fbfly
 Wavefront cost scales badly
Power
0.025
0.02
0.015
sep_if
0.01
 Synthesis fails for larger
FBFly configurations
sep_of
wf
0.005
0
2x1x1
2x1x2
mesh
10/13/09
2x1x2
2x1x4
2x2x1
2x2x2
2x2x4
 But cost is reasonable for
small mesh configurations
fbfly
NoC: Router Microarchitecture & Network Topologies
28
Switch Allocation
 Traversal of the router requires access to the crossbar
 All VCs at a given input port share one crossbar input
 Switch allocator matches ready-to-go flits with crossbar
time slots
 Allocation performed on a cycle-by-cycle basis
 P×V requestors (input VCs), P resources (output ports)
 At most one flit at each input port can be granted
10/13/09
NoC: Router Microarchitecture & Network Topologies
29
Switch Allocator Implementations
 Not shown:
 Masking logic for credit and
VC state check
10/13/09
NoC: Router Microarchitecture & Network Topologies
Normalized
to cycle time
Switch Allocators – 1 VC / Class
Mesh
FBFly
10/13/09
NoC: Router Microarchitecture & Network Topologies
30
31
Switch Allocators – 2 VCs / Class
10/13/09
NoC: Router Microarchitecture & Network Topologies
32
Switch Allocators – 4 VCs / Class
10/13/09
NoC: Router Microarchitecture & Network Topologies
33
Switch Allocator Cost
Delay
Area
4
3.5
3
2.5
2
1.5
1
0.5
0
sep_if m
sep_if rr
sep_of m
sep_of rr
wf
2x1x1
2x1x2
2x1x4
2x2x1
mesh
2x2x2
2x2x4
sep_if m
sep_if rr
sep_of m
sep_of rr
wf
2x1x1
fbfly
2x1x2
mesh
Power
0.018
0.016
0.014
0.012
0.01
0.008
0.006
0.004
0.002
0
2x1x4
2x2x1
2x2x2
2x2x4
fbfly

Separable input-first is fastest and has least cost

Separable output-first has higher delay & cost but
similar matching performance

Matrix arbiters cost much more than round-robin,
but delay advantage is small

Wavefront provides best matching, but at the cost
of increased delay, area and power
sep_if m
sep_if rr
sep_of m
sep_of rr
2x1x1
2x1x2
mesh
10/13/09
200000
180000
160000
140000
120000
100000
80000
60000
40000
20000
0
2x1x4
2x2x1
2x2x2
fbfly
2x2x4
wf
NoC: Router Microarchitecture & Network Topologies
34
[Peh’01]
Speculative Switch Allocation
 Perform switch allocation in parallel with VC allocation
 Speculate that the latter will be successful
 If so, saves a pipeline stage, otherwise try again
 Reduces zero-load latency, but adds complexity
 Prioritize non-speculative requests
 Avoid performance degradation due to misspeculation
 Usually implemented through secondary switch allocator
 Simpler/faster/cheaper than true multi-priority allocator
 But need to prioritize non-speculative grants
10/13/09
NoC: Router Microarchitecture & Network Topologies
35
Speculative Grant Masking
 Conventional approach kills
speculative grants upon
conflict with non-speculative
ones
 Speculation matters
primarily at low load, so can
pessimistically kill using nonspeculative requests instead
 Sacrifice some speculation
opportunities for lower delay
 But zero-load latency
remains virtually unaffected!
10/13/09
NoC: Router Microarchitecture & Network Topologies
Speculation – 1 VC / Class
Normalized
to cycle time
Mesh
FBFly
10/13/09
NoC: Router Microarchitecture & Network Topologies
36
37
Speculation – 2 VCs / Class
10/13/09
NoC: Router Microarchitecture & Network Topologies
38
Speculation – 4 VCs / Class
10/13/09
NoC: Router Microarchitecture & Network Topologies
39
Speculation Cost
Delay
Area
3.5
3
2.5
2
nonspec
1.5
spec_gnt
1
spec_req
0.5
0
2x1x1
2x1x2
2x1x4
2x2x1
mesh
2x2x2
2x2x4
fbfly
0.002
0.0015
nonspec
0.001
spec_gnt
spec_req
0.0005
0
mesh
10/13/09
2x2x1
spec_req
2x1x2
2x1x4
2x2x1
2x2x2
2x2x4
fbfly
 Pessimistic masking reduces
delay hit due to speculation
0.0025
2x1x4
spec_gnt
mesh
0.003
2x1x2
nonspec
2x1x1
Power
2x1x1
45000
40000
35000
30000
25000
20000
15000
10000
5000
0
2x2x2
fbfly
2x2x4
 Slight increase in area
 Slight power increase in
small configurations only
NoC: Router Microarchitecture & Network Topologies
40
Allocator Study Conclusions
 Network performance largely insensitive to VC allocator
 For reasonable packet sizes, VC allocator is lightly loaded
 With light load, all variants produce near-ideal matchings
 Favors use of simplest / fastest variant (sep/if)
 Sparse VC allocation can greatly reduce delay & cost
 For switch allocation, wavefront allocator produces better
matchings at the cost of slightly higher delay
 Best choice depends on application goals & constraints
 Pessimistic speculation narrows delay gap to non-speculative
implementation, but can slightly increase area & power
10/13/09
NoC: Router Microarchitecture & Network Topologies
41
Dual-Path Switch Allocation (1)
 Significant fraction of cycle is used for peripheral logic
 VC state & credit checking
 Input VC selection / combination of requests
 For buffered flits, much of this can be precomputed
 Leave most of cycle for actual allocation logic
 In some cases need to be pessimistic!
 For newly arriving flits, precomputation not possible
 But at most one such flit per input port per cycle!
 So can simplify allocation for these flits
10/13/09
NoC: Router Microarchitecture & Network Topologies
42
Dual-Path Switch Allocation (2)
 Idea: Provide separate, optimized logic paths for newly arriving
and buffered flits
 New flits always try to bypass buffer (fast path)
 If no other VC at input port is active, send requests to fast-path
output arbiter
 First part of cycle available for credit checking, etc.
 Also write to buffer in case bypassing fails
 For buffered flits, precompute control signals (slow path)
 Preselect next VC to go (round-robin between active VCs)
 Check for credit and eligibility one cycle ahead
 Almost entire cycle available for slow-path allocation
 Merge grants from both paths
 Prioritize slow-path grants to avoid starvation
10/13/09
NoC: Router Microarchitecture & Network Topologies
43
Network Topologies (1)
 Physical organization of the network
 Number of nodes per router
 Connectivity between routers
 Directly affects all key network parameters
 Channel length, router complexity, latency, throughput, …
 These parameters translate into performance and power
10/13/09
NoC: Router Microarchitecture & Network Topologies
44
Network Topologies (2)
 2D mesh
Node
Router
 One router per node
 Connected with 4 neighbors
 Concentrated mesh
 Several nodes share a router
 Reduces the network
diameter
 Fat tree / folded clos
 Several nodes share a router
 Number of channels between
levels is constant
Source: N. Jiang
10/13/09
NoC: Router Microarchitecture & Network Topologies
45
[Kim’07]
Flattened Butterfly
4-ary 2-flatfly
10/13/09
2-ary 4-flatfly
NoC: Router Microarchitecture & Network Topologies
46
Flattened Butterfly Benefits
 High-radix routers reduce network diameter
 Routers are fully connected within each dimension
 Minimal paths require one inter-router hop per dimension
 Folding adds path diversity
 Can now traverse dimensions in any order
 Highly scalable with network size
10/13/09
NoC: Router Microarchitecture & Network Topologies
47
[N. Jiang]
Topology Comparison (1)
 Evaluate implementation
cost for various topologies
 Focus on cost of routers
 Area and power from P&R
 Unless indicated otherwise,
use parameters on the right
Parameter
Value
Channel width
64 bits
Packet injection rate
2%
Request size
32 bits
Reply size
128 bits
Network size
64 nodes
Operating frequency
200 MHz
 Missing points for kn router
because it failed to meet
cycle time
10/13/09
NoC: Router Microarchitecture & Network Topologies
48
Topology Comparison (2)
Millions
Aggregate Router Area vs. Bisection Bandwidth
14
12
Area (um^2)
10
mesh
8
fbfly
cmesh
6
ftree
kn
4
2
0
0
10/13/09
1000
2000
3000
4000
5000
6000
7000
Network Bisection Bandwidth (bits per cycle)
8000
9000
NoC: Router Microarchitecture & Network Topologies
49
Topology Comparison (3)
Aggregate Network Power vs. Bisection Bandwidth
0.3
0.25
Power (W)
0.2
mesh
fbfly
0.15
cmesh
ftree
0.1
kn
0.05
0
0
10/13/09
1000
2000
3000
4000
5000
6000
Network Bisection Bandwidth (bits per cycle)
7000
8000
9000
NoC: Router Microarchitecture & Network Topologies
50
Topology Study Conclusions
 Meshes are most commonly used topology in NoCs, but
they are not very efficient
 High average hop count
 Flattened Butterfly provides more bandwidth per watt
at the cost of increased complexity
 Non-tiled layout, more complex routers
 Best choice for particular application depends on design
goals and parameters
10/13/09
NoC: Router Microarchitecture & Network Topologies
51
Router RTL (1)
 Analytical models are becoming increasingly inaccurate
 Do not properly model wire delay
 Crucial in submicron processes!
 Derived from idealized, full-custom circuit designs
 But much of the timing-critical control logic is synthesized
 Useful for developing intuition & high-level models
 But detailed evaluations require a more accurate model
 Goal: Provide flexible template to generate customized
RTL-level router implementation
10/13/09
NoC: Router Microarchitecture & Network Topologies
52
Router RTL (2)
 Developed complete VC router implementation in RTL
 Highly parameterized
 Topologies, VC & flit buffer configurations, allocators,
routing logic, and various other implementation details
 Fully synthesizable Verilog-2001
 BSD license allows for virtually unrestricted use
 Live source code repository & bug tracker
 http://nocs.stanford.edu/
10/13/09
NoC: Router Microarchitecture & Network Topologies
53
Thank You!
Questions?
10/13/09
NoC: Router Microarchitecture & Network Topologies
Download