18-742 Parallel Computer Architecture Lecture 18: Interconnection Networks II

advertisement
18-742
Parallel Computer Architecture
Lecture 18: Interconnection Networks II
Chris Fallin
Carnegie Mellon University
Material based on Michael Papamichael’s 18-742 lecture slides from Spring 2011,
in turn based on Onur Mutlu’s 18-742 lecture slides from Spring 2010.
Readings: Interconnection Networks

Required







Dally, “Virtual-Channel Flow Control,” ISCA 1990.
Mullins et al., “Low-Latency Virtual-Channel Routers for On-Chip
Networks,” ISCA 2004.
Wentzlaff et al., “On-Chip Interconnection Architecture of the Tile
Processor,” IEEE Micro 2007.
Fallin et al., “CHIPPER: A Low-complexity Bufferless Deflection
Router,” HPCA 2011.
Fallin et al., “MinBD: Minimally-Buffered Deflection Routing for OnChip Interconnect,” NOCS 2012.
Patel et al., “Processor-Memory Interconnections for
Multiprocessors,” ISCA 1979.
Recommended


Moscibroda and Mutlu, “A Case for Bufferless Routing in On-Chip
Networks,” ISCA 2009.
Tobias Bjerregaard, Shankar Mahadevan, “A Survey of Research and
Practices of Network-on-Chip”, ACM Computing Surveys (CSUR)
2006.
2
Last Lecture

Interconnection Networks



Introduction & Terminology
Topology
Buffering and Flow control
3
Today


Review (Topology & Flow Control)
More on interconnection networks





Research on NoC Router Design




Routing
Router design
Network performance metrics
On-chip vs. off-chip differences
BLESS: bufferless deflection routing
CHIPPER: cheaper bufferless deflection routing
MinBD: adding small buffers to recover some performance
Research on Congestion Control

HAT: Heterogeneous Adaptive Throttling
4
Today


Review (Topology & Flow Control)
More on interconnection networks





Research on NoC Router Design




Routing
Router design
Network performance metrics
On-chip vs. off-chip differences
BLESS: bufferless deflection routing
CHIPPER: cheaper bufferless deflection routing
MinBD: adding small buffers to recover some performance
Research on Congestion Control

HAT: Heterogeneous Adaptive Throttling
5
Review: Topologies
3
2
1
0
0 1
Topology
2
7
6
7
6
5
4
5
4
3
2
3
2
1
0
1
0
3
Crossbar
Multistage Logarith.
Mesh
Direct/Indirect
Indirect
Indirect
Direct
Blocking/
Non-blocking
Non-blocking
Blocking
Blocking
Cost
O(N2)
O(NlogN)
O(N)
Latency
O(1)
O(logN)
O(sqrt(N))
6
Review: Flow Control
S
Store and Forward
S
Cut Through / Wormhole
Shrink Buffers
D
D
Reduce latency
Any other
issues?
Head-of-Line
Blocking
Use Virtual
Channels
Red holds this channel:
channel remains idle
until read proceeds
Blocked by other
packets
Channel idle but
red packet blocked
behind blue
Buffer full: blue
cannot proceed
7
Review: Flow Control
S
Store and Forward
S
Cut Through / Wormhole
Shrink Buffers
D
D
Reduce latency
Any other
issues?
Head-of-Line
Blocking
Use Virtual
Channels
Buffer full: blue
cannot proceed
Blocked by other
packets
8
Today


Review (Topology & Flow Control)
More on interconnection networks





Research on NoC Router Design




Routing
Router design
Network performance metrics
On-chip vs. off-chip differences
BLESS: bufferless deflection routing
CHIPPER: cheaper bufferless deflection routing
MinBD: adding small buffers to recover some performance
Research on Congestion Control

HAT: Heterogeneous Adaptive Throttling
12
Routing Mechanism

Arithmetic



Simple arithmetic to determine route in regular topologies
Dimension order routing in meshes/tori
Source Based
Source specifies output port for each switch in route
+ Simple switches


no control state: strip output port off header
- Large header

Table Lookup Based
Index into table for output port
+ Small header
- More complex switches

13
Routing Algorithm

Types




Deterministic: always choose the same path
Oblivious: do not consider network state (e.g., random)
Adaptive: adapt to state of the network
How to adapt


Local/global feedback
Minimal or non-minimal paths
14
Deterministic Routing


All packets between the same (source, dest) pair take the
same path
Dimension-order routing


E.g., XY routing (used in Cray T3D, and many on-chip
networks)
First traverse dimension X, then traverse dimension Y
+ Simple
+ Deadlock freedom (no cycles in resource allocation)
- Could lead to high contention
- Does not exploit path diversity
15
Deadlock



No forward progress
Caused by circular dependencies on resources
Each packet waits for a buffer occupied by another packet
downstream
16
Handling Deadlock

Avoid cycles in routing

Dimension order routing



Restrict the “turns” each packet can take
Avoid deadlock by adding virtual channels


Cannot build a circular dependency
Separate VC pool per distance
Detect and break deadlock

Preemption of buffers
17
Turn Model to Avoid Deadlock

Idea




Analyze directions in which packets can turn in the network
Determine the cycles that such turns can form
Prohibit just enough turns to break possible cycles
Glass and Ni, “The Turn Model for Adaptive Routing,” ISCA
1992.
18
Valiant’s Algorithm



An example of oblivious algorithm
Goal: Balance network load
Idea: Randomly choose an intermediate destination, route
to it first, then route from there to destination

Between source-intermediate and intermediate-dest, can use
dimension order routing
+ Randomizes/balances network load
- Non minimal (packet latency can increase)

Optimizations:


Do this on high load
Restrict the intermediate node to be close (in the same quadrant)
19
Adaptive Routing

Minimal adaptive
Router uses network state (e.g., downstream buffer
occupancy) to pick which “productive” output port to send a
packet to
 Productive output port: port that gets the packet closer to its
destination
+ Aware of local congestion
- Minimality restricts achievable link utilization (load balance)


Non-minimal (fully) adaptive
“Misroute” packets to non-productive output ports based on
network state
+ Can achieve better network utilization and load balance
- Need to guarantee livelock freedom

20
More on Adaptive Routing

Can avoid faulty links/routers

Idea: Route around faults
+ Deterministic routing cannot handle faulty components
- Need to change the routing table to disable faulty routes
- Assuming the faulty link/router is detected
21
Today


Review (Topology & Flow Control)
More on interconnection networks





Research on NoC Router Design




Routing
Router design
Network performance metrics
On-chip vs. off-chip differences
BLESS: bufferless deflection routing
CHIPPER: cheaper bufferless deflection routing
MinBD: adding small buffers to recover some performance
Research on Congestion Control

HAT: Heterogeneous Adaptive Throttling
22
On-chip Networks
PE
PE
PE
R
R
PE
PE
PE
PE
PE
R
PE
PE
PE
R
R
R
R
PE
R
R
R
PE
VC Identifier
R
From East
PE
R
PE
R
Input Port with Buffers
R
From West
VC 0
VC 1
VC 2
Control Logic
Routing Unit
(RC)
VC Allocator
(VA)
Switch
Allocator (SA)
PE
To East
From North
R
To West
To North
R
To South
To PE
From South
R
Router
PE Processing Element
Crossbar
(5 x 5)
From PE
Crossbar
(Cores, L2 Banks, Memory Controllers etc)
23
Router Design: Functions of a Router

Buffering (of flits)

Route computation

Arbitration of flits (i.e. prioritization) when contention


Switching


Called packet scheduling
From input port to output port
Power management

Scale link/router frequency
24
Router Pipeline
BW

RC
VA
SA
ST
LT
Five logical stages

BW: Buffer Write
RC: Route computation
VA: Virtual Channel Allocation
SA: Switch Allocation
ST: Switch Traversal

LT: Link Traversal




25
Wormhole Router Timeline
BW
Head
Body 1
Body 2
RC
VA
SA
BW
BW
BW
Tail
ST
LT
SA
ST
LT
SA
ST
LT
SA
ST
LT
Route computation performed once per packet
 Virtual channel allocated once per packet
Body and tail flits inherit this information from head flit


26
Dependencies in a Router
Decode + Routing
Switch Arbitration
Crossbar Traversal
Wormhole Router
Decode +
Routing
VC
Switch
Allocation
Arbitration
Virtual Channel Router
Decode +
Routing
VC
Allocation
Speculative Switch
Arbitration
Crossbar
Traversal
Crossbar
Traversal
Speculative Virtual Channel
Router

Dependence between output of one module and input of
another


Determine critical path through router
Cannot bid for switch port until routing performed
27
Pipeline Optimizations: Lookahead Routing

At current router perform routing computation for next
router

Overlap with BW
BW
RC



SA
ST
LT
Precomputing route allows flits to compete for VCs
immediately after BW
RC decodes route header
Routing computation needed at next hop


VA
Can be computed in parallel with VA
Galles, “Spider: A High-Speed Network Interconnect,”
IEEE Micro 1997.
Pipeline Optimizations: Speculation

Assume that Virtual Channel Allocation stage will be
successful


Valid under low to moderate loads
Entire VA and SA in parallel
BW
RC

ST
LT
If VA unsuccessful (no virtual channel returned)


VA
SA
Must repeat VA/SA in next cycle
Prioritize non-speculative requests
Pipeline Optimizations: Bypassing

When no flits in input buffer


Speculatively enter ST
On port conflict, speculation aborted
VA
RC
Setup

ST
LT
In the first stage, a free VC is allocated, next routing is
performed and the crossbar is setup
Today


Review (Topology & Flow Control)
More on interconnection networks





Research on NoC Router Design




Routing
Router design
Network performance metrics
On-chip vs. off-chip differences
BLESS: bufferless deflection routing
CHIPPER: cheaper bufferless deflection routing
MinBD: adding small buffers to recover some performance
Research on Congestion Control

HAT: Heterogeneous Adaptive Throttling
39
Interconnection Network Performance
Throughput
given by flow
control
Latency
Zero load latency
(topology+routing+f
low control)
Throughput
given by
routing
Throughput
given by
topology
Min latency
given by
routing
algorithm
Min latency
given by
topology
Offered Traffic (bits/sec)
40
Ideal Latency

Ideal latency

Solely due to wire delay between source and destination
D L
Tideal  
v b




D = Manhattan distance
L = packet size
b = channel bandwidth

v = propagation
velocity
41
Actual Latency

Dedicated wiring impractical

Long wires segmented with insertion of routers
D L
Tactual    H  Trouter  Tc
v b







D = Manhattan distance
L = packet size
b = channel bandwidth
v = propagation velocity
H = hops
Trouter = router latency
Tc = latency due to contention
42
Network Performance Metrics

Packet latency

Round trip latency

Saturation throughput

Application-level performance: system performance

Affected by interference among threads/applications
44
Today


Review (Topology & Flow Control)
More on interconnection networks





Research on NoC Router Design




Routing
Router design
Network performance metrics
On-chip vs. off-chip differences
BLESS: bufferless deflection routing
CHIPPER: cheaper bufferless deflection routing
MinBD: adding small buffers to recover some performance
Research on Congestion Control

HAT: Heterogeneous Adaptive Throttling
45
On-Chip vs. Off-Chip Differences
Advantages of on-chip


Wires are “free”
 Can build highly connected networks with wide buses
Low latency


Can cross entire network in few clock cycles
High Reliability

Packets are not dropped and links rarely fail
Disadvantages of on-chip



Sharing resources with rest of components on chip
 Area
 Power
Limited buffering available
Not all topologies map well to 2D plane
46
Today


Review (Topology & Flow Control)
More on interconnection networks





Research on NoC Router Design




Routing
Router design
Network performance metrics
On-chip vs. off-chip differences
BLESS: bufferless deflection routing
CHIPPER: cheaper bufferless deflection routing
MinBD: adding small buffers to recover some performance
Research on Congestion Control

HAT: Heterogeneous Adaptive Throttling
47
A Case for Bufferless Routing
in On-Chip Networks
$
Thomas Moscibroda
Microsoft Research
Onur Mutlu
CMU
On-Chip Networks (NoC)
Multi-core Chip
CPU+L1
Cache
-Bank
CPU+L1
CPU+L1
CPU+L1
Cache
-Bank
Cache
-Bank
Cache
-Bank
$
CPU+L1
Cache
-Bank
CPU+L1
CPU+L1
CPU+L1
Cache
-Bank
Cache
-Bank
Cache
-Bank
Memory
Controller
Thomas Moscibroda, Microsoft Research
Accelerator,
etc…
On-Chip Networks (NoC)
• Connect cores, caches, memory controllers, etc…
• Examples:
• Intel 80-core Terascale chip
• MIT RAW chip
$
• Design goals in NoC design:
• High throughput, low latency
• Fairness between cores, QoS, …
• Low complexity, low cost
• Power, low energy consumption
Thomas Moscibroda, Microsoft Research
On-Chip Networks (NoC)
• Connect cores, caches, memory controllers, etc…
• Examples:
• Intel 80-core Terascale chip
• MIT RAW chip Energy/Power in On-Chip Networks
• Power is a$key constraint in the design
• Design goals in NoC design:
of high-performance processors
• High throughput, low latency
• NoCs
consume
portion of system
• Fairness between
cores,
QoS, substantial
…
power
• Low complexity,
low cost
• ~30% in Intel 80-core Terascale [IEEE Micro’07]
• Power, low energy consumption
• ~40% in MIT RAW Chip [ISCA’04]
• NoCs estimated to consume 100s of Watts
[Borkar, DAC’07]
Thomas Moscibroda, Microsoft Research
Current NoC Approaches
•
Existing approaches differ in numerous ways:
•
Network topology [Kim et al, ISCA’07, Kim et al, ISCA’08 etc]
•
Flow control [Michelogiannakis et al, HPCA’09, Kumar et al, MICRO’08, etc]
•
Virtual Channels [Nicopoulos et al, MICRO’06, etc]
•
QoS & fairness mechanisms [Lee et al, ISCA’08, etc]
•
Routing algorithms [Singh et al, CAL’04]
•
Router architecture [Park et al, ISCA’08]
•
Broadcast, Multicast [Jerger et al, ISCA’08, Rodrigo et al, MICRO’08]
$
Existing work assumes existence of
buffers in routers!
Thomas Moscibroda, Microsoft Research
A Typical Router
Input Channel 1
Credit Flow
to upstream
router
VC1
Scheduler
Routing Computation
VC Arbiter
VC2
Switch Arbiter
VCv
Input Port 1
Output Channel 1
$
Input Channel N
VC1
VC2
Output Channel N
Credit Flow
to upstream
router
VCv
Input Port N
N x N Crossbar
Buffers are integral part of
existing NoC Routers
Thomas Moscibroda, Microsoft Research
Buffers in NoC Routers
Buffers are necessary for high network throughput
 buffers increase total available bandwidth in network
Avg. packet latency
•
small
buffers
$
medium
buffers
large
buffers
Injection Rate
Thomas Moscibroda, Microsoft Research
Buffers in NoC Routers
•
Buffers are necessary for high network throughput
 buffers increase total available bandwidth in network
•
Buffers consume significant energy/power
•
•
•
•
Dynamic energy when read/write
$
Static energy even when not occupied
Buffers add complexity and latency
•
Logic for buffer management
•
Virtual channel allocation
•
Credit-based flow control
Buffers require significant chip area
•
E.g., in TRIPS prototype chip, input buffers occupy 75% of
total on-chip network area [Gratz et al, ICCD’06]
Thomas Moscibroda, Microsoft Research
Going Bufferless…?
How much throughput do we lose?
 How is latency affected?
buffers
latency
•
no
buffers
Injection Rate
•
Up to what injection rates can we use bufferless routing?
•
Can we achieve energy reduction?
$
 Are there realistic scenarios in which NoC is
operated at injection rates below the threshold?
 If so, how much…?
•
Can we reduce area, complexity, etc…?
Thomas Moscibroda, Microsoft Research
Answers in
our paper!
Overview
•
Introduction and Background
•
Bufferless Routing (BLESS)
• FLIT-BLESS
• WORM-BLESS
$
• BLESS with buffers
•
Advantages and Disadvantages
•
Evaluations
•
Conclusions
Thomas Moscibroda, Microsoft Research
BLESS: Bufferless Routing
•
•
Always forward all incoming flits to some output port
If no productive direction is available, send to another
direction
•
 packet is deflected
 Hot-potato routing [Baran’64,
etc]
$
Deflected!
Buffered
Thomas Moscibroda, Microsoft Research
BLESS
BLESS: Bufferless Routing
Routing
Flit-Ranking
VC Arbiter
PortSwitch Arbiter
Prioritization
arbitration policy
$
Flit-Ranking
PortPrioritization
1. Create a ranking over all incoming flits
2. For a given flit in this ranking, find the best free output-port
Apply to each flit in order of ranking
Thomas Moscibroda, Microsoft Research
FLIT-BLESS: Flit-Level Routing
•
•
Each flit is routed independently.
Oldest-first arbitration (other policies evaluated in paper)
Flit-Ranking
PortPrioritization
•
1. Oldest-first ranking
2. Assign flit to productive port, if possible.
Otherwise, assign to non-productive port.
Network Topology:
$
 Can be applied to most topologies (Mesh, Torus, Hypercube, Trees, …)
1) #output ports ¸ #input ports
at every router
2) every router is reachable from every other router
•
Flow Control & Injection Policy:
 Completely local, inject whenever input port is free
•
•
Absence of Deadlocks: every flit is always moving
Absence of Livelocks: with oldest-first ranking
Thomas Moscibroda, Microsoft Research
WORM-BLESS: Wormhole Routing
•
Potential downsides of FLIT-BLESS
•
Not-energy optimal (each flits needs header information)
•
Increase in latency (different flits take different path)
•
Increase in receive buffer size
new worm!
•
•
$
BLESS with wormhole routing…?
[Dally, Seitz’86]
Problems:
•
Injection Problem
(not known when it is safe to inject)
•
Livelock Problem
(packets can be deflected forever)
Thomas Moscibroda, Microsoft Research
WORM-BLESS: Wormhole Routing
Flit-Ranking
Port-Prioritization
Deflect worms
if necessary!
Truncate worms
if necessary!
At low congestion, packets
travel routed as worms
1.
2.
Oldest-first ranking
If flit is head-flit
a) assign flit to unallocated, productive port
b) assign flit to allocated, productive port
c) assign flit to unallocated, non-productive port
d) assign flit to allocated, non-productive port
else,
a) assign flit to port that is allocated to worm
$
Body-flit turns
into head-flit
allocated
to West
This worm
is truncated!
& deflected!
allocated
to North
Head-flit: West
See paper for details…
Thomas Moscibroda, Microsoft Research
BLESS with Buffers
•
•
•
•
BLESS without buffers is extreme end of a continuum
BLESS can be integrated with buffers
•
FLIT-BLESS with Buffers
•
WORM-BLESS with Buffers
Whenever a buffer is full, it’s first flit becomes
must-schedule
$
must-schedule flits must be deflected if necessary
See paper for details…
Thomas Moscibroda, Microsoft Research
Overview
•
Introduction and Background
•
Bufferless Routing (BLESS)
• FLIT-BLESS
• WORM-BLESS
$
• BLESS with buffers
•
Advantages and Disadvantages
•
Evaluations
•
Conclusions
Thomas Moscibroda, Microsoft Research
BLESS: Advantages & Disadvantages
Advantages
• No buffers
• Purely local flow control
• Simplicity
- no credit-flows
- no virtual channels
- simplified router design
•
•
No deadlocks, livelocks
Adaptivity
$
Disadvantages
• Increased latency
• Reduced bandwidth
• Increased buffering at
receiver
• Header information at
each flit
- packets are deflected around
congested areas!
•
•
Router latency reduction
Area savings
Impact on energy…?
Thomas Moscibroda, Microsoft Research
Reduction of Router Latency
•
Baseline
Router
(speculative)
BLESS
Router
(standard)
BLESS
Router
(optimized)
BW: Buffer Write
RC: Route Computation
VA: Virtual Channel Allocation
SA: Switch Allocation
ST: Switch Traversal
LT: Link Traversal
LA LT: Link Traversal of Lookahead
BLESS gets rid of input buffers
and virtual channels
head BW
flit
RC
body
flit
Router 1
VA
SA
BW
RC
ST
SA
ST
LT
ST
RC
ST
LT
$
RC
Can be improved to 2.
ST
LT
Router Latency = 2
LT
LA LT
Router 2
Router Latency = 3
LT
Router 2
Router 1
[Dally, Towles’04]
RC
ST
LT
Router Latency = 1
Thomas Moscibroda, Microsoft Research
BLESS: Advantages & Disadvantages
Advantages
• No buffers
• Purely local flow control
• Simplicity
- no credit-flows
- no virtual channels
- simplified router design
•
•
No deadlocks, livelocks
Adaptivity
- packets are deflected around
congested areas!
•
•
Router latency reduction
Area savings
$
Disadvantages
• Increased latency
• Reduced bandwidth
• Increased buffering at
receiver
• Header information at
each flit
Extensive evaluations in the paper!
Impact on energy…?
Thomas Moscibroda, Microsoft Research
Evaluation Methodology
•
2D mesh network, router latency is 2 cycles
o
o
o
o
o
•
•
4x4, 8 core, 8 L2 cache banks (each node is a core or an L2 bank)
is cycle-accurate
4x4, 16 core, 16 L2 cache banks (each nodeSimulation
is a core and
an L2 bank)
 Models stalls in network
8x8, 16 core, 64 L2 cache banks (each node is L2 bank and may be a core)
and processors
128-bit wide links, 4-flit data packets, 1-flit
address
packets behavior
Self-throttling
 input
Aggressive
model
For baseline configuration: 4 VCs per physical
port, 1processor
packet deep
Benchmarks
$
o
Multiprogrammed SPEC CPU2006 and Windows Desktop applications
o
Heterogeneous and homogenous application mixes
o
o
Synthetic traffic patterns: UR, Transpose, Tornado, Bit Complement
Most of M
our evaluations
processor model based on Intel Pentium
with perfect L2 caches
2 GHz processor, 128-entry instruction window
 Puts maximal stress
64Kbyte private L1 caches
on NoC
o
Total 16Mbyte shared L2 caches; 16 MSHRs per bank
o
DRAM model based on Micron DDR2-800
x86
o
Thomas Moscibroda, Microsoft Research
Evaluation Methodology
•
Energy model provided by Orion simulator [MICRO’02]
o
•
70nm technology, 2 GHz routers at 1.0 Vdd
For BLESS, we model
o
Additional energy to transmit header information
o
Additional buffers needed on the receiver side
o
Additional logic to reorder flits of individual packets at receiver
$
•
We partition network energy into
buffer energy, router energy, and link energy,
each having static and dynamic components.
•
Comparisons against non-adaptive and aggressive
adaptive buffered routing algorithms (DO, MIN-AD, ROMM)
Thomas Moscibroda, Microsoft Research
Evaluation – Synthethic Traces
BLESS
Injection Rate (flits per cycle per node)
Thomas Moscibroda, Microsoft Research
0.49
0.46
0.43
0.4
0.37
0.34
0.31
0.28
0.25
0.22
0.19
0.16
0.13
$
0.1
FLIT-2
WORM-2
FLIT-1
WORM-1
MIN-AD
0.07
• BLESS has significantly lower
saturation throughput
compared to buffered
baseline.
100
90
80
70
60
50
40
30
20
10
0
0
• Uniform random injection
Average Latency
• First, the bad news 
Best
Baseline
• Perfect caches!
• Very little performance
degradation with BLESS
(less than 4% in dense
network)
• With router latency 1,
BLESS can even
outperform baseline
(by ~10%)
• Significant energy
improvements
(almost 40%)
Baseline
18
16
14
12
10
8
6
4
2
0
$ milc
4x4, 8x
Energy (normalized)
• milc benchmarks
(moderately intensive)
W-Speedup
Evaluation – Homogenous Case Study
1.2
BLESS
4x4, 16x milc
BufferEnergy
LinkEnergy
RL=1
8x8, 16x milc
RouterEnergy
1
0.8
0.6
0.4
0.2
0
4x4, 8x milc
4x4, 16x milc
Thomas Moscibroda, Microsoft Research
8x8, 16x milc
• milc benchmarks
(moderately intensive)
Observations:
• Perfect caches!
W-Speedup
Evaluation – Homogenous Case Study
Baseline
18
16
14
12
10
8
6
4
2
0
BLESS
RL=1
1)Injection rates not extremely high
Energy (normalized)
• Very little performance
degradation
with BLESS
on
average
$ milc
4x4, 8x
(less than 4% in dense

self-throttling!
network)
1.2
1
• With router latency 1,
2)For
and 0.8temporary
BLESS can bursts
even
0.6
outperform
baseline
network links as 0.4buffers!
(by ~10%)
• Significant energy
improvements
(almost 40%)
4x4, 16x milc
BufferEnergy
LinkEnergy
8x8, 16x milc
RouterEnergy
hotspots, use
0.2
0
4x4, 8 8x milc
4x4, 16x milc
Thomas Moscibroda, Microsoft Research
8x8, 16x milc
Evaluation – Further Results
•
BLESS increases buffer requirement
at receiver by at most 2x
See paper for details…
 overall, energy is still reduced
Impact of memory latency
DO
MIN-AD
ROMM
FLIT-2
WORM-2
FLIT-1
WORM-1
4x4, 8x matlab
DO
MIN-AD
ROMM
FLIT-2
WORM-2
FLIT-1
WORM-1
18
16
14
12
10
8
6
4
2
0
DO
MIN-AD
ROMM
FLIT-2
WORM-2
FLIT-1
WORM-1
 with real caches, very little slowdown! (at most 1.5%)
$
W-Speedup
•
4x4, 16x matlab
8x8, 16x matlab
Thomas Moscibroda, Microsoft Research
Evaluation – Further Results
•
BLESS increases buffer requirement
at receiver by at most 2x
See paper for details…
 overall, energy is still reduced
•
Impact of memory latency
 with real caches, very little slowdown! (at most 1.5%)
$
•
Heterogeneous application mixes
(we evaluate several mixes of intensive and non-intensive applications)
 little performance degradation
 significant energy savings in all cases
 no significant increase in unfairness across different applications
•
Area savings: ~60% of network area can be saved!
Thomas Moscibroda, Microsoft Research
Evaluation – Aggregate Results
•
Aggregate results over all 29 applications
Sparse Network
-46.4%
-41.0%
∆ System Performance
-0.5%
$ -3.2%
-0.15%
-0.55%
LinkEnergy
RouterEnergy
0.6
0.4
0.2
BASE FLIT WORM
Mean
BASE FLIT WORM
Worst-Case
W-Speedup
0.8
8
7
6
5
4
3
2
1
0
Mean
Thomas Moscibroda, Microsoft Research
WORM
-28.1%
FLIT
-39.4%
BASE
∆ Network Energy
WORM
Worst-Case
FLIT
Average
BASE
Worst-Case
1
0
Realistic L2
Average
BufferEnergy
Energy
(normalized)
Perfect L2
Worst-Case
Evaluation – Aggregate Results
•
Aggregate results over all 29 applications
Sparse Network
Perfect L2
Realistic L2
Average
Worst-Case
Average
Worst-Case
∆ Network Energy
-39.4%
-28.1%
-46.4%
-41.0%
∆ System Performance
-0.5%
$ -3.2%
-0.15%
-0.55%
Dense Network
Perfect L2
Realistic L2
Average
Worst-Case
Average
Worst-Case
∆ Network Energy
-32.8%
-14.0%
-42.5%
-33.7%
∆ System Performance
-3.6%
-17.1%
-0.7%
-1.5%
Thomas Moscibroda, Microsoft Research
Conclusion
•
For a very wide range of applications and network settings,
buffers are not needed in NoC
•
•
•
•
Significant energy savings
(32% even in dense networks and perfect caches)
Area-savings of 60%
Simplified router and network design (flow control, etc…)
Performance slowdown is$minimal (can even increase!)

A strong case for a rethinking of NoC design!
•
We are currently working on future research.
•
Support for quality of service, different traffic classes, energymanagement, etc…
Thomas Moscibroda, Microsoft Research
CHIPPER: A Low-complexity
Bufferless Deflection Router
Chris Fallin
Chris Craik
Onur Mutlu
Motivation

Recent work has proposed bufferless deflection routing
(BLESS [Moscibroda, ISCA 2009])





Energy savings: ~40% in total NoC energy
Area reduction: ~40% in total NoC area
Minimal performance loss: ~4% on average
Unfortunately: unaddressed complexities in router
 long critical path, large reassembly buffers
Goal: obtain these benefits while simplifying the router
in order to make bufferless NoCs practical.
79
Problems that Bufferless Routers Must Solve
1. Must provide livelock freedom
 A packet should not be deflected forever
2. Must reassemble packets upon arrival
Flit: atomic routing unit
Packet: one or multiple flits
0 1 2 3
80
A Bufferless Router: A High-Level View
Crossbar
Router
Deflection
Routing
Logic
Local Node
Problem 2: Packet Reassembly
Inject
Problem 1: Livelock Freedom
Eject
Reassembly
Buffers
81
Complexity in Bufferless Deflection Routers
1. Must provide livelock freedom
Flits are sorted by age, then assigned in age order to
output ports
 43% longer critical path than buffered router
2. Must reassemble packets upon arrival
Reassembly buffers must be sized for worst case
 4KB per node
(8x8, 64-byte cache block)
82
Problem 1: Livelock Freedom
Crossbar
Deflection
Routing
Logic
Inject
Problem 1: Livelock Freedom
Eject
Reassembly
Buffers
83
Livelock Freedom in Previous Work




What stops a flit from deflecting forever?
All flits are timestamped
Oldest flits are assigned their desired ports
Total order among flits
Guaranteed
progress!
New traffic is lowest priority
<
<
<
<
<
Flit age forms total order

But what is the cost of this?
84
Age-Based Priorities are Expensive: Sorting

Router must sort flits by age: long-latency sort network

Three comparator stages for 4 flits
4
1
2
3
85
Age-Based Priorities Are Expensive: Allocation


After sorting, flits assigned to output ports in priority order
Port assignment of younger flits depends on that of older flits

1
sequential dependence in the port allocator
East?
2
GRANT: Flit 1  East
{N,S,W}
East?
DEFLECT: Flit 2  North
{S,W}
3
GRANT: Flit 3  South
South?
{W}
4
South?
DEFLECT: Flit 4  West
Age-Ordered Flits
86
Age-Based Priorities Are Expensive

Overall, deflection routing logic based on Oldest-First
has a 43% longer critical path than a buffered router
Priority Sort

Port Allocator
Question: is there a cheaper way to route while
guaranteeing livelock-freedom?
87
Solution: Golden Packet for Livelock Freedom

What is really necessary for livelock freedom?
Key Insight: No total order. it is enough to:
1. Pick one flit to prioritize until arrival
2. Ensure any flit is eventually picked
Guaranteed
progress!
New traffic is
lowest-priority
<
<
Flit age forms total order
partial ordering is sufficient!
<
“Golden Flit”
88
What Does Golden Flit Routing Require?



Only need to properly route the Golden Flit
First Insight: no need for full sort
Second Insight: no need for sequential allocation
Priority Sort
Port Allocator
89
Golden Flit Routing With Two Inputs



Let’s route the Golden Flit in a two-input router first
Step 1: pick a “winning” flit: Golden Flit, else random
Step 2: steer the winning flit to its desired output
and deflect other flit
 Golden Flit always routes toward destination
90
Golden Flit Routing with Four Inputs

Each block makes decisions independently!
 Deflection is a distributed decision
N
N
E
S
S
E
W
W
91
Permutation Network Operation
wins  swap!
Golden:
N
E
N
wins  swap!
N
E
Priority
Sort
Port Allocator
S
x
S
S
E
W
W
W
wins  no swap!
deflected
wins  no swap!
92
CHIPPER: Cheap Interconnect Partially-Permuting Router
Inject/Eject
Inject
Eject
Miss Buffers (MSHRs)
93
EVALUATION
94
Methodology

Multiprogrammed workloads: CPU2006, server, desktop


Multithreaded workloads: SPLASH-2, 16 threads


8x8 (64 cores), 39 homogeneous and 10 mixed sets
4x4 (16 cores), 5 applications
System configuration




Buffered baseline: 2-cycle router, 4 VCs/channel, 8 flits/VC
Bufferless baseline: 2-cycle latency, FLIT-BLESS
Instruction-trace driven, closed-loop, 128-entry OoO window
64KB L1, perfect L2 (stresses interconnect), XOR mapping
95
Methodology

Hardware modeling

Verilog models for CHIPPER, BLESS, buffered logic



Synthesized with commercial 65nm library
ORION for crossbar, buffers and links
Power


Static and dynamic power from hardware models
Based on event counts in cycle-accurate simulations
96
Results: Performance Degradation
Multiprogrammed (subset of 49 total)
Multithreaded
13.6%
BLESS
64
CHIPPER
56
1.8%
1
48
Speedup (Normalized)
Weighted Speedup
Buffered
40
32
24
0.8
0.6
0.4
16
0.2
8
Minimal loss for low-to-medium-intensity workloads
3.6%
49.8%
97
AVG
lun
fft
radix
cholesky
AVG (full set)
mcf
stream
GemsFDTD
MIX.6
MIX.0
MIX.8
MIX.2
MIX.5
search.1
vpr
h264ref
gcc
tonto
perlbench
C
luc
0
0
Results: Power Reduction
Multiprogrammed (subset of 49 total)
Multithreaded
2.5
18
Buffered
14
BLESS
12
CHIPPER
73.4%
54.9% 2
1.5
10
8
1
6
4
Removing buffers  majority of power savings
Slight savings from BLESS to CHIPPER
98
AVG
lun
fft
AVG (full set)
mcf
stream
GemsFDTD
MIX.6
MIX.0
MIX.8
MIX.2
MIX.5
search.1
vpr
h264ref
gcc
tonto
C
radix
0
0
cholesky
2
luc
C
0.5
perlbench
Network Power (W)
16
Results: Area and Critical Path Reduction
Normalized Router Area
1.5
Normalized Critical Path
1.5
-29.1%
1.25
1.25
+1.1%
1
1
-36.2%
0.75
0.75
-1.6%
0.5
C
CHIPPER maintains area savings of BLESS
0.25
C
0
0.5
0.25
0
Critical path becomes competitive
to buffered
Buffered
BLESS
CHIPPER
Buffered
BLESS
CHIPPER
99
Conclusions

Two key issues in bufferless deflection routing


Bufferless deflection routers were high-complexity and impractical



Oldest-first prioritization  long critical path in router
No end-to-end flow control for reassembly  prone to deadlock with
reasonably-sized reassembly buffers
CHIPPER is a new, practical bufferless deflection router




livelock freedom and packet reassembly
Golden packet prioritization  short critical path in router
Retransmit-once protocol  deadlock-free packet reassembly
Cache miss buffers as reassembly buffers  truly bufferless network
CHIPPER frequency comparable to buffered routers at much lower
area and power cost, and minimal performance loss
100
MinBD:
Minimally-Buffered Deflection Routing
for Energy-Efficient Interconnect
Chris Fallin, Greg Nazario, Xiangyao Yu*,
Kevin Chang, Rachata Ausavarungnirun, Onur Mutlu
Carnegie Mellon University
*CMU and Tsinghua University
Bufferless Deflection Routing


Key idea: Packets are never buffered in the network. When two
packets contend for the same link, one is deflected.
Removing buffers yields significant benefits



But, at high network utilization (load), bufferless deflection
routing causes unnecessary link & router traversals



Reduces power (CHIPPER: reduces NoC power by 55%)
Reduces die area (CHIPPER: reduces NoC area by 36%)
Reduces network throughput and application performance
Increases dynamic power
Goal: Improve high-load performance of low-cost deflection
networks by reducing the deflection rate.
102
Outline: This Talk

Motivation

Background: Bufferless Deflection Routing

MinBD: Reducing Deflections





Addressing Link Contention
Addressing the Ejection Bottleneck
Improving Deflection Arbitration
Results
Conclusions
103
Outline: This Talk

Motivation

Background: Bufferless Deflection Routing

MinBD: Reducing Deflections





Addressing Link Contention
Addressing the Ejection Bottleneck
Improving Deflection Arbitration
Results
Conclusions
104
Issues in Bufferless Deflection Routing

Correctness: Deliver all packets without livelock



Correctness: Reassemble packets without deadlock


CHIPPER1: Golden Packet
Globally prioritize one packet until delivered
CHIPPER1: Retransmit-Once
Performance: Avoid performance degradation at high load

1 Fallin
MinBD
et al., “CHIPPER: A Low-complexity Bufferless Deflection Router”, HPCA
105
Key Performance Issues
1. Link contention: no buffers to hold traffic 
any link contention causes a deflection
 use side buffers
2. Ejection bottleneck: only one flit can eject per router
per cycle  simultaneous arrival causes deflection
 eject up to 2 flits/cycle
3. Deflection arbitration: practical (fast) deflection
arbiters deflect unnecessarily
 new priority scheme (silver flit)
106
Outline: This Talk

Motivation

Background: Bufferless Deflection Routing

MinBD: Reducing Deflections





Addressing Link Contention
Addressing the Ejection Bottleneck
Improving Deflection Arbitration
Results
Conclusions
107
Outline: This Talk

Motivation

Background: Bufferless Deflection Routing

MinBD: Reducing Deflections





Addressing Link Contention
Addressing the Ejection Bottleneck
Improving Deflection Arbitration
Results
Conclusions
108
Addressing Link Contention



Problem 1: Any link contention causes a deflection
Buffering a flit can avoid deflection on contention
But, input buffers are expensive:



All flits are buffered on every hop  high dynamic energy
Large buffers necessary  high static energy and large area
Key Idea 1: add a small buffer to a bufferless deflection
router to buffer only flits that would have been deflected
109
How to Buffer Deflected Flits
Destination
Destination
DEFLECTED
Eject Inject
1 Fallin
2011.
Baseline Router
et al., “CHIPPER: A Low-complexity Bufferless Deflection Router”, HPCA
110
How to Buffer Deflected Flits
Side Buffer
Step 2. Buffer this flit in a small
FIFO “side buffer.”
Destination
Destination
Step 3. Re-inject this flit into
pipeline when a slot is available.
Step 1. Remove up to
one deflected flit per
cycle from the outputs.
Eject
Inject
DEFLECTED
Side-Buffered Router
111
Why Could A Side Buffer Work Well?

Buffer some flits and deflect other flits at per-flit level


Relative to bufferless routers, deflection rate reduces
(need not deflect all contending flits)
 4-flit buffer reduces deflection rate by 39%
Relative to buffered routers, buffer is more efficiently
used (need not buffer all flits)
 similar performance with 25% of buffer space
112
Outline: This Talk

Motivation

Background: Bufferless Deflection Routing

MinBD: Reducing Deflections





Addressing Link Contention
Addressing the Ejection Bottleneck
Improving Deflection Arbitration
Results
Conclusions
113
Addressing the Ejection Bottleneck




Problem 2: Flits deflect unnecessarily because only one flit
can eject per router per cycle
In 20% of all ejections, ≥ 2 flits could have ejected
 all but one flit must deflect and try again
 these deflected flits cause additional contention
Ejection width of 2 flits/cycle reduces deflection rate 21%
Key idea 2: Reduce deflections due to a single-flit ejection
port by allowing two flits to eject per cycle
114
Addressing the Ejection Bottleneck
DEFLECTED
Eject
Inject
Single-Width Ejection
115
Addressing the Ejection Bottleneck
For fair comparison, baseline routers have
dual-width ejection for perf. (not power/area)
Eject
Inject
Dual-Width Ejection
116
Outline: This Talk

Motivation

Background: Bufferless Deflection Routing

MinBD: Reducing Deflections





Addressing Link Contention
Addressing the Ejection Bottleneck
Improving Deflection Arbitration
Results
Conclusions
117
Improving Deflection Arbitration



Problem 3: Deflections occur unnecessarily because fast
arbiters must use simple priority schemes
Age-based priorities (several past works): full priority order
gives fewer deflections, but requires slow arbiters
State-of-the-art deflection arbitration (Golden Packet &
two-stage permutation network)



Prioritize one packet globally (ensure forward progress)
Arbitrate other flits randomly (fast critical path)
Random common case leads to uncoordinated arbitration
118
Fast Deflection Routing Implementation



Let’s route in a two-input router first:
Step 1: pick a “winning” flit (Golden Packet, else random)
Step 2: steer the winning flit to its desired output
and deflect other flit
 Highest-priority flit always routes to destination
119
Fast Deflection Routing with Four Inputs

Each block makes decisions independently
 Deflection is a distributed decision
N
N
E
S
S
E
W
W
120
Unnecessary Deflections in Fast Arbiters

How does lack of coordination cause unnecessary deflections?
1. No flit is golden (pseudorandom arbitration)
2. Red flit wins at first stage
3. Green flit loses at first stage (must be deflected now)
4. Red flit loses at second stage; Red and Green are deflected
Destination
all flits have
equal priority
unnecessary
deflection!
Destination
121
Improving Deflection Arbitration


Key idea 3: Add a priority level and prioritize one flit
to ensure at least one flit is not deflected in each cycle
Highest priority: one Golden Packet in network



Chosen in static round-robin schedule
Ensures correctness
Next-highest priority: one silver flit per router per cycle


Chosen pseudo-randomly & local to one router
Enhances performance
122
Adding A Silver Flit

Randomly picking a silver flit ensures one flit is not deflected
1. No flit is golden but Red flit is silver
2. Red flit wins at first stage (silver)
3. Green flit is deflected at first stage
4. Red flit wins at second stage (silver); not deflected
Destination
red
all flits
flit has
have
higher
equal priority
priority
At least one flit
is not deflected
Destination
123
Minimally-Buffered Deflection Router
Problem 1: Link Contention
Solution 1: Side Buffer
Problem 2: Ejection Bottleneck
Solution 2: Dual-Width Ejection
Eject
Problem 3: Unnecessary Deflections
Inject
Solution 3: Two-level priority scheme
124
Outline: This Talk

Motivation

Background: Bufferless Deflection Routing

MinBD: Reducing Deflections



Addressing Link Contention
Addressing the Ejection Bottleneck
Improving Deflection Arbitration
125
Outline: This Talk

Motivation

Background: Bufferless Deflection Routing

MinBD: Reducing Deflections





Addressing Link Contention
Addressing the Ejection Bottleneck
Improving Deflection Arbitration
Results
Conclusions
126
Methodology: Simulated System

Chip Multiprocessor Simulation






64-core and 16-core models
Closed-loop core/cache/NoC cycle-level model
Directory cache coherence protocol (SGI Origin-based)
64KB L1, perfect L2 (stresses interconnect), XOR-mapping
Performance metric: Weighted Speedup
(similar conclusions from network-level latency)
Workloads: multiprogrammed SPEC CPU2006


75 randomly-chosen workloads
Binned into network-load categories by average injection rate
127
Methodology: Routers and Network
Input-buffered virtual-channel router





Bufferless deflection router: CHIPPER1
Bufferless-buffered hybrid router: AFC2




Has input buffers and deflection routing logic
Performs coarse-grained (multi-cycle) mode switching
Common parameters




1Fallin
2Jafri
8 VCs, 8 flits/VC [Buffered(8,8)]: large buffered router
4 VCs, 4 flits/VC [Buffered(4,4)]: typical buffered router
4 VCs, 1 flit/VC [Buffered(4,1)]: smallest deadlock-free router
All power-of-2 buffer sizes up to (8, 8) for perf/power sweep
2-cycle router latency, 1-cycle link latency
2D-mesh topology (16-node: 4x4; 64-node: 8x8)
Dual ejection assumed for baseline routers (for perf. only)
et al., “CHIPPER: A Low-complexity Bufferless Deflection Router”, HPCA 2011.
et al., “Adaptive Flow Control for Robust Performance and Energy”, MICRO 2010.
128
Methodology: Power, Die Area, Crit. Path

Hardware modeling

Verilog models for CHIPPER, MinBD, buffered control logic



Synthesized with commercial 65nm library
ORION 2.0 for datapath: crossbar, muxes, buffers and links
Power



Static and dynamic power from hardware models
Based on event counts in cycle-accurate simulations
Broken down into buffer, link, other
129
Weighted Speedup
Reduced Deflections & Improved Perf.
1. All mechanisms
individually
reduce
3. Overall,
5.8% over baseline,
2.7%
overdeflections
dual-eject
15
by reducing deflections 64% / 54%
2. Side buffer alone is not sufficient for performance
14.5
(ejection bottleneck remains)
14
2.7%
5.8%
Baseline
13.5
B (Side
(Side-Buf)
Buffer)
D (Dual-Eject)
13
S (Silver Flits)
B+D
12.5
B+S+D (MinBD)
12
Deflection 28%
Rate
17%
22% 27%
11%
10%
130
Overall Performance Results
16
Weighted Speedup
2.7%
2.7%
14
8.3%
12
Buffered (8,8)
Buffered (4,4)
Buffered (4,1)
CHIPPER
10
8.1%
AFC (4,4)
MinBD-4
8
Injection
Rate @(8.1%
Improves
2.7%
over
CHIPPER
at buffering
high load)space
• Similar
perf.
to Buffered
(4,1)
25% of
• Within 2.7% of Buffered (4,4) (8.3% at high load)
131
Overall Power Results
Network Power (W)
3.0
dynamic other
static other
2.5
dynamic link
dynamic buffer
static
buffer
static non-buffer
link dynamic static
buffer
2.0
1.5
1.0
0.5
MinBD-4
AFC(4,4)
CHIPPER
Buffered
(4,1)
Buffered
(4,4)
Buffered
(8,8)
0.0
• Dynamic power increases with deflection routing
• Buffers are significant fraction of power in baseline routers
132
Buffer power
is much
smaller
in MinBD
(4-flit
buffer)
• Dynamic
power
reduces
in MinBD
relative
to CHIPPER
Weighted Speedup
Performance-Power Spectrum
15.0
14.8
14.6
14.4
14.2
14.0
13.8
13.6
13.4
13.2
13.0
More Perf/Power
Less Perf/Power
Buf (8,8)
MinBD
Buf (4,4)
AFC
Buf (4,1)
CHIPPER
Buf (1,1)
• Most
(perf/watt)
of
0.5 energy-efficient
1.0
1.5
2.0
2.5any
evaluated network
routerPower
design
Network
(W)
3.0
133
Die Area and Critical Path
Normalized Die Area
Normalized Critical Path
2.5
1.2
2
-36%
+8%
+7%
1.0
0.8
1.5
0.6
+3%
1
MinBD
CHIPPER
Buffered (4,1)
Buffered (4,4)
MinBD
CHIPPER
0.0
Buffered (4,1)
0
Buffered (4,4)
0.2
Buffered (8,8)
0.5
Buffered (8,8)
0.4
• Only 3% area increase over CHIPPER (4-flit buffer)
by 36%
from Buffered
• Reduces
Increasesarea
by 7%
over CHIPPER,
8% (4,4)
over Buffered (4,4)
134
Conclusions

Bufferless deflection routing offers reduced power & area
But, high deflection rate hurts performance at high load

MinBD (Minimally-Buffered Deflection Router) introduces:







Side buffer to hold only flits that would have been deflected
Dual-width ejection to address ejection bottleneck
Two-level prioritization to avoid unnecessary deflections
MinBD yields reduced power (31%) & reduced area (36%)
relative to buffered routers
MinBD yields improved performance (8.1% at high load)
relative to bufferless routers  closes half of perf. gap
MinBD has the best energy efficiency of all evaluated designs
with competitive performance
135
Download