Bufferless Routing - Microsoft Research

advertisement
A Case for Bufferless Routing
in On-Chip Networks
$
Thomas Moscibroda
Onur Mutlu
Microsoft Research
CMU
On-Chip Networks (NoC)
Multi-core Chip
CPU+L1
Cache
-Bank
CPU+L1
CPU+L1
CPU+L1
Cache
-Bank
Cache
-Bank
Cache
-Bank
$
CPU+L1
Cache
-Bank
CPU+L1
CPU+L1
CPU+L1
Cache
-Bank
Cache
-Bank
Cache
-Bank
Memory
Controller
Thomas Moscibroda, Microsoft Research
Accelerator,
etc…
On-Chip Networks (NoC)
•
•
Connect cores, caches, memory controllers, etc…
Examples:
• Intel 80-core Terascale chip
• MIT RAW chip
•
Design goals in NoC design:
• High throughput, low latency
• Fairness between cores, QoS, …
• Low complexity, low cost
• Power, low energy consumption
$
Thomas Moscibroda, Microsoft Research
On-Chip Networks (NoC)
•
•
•
Connect cores, caches, memory controllers, etc…
Examples:
• Intel 80-core Terascale chip
• MIT RAW chip Energy/Power in On-Chip Networks
• Power is a$key constraint in the design
Design goals in NoC design:
of high-performance processors
• High throughput, low latency
• NoCs
consume
portion of system
• Fairness between
cores,
QoS, substantial
…
power
• Low complexity,
low cost
• ~30% in Intel 80-core Terascale [IEEE Micro’07]
• Power, low energy consumption
• ~40% in MIT RAW Chip [ISCA’04]
• NoCs estimated to consume 100s of Watts
[Borkar, DAC’07]
Thomas Moscibroda, Microsoft Research
Current NoC Approaches
•
Existing approaches differ in numerous ways:
•
Network topology [Kim et al, ISCA’07, Kim et al, ISCA’08 etc]
•
Flow control [Michelogiannakis et al, HPCA’09, Kumar et al, MICRO’08, etc]
•
Virtual Channels [Nicopoulos et al, MICRO’06, etc]
•
QoS & fairness mechanisms [Lee et al, ISCA’08, etc]
•
Routing algorithms [Singh et al, CAL’04]
•
Router architecture [Park et al, ISCA’08]
•
Broadcast, Multicast [Jerger et al, ISCA’08, Rodrigo et al, MICRO’08]
$
Existing work assumes existence of
buffers in routers!
Thomas Moscibroda, Microsoft Research
A Typical Router
Input Channel 1
Credit Flow
to upstream
router
VC1
Scheduler
Routing Computation
VC Arbiter
VC2
Switch Arbiter
VCv
Input Port 1
Output Channel 1
$
Input Channel N
VC1
VC2
Output Channel N
Credit Flow
to upstream
router
VCv
Input Port N
N x N Crossbar
Buffers are integral part of
existing NoC Routers
Thomas Moscibroda, Microsoft Research
Buffers in NoC Routers
Buffers are necessary for high network throughput
 buffers increase total available bandwidth in network
Avg. packet latency
•
small
buffers
$
medium
buffers
large
buffers
Injection Rate
Thomas Moscibroda, Microsoft Research
Buffers in NoC Routers
•
Buffers are necessary for high network throughput
 buffers increase total available bandwidth in network
•
Buffers consume significant energy/power
•
•
•
•
Dynamic energy when read/write
$
Static energy even when not occupied
Buffers add complexity and latency
•
Logic for buffer management
•
Virtual channel allocation
•
Credit-based flow control
Buffers require significant chip area
•
E.g., in TRIPS prototype chip, input buffers occupy 75% of
total on-chip network area [Gratz et al, ICCD’06]
Thomas Moscibroda, Microsoft Research
Going Bufferless…?
How much throughput do we lose?
 How is latency affected?
buffers
latency
•
no
buffers
Injection Rate
•
Up to what injection rates can we use bufferless routing?
•
Can we achieve energy reduction?
$
 Are there realistic scenarios in which NoC is
operated at injection rates below the threshold?
 If so, how much…?
•
Can we reduce area, complexity, etc…?
Thomas Moscibroda, Microsoft Research
Answers in
our paper!
Overview
•
Introduction and Background
•
Bufferless Routing (BLESS)
• FLIT-BLESS
• WORM-BLESS
$
• BLESS with buffers
•
Advantages and Disadvantages
•
Evaluations
•
Conclusions
Thomas Moscibroda, Microsoft Research
BLESS: Bufferless Routing
•
•
Always forward all incoming flits to some output port
If no productive direction is available, send to another
direction
•
 packet is deflected
 Hot-potato routing [Baran’64,
etc]
$
Deflected!
Buffered
Thomas Moscibroda, Microsoft Research
BLESS
BLESS: Bufferless Routing
Routing
Flit-Ranking
VC Arbiter
PortSwitch Arbiter
Prioritization
arbitration policy
$
Flit-Ranking
PortPrioritization
1. Create a ranking over all incoming flits
2. For a given flit in this ranking, find the best free output-port
Apply to each flit in order of ranking
Thomas Moscibroda, Microsoft Research
FLIT-BLESS: Flit-Level Routing
•
•
Each flit is routed independently.
Oldest-first arbitration (other policies evaluated in paper)
Flit-Ranking
PortPrioritization
•
1. Oldest-first ranking
2. Assign flit to productive port, if possible.
Otherwise, assign to non-productive port.
Network Topology:
$
 Can be applied to most topologies (Mesh, Torus, Hypercube, Trees, …)
1) #output ports ¸ #input ports
at every router
2) every router is reachable from every other router
•
Flow Control & Injection Policy:
 Completely local, inject whenever input port is free
•
•
Absence of Deadlocks: every flit is always moving
Absence of Livelocks: with oldest-first ranking
Thomas Moscibroda, Microsoft Research
WORM-BLESS: Wormhole Routing
•
Potential downsides of FLIT-BLESS
•
Not-energy optimal (each flits needs header information)
•
Increase in latency (different flits take different path)
•
Increase in receive buffer size
new worm!
•
•
$
BLESS with wormhole routing…?
[Dally, Seitz’86]
Problems:
•
Injection Problem
(not known when it is safe to inject)
•
Livelock Problem
(packets can be deflected forever)
Thomas Moscibroda, Microsoft Research
WORM-BLESS: Wormhole Routing
Flit-Ranking
Port-Prioritization
Deflect worms
if necessary!
Truncate worms
if necessary!
At low congestion, packets
travel routed as worms
1.
2.
Oldest-first ranking
If flit is head-flit
a) assign flit to unallocated, productive port
b) assign flit to allocated, productive port
c) assign flit to unallocated, non-productive port
d) assign flit to allocated, non-productive port
else,
a) assign flit to port that is allocated to worm
$
Body-flit turns
into head-flit
allocated
to West
This worm
is truncated!
& deflected!
allocated
to North
Head-flit: West
See paper for details…
Thomas Moscibroda, Microsoft Research
BLESS with Buffers
•
•
•
•
BLESS without buffers is extreme end of a continuum
BLESS can be integrated with buffers
•
FLIT-BLESS with Buffers
•
WORM-BLESS with Buffers
Whenever a buffer is full, it’s first flit becomes
must-schedule
$
must-schedule flits must be deflected if necessary
See paper for details…
Thomas Moscibroda, Microsoft Research
Overview
•
Introduction and Background
•
Bufferless Routing (BLESS)
• FLIT-BLESS
• WORM-BLESS
$
• BLESS with buffers
•
Advantages and Disadvantages
•
Evaluations
•
Conclusions
Thomas Moscibroda, Microsoft Research
BLESS: Advantages & Disadvantages
Advantages
• No buffers
• Purely local flow control
• Simplicity
- no credit-flows
- no virtual channels
- simplified router design
•
•
No deadlocks, livelocks
Adaptivity
$
Disadvantages
• Increased latency
• Reduced bandwidth
• Increased buffering at
receiver
• Header information at
each flit
- packets are deflected around
congested areas!
•
•
Router latency reduction
Area savings
Impact on energy…?
Thomas Moscibroda, Microsoft Research
Reduction of Router Latency
•
Baseline
Router
(speculative)
BLESS
Router
(standard)
BLESS
Router
(optimized)
BW: Buffer Write
RC: Route Computation
VA: Virtual Channel Allocation
SA: Switch Allocation
ST: Switch Traversal
LT: Link Traversal
LA LT: Link Traversal of Lookahead
BLESS gets rid of input buffers
and virtual channels
head BW
flit
RC
body
flit
Router 1
VA
SA
BW
RC
ST
SA
ST
LT
ST
RC
ST
LT
$
RC
Can be improved to 2.
ST
LT
Router Latency = 2
LT
LA LT
Router 2
Router Latency = 3
LT
Router 2
Router 1
[Dally, Towles’04]
RC
ST
LT
Router Latency = 1
Thomas Moscibroda, Microsoft Research
BLESS: Advantages & Disadvantages
Advantages
• No buffers
• Purely local flow control
• Simplicity
- no credit-flows
- no virtual channels
- simplified router design
•
•
No deadlocks, livelocks
Adaptivity
- packets are deflected around
congested areas!
•
•
Router latency reduction
Area savings
$
Disadvantages
• Increased latency
• Reduced bandwidth
• Increased buffering at
receiver
• Header information at
each flit
Extensive evaluations in the paper!
Impact on energy…?
Thomas Moscibroda, Microsoft Research
Evaluation Methodology
•
2D mesh network, router latency is 2 cycles
o
o
o
o
o
•
•
4x4, 8 core, 8 L2 cache banks (each node is a core or an L2 bank)
is cycle-accurate
4x4, 16 core, 16 L2 cache banks (each nodeSimulation
is a core and
an L2 bank)
 Models stalls in network
8x8, 16 core, 64 L2 cache banks (each node is L2 bank and may be a core)
and processors
128-bit wide links, 4-flit data packets, 1-flit
address
packets behavior
Self-throttling
 input
Aggressive
model
For baseline configuration: 4 VCs per physical
port, 1processor
packet deep
Benchmarks
$
o
Multiprogrammed SPEC CPU2006 and Windows Desktop applications
o
Heterogeneous and homogenous application mixes
o
o
Synthetic traffic patterns: UR, Transpose, Tornado, Bit Complement
Most of M
our evaluations
processor model based on Intel Pentium
with perfect L2 caches
2 GHz processor, 128-entry instruction window
 Puts maximal stress
64Kbyte private L1 caches
on NoC
o
Total 16Mbyte shared L2 caches; 16 MSHRs per bank
o
DRAM model based on Micron DDR2-800
x86
o
Thomas Moscibroda, Microsoft Research
Evaluation Methodology
•
Energy model provided by Orion simulator [MICRO’02]
o
•
70nm technology, 2 GHz routers at 1.0 Vdd
For BLESS, we model
o
Additional energy to transmit header information
o
Additional buffers needed on the receiver side
o
Additional logic to reorder flits of individual packets at receiver
$
•
We partition network energy into
buffer energy, router energy, and link energy,
each having static and dynamic components.
•
Comparisons against non-adaptive and aggressive
adaptive buffered routing algorithms (DO, MIN-AD, ROMM)
Thomas Moscibroda, Microsoft Research
Evaluation – Synthethic Traces
BLESS
Injection Rate (flits per cycle per node)
Thomas Moscibroda, Microsoft Research
0,49
0,46
0,43
0,4
0,37
0,34
0,31
0,28
0,25
0,22
0,19
0,16
0,13
$
0,1
FLIT-2
WORM-2
FLIT-1
WORM-1
MIN-AD
0,07
• BLESS has significantly lower
saturation throughput
compared to buffered
baseline.
100
90
80
70
60
50
40
30
20
10
0
0
• Uniform random injection
Average Latency
• First, the bad news 
Best
Baseline
• Perfect caches!
• Very little performance
degradation with BLESS
(less than 4% in dense
network)
• With router latency 1,
BLESS can even
outperform baseline
(by ~10%)
• Significant energy
improvements
(almost 40%)
Baseline
18
16
14
12
10
8
6
4
2
0
$ milc
4x4, 8x
Energy (normalized)
• milc benchmarks
(moderately intensive)
W-Speedup
Evaluation – Homogenous Case Study
1,2
BLESS
4x4, 16x milc
BufferEnergy
LinkEnergy
RL=1
8x8, 16x milc
RouterEnergy
1
0,8
0,6
0,4
0,2
0
4x4, 8x milc
4x4, 16x milc
Thomas Moscibroda, Microsoft Research
8x8, 16x milc
• milc benchmarks
(moderately intensive)
Observations:
• Perfect caches!
W-Speedup
Evaluation – Homogenous Case Study
Baseline
18
16
14
12
10
8
6
4
2
0
BLESS
RL=1
1) Injection rates not extremely high
Energy (normalized)
• Very little performance
degradation
with BLESS
on average
$ milc
4x4, 8x
(less than 4% in dense

self-throttling!
network)
1,2
4x4, 16x milc
BufferEnergy
LinkEnergy
8x8, 16x milc
RouterEnergy
1
• With router latency 1,
2)
bursts and0,8 temporary hotspots,
BLESSFor
can even
0,6
outperform
baseline
use network links
as buffers!
0,4
(by ~10%)
• Significant energy
improvements
(almost 40%)
0,2
0
4x4, 8 8x milc
4x4, 16x milc
Thomas Moscibroda, Microsoft Research
8x8, 16x milc
Evaluation – Further Results
•
BLESS increases buffer requirement
at receiver by at most 2x
See paper for details…
 overall, energy is still reduced
Impact of memory latency
DO
MIN-AD
ROMM
FLIT-2
WORM-2
FLIT-1
WORM-1
4x4, 8x matlab
DO
MIN-AD
ROMM
FLIT-2
WORM-2
FLIT-1
WORM-1
18
16
14
12
10
8
6
4
2
0
DO
MIN-AD
ROMM
FLIT-2
WORM-2
FLIT-1
WORM-1
 with real caches, very little slowdown! (at most 1.5%)
$
W-Speedup
•
4x4, 16x matlab
8x8, 16x matlab
Thomas Moscibroda, Microsoft Research
Evaluation – Further Results
•
BLESS increases buffer requirement
at receiver by at most 2x
See paper for details…
 overall, energy is still reduced
•
Impact of memory latency
 with real caches, very little slowdown! (at most 1.5%)
$
•
Heterogeneous application mixes
(we evaluate several mixes of intensive and non-intensive applications)
 little performance degradation
 significant energy savings in all cases
 no significant increase in unfairness across different applications
•
Area savings: ~60% of network area can be saved!
Thomas Moscibroda, Microsoft Research
Evaluation – Aggregate Results
•
Aggregate results over all 29 applications
Sparse Network
-46.4%
-41.0%
∆ System Performance
-0.5%
$ -3.2%
-0.15%
-0.55%
LinkEnergy
RouterEnergy
0,6
0,4
0,2
BASE FLIT WORM
Mean
BASE FLIT WORM
Worst-Case
W-Speedup
0,8
8
7
6
5
4
3
2
1
0
Mean
Thomas Moscibroda, Microsoft Research
WORM
-28.1%
FLIT
-39.4%
BASE
∆ Network Energy
WORM
Worst-Case
FLIT
Average
BASE
Worst-Case
1
0
Realistic L2
Average
BufferEnergy
Energy
(normalized)
Perfect L2
Worst-Case
Evaluation – Aggregate Results
•
Aggregate results over all 29 applications
Sparse Network
Perfect L2
Realistic L2
Average
Worst-Case
Average
Worst-Case
∆ Network Energy
-39.4%
-28.1%
-46.4%
-41.0%
∆ System Performance
-0.5%
$ -3.2%
-0.15%
-0.55%
Dense Network
Perfect L2
Realistic L2
Average
Worst-Case
Average
Worst-Case
∆ Network Energy
-32.8%
-14.0%
-42.5%
-33.7%
∆ System Performance
-3.6%
-17.1%
-0.7%
-1.5%
Thomas Moscibroda, Microsoft Research
Conclusion
•
For a very wide range of applications and network settings,
buffers are not needed in NoC
•
•
•
•
Significant energy savings
(32% even in dense networks and perfect caches)
Area-savings of 60%
Simplified router and network design (flow control, etc…)
Performance slowdown is$minimal (can even increase!)

A strong case for a rethinking of NoC design!
•
We are currently working on future research.
•
Support for quality of service, different traffic classes, energymanagement, etc…
Thomas Moscibroda, Microsoft Research
Download