When congested

advertisement
On-Chip Networks from a Networking
Perspective:
Congestion and Scalability in
Many-Core Interconnects
George Nychis✝, Chris Fallin✝,
Thomas Moscibroda★, Onur Mutlu✝, Srinivasan Seshan✝
Carnegie Mellon University ✝
Microsoft Research Asia ★
Presenter:Zhi Liu
es
What is the On-Chip Network?
Multi-core Processor (9-core)
CPU
CPU
CPU
L1
L1
L1
L2 bank
L2 bank
L2 bank
CPU
CPU
CPU
L1
L1
L1
L2 bank
L2 bank
L2 bank
CPU
CPU
CPU
L1
L1
L1
L2 bank
L2 bank
L2 bank
Cache
Banks
Memory
Controllers
2
What is the On-Chip Network?
Multi-core Processor (9-core)
CPU
CPU
CPU
L1
L1
L1
L2 bank
L2 bank
L2 bank
CPU
Network
Links
CPU
CPU
L1
L1
L2 bank
L2 bank
L2 bank
CPU
CPU
CPU
L1
L1
L1
L2 bank
L2 bank
L2 bank
L1
D
S
uter
3
Networking Challenges
he architecture community, e.g.:
educe congestion
scale the network
an effective topology
On-Chip Network
g and buffer size
4
Characteristics of Bufferless NoCs
s: unique network design
unique style of traffic and flow
ess
a: -60% On-Chip Network (3x3)
40%
Zoomed In
X
S
Routing
min. complexity:
X-Y-Routing,
low latency
D
Coordination
global is often
practical for the known topology
Links
links cannot be
over-provisioned
Cs Characteristics
s: unique network design
Routing
min. complexity:
X-Y-Routing,
low latency
unique style of traffic and flow
ess
0%
40%
Zoomed In
d-Loop
uction
dow
Traffic Per-Core
Coordination
global is often
Insn. i5
R5 i6 i7
R7 i8R8i9
practical
for the known topology
Architecture
(Instruct. Win)
Router
Network Layer
Links
links cannot be
over-provisioned
Traffic and Congestion
• Injection only when output link is free
• Arbitration:
oldest pktOn-Chip
first (dead/live-lock
free) D
Network
Congestion
1. Deflection: arbitration causing non-optimal hop2
S1 0
1
1
0
S2
contending for top port, oldest first, newest deflected
age is initialized
Traffic and Congestion
• Injection only when output link is free
• Arbitration:
oldest pktOn-Chip
first (dead/live-lock
free)
Network
Congestion
1. Deflection: arbitration causing non-optimal hop
2. Starvation: when a core cannot inject (no loss)
Definition: Starvation rate is a fraction of starved cycles
Can’t inject packet
without a free output port
Outline
• Bufferless On-Chip Networks:
Congestion &
Scalability
- Study of congestion at network and application
layers
- Impact of congestion on scalability
• Novel application-aware congestion control
mechanism
• Evaluation of congestion control mechanism
Congestion and Scalability Study
• Prior work: moderate intensity workloads, small on-chip
net
- Energy and area benefits of going bufferless
- throughput comparable to buffered
• Study: high intensity workloads & large network (4096
cores)
- Still comparable throughput with the benefits of
bufferless?
• Methodology : Use real application workloads (e.g.,
matlab, gcc, bzip2, perl)
avg. net latency (cycles)
Congestion at the Network Level
50
• Evaluate 700 different
40
loads(intensity of high
30 with congestion/deflects
25%
• Finding:
net latency remains stable
medium and low) in 1620
•core
Unlike
traditional
networks
system
Increase
10
0
0
0.2
0.4
0.6
0.8
average network utilization
1
average starvation rate
• What about starvation rate?
700%
0.4 congestion is only
Increase
in
network
latency
under
• Starvation
increasesIncrease
significantly with congestion
Each point represents a single workload
0.3
~5-6 cycles
0.2
0.1 performance; indicator o
• Finding: starvation likely to impact
0.0
0.0
0.2
0.4
0.6
0.8
average network utilization
1.0
Congestion at the Application Level
• Define system throughput as sum of instructions-per-cycle
(IPC) of all applications in system:
• Unthrottle apps in single high intensity wkld in 4x4 system
•
Throttled
Finding 1: Throughput decreases under congestion
Sub-optimal
Unthrottled
Throughput Does
with congestion
Not Collapse
• Finding 2:
Self-throttling of cores prevents collapse
• Finding 3:
Static throttling can provide some gain (e.g., 14%
Impact of Congestion on Scalability
Our work: up to 4096 cores
• As we increase system’s size
Starvation Rate
• Prior work: 16-64 cores
0.4
0.3
0.2
0.1
0
16
avg_hop=1):
- Starvation rate increases
- Aupcore
can be starved
to 37% of all cycles!
- Per-node throughput
decreases with system’s size
- Up to 38% reduction
Throughput (IPC/Node)
(high load, locality model,
64
256
1024
Number of Cores
4096
1
0.8
0.6
0.4
0.2
0
16
64
256
1024
Number of Cores
4096
Summary of Congestion Study
• Network congestion limits scalability and performance
- Due to starvation rate, not increased network latency
- Starvation rate is the indicator of congestion in on-chip
net
• Self-throttling nature of cores prevent congestion
collapse
• Throttling: reduced congestion, improved performance
Motivation for congestion control
be application-aware
Outline
• Bufferless On-Chip Networks:
Congestion &
Scalability
- Study of congestion at network and application
layers
- Impact of congestion on scalability
• Novel application-aware congestion control
mechanism
• Evaluation of congestion control mechanism
Developing a Congestion Controller
• Traditional congestion controllers designed to:
- Improve network efficiency
When
- Maintain fairness of network access
Considering:
- Provide stability (and avoid collapse)
On-Chip Network
- Operate in a distributed manner
• A controller for on-chip networks must:
- Have minimal complexity
Be area-efficient
al and simple controller
- We show: be application-aware
Need For Application Awareness
• Throttling reduces congestion, improves system
throughput
• Use
16-core
system, 8
instances
each app,
alternate 90% thr
- Under
congestion,
which
core should
be throttled?
• Finding 1: the app that is throttled impacts system
+21%performan
Instruction
Throughput
Avg.Avg.
Instruction
Throughput
Avg. Instruction Throughput
Gromacs
Gromacs
Gromacs
2.5
2.5
Nothing
Nothing
MCF
Nothing
-9%
2.5
2 2
1.521.5
MCF
MCF
0.6
<1%
+33%
• Finding 2: application throughput does
not dictate
who to thro
1.3
1.5
1 1
0.510.5
0.5
0 0
0
Baseline
Baseline
Baseline
Throttle
Throttle
Gromacs
Gromacs
Throttle
Gromacs
Throttle
Throttle
MCF
MCF
Throttle
MCF
• Finding
different
applications
respondagnostic
differently to an incr
rollers
(e.g.,3:
TCP):
cannot
be application
Instructions-Per-Packet (IPP)
• Key Insight: Not all packets are created equal
more L1 misses  more traffic to progress
Throttling
during on
a “high”
phase
•- IPP
only depends
the L1IPP
miss
rate will hurt performance
IPP
IPP
IPP
IPP
- independent of the level of congestion & execution rate
r-Packet
I/Pmore packets into network
- low(IPP)
value: =inject
behavior
- provides a stable measure of app’slcurrent network intens
millions
of cycles
- MCF’s
IPF: 0.583,
Gromacs IPF: 12.41
25000
25000
20000
20000
15000
15000
10000
10000
5000
5000
0 0
0 0
10000
10000
1000
1000
100100
10 10
1 1
0.10.1
0 0
deal.ii
deal.ii
10 10 20 20 30 30 40 40 50 50 60 60 70 70 80 80
xmlxml
• Since L1 miss rate varies over execution, IPP is dynamic
10 10 20 20 30 30 40 40 50 50 60 60 70 70 80 80
Execution
Cycle
(million)
Execution
Cycle
(million)
1
Instructions-Per-Packet (IPP)
• Since L1 miss rate varies over execution, IPP is dynamic
-
Throttling during a “high” IPP phase will hurt performance
-layer insight in who to throttle
Throttling must be dynamic
ottle applications with low IPP
ling rate by application’s IPP
1
Summary of Congestion Controlling
gestion control algorithm every 100,000 cycles
: based on starvation rates
termines IPF of applications
ottle applications with low IPP
scale throttling rate by application’s IPP
for applications with lower IPP
Outline
• Bufferless On-Chip Network:
Congestion &
Scalability
- Study of congestion at network and application
layers
- Impact of congestion on scalability
• Novel application-aware congestion control
mechanism
• Evaluation of congestion control mechanism
• Evaluate with 875 real workloads (700 16-core, 175 64core)
generate balanced set of CMP workloads (cloud
• Improvement up to 27% under congested workloads
•
computing)
Does not degrade
non-congested workloads
3030
2525
2020
1515
Only 4/875 workloads have perform.
1010 reduced > 0.5%
55
00
throughput for workloads
-5-5down (in paper)
Do not unfairly throttle applications
0.00.0 0.2
0.2 0.4
0.4 0.6
0.6
0.8
•
•
System
Throughput
% Improvement
% Improvement
m
Evaluation of Improved Efficiency
1.0
1.0
baseline
baselineaverage
averagenetwork
networkutilization
utilization
Network Utilization With No Congestion Control
Evaluation of Improved Scalability
•
•
•
Baseline bufferless: doesn’t scale
Buffered: area/power expensive
Contribution: keep area and
power benefits of bufferless, while
achieving comparable performance
•
•
•
Throughput (IPC/Node)
Comparison points…
1
0.8
0.6
0.4
Baseline Bufferless
Buffered
0.2
Throttling Bufferless
0
16
64
256
1024
Number of Cores
4096
Application-aware throttling
% Reduction in
Power Consumption
•
Overall reduction in congestion
Power consumption reduced
through increase in net efficiency
20
15
10
5
0
16
64
256
1024
Number of Cores
4096
Summary of Study, Results, and
Conclusions
•
Highlighted a traditional networking problem in a new context
- Unique design requires novel solution
•
•
•
We showed congestion limited efficiency and scalability, and
that self-throttling nature of cores prevents collapse
Study showed congestion control would require appawareness
Our application-aware congestion controller provided:
- A more efficient network-layer (reduced latency)
- Improvements in system throughput (up to 27%)
- Effectively scale the CMP (shown for up to 4096 cores)
Download