Next Generation On-Chip Networks: What Kind of Congestion

advertisement
On-Chip Networks from a Networking
Perspective:
Congestion and Scalability in
Many-Core Interconnects
George Nychis✝, Chris Fallin✝,
Thomas Moscibroda★, Onur Mutlu✝, Srinivasan Seshan✝
Carnegie Mellon University ✝
Microsoft Research Asia ★
1
What is the On-Chip Network?
Multi-core Processor (9-core)
CPU
CPU
CPU
L1
L1
L1
L2 bank
L2 bank
L2 bank
Cores
GPUs
CPU
CPU
CPU
L1
L1
L1
L2 bank
L2 bank
L2 bank
CPU
CPU
CPU
L1
L1
L1
L2 bank
L2 bank
L2 bank
2
Cache
Banks
Memory
Controllers
What is the On-Chip Network?
Multi-core Processor (9-core)
CPU
CPU
CPU
L1
L1
L1
L2 bank
L2 bank
L2 bank
CPU
CPU
CPU
L1
L1
L2 bank
L2 bank
CPU
CPU
CPU
L1
L1
L1
L2 bank
L2 bank
L2 bank
L1
L2 bank
D
S
Router
3
Network
Links
Networking Challenges
• Familiar discussion in the architecture community, e.g.:
- How to reduce congestion
- How to scale the network
- Choosing an effective topology
- Routing and buffer size
All historical problems in
our field…
4
On-Chip Network
Can We Apply Traditional Solutions? (1)
1. Different constraints: unique network design
2. Different workloads:
Bufferless
Area: -60%
Power: -40%
unique style of traffic and flow
On-Chip Network (3x3)
Zoomed In
D
X
Routing
min. complexity:
X-Y-Routing,
low latency
Coordination
global is often
less expensive
S
Links
links cannot be
over-provisioned
5
Can We Apply Traditional Solutions? (2)
1. Different constraints: unique network design
2. Different workloads:
Bufferless
Area: -60%
Power: -40%
Closed-Loop
Instruction
Window
Limits In-Flight
Traffic Per-Core
unique style of traffic and flow
Zoomed In
Architecture
(Instruct. Win)
Insn. R5
i5 i6 R7
i7 i8
R8 i9
Router
Network Layer
6
Routing
min. complexity:
X-Y-Routing,
low latency
Coordination
global is often
less expensive
Links
links cannot be
over-provisioned
Traffic and Congestion
• Injection only when
•
output link is free
Arbitration: oldest pkt
first (dead/live-lock free)
On-Chip Network D
Manifestation of Congestion
1. Deflection: arbitration
causing non-optimal hop
S1 0
1
2
1
0
S2
contending for top port,
age
is
initialized
oldest
first,
newest
deflected
7
Traffic and Congestion
• Injection only when
•
output link is free
Arbitration: oldest pkt
first (dead/live-lock free)
On-Chip Network
Manifestation of Congestion
1. Deflection: arbitration
causing non-optimal hop
2. Starvation: when a core
cannot inject (no loss)
Definition: Starvation rate
is a fraction of starved
cycles
8
Can’t inject packet
without a free port
Outline
• Bufferless On-Chip Networks:
Congestion &
Scalability
Study of congestion at network and application
layers
Impact of congestion on scalability
-
• Novel application-aware congestion control
mechanism
• Evaluation of congestion control mechanism
- Able to effectively scale the network
9
Congestion and Scalability Study
• Prior work: moderate intensity workloads, small on-chip
net
- Energy and area benefits of going bufferless
- throughput comparable to buffered
• Study: high intensity workloads & large network (4096
cores)
- Still
comparable throughput with the benefits of
bufferless?
• Use real application workloads (e.g., matlab, gcc, bzip2,
perl)
- Simulate the whole chip, including NoC
10
avg. net latency (cycles)
Congestion at the Network Level
• Evaluate 700 different
•
appl. mixes in 16-core
system
Finding: net latency
remains stable with
congestion/deflects
•
40
30
20
25%
Increase
10
0
0
Unlike traditional networks
average starvation rate
• What about starvation
rate?
• Starvation increases
•
50
0.2
0.4
0.6
0.8
average network utilization
1
0.4
700%
0.3
Increase in network
latency
EachIncrease
point represents
a
0.2
under
congestion
is only
single
workload
0.1
~5-6 cycles
significantly with
congestion
Finding: starvation likely
to impact performance;
indicator of congestion
11
0.0
0.0
0.2
0.4
0.6
0.8
average network utilization
1.0
Congestion at the Application Level
• Define system throughput as sum of instructions-per-cycle
(IPC) of all applications in system:
• Unthrottle apps in single wkld
• Finding 1:
•
Throughput
decreases under
congestion
Finding 2: Self-throttling
of cores prevents collapse
• Finding 3:
Throttled
Sub-optimal
Unthrottled
Throughput
Does
with congestion
Not Collapse
Static throttling can provide some gain (e.g.,
14%), but we will show up to 27% gain with app-aware
12
throttling
Impact of Congestion on Scalability
• Prior work: 16-64 cores
Our work: up to 4096 cores
• As we increase system’s size:
Starvation Rate
0.4
0.3
0.2
0.1
0
16
• A core can be starved
up to 37% of all cycles!
- Per-node
throughput
decreases with system’s size
• Up to 38% reduction
13
Throughput (IPC/Node)
- Starvation rate increases
64
256
1024
Number of Cores
4096
1
0.8
0.6
0.4
0.2
0
16
64
256
1024
Number of Cores
4096
Summary of Congestion Study
• Network congestion limits scalability and performance
- Due to starvation rate, not increased network latency
- Starvation
rate is the indicator of congestion in on-chip
net
• Self-throttling nature of cores prevent congestion
collapse
• Throttling: reduced congestion, improved performance
- Motivation for congestion control
Congestion control must be application-aware
14
Outline
• Bufferless On-Chip Networks:
Congestion &
Scalability
Study of congestion at network and application
layers
Impact of congestion on scalability
-
• Novel application-aware congestion control
mechanism
• Evaluation of congestion control mechanism
- Able to effectively scale the network
15
Developing a Congestion Controller
• Traditional congestion controllers designed to:
- Improve network efficiency
- Maintain fairness of network access
When
- Provide stability (and avoid collapse) Considering:
- Operate in a distributed manner
On-Chip Network
• A controller for on-chip networks must:
- Have minimal complexity
- Be area-efficient
- We show: be application-aware
…in paper: global and simple controller
16
Need For Application Awareness
• Throttling reduces congestion, improves system
throughput
Under congestion, what core should be throttled?
• Use 16-core system, alternate 90% throttle rate to
Instruction
Throughput
Avg.Avg.
Instruction
Throughput
Avg. Instruction Throughput
applications
Gromacs
MCF
Nothing
Gromacs
MCF
Nothing
• Finding 1: the app
Gromacs
MCF
Nothing
+21%
2.5
2.5
that is throttled
2.5
2 2
-9%
impacts system
0.6
1.521.5
<1%
performance
1.5
1 1
• Finding 2: application
+33%
1.3
0.510.5
throughput does not
0.5
0 0
Baseline
Throttle
Throttle
dictate whom to
Baseline
Throttle
Throttle
Gromacs
MCF
0
Gromacs
MCF
Baseline
Throttle
Throttle
throttle
Gromacs
MCF
• Finding 3: different applications respond differently
to
an
increase
in network
throughput
(unlike gromacs,
mcf barely
Unlike
traditional
congestion
controllers
gains)(e.g., TCP): cannot be application agnostic
17
Instructions-Per-Packet (IPP)
• Key Insight: Not all packets are created equal
L1 misses
more traffic
to progress
-- more
Throttling
during
a “high”
IPP phase
will hurt
performance
25000
25000
20000
20000
15000
15000
10000
10000
5000
5000
0 0
0 0
IPP
IPP
Instructions-Per-Packet (IPP) = I/P
deal.ii
deal.ii
-
IPP
IPP
• IPP only
depends on the L1 miss rate
Phase
behavior
50 60 60 70 70 80 80
independent
of the level 10
of10congestion
20 20 30 30 40 40 &50execution
on the
order
of
rate
10000
10000
xml
millions
of
cycles
xml
low value: need many
1000 packets shift window
1000
100
100
provides the application-layer
insight needed
10
10
1 1
0.10.1
0 0
• Since
L1 miss rate varies over execution, IPP is
dynamic
10 10 20 20 30 30 40 40 50 50 60 60 70 70 80 80
Execution
Cycle
(million)
Execution
Cycle
(million)
18
Instructions-Per-Packet (IPP)
• Since
L1 miss rate varies over execution, IPP is
dynamic
- Throttling
during a “high” IPP phase will hurt
performance
• IPP provides application-layer insight in who to throttle
- Dynamic IPP
 Throttling must be dynamic
• When congested:
• Fairness:
throttle applications with low IPP
scale throttling rate by application’s IPP
- Details in paper show that fairness in throttling
19
Outline
• Bufferless On-Chip Network:
Congestion &
Scalability
Study of congestion at network and application
layers
Impact of congestion on scalability
-
• Novel application-aware congestion control
mechanism
• Evaluation of congestion control mechanism
- Able to effectively scale the network
20
Evaluation of Improved Efficiency
• Evaluate with 875 real workloads (700 16-core, 175 64core)
- generate
balanced set of CMP workloads (cloud
computing)
• Improvement up to 27%
System
Throughput
% Improvement
% Improvement
under congested workloads
3030
2525
• Does not degrade
2020
non-congested workloads
1515
1010
• Only 4/875 workloads have
55
perform. reduced > 0.5%
00
-5-5
The improvement in system
0.00.0 0.2
0.2 0.4
0.4 0.6
0.6
0.8
1.0
1.0
•throughput
Do not unfairly throttle
workloads
baseline
baselineaverage
averagenetwork
networkutilization
utilization
applicationsfor
down
(in
paper)
Network Utilization With No
Congestion
Control
21
Evaluation of Improved Scalability
•
•
•
•
Buffered: area/power expensive
Contribution: keep area and
power benefits of bufferless, while
achieving comparable performance
•
•
•
Baseline bufferless: doesn’t scale
Throughput (IPC/Node)
Comparison points…
1
0.8
0.6
0.4
Baseline Bufferless
Buffered
0.2
Throttling Bufferless
0
16
64
256
1024
Number of Cores
4096
Application-aware throttling
Overall reduction in congestion
Power consumption reduced
through increase in net efficiency
% Reduction in
Power Consumption
•
20
15
10
5
0
16
Many other results in paper, e.g.,
Fairness, starvation, latency…
22
64
256
1024
Number of Cores
4096
Summary of Study, Results, and
Conclusions
•
Highlighted a traditional networking problem in a new context
- Unique design requires novel solution
•
•
•
We showed congestion limited efficiency and scalability, and
that self-throttling nature of cores prevents collapse
Study showed congestion control would require appawareness
Our application-aware congestion controller provided:
- A more efficient network-layer (reduced latency)
- Improvements in system throughput (up to 27%)
- Effectively scale the CMP (shown for up to 4096 cores)
23
Discussion
•
Congestion is just one of many similarities, discussion in
paper, e.g.,
- Traffic Engineering:
“hotspots”
- Data Centers:
multi-threaded workloads w/
similar topology, dynamic routing &
computation
- Coding:
“XOR’s In-The-Air” adapted to the on-chip
network:
•
i.e., instead of deflecting 1 of 2 packets, XOR the
packets and forward the combination over the optimal
hop
24
Download