A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach

advertisement
A Heterogeneous Multiple
Network-On-Chip Design:
An Application-Aware Approach
Asit K. Mishra
Onur Mutlu
Chita R. Das
Executive summary
•  Problem: Current day NoC designs are agnostic to application requirements and are
provisioned for the general case or worst case. Applications have widely differing
demands from the network
•  Our goal: To design a NoC that can satisfy the diverse dynamic performance
requirements of applications
•  Observation: Applications can be divided into two general classes in terms of their
requirements from the network: bandwidth-sensitive and latency-sensitive
- Not all applications are equally sensitive to bandwidth and latency
•  Key idea: Design two NoC
- Each sub-network customized for either BW or LAT sensitive applications
- Propose metrics to classify applications as BW or LAT sensitive
- Prioritize applications’ packets within the sub-networks based on their sensitivity
•  Network design: BW optimized network has wider link width but operates at a lower
frequency and LAT optimized network has narrow link width but operates at a higher
frequency
•  Results: Our proposal is significantly better when compared to an iso-resource
monolithic network (5%/3% weighted/instruction throughput improvement and 31%
energy reduction)
2
Resource requirements of various applications - I
Impact of channel bandwidth on application performance
•  Channel bandwidth affects network latency, throughput and
energy/power
•  Increase in channel BW leads to
- Reduction in packet serialization
- Increase in router power
3
Resource requirements of various applications - I
Impact of channel bandwidth on application performance
Simulation settings:
•  8x8 multi-hop packet based mesh network
•  Each node in the network has an OoO processor (2GHz), private L1
cache and a router (2GHz)
•  Shared 1MB per core shared L2
•  6VC/PC, 2 stage router
4
applu
0
mcf
gems
omnet
cacts
soplx
sjas
lbm
bzip
sphnx
256b links
xalan
sap
sjbb
swim
hmmer
128b links
milc
astar
gobmk
libq
tonto
64b links
pvray
gcc
h264
5
namd
6
grmcs
barnes
sjeng
deal
art
wrf
IT (norm. to 64b links)
Resource requirements of various applications - I
Impact of channel bandwidth on application performance
7
512b links
4
3
2
1
5
Resource requirements of various applications - I
Impact of channel bandwidth on application performance
6
64b links
5
128b links
256b links
512b links
4
3
2
mcf
gems
omnet
cacts
soplx
sjas
lbm
bzip
sphnx
xalan
sap
sjbb
swim
hmmer
milc
astar
gobmk
libq
tonto
pvray
gcc
h264
namd
grmcs
barnes
sjeng
deal
art
0
wrf
1
applu
IT (norm. to 64b links)
7
1.  18/30 (21/36 total) applications’ performance is agnostic to channel BW
(8x BW inc. → less than 2x performance inc.)
6
Resource requirements of various applications - I
Impact of channel bandwidth on application performance
6
64b links
5
128b links
256b links
512b links
4
3
2
mcf
gems
omnet
cacts
soplx
sjas
lbm
bzip
sphnx
xalan
sap
sjbb
swim
hmmer
milc
astar
gobmk
libq
tonto
pvray
gcc
h264
namd
grmcs
barnes
sjeng
deal
art
0
wrf
1
applu
IT (norm. to 64b links)
7
1.  18/30 (21/36 total) applications’ performance is agnostic to channel BW
(8x BW inc. → less than 2x performance inc.)
2. 12/30 (15/36 total) applications’ performance scale with increase in
channel BW (8x BW inc. → at least 2x performance inc.)
7
Resource requirements of various applications - II
Impact of network latency on application performance
•  Reduction in router latency (by increasing frequency)
-  Reduction in packet latency
-  Increase in router power consumption
8
Resource requirements of various applications - II
Impact of network latency on application performance
Simulation settings:
•  … same as last experiment
•  128b links
•  Added dummy stages (2-cycle and 4-cycle ) to each router
9
0.5 mcf gems omne
cacts soplx sjas lbm bzip sphnx 4-­‐cycle router xalan sap sjbb swim hmm
milc astar 2-­‐cycle router gobm
libq tonto pvray gcc h264 1.0 namd 1.1 grmcs barne
sjeng deal art wrf applu IT (norm. to 2-­‐cycle router) Resource requirements of various applications - II
Impact of network latency on application performance
6-­‐cycle router 0.9 0.8 0.7 0.6 10
Resource requirements of various applications - II
1.1 2-­‐cycle router 1.0 4-­‐cycle router 6-­‐cycle router 0.9 0.8 0.7 mcf gems omne
cacts soplx sjas lbm bzip sphnx xalan sap sjbb swim hmm
milc astar gobm
libq tonto pvray gcc h264 namd grmcs barne
sjeng deal art 0.5 wrf 0.6 applu IT (norm. to 2-­‐cycle router) Impact of network latency on application performance
1.  18/30 (21/36 total) applications’ performance is sensitive to network
latency (3x latency reduction → at least 25% performance improvement)
11
Resource requirements of various applications - II
1.1 2-­‐cycle router 1.0 4-­‐cycle router 6-­‐cycle router 0.9 0.8 0.7 mcf gems omne
cacts soplx sjas lbm bzip sphnx xalan sap sjbb swim hmm
milc astar gobm
libq tonto pvray gcc h264 namd grmcs barne
sjeng deal art 0.5 wrf 0.6 applu IT (norm. to 2-­‐cycle router) Impact of network latency on application performance
1.  18/30 (21/36 total) applications’ performance is sensitive to network
latency (3x latency reduction → at least 25% performance improvement)
2. 12/30 (15/36 total) applications’ performance is marginally sensitive to
network latency (3x latency increase → less than 15% performance
improvement)
12
IT (norm. to 2-­‐cycle router) 0.5 omnet
gems
omnet gems mcf
cacts
cacts mcf soplx
soplx bzip
sphnx
sjas
0.7 sjas 0.9 lbm
6-­‐cycle router lbm bzip sphnx 4-­‐cycle router xalan
sap
sjbb
swim
hmmer
256b links
xalan sap sjbb swim hmmer milc
astar
gobmk
libq
128b links
milc 2-­‐cycle router astar gobmk libq tonto
pvray
gcc
h264
64b links
tonto pvray gcc h264 namd
grmcs
6
namd grmcs sjeng
deal
art
wrf
barnes
1.1 barnes sjeng deal art wrf applu
0
applu IT (norm. to 64b
links)
Application-aware approach to designing multiple NoCs
4
512b links
2
13
6
64b links
128b links
256b links
512b links
4
sjas
soplx
cacts
omnet
gems
sjas soplx cacts omnet gems mcf
lbm
bzip
sphnx
xalan
sap
sjbb
swim
hmmer
milc
4-­‐cycle router lbm 2-­‐cycle router astar
libq
tonto
pvray
gcc
h264
namd
grmcs
gobmk
1.1 barnes
sjeng
deal
art
wrf
0
applu
2
6-­‐cycle router 0.9 mcf bzip sphnx xalan sap sjbb swim hmmer milc astar gobmk libq tonto pvray gcc h264 namd grmcs barnes sjeng deal art 0.5 wrf 0.7 applu IT (norm. to 2-­‐cycle router) IT (norm. to 64b
links)
Application-aware approach to designing multiple NoCs
Based on the observations:
1.  Applications can be classified into distinct classes: typically LAT/BW sensitive
2.  LAT sensitive applications can benefit from low network latency
3.  BW sensitive applications can benefit from high network bandwidth
4.  Not all applications are equally sensitive to either LAT or BW
5.  Monolithic network cannot optimize both classes simultaneously
14
IT (norm. to 2-­‐cycle router) 0.5 omnet
gems
omnet gems mcf
cacts
cacts mcf soplx
soplx bzip
sphnx
sjas
0.7 sjas 0.9 lbm
6-­‐cycle router lbm bzip sphnx 4-­‐cycle router xalan
sap
sjbb
swim
hmmer
256b links
xalan sap sjbb swim hmmer milc
astar
gobmk
libq
128b links
milc 2-­‐cycle router astar gobmk libq tonto
pvray
gcc
h264
64b links
tonto pvray gcc h264 namd
grmcs
6
namd grmcs sjeng
deal
art
wrf
barnes
1.1 barnes sjeng deal art wrf applu
0
applu IT (norm. to 64b
links)
Application-aware approach to designing multiple NoCs
4
512b links
2
Solution
Two NoCs where each (sub)network is optimized for either LAT
or BW sensitive applications
15
Design methodology
Logical view of a multicore processor
Processors/L1$
Network
L2$ and Mem. Controllers
16
Design methodology
Logical view of a multicore processor
1
Processors/L1$
1
Network
L2$ and Mem. Controllers
Identify LAT/BW sensitive applications
- Proposes a novel dynamic application classification scheme
17
Design methodology
Logical view of a multicore processor
1
Processors/L1$
Network
2
L2$ and Mem. Controllers
1
Identify LAT/BW sensitive applications
- Proposes a novel dynamic application classification scheme
2
Design sub-networks based on applications’ demand
- This network architecture is better than a monolithic iso-resource
network
18
Design methodology
Logical view of a multicore processor
DEMUX
1
Processors/L1$
Network
2
L2$ and Mem. Controllers
1
Identify LAT/BW sensitive applications
- Proposes a novel dynamic application classification scheme
2
Design sub-networks based on applications’ demand
- This network architecture is better than a monolithic iso-resource
network
19
Outstanding network
packets
Design: Dynamic classification of applications
Application life cycle
time
Network episode Compute episode
20
Outstanding network
packets
Design: Dynamic classification of applications
Application life cycle
time
Network episode Compute episode
•  App. has at least one outstanding packet
•  Processor is likely stalling → low IPC
21
Outstanding network
packets
Design: Dynamic classification of applications
Application life cycle
time
Network episode Compute episode
•  App. has at least one outstanding packet
•  Processor is likely stalling → low IPC
•  App. has no outstanding packet
•  High IPC
22
Application life cycle
Episode length
Episode height
Outstanding network
packets
Design: Dynamic classification of applications
time
Network episode Compute episode
•  App. has at least one outstanding packet
•  Processor is likely stalling → low IPC
•  App. has no outstanding packet
•  High IPC
Episode length = Number of consecutive cycles there are net. packets
Episode height = Avg. number of L1 packets injected during an episode
23
Application life cycle
Episode length
Episode height
Outstanding network
packets
Design: Dynamic classification of applications
time
Network episode Compute episode
•  App. has at least one outstanding packet
•  Processor is likely stalling → low IPC
•  App. has no outstanding packet
•  High IPC
Short episode ht.: Low MLP, each request is critical (LAT sensitive)
Tall episode ht.: High MLP (BW sensitive)
Short episode len.: Packets are very critical (LAT sensitive)
Long episode len.: Latency tolerant (could be de-prioritized)
24
Classification and ranking
Classifica(on Tall Height Medium Short Long gems, mcf Length Medium sphinx, lbm, cactus, xalan omnetpp, apsi leslie ocean, sjbb, sap, bzip, sjas, soplex, tpc art, libq, milc, swim Short sjeng, tonto applu, perl, barnes, gromacs, namd, calculix, gcc, povray, h264, gobmk, hmmer, astar wrf, deal Classification: LAT/BW
25
Classification and ranking
Classifica(on Tall Height Medium Short Long gems, mcf Length Medium sphinx, lbm, cactus, xalan omnetpp, apsi leslie ocean, sjbb, sap, bzip, sjas, soplex, tpc art, libq, milc, swim Short sjeng, tonto applu, perl, barnes, gromacs, namd, calculix, gcc, povray, h264, gobmk, hmmer, astar wrf, deal Classification: LAT/BW
Ranking: Sensitivity to LAT/BW
Ranking Height High Medium Short Long Rank-­‐4 Rank-­‐3 Rank-­‐4 Length Medium Rank-­‐2 Rank-­‐2 Rank-­‐3 Short Rank-­‐1 Rank-­‐2 Rank-­‐1 26
Network design
1N-128
27
Network design
1N-128
2N-64x256-ST
(Steering)
28
Network design
1N-128
2N-64x256-ST
(Steering)
2N-64x256-ST-RK
(Steering+Ranking)
29
Network design
1N-128
2N-64x256-ST
(Steering)
2N-64x256-ST-RK 2N-64x256-ST-RK(FS)
(Steering+Ranking) (Steering+Ranking and
Frequency Scaling)
30
Network design
1N-128
2N-64x256-ST
(Steering)
1N-256
2N-128X128
2N-64x256-ST-RK 2N-64x256-ST-RK(FS)
(Steering+Ranking) (Steering+Ranking and
Frequency Scaling)
1N-512
(High BW)
1N-320
(iso-BW)
1N-320(FS)
(iso-resource)
31
Analysis
Weighted speedup
Performance (25 WL with 50% BW and 50% LAT)
60
50
40
30
20
10
0
32
Analysis
Weighted speedup
Performance (25 WL with 50% BW and 50% LAT)
60
50
40
30
20
10
0
33
Analysis
Weighted speedup
Performance (25 WL with 50% BW and 50% LAT)
60
50
+18%
40
30
20
10
0
34
Weighted speedup
Analysis
Performance (25 WL with 50% BW and 50% LAT)
60
+7%
+18%
50
40
30
20
10
0
35
Weighted speedup
Analysis
Performance (25 WL with 50% BW and 50% LAT)
60
+5%
+7%
+18%
50
40
30
20
10
0
36
Weighted speedup
Analysis
Performance (25 WL with 50% BW and 50% LAT)
60
+5%
5%
+7%
+18%
50
40
30
20
10
0
37
Weighted speedup
Analysis
Performance (25 WL with 50% BW and 50% LAT)
60
+5%
w. 2%
5%
+7%
+18%
50
40
30
20
10
0
38
Weighted speedup
Analysis
Performance (25 WL with 50% BW and 50% LAT)
60
+5%
w. 2%
w. 2%
5%
+7%
+18%
50
40
30
20
10
0
39
Performance (25 WL with 50% BW and 50% LAT)
Energy (25 WL with 50% BW and 50% LAT)
2 60
+5%
w. 2%
w. 2%
5%
+7%
+18%
50
1.6 - 47%
40
30
20
Normalized energy Weighted speedup
Analysis
-  59%
1.2 0.8 10
0.4 0
0 40
Performance (25 WL with 50% BW and 50% LAT)
Energy (25 WL with 50% BW and 50% LAT)
2 60
+5%
w. 2%
w. 2%
5%
+7%
+18%
50
1.6 - 47%
40
30
20
Normalized energy Weighted speedup
Analysis
-  59%
1.2 0.8 10
0.4 0
0 Best EDP across all designs
41
Conclusions
•  Problem: Current day NoC designs are agnostic to application requirements and are
provisioned for the general case or worst case. Applications have widely differing
demands from the network
•  Our goal: To design a NoC that can satisfy the diverse dynamic performance
requirements of applications
•  Observation: Applications can be divided into two general classes in terms of their
requirements from the network: bandwidth-sensitive and latency-sensitive
- Not all applications are equally sensitive to bandwidth and latency
•  Key idea: Design two NoC
- Each sub-network customized for either BW or LAT sensitive applications
- Propose metrics to classify applications as BW or LAT sensitive
- Prioritize applications’ packets within the sub-networks based on their sensitivity
•  Network design: BW optimized network has wider link width but operates at a lower
frequency and LAT optimized network has narrow link width but operates at a higher
frequency
•  Results: Our proposal is significantly better when compared to an iso-resource
monolithic network (5%/3% weighted/instruction throughput improvement and 31%
energy reduction)
42
Thank you
Q?
Asit Mishra
asit.k.mishra@intel.com
43
Backup Slides . . .
44
120
0
L1MPKI
L2MPKI
Slack
80
1400
1200
60
1000
40
800
600
20
Slack (in cycles)
100
applu
wrf
art
deal
sjeng
barnes
grmcs
namd
h264
gcc
pvray
tonto
libq
gobmk
astar
milc
hmmer
swim
sjbb
sap
xalan
sphnx
bzip
lbm
sjas
soplx
cacts
omnet
gems
mcf
L1/L2 MPKI
Other metrics considered for application classification
2000
1800
1600
400
200
0
45
Long length/High height
mcf gems mcf gems omnet cacts soplx sjas lbm bzip sphnx xalan sap sjbb swim hmmer milc astar gobmk libq tonto pvray gcc h264 namd grmcs barnes sjeng deal art wrf 10K
omnet cacts soplx sjas lbm bzip sphnx xalan sap sjbb swim hmmer milc astar gobmk libq tonto pvray gcc h264 namd grmcs barnes sjeng deal art wrf applu 14 12 10 8 6 4 2 0 applu Avg. episode length (in cycles) 6000 5000 4000 3000 2000 1000 0 Avg. episode height (network packets) Analysis of network episode length and height
18K
0.3M 0.4M
Short length/height
Medium length/height
46
Short length/height
Medium length/height
Long length/High height
mcf gems mcf gems omnet cacts soplx sjas lbm bzip sphnx xalan sap sjbb swim hmmer milc astar gobmk libq tonto pvray gcc h264 namd grmcs barnes sjeng deal art wrf 10K
omnet cacts soplx sjas lbm bzip sphnx xalan sap sjbb swim hmmer milc astar gobmk libq tonto pvray gcc h264 namd grmcs barnes sjeng deal art wrf applu 14 12 10 8 6 4 2 0 applu Avg. episode length (in cycles) 6000 5000 4000 3000 2000 1000 0 Avg. episode height (network packets) Analysis of network episode length and height
18K
0.3M 0.4M
Based on performance scaling sensitivity
to bandwidth and frequency
47
Empirical results to support the classification
SPECjbb (sjbb) as cut-off for BW/LAT sensitive applications
48
Empirical results to support the classification
SPECjbb (sjbb) as cut-off for BW/LAT sensitive applications
49
Empirical results to support the classification
SPECjbb (sjbb) as cut-off for BW/LAT sensitive applications
50
Empirical results to support the classification
SPECjbb (sjbb) as cut-off for BW/LAT sensitive applications
51
Empirical results to support the classification
SPECjbb (sjbb) as cut-off for BW/LAT sensitive applications
Within group sum of
squares
Why 9 clusters?
225
200
175
150
125
100
75
50
25
0
0
5
10
15
20
25
Number of clusters
30
35
52
Empirical results to support the classification
SPECjbb (sjbb) as cut-off for BW/LAT sensitive applications
Within group sum of
squares
Why 9 clusters?
225
200
175
150
125
100
75
50
25
0
13x
0
5
10
15
20
25
Number of clusters
30
35
53
Empirical results to support the classification
SPECjbb (sjbb) as cut-off for BW/LAT sensitive applications
Within group sum of
squares
Why 9 clusters?
225
200
175
150
125
100
75
50
25
0
0
5
10
15
20
25
Number of clusters
30
35
54
Analysis with varying workload combinations
WS and IT (norm. to 1N-128 net.)
1.5
WS
IT
1.3
1.1
0.9
0.7
0% BANDWIDTH 25% BANDWIDTH 50% BANDWIDTH 75% BANDWIDTH
100%
100% LATENCY 75% LATENCY
50% LATENCY
25% LATENCY BANDWIDTH 0%
LATENCY
55
WS and IT (norm. to 1N-128 net.)
0.8
2N-64x256-ST+RK(FS)
WS
2N-64x256-W-LD-BAL
1.4
2N-128x128-LD-BAL
1N-128-ST+RK
1N-128-STC
Comparison to prior works
IT
1.2
1.0
56
0%
applu
ocean
mcf
omnet
soplx
lbm
sphnx
sap
sjeng
calculix
leslie
hmmer
astar
libq
gcc
namd
barnes
80%
art
% packets in sub-network
Dynamic steering of packets
100%
Latency-optimized network
Bandwidth-optimized network
60%
40%
20%
57
Design: Putting it all together
Logical view of a multicore processor
MUX
Processors/L1$
Network
L2$ and Mem. Controllers
58
Design: Putting it all together
Logical view of a multicore processor
MUX
Processors/L1$
Network
L2$ and Mem. Controllers
Classify applications based on
sensitivity to network BW/LAT
59
Design: Putting it all together
Logical view of a multicore processor
Episode
LEN/HT
MUX
Processors/L1$
Network
L2$ and Mem. Controllers
Classify applications based on
sensitivity to network BW/LAT
Use network episode length/
height to dynamically identify
apps
60
Design: Putting it all together
Logical view of a multicore processor
Episode
LEN/HT
MUX
Processors/L1$
Network
Classify applications based on
sensitivity to network BW/LAT
L2$ and Mem. Controllers
Design LAT/BW optimized
networks
Use network episode length/
height to dynamically identify
apps
61
Design: Putting it all together
Logical view of a multicore processor
Episode
LEN/HT
MUX
Processors/L1$
Network
Classify applications based on
sensitivity to network BW/LAT
L2$ and Mem. Controllers
Design LAT/BW optimized
networks
Use network episode length/
height to dynamically identify
apps
Prioritization within
networks
62
Summary
•  A NoC paradigm based on top-down approach (application demand/
requirement analysis)
•  An efficient design paradigm for future heterogeneous multicores
63
Summary
•  A NoC paradigm based on top-down approach (application demand/
requirement analysis)
•  An efficient design paradigm for future heterogeneous multicores
Big core
Latency
critical
Small core
Throughput
(BW) critical
GPGPUs
Throughput
(BW) critical
Accelerators/ ASIC
Latency
critical (realtime
constraints)
64
Summary
•  A NoC paradigm based on top-down approach (application demand/
requirement analysis)
•  An efficient design paradigm for future heterogeneous multicores
Big core
Latency
critical
Providing all these guarantees
in one network is hard
Small core
Throughput
(BW) critical
GPGPUs
Multiple networks: each
customized for one metric
Throughput
(BW) critical
Accelerators/ ASIC
Latency
critical (realtime
constraints)
65
Summary
•  A NoC paradigm based on top-down approach (application demand/
requirement analysis)
•  An efficient design paradigm for future heterogeneous multicore m/c
Episode
LEN/HT/??
Long haul comm.
Butterfly/express
channels
MUX
Local communication
Hybrid/ fewer
connectivity
network
Power
Power efficient
links/DVFS router
Throughput
1 cycle/ high
bandwidth
Latency
1 cycle/
bufferless/Faster
routers
Share 2D space or 3D layers
66
Download