Presentation Template

advertisement
A Heterogeneous Multiple
Network-On-Chip Design:
An Application-Aware Approach
Asit K. Mishra
Onur Mutlu
Chita R. Das
Executive summary
• Problem: Current day NoC designs are agnostic to application requirements and are
provisioned for the general case or worst case. Applications have widely differing
demands from the network
• Our goal: To design a NoC that can satisfy the diverse dynamic performance
requirements of applications
• Observation: Applications can be divided into two general classes in terms of their
requirements from the network: bandwidth-sensitive and latency-sensitive
- Not all applications are equally sensitive to bandwidth and latency
• Key idea: Design two NoC
- Each sub-network customized for either BW or LAT sensitive applications
- Propose metrics to classify applications as BW or LAT sensitive
- Prioritize applications’ packets within the sub-networks based on their sensitivity
• Network design: BW optimized network has wider link width but operates at a lower
frequency and LAT optimized network has narrow link width but operates at a higher
frequency
• Results: Our proposal is significantly better when compared to an iso-resource
monolithic network (5%/3% weighted/instruction throughput improvement and 31%
energy reduction)
2
Resource requirements of various applications - I
Impact of channel bandwidth on application performance
• Channel bandwidth affects network latency, throughput and
energy/power
• Increase in channel BW leads to
- Reduction in packet serialization
- Increase in router power
3
Resource requirements of various applications - I
Impact of channel bandwidth on application performance
Simulation settings:
• 8x8 multi-hop packet based mesh network
• Each node in the network has an OoO processor (2GHz), private L1
cache and a router (2GHz)
• Shared 1MB per core shared L2
• 6VC/PC, 2 stage router
4
mcf
gems
omnet
cacts
soplx
sjas
lbm
bzip
sphnx
256b links
xalan
sap
sjbb
swim
hmmer
128b links
milc
astar
gobmk
libq
tonto
64b links
pvray
gcc
h264
namd
6
grmcs
barnes
sjeng
deal
art
wrf
applu
IT (norm. to 64b links)
Resource requirements of various applications - I
Impact of channel bandwidth on application performance
7
512b links
5
4
3
2
1
0
5
Resource requirements of various applications - I
Impact of channel bandwidth on application performance
6
64b links
128b links
256b links
512b links
5
4
3
2
1
mcf
gems
omnet
cacts
soplx
sjas
lbm
bzip
sphnx
xalan
sap
sjbb
swim
hmmer
milc
astar
gobmk
libq
tonto
pvray
gcc
h264
namd
grmcs
barnes
sjeng
deal
art
wrf
0
applu
IT (norm. to 64b links)
7
1. 18/30 (21/36 total) applications’ performance is agnostic to channel BW
(8x BW inc. → less than 2x performance inc.)
6
Resource requirements of various applications - I
Impact of channel bandwidth on application performance
6
64b links
128b links
256b links
512b links
5
4
3
2
1
mcf
gems
omnet
cacts
soplx
sjas
lbm
bzip
sphnx
xalan
sap
sjbb
swim
hmmer
milc
astar
gobmk
libq
tonto
pvray
gcc
h264
namd
grmcs
barnes
sjeng
deal
art
wrf
0
applu
IT (norm. to 64b links)
7
1. 18/30 (21/36 total) applications’ performance is agnostic to channel BW
(8x BW inc. → less than 2x performance inc.)
2. 12/30 (15/36 total) applications’ performance scale with increase in
channel BW (8x BW inc. → at least 2x performance inc.)
7
Resource requirements of various applications - II
Impact of network latency on application performance
• Reduction in router latency (by increasing frequency)
- Reduction in packet latency
- Increase in router power consumption
8
Resource requirements of various applications - II
Impact of network latency on application performance
Simulation settings:
• … same as last experiment
• 128b links
• Added dummy stages (2-cycle and 4-cycle ) to each router
9
Resource requirements of various applications - II
1.1
2-cycle router
1.0
4-cycle router
6-cycle router
0.9
0.8
0.7
mcf
gems
omn…
cacts
soplx
sjas
lbm
bzip
sphnx
xalan
sap
sjbb
swim
hm…
milc
astar
gob…
libq
tonto
pvray
gcc
h264
namd
grmcs
sjeng
deal
art
wrf
0.5
barn…
0.6
applu
IT (norm. to 2-cycle router)
Impact of network latency on application performance
10
Resource requirements of various applications - II
1.1
2-cycle router
1.0
4-cycle router
6-cycle router
0.9
0.8
0.7
mcf
gems
omn…
cacts
soplx
sjas
lbm
bzip
sphnx
xalan
sap
sjbb
swim
hm…
milc
astar
gob…
libq
tonto
pvray
gcc
h264
namd
grmcs
sjeng
deal
art
wrf
0.5
barn…
0.6
applu
IT (norm. to 2-cycle router)
Impact of network latency on application performance
1. 18/30 (21/36 total) applications’ performance is sensitive to network
latency (3x latency reduction → at least 25% performance improvement)
11
Resource requirements of various applications - II
1.1
2-cycle router
1.0
4-cycle router
6-cycle router
0.9
0.8
0.7
mcf
gems
omn…
cacts
soplx
sjas
lbm
bzip
sphnx
xalan
sap
sjbb
swim
hm…
milc
astar
gob…
libq
tonto
pvray
gcc
h264
namd
grmcs
sjeng
deal
art
wrf
0.5
barn…
0.6
applu
IT (norm. to 2-cycle router)
Impact of network latency on application performance
1. 18/30 (21/36 total) applications’ performance is sensitive to network
latency (3x latency reduction → at least 25% performance improvement)
2. 12/30 (15/36 total) applications’ performance is marginally sensitive to
network latency (3x latency increase → less than 15% performance
improvement)
12
mcf
ge…
gems
mcf
o…
cacts
cacts
omnet
soplx
bzip
sp…
soplx
0.5
sjas
0.7
sjas
0.9
lbm
6-cycle router
lbm
bzip
sphnx
4-cycle router
xal…
sap
sjbb
swim
h…
256b links
xalan
sap
sjbb
swim
hmmer
milc
astar
go…
libq
128b links
milc
2-cycle router
astar
gobmk
libq
tonto
pv…
gcc
h264
64b links
tonto
pvray
gcc
h264
1.1
na…
gr…
ba…
sje…
deal
art
wrf
ap…
6
namd
grmcs
barnes
sjeng
deal
art
wrf
IT (norm. to 64b
links)
0
applu
IT (norm. to 2-cycle
router)
Application-aware approach to designing multiple NoCs
512b links
4
2
13
6
64b links
128b links
256b links
512b links
4
mcf
ge…
cacts
cacts
gems
soplx
soplx
o…
sjas
sjas
omnet
lbm
bzip
sp…
xal…
sap
sjbb
swim
h…
milc
4-cycle router
lbm
2-cycle router
astar
go…
libq
tonto
pv…
gcc
h264
1.1
na…
gr…
ba…
sje…
deal
art
wrf
0
ap…
2
6-cycle router
0.9
0.7
mcf
bzip
sphnx
xalan
sap
sjbb
swim
hmmer
milc
astar
gobmk
libq
tonto
pvray
gcc
h264
namd
grmcs
barnes
sjeng
deal
art
wrf
0.5
applu
IT (norm. to 2-cycle
router)
IT (norm. to 64b
links)
Application-aware approach to designing multiple NoCs
Based on the observations:
1. Applications can be classified into distinct classes: typically LAT/BW sensitive
2. LAT sensitive applications can benefit from low network latency
3. BW sensitive applications can benefit from high network bandwidth
4. Not all applications are equally sensitive to either LAT or BW
5. Monolithic network cannot optimize both classes simultaneously
14
mcf
ge…
gems
mcf
o…
cacts
cacts
omnet
soplx
bzip
sp…
soplx
0.5
sjas
0.7
sjas
0.9
lbm
6-cycle router
lbm
bzip
sphnx
4-cycle router
xal…
sap
sjbb
swim
h…
256b links
xalan
sap
sjbb
swim
hmmer
milc
astar
go…
libq
128b links
milc
2-cycle router
astar
gobmk
libq
tonto
pv…
gcc
h264
64b links
tonto
pvray
gcc
h264
1.1
na…
gr…
ba…
sje…
deal
art
wrf
ap…
6
namd
grmcs
barnes
sjeng
deal
art
IT (norm. to 64b
links)
0
wrf
applu
IT (norm. to 2-cycle
router)
Application-aware approach to designing multiple NoCs
512b links
4
2
Solution
Two NoCs where each (sub)network is optimized for either LAT
or BW sensitive applications
15
Design methodology
Logical view of a multicore processor
Processors/L1$
Network
L2$ and Mem. Controllers
16
Design methodology
Logical view of a multicore processor
1
Processors/L1$
1
Network
L2$ and Mem. Controllers
Identify LAT/BW sensitive applications
- Proposes a novel dynamic application classification scheme
17
Design methodology
Logical view of a multicore processor
1
Processors/L1$
1
2
Network
2
L2$ and Mem. Controllers
Identify LAT/BW sensitive applications
- Proposes a novel dynamic application classification scheme
Design sub-networks based on applications’ demand
- This network architecture is better than a monolithic iso-resource
network
18
Design methodology
Logical view of a multicore processor
DEMUX
1
Processors/L1$
1
2
Network
2
L2$ and Mem. Controllers
Identify LAT/BW sensitive applications
- Proposes a novel dynamic application classification scheme
Design sub-networks based on applications’ demand
- This network architecture is better than a monolithic iso-resource
network
19
Outstanding network
packets
Design: Dynamic classification of applications
Application life cycle
time
Network episode Compute episode
20
Outstanding network
packets
Design: Dynamic classification of applications
Application life cycle
time
Network episode Compute episode
• App. has at least one outstanding packet
• Processor is likely stalling → low IPC
21
Outstanding network
packets
Design: Dynamic classification of applications
Application life cycle
time
Network episode Compute episode
• App. has at least one outstanding packet
• Processor is likely stalling → low IPC
• App. has no outstanding packet
• High IPC
22
Application life cycle
Episode length
Episode height
Outstanding network
packets
Design: Dynamic classification of applications
time
Network episode Compute episode
• App. has at least one outstanding packet
• Processor is likely stalling → low IPC
• App. has no outstanding packet
• High IPC
Episode length = Number of consecutive cycles there are net. packets
Episode height = Avg. number of L1 packets injected during an episode
23
Application life cycle
Episode length
Episode height
Outstanding network
packets
Design: Dynamic classification of applications
time
Network episode Compute episode
• App. has at least one outstanding packet
• Processor is likely stalling → low IPC
• App. has no outstanding packet
• High IPC
Short episode ht.: Low MLP, each request is critical (LAT sensitive)
Tall episode ht.: High MLP (BW sensitive)
Short episode len.: Packets are very critical (LAT sensitive)
Long episode len.: Latency tolerant (could be de-prioritized)
24
Classification and ranking
Classification
Tall
Height
Medium
Short
Long
gems, mcf
Length
Medium
sphinx, lbm, cactus, xalan
omnetpp, apsi
leslie
ocean, sjbb, sap, bzip,
sjas, soplex, tpc
art, libq, milc, swim
Short
sjeng, tonto
applu, perl, barnes,
gromacs, namd, calculix,
gcc, povray, h264,
gobmk, hmmer, astar
wrf, deal
Classification: LAT/BW
25
Classification and ranking
Classification
Tall
Height
Medium
Short
Long
gems, mcf
Length
Medium
sphinx, lbm, cactus, xalan
omnetpp, apsi
leslie
ocean, sjbb, sap, bzip,
sjas, soplex, tpc
art, libq, milc, swim
Short
sjeng, tonto
applu, perl, barnes,
gromacs, namd, calculix,
gcc, povray, h264,
gobmk, hmmer, astar
wrf, deal
Classification: LAT/BW
Ranking: Sensitivity to LAT/BW
Ranking
Height
High
Medium
Short
Long
Rank-4
Rank-3
Rank-4
Length
Medium
Rank-2
Rank-2
Rank-3
Short
Rank-1
Rank-2
Rank-1
26
Network design
1N-128
27
Network design
1N-128
2N-64x256-ST
(Steering)
28
Network design
1N-128
2N-64x256-ST
(Steering)
2N-64x256-ST-RK
(Steering+Ranking)
29
Network design
1N-128
2N-64x256-ST
(Steering)
2N-64x256-ST-RK 2N-64x256-ST-RK(FS)
(Steering+Ranking) (Steering+Ranking and
Frequency Scaling)
30
Network design
1N-128
2N-64x256-ST
(Steering)
1N-256
2N-128X128
2N-64x256-ST-RK 2N-64x256-ST-RK(FS)
(Steering+Ranking) (Steering+Ranking and
Frequency Scaling)
1N-512
(High BW)
1N-320
(iso-BW)
1N-320(FS)
(iso-resource)
31
Analysis
Weighted speedup
Performance (25 WL with 50% BW and 50% LAT)
60
50
40
30
20
10
0
32
Analysis
Weighted speedup
Performance (25 WL with 50% BW and 50% LAT)
60
50
40
30
20
10
0
33
Analysis
Weighted speedup
Performance (25 WL with 50% BW and 50% LAT)
60
50
+18%
40
30
20
10
0
34
Analysis
Weighted speedup
Performance (25 WL with 50% BW and 50% LAT)
60
50
+7%
+18%
40
30
20
10
0
35
Weighted speedup
Analysis
Performance (25 WL with 50% BW and 50% LAT)
60
+5%
+7%
+18%
50
40
30
20
10
0
36
Weighted speedup
Analysis
Performance (25 WL with 50% BW and 50% LAT)
60
+5%
5%
+7%
+18%
50
40
30
20
10
0
37
Weighted speedup
Analysis
Performance (25 WL with 50% BW and 50% LAT)
60
+5%
w. 2%
5%
+7%
+18%
50
40
30
20
10
0
38
Weighted speedup
Analysis
Performance (25 WL with 50% BW and 50% LAT)
60
+5%
w. 2%
w. 2%
5%
+7%
+18%
50
40
30
20
10
0
39
Performance (25 WL with 50% BW and 50% LAT)
Energy (25 WL with 50% BW and 50% LAT)
2
60
+5%
w.
2%
w. 2%
5%
+7%
+18%
50
1.6
-47%
40
30
20
Normalized energy
Weighted speedup
Analysis
- 59%
1.2
0.8
10
0.4
0
0
40
Performance (25 WL with 50% BW and 50% LAT)
Energy (25 WL with 50% BW and 50% LAT)
2
60
+5%
w.
2%
w. 2%
5%
+7%
+18%
50
1.6
-47%
40
30
20
Normalized energy
Weighted speedup
Analysis
- 59%
1.2
0.8
10
0.4
0
0
Best EDP across all designs
41
Conclusions
• Problem: Current day NoC designs are agnostic to application requirements and are
provisioned for the general case or worst case. Applications have widely differing
demands from the network
• Our goal: To design a NoC that can satisfy the diverse dynamic performance
requirements of applications
• Observation: Applications can be divided into two general classes in terms of their
requirements from the network: bandwidth-sensitive and latency-sensitive
- Not all applications are equally sensitive to bandwidth and latency
• Key idea: Design two NoC
- Each sub-network customized for either BW or LAT sensitive applications
- Propose metrics to classify applications as BW or LAT sensitive
- Prioritize applications’ packets within the sub-networks based on their sensitivity
• Network design: BW optimized network has wider link width but operates at a lower
frequency and LAT optimized network has narrow link width but operates at a higher
frequency
• Results: Our proposal is significantly better when compared to an iso-resource
monolithic network (5%/3% weighted/instruction throughput improvement and 31%
energy reduction)
42
Q?
Thank
you
Asit Mishra
asit.k.mishra@intel.com
43
Backup Slides . . .
44
120
L1MPKI
L2MPKI
Slack
80
1400
1200
60
1000
800
40
600
20
0
Slack (in cycles)
100
applu
wrf
art
deal
sjeng
barnes
grmcs
namd
h264
gcc
pvray
tonto
libq
gobmk
astar
milc
hmmer
swim
sjbb
sap
xalan
sphnx
bzip
lbm
sjas
soplx
cacts
omnet
gems
mcf
L1/L2 MPKI
Other metrics considered for application classification
2000
1800
1600
400
200
0
45
mcf
gems
mcf
gems
omnet
cacts
soplx
sjas
lbm
bzip
sphnx
xalan
sap
sjbb
swim
hmmer
milc
astar
gobmk
libq
tonto
pvray
gcc
h264
namd
grmcs
barnes
sjeng
deal
art
wrf
applu
10K
omnet
cacts
soplx
sjas
lbm
bzip
sphnx
xalan
sap
sjbb
swim
hmmer
milc
astar
gobmk
libq
tonto
pvray
gcc
h264
namd
grmcs
barnes
sjeng
deal
art
wrf
14
12
10
8
6
4
2
0
applu
Avg. episode length
(in cycles)
6000
5000
4000
3000
2000
1000
0
Avg. episode height
(network packets)
Analysis of network episode length and height
18K
0.3M 0.4M
Short length/height
Medium length/height
Long length/High height
46
Short length/height
Medium length/height
mcf
gems
mcf
gems
omnet
cacts
soplx
sjas
lbm
bzip
sphnx
xalan
sap
sjbb
swim
hmmer
milc
astar
gobmk
libq
tonto
pvray
gcc
h264
namd
grmcs
barnes
sjeng
deal
art
wrf
applu
10K
omnet
cacts
soplx
sjas
lbm
bzip
sphnx
xalan
sap
sjbb
swim
hmmer
milc
astar
gobmk
libq
tonto
pvray
gcc
h264
namd
grmcs
barnes
sjeng
deal
art
wrf
14
12
10
8
6
4
2
0
applu
Avg. episode length
(in cycles)
6000
5000
4000
3000
2000
1000
0
Avg. episode height
(network packets)
Analysis of network episode length and height
18K
0.3M 0.4M
Based on performance scaling sensitivity to
bandwidth and frequency
Long length/High height
47
Empirical results to support the classification
SPECjbb (sjbb) as cut-off for BW/LAT sensitive applications
48
Empirical results to support the classification
SPECjbb (sjbb) as cut-off for BW/LAT sensitive applications
49
Empirical results to support the classification
SPECjbb (sjbb) as cut-off for BW/LAT sensitive applications
50
Empirical results to support the classification
SPECjbb (sjbb) as cut-off for BW/LAT sensitive applications
51
Empirical results to support the classification
SPECjbb (sjbb) as cut-off for BW/LAT sensitive applications
Within group sum of
squares
Why 9 clusters?
225
200
175
150
125
100
75
50
25
0
0
5
10
15
20
Number of clusters
25
30
35
52
Empirical results to support the classification
SPECjbb (sjbb) as cut-off for BW/LAT sensitive applications
Within group sum of
squares
Why 9 clusters?
225
200
175
150
125
100
75
50
25
0
13x
0
5
10
15
20
Number of clusters
25
30
35
53
Empirical results to support the classification
SPECjbb (sjbb) as cut-off for BW/LAT sensitive applications
Within group sum of
squares
Why 9 clusters?
225
200
175
150
125
100
75
50
25
0
0
5
10
15
20
Number of clusters
25
30
35
54
Analysis with varying workload combinations
WS and IT (norm. to 1N-128 net.)
1.5
WS
IT
1.3
1.1
0.9
0.7
0% BANDWIDTH 25% BANDWIDTH 50% BANDWIDTH 75% BANDWIDTH
100%
100% LATENCY 75% LATENCY
50% LATENCY
25% LATENCY BANDWIDTH 0%
LATENCY
55
2N-64x256-ST+RK(FS)
WS
2N-64x256-W-LD-BAL
1.4
2N-128x128-LD-BAL
1N-128-ST+RK
1N-128-STC
WS and IT (norm. to 1N-128 net.)
Comparison to prior works
IT
1.2
1.0
0.8
56
ocean
mcf
omnet
soplx
lbm
sphnx
sap
sjeng
calculix
leslie
hmmer
astar
libq
gcc
namd
barnes
80%
art
applu
% packets in sub-network
Dynamic steering of packets
100%
Latency-optimized network
Bandwidth-optimized network
60%
40%
20%
0%
57
Design: Putting it all together
Logical view of a multicore processor
MUX
Processors/L1$
Network
L2$ and Mem. Controllers
58
Design: Putting it all together
Logical view of a multicore processor
MUX
Processors/L1$
Network
L2$ and Mem. Controllers
Classify applications based on
sensitivity to network BW/LAT
59
Design: Putting it all together
Episode
LEN/HT
Logical view of a multicore processor
MUX
Processors/L1$
Network
L2$ and Mem. Controllers
Classify applications based on
sensitivity to network BW/LAT
Use network episode
length/height to dynamically
identify apps
60
Design: Putting it all together
Episode
LEN/HT
Logical view of a multicore processor
MUX
Processors/L1$
Network
Classify applications based on
sensitivity to network BW/LAT
L2$ and Mem. Controllers
Design LAT/BW optimized
networks
Use network episode
length/height to dynamically
identify apps
61
Design: Putting it all together
Episode
LEN/HT
Logical view of a multicore processor
MUX
Processors/L1$
Network
Classify applications based on
sensitivity to network BW/LAT
L2$ and Mem. Controllers
Design LAT/BW optimized
networks
Use network episode
length/height to dynamically
identify apps
Prioritization within
networks
62
Summary
• A NoC paradigm based on top-down approach (application
demand/requirement analysis)
• An efficient design paradigm for future heterogeneous multicores
63
Summary
• A NoC paradigm based on top-down approach (application
demand/requirement analysis)
• An efficient design paradigm for future heterogeneous multicores
Big core
Latency
critical
Small core
Throughput
(BW) critical
GPGPUs
Throughput
(BW) critical
Accelerators/ ASIC
Latency
critical (realtime
constraints)
64
Summary
• A NoC paradigm based on top-down approach (application
demand/requirement analysis)
• An efficient design paradigm for future heterogeneous multicores
Big core
Latency
critical
Providing all these guarantees
in one network is hard
Small core
Throughput
(BW) critical
GPGPUs
Multiple networks: each
customized for one metric
Throughput
(BW) critical
Accelerators/ ASIC
Latency
critical (realtime
constraints)
65
Summary
• A NoC paradigm based on top-down approach (application
demand/requirement analysis)
• An efficient design paradigm for future heterogeneous multicore m/c
Episode
LEN/HT/??
Long haul comm.
Butterfly/express
channels
MUX
Local communication
Hybrid/ fewer
connectivity
network
Power
Power efficient
links/DVFS router
Throughput
1 cycle/ high
bandwidth
Latency
1 cycle/
bufferless/Faster
routers
Share 2D space or 3D layers
66
Download