A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach Asit K. Mishra Onur Mutlu Chita R. Das Executive summary • Problem: Current day NoC designs are agnostic to application requirements and are provisioned for the general case or worst case. Applications have widely differing demands from the network • Our goal: To design a NoC that can satisfy the diverse dynamic performance requirements of applications • Observation: Applications can be divided into two general classes in terms of their requirements from the network: bandwidth-sensitive and latency-sensitive - Not all applications are equally sensitive to bandwidth and latency • Key idea: Design two NoC - Each sub-network customized for either BW or LAT sensitive applications - Propose metrics to classify applications as BW or LAT sensitive - Prioritize applications’ packets within the sub-networks based on their sensitivity • Network design: BW optimized network has wider link width but operates at a lower frequency and LAT optimized network has narrow link width but operates at a higher frequency • Results: Our proposal is significantly better when compared to an iso-resource monolithic network (5%/3% weighted/instruction throughput improvement and 31% energy reduction) 2 Resource requirements of various applications - I Impact of channel bandwidth on application performance • Channel bandwidth affects network latency, throughput and energy/power • Increase in channel BW leads to - Reduction in packet serialization - Increase in router power 3 Resource requirements of various applications - I Impact of channel bandwidth on application performance Simulation settings: • 8x8 multi-hop packet based mesh network • Each node in the network has an OoO processor (2GHz), private L1 cache and a router (2GHz) • Shared 1MB per core shared L2 • 6VC/PC, 2 stage router 4 applu 0 mcf gems omnet cacts soplx sjas lbm bzip sphnx 256b links xalan sap sjbb swim hmmer 128b links milc astar gobmk libq tonto 64b links pvray gcc h264 5 namd 6 grmcs barnes sjeng deal art wrf IT (norm. to 64b links) Resource requirements of various applications - I Impact of channel bandwidth on application performance 7 512b links 4 3 2 1 5 Resource requirements of various applications - I Impact of channel bandwidth on application performance 6 64b links 5 128b links 256b links 512b links 4 3 2 mcf gems omnet cacts soplx sjas lbm bzip sphnx xalan sap sjbb swim hmmer milc astar gobmk libq tonto pvray gcc h264 namd grmcs barnes sjeng deal art 0 wrf 1 applu IT (norm. to 64b links) 7 1. 18/30 (21/36 total) applications’ performance is agnostic to channel BW (8x BW inc. → less than 2x performance inc.) 6 Resource requirements of various applications - I Impact of channel bandwidth on application performance 6 64b links 5 128b links 256b links 512b links 4 3 2 mcf gems omnet cacts soplx sjas lbm bzip sphnx xalan sap sjbb swim hmmer milc astar gobmk libq tonto pvray gcc h264 namd grmcs barnes sjeng deal art 0 wrf 1 applu IT (norm. to 64b links) 7 1. 18/30 (21/36 total) applications’ performance is agnostic to channel BW (8x BW inc. → less than 2x performance inc.) 2. 12/30 (15/36 total) applications’ performance scale with increase in channel BW (8x BW inc. → at least 2x performance inc.) 7 Resource requirements of various applications - II Impact of network latency on application performance • Reduction in router latency (by increasing frequency) - Reduction in packet latency - Increase in router power consumption 8 Resource requirements of various applications - II Impact of network latency on application performance Simulation settings: • … same as last experiment • 128b links • Added dummy stages (2-cycle and 4-cycle ) to each router 9 0.5 mcf gems omne cacts soplx sjas lbm bzip sphnx 4-­‐cycle router xalan sap sjbb swim hmm milc astar 2-­‐cycle router gobm libq tonto pvray gcc h264 1.0 namd 1.1 grmcs barne sjeng deal art wrf applu IT (norm. to 2-­‐cycle router) Resource requirements of various applications - II Impact of network latency on application performance 6-­‐cycle router 0.9 0.8 0.7 0.6 10 Resource requirements of various applications - II 1.1 2-­‐cycle router 1.0 4-­‐cycle router 6-­‐cycle router 0.9 0.8 0.7 mcf gems omne cacts soplx sjas lbm bzip sphnx xalan sap sjbb swim hmm milc astar gobm libq tonto pvray gcc h264 namd grmcs barne sjeng deal art 0.5 wrf 0.6 applu IT (norm. to 2-­‐cycle router) Impact of network latency on application performance 1. 18/30 (21/36 total) applications’ performance is sensitive to network latency (3x latency reduction → at least 25% performance improvement) 11 Resource requirements of various applications - II 1.1 2-­‐cycle router 1.0 4-­‐cycle router 6-­‐cycle router 0.9 0.8 0.7 mcf gems omne cacts soplx sjas lbm bzip sphnx xalan sap sjbb swim hmm milc astar gobm libq tonto pvray gcc h264 namd grmcs barne sjeng deal art 0.5 wrf 0.6 applu IT (norm. to 2-­‐cycle router) Impact of network latency on application performance 1. 18/30 (21/36 total) applications’ performance is sensitive to network latency (3x latency reduction → at least 25% performance improvement) 2. 12/30 (15/36 total) applications’ performance is marginally sensitive to network latency (3x latency increase → less than 15% performance improvement) 12 IT (norm. to 2-­‐cycle router) 0.5 omnet gems omnet gems mcf cacts cacts mcf soplx soplx bzip sphnx sjas 0.7 sjas 0.9 lbm 6-­‐cycle router lbm bzip sphnx 4-­‐cycle router xalan sap sjbb swim hmmer 256b links xalan sap sjbb swim hmmer milc astar gobmk libq 128b links milc 2-­‐cycle router astar gobmk libq tonto pvray gcc h264 64b links tonto pvray gcc h264 namd grmcs 6 namd grmcs sjeng deal art wrf barnes 1.1 barnes sjeng deal art wrf applu 0 applu IT (norm. to 64b links) Application-aware approach to designing multiple NoCs 4 512b links 2 13 6 64b links 128b links 256b links 512b links 4 sjas soplx cacts omnet gems sjas soplx cacts omnet gems mcf lbm bzip sphnx xalan sap sjbb swim hmmer milc 4-­‐cycle router lbm 2-­‐cycle router astar libq tonto pvray gcc h264 namd grmcs gobmk 1.1 barnes sjeng deal art wrf 0 applu 2 6-­‐cycle router 0.9 mcf bzip sphnx xalan sap sjbb swim hmmer milc astar gobmk libq tonto pvray gcc h264 namd grmcs barnes sjeng deal art 0.5 wrf 0.7 applu IT (norm. to 2-­‐cycle router) IT (norm. to 64b links) Application-aware approach to designing multiple NoCs Based on the observations: 1. Applications can be classified into distinct classes: typically LAT/BW sensitive 2. LAT sensitive applications can benefit from low network latency 3. BW sensitive applications can benefit from high network bandwidth 4. Not all applications are equally sensitive to either LAT or BW 5. Monolithic network cannot optimize both classes simultaneously 14 IT (norm. to 2-­‐cycle router) 0.5 omnet gems omnet gems mcf cacts cacts mcf soplx soplx bzip sphnx sjas 0.7 sjas 0.9 lbm 6-­‐cycle router lbm bzip sphnx 4-­‐cycle router xalan sap sjbb swim hmmer 256b links xalan sap sjbb swim hmmer milc astar gobmk libq 128b links milc 2-­‐cycle router astar gobmk libq tonto pvray gcc h264 64b links tonto pvray gcc h264 namd grmcs 6 namd grmcs sjeng deal art wrf barnes 1.1 barnes sjeng deal art wrf applu 0 applu IT (norm. to 64b links) Application-aware approach to designing multiple NoCs 4 512b links 2 Solution Two NoCs where each (sub)network is optimized for either LAT or BW sensitive applications 15 Design methodology Logical view of a multicore processor Processors/L1$ Network L2$ and Mem. Controllers 16 Design methodology Logical view of a multicore processor 1 Processors/L1$ 1 Network L2$ and Mem. Controllers Identify LAT/BW sensitive applications - Proposes a novel dynamic application classification scheme 17 Design methodology Logical view of a multicore processor 1 Processors/L1$ Network 2 L2$ and Mem. Controllers 1 Identify LAT/BW sensitive applications - Proposes a novel dynamic application classification scheme 2 Design sub-networks based on applications’ demand - This network architecture is better than a monolithic iso-resource network 18 Design methodology Logical view of a multicore processor DEMUX 1 Processors/L1$ Network 2 L2$ and Mem. Controllers 1 Identify LAT/BW sensitive applications - Proposes a novel dynamic application classification scheme 2 Design sub-networks based on applications’ demand - This network architecture is better than a monolithic iso-resource network 19 Outstanding network packets Design: Dynamic classification of applications Application life cycle time Network episode Compute episode 20 Outstanding network packets Design: Dynamic classification of applications Application life cycle time Network episode Compute episode • App. has at least one outstanding packet • Processor is likely stalling → low IPC 21 Outstanding network packets Design: Dynamic classification of applications Application life cycle time Network episode Compute episode • App. has at least one outstanding packet • Processor is likely stalling → low IPC • App. has no outstanding packet • High IPC 22 Application life cycle Episode length Episode height Outstanding network packets Design: Dynamic classification of applications time Network episode Compute episode • App. has at least one outstanding packet • Processor is likely stalling → low IPC • App. has no outstanding packet • High IPC Episode length = Number of consecutive cycles there are net. packets Episode height = Avg. number of L1 packets injected during an episode 23 Application life cycle Episode length Episode height Outstanding network packets Design: Dynamic classification of applications time Network episode Compute episode • App. has at least one outstanding packet • Processor is likely stalling → low IPC • App. has no outstanding packet • High IPC Short episode ht.: Low MLP, each request is critical (LAT sensitive) Tall episode ht.: High MLP (BW sensitive) Short episode len.: Packets are very critical (LAT sensitive) Long episode len.: Latency tolerant (could be de-prioritized) 24 Classification and ranking Classifica(on Tall Height Medium Short Long gems, mcf Length Medium sphinx, lbm, cactus, xalan omnetpp, apsi leslie ocean, sjbb, sap, bzip, sjas, soplex, tpc art, libq, milc, swim Short sjeng, tonto applu, perl, barnes, gromacs, namd, calculix, gcc, povray, h264, gobmk, hmmer, astar wrf, deal Classification: LAT/BW 25 Classification and ranking Classifica(on Tall Height Medium Short Long gems, mcf Length Medium sphinx, lbm, cactus, xalan omnetpp, apsi leslie ocean, sjbb, sap, bzip, sjas, soplex, tpc art, libq, milc, swim Short sjeng, tonto applu, perl, barnes, gromacs, namd, calculix, gcc, povray, h264, gobmk, hmmer, astar wrf, deal Classification: LAT/BW Ranking: Sensitivity to LAT/BW Ranking Height High Medium Short Long Rank-­‐4 Rank-­‐3 Rank-­‐4 Length Medium Rank-­‐2 Rank-­‐2 Rank-­‐3 Short Rank-­‐1 Rank-­‐2 Rank-­‐1 26 Network design 1N-128 27 Network design 1N-128 2N-64x256-ST (Steering) 28 Network design 1N-128 2N-64x256-ST (Steering) 2N-64x256-ST-RK (Steering+Ranking) 29 Network design 1N-128 2N-64x256-ST (Steering) 2N-64x256-ST-RK 2N-64x256-ST-RK(FS) (Steering+Ranking) (Steering+Ranking and Frequency Scaling) 30 Network design 1N-128 2N-64x256-ST (Steering) 1N-256 2N-128X128 2N-64x256-ST-RK 2N-64x256-ST-RK(FS) (Steering+Ranking) (Steering+Ranking and Frequency Scaling) 1N-512 (High BW) 1N-320 (iso-BW) 1N-320(FS) (iso-resource) 31 Analysis Weighted speedup Performance (25 WL with 50% BW and 50% LAT) 60 50 40 30 20 10 0 32 Analysis Weighted speedup Performance (25 WL with 50% BW and 50% LAT) 60 50 40 30 20 10 0 33 Analysis Weighted speedup Performance (25 WL with 50% BW and 50% LAT) 60 50 +18% 40 30 20 10 0 34 Weighted speedup Analysis Performance (25 WL with 50% BW and 50% LAT) 60 +7% +18% 50 40 30 20 10 0 35 Weighted speedup Analysis Performance (25 WL with 50% BW and 50% LAT) 60 +5% +7% +18% 50 40 30 20 10 0 36 Weighted speedup Analysis Performance (25 WL with 50% BW and 50% LAT) 60 +5% 5% +7% +18% 50 40 30 20 10 0 37 Weighted speedup Analysis Performance (25 WL with 50% BW and 50% LAT) 60 +5% w. 2% 5% +7% +18% 50 40 30 20 10 0 38 Weighted speedup Analysis Performance (25 WL with 50% BW and 50% LAT) 60 +5% w. 2% w. 2% 5% +7% +18% 50 40 30 20 10 0 39 Performance (25 WL with 50% BW and 50% LAT) Energy (25 WL with 50% BW and 50% LAT) 2 60 +5% w. 2% w. 2% 5% +7% +18% 50 1.6 - 47% 40 30 20 Normalized energy Weighted speedup Analysis - 59% 1.2 0.8 10 0.4 0 0 40 Performance (25 WL with 50% BW and 50% LAT) Energy (25 WL with 50% BW and 50% LAT) 2 60 +5% w. 2% w. 2% 5% +7% +18% 50 1.6 - 47% 40 30 20 Normalized energy Weighted speedup Analysis - 59% 1.2 0.8 10 0.4 0 0 Best EDP across all designs 41 Conclusions • Problem: Current day NoC designs are agnostic to application requirements and are provisioned for the general case or worst case. Applications have widely differing demands from the network • Our goal: To design a NoC that can satisfy the diverse dynamic performance requirements of applications • Observation: Applications can be divided into two general classes in terms of their requirements from the network: bandwidth-sensitive and latency-sensitive - Not all applications are equally sensitive to bandwidth and latency • Key idea: Design two NoC - Each sub-network customized for either BW or LAT sensitive applications - Propose metrics to classify applications as BW or LAT sensitive - Prioritize applications’ packets within the sub-networks based on their sensitivity • Network design: BW optimized network has wider link width but operates at a lower frequency and LAT optimized network has narrow link width but operates at a higher frequency • Results: Our proposal is significantly better when compared to an iso-resource monolithic network (5%/3% weighted/instruction throughput improvement and 31% energy reduction) 42 Thank you Q? Asit Mishra asit.k.mishra@intel.com 43 Backup Slides . . . 44 120 0 L1MPKI L2MPKI Slack 80 1400 1200 60 1000 40 800 600 20 Slack (in cycles) 100 applu wrf art deal sjeng barnes grmcs namd h264 gcc pvray tonto libq gobmk astar milc hmmer swim sjbb sap xalan sphnx bzip lbm sjas soplx cacts omnet gems mcf L1/L2 MPKI Other metrics considered for application classification 2000 1800 1600 400 200 0 45 Long length/High height mcf gems mcf gems omnet cacts soplx sjas lbm bzip sphnx xalan sap sjbb swim hmmer milc astar gobmk libq tonto pvray gcc h264 namd grmcs barnes sjeng deal art wrf 10K omnet cacts soplx sjas lbm bzip sphnx xalan sap sjbb swim hmmer milc astar gobmk libq tonto pvray gcc h264 namd grmcs barnes sjeng deal art wrf applu 14 12 10 8 6 4 2 0 applu Avg. episode length (in cycles) 6000 5000 4000 3000 2000 1000 0 Avg. episode height (network packets) Analysis of network episode length and height 18K 0.3M 0.4M Short length/height Medium length/height 46 Short length/height Medium length/height Long length/High height mcf gems mcf gems omnet cacts soplx sjas lbm bzip sphnx xalan sap sjbb swim hmmer milc astar gobmk libq tonto pvray gcc h264 namd grmcs barnes sjeng deal art wrf 10K omnet cacts soplx sjas lbm bzip sphnx xalan sap sjbb swim hmmer milc astar gobmk libq tonto pvray gcc h264 namd grmcs barnes sjeng deal art wrf applu 14 12 10 8 6 4 2 0 applu Avg. episode length (in cycles) 6000 5000 4000 3000 2000 1000 0 Avg. episode height (network packets) Analysis of network episode length and height 18K 0.3M 0.4M Based on performance scaling sensitivity to bandwidth and frequency 47 Empirical results to support the classification SPECjbb (sjbb) as cut-off for BW/LAT sensitive applications 48 Empirical results to support the classification SPECjbb (sjbb) as cut-off for BW/LAT sensitive applications 49 Empirical results to support the classification SPECjbb (sjbb) as cut-off for BW/LAT sensitive applications 50 Empirical results to support the classification SPECjbb (sjbb) as cut-off for BW/LAT sensitive applications 51 Empirical results to support the classification SPECjbb (sjbb) as cut-off for BW/LAT sensitive applications Within group sum of squares Why 9 clusters? 225 200 175 150 125 100 75 50 25 0 0 5 10 15 20 25 Number of clusters 30 35 52 Empirical results to support the classification SPECjbb (sjbb) as cut-off for BW/LAT sensitive applications Within group sum of squares Why 9 clusters? 225 200 175 150 125 100 75 50 25 0 13x 0 5 10 15 20 25 Number of clusters 30 35 53 Empirical results to support the classification SPECjbb (sjbb) as cut-off for BW/LAT sensitive applications Within group sum of squares Why 9 clusters? 225 200 175 150 125 100 75 50 25 0 0 5 10 15 20 25 Number of clusters 30 35 54 Analysis with varying workload combinations WS and IT (norm. to 1N-128 net.) 1.5 WS IT 1.3 1.1 0.9 0.7 0% BANDWIDTH 25% BANDWIDTH 50% BANDWIDTH 75% BANDWIDTH 100% 100% LATENCY 75% LATENCY 50% LATENCY 25% LATENCY BANDWIDTH 0% LATENCY 55 WS and IT (norm. to 1N-128 net.) 0.8 2N-64x256-ST+RK(FS) WS 2N-64x256-W-LD-BAL 1.4 2N-128x128-LD-BAL 1N-128-ST+RK 1N-128-STC Comparison to prior works IT 1.2 1.0 56 0% applu ocean mcf omnet soplx lbm sphnx sap sjeng calculix leslie hmmer astar libq gcc namd barnes 80% art % packets in sub-network Dynamic steering of packets 100% Latency-optimized network Bandwidth-optimized network 60% 40% 20% 57 Design: Putting it all together Logical view of a multicore processor MUX Processors/L1$ Network L2$ and Mem. Controllers 58 Design: Putting it all together Logical view of a multicore processor MUX Processors/L1$ Network L2$ and Mem. Controllers Classify applications based on sensitivity to network BW/LAT 59 Design: Putting it all together Logical view of a multicore processor Episode LEN/HT MUX Processors/L1$ Network L2$ and Mem. Controllers Classify applications based on sensitivity to network BW/LAT Use network episode length/ height to dynamically identify apps 60 Design: Putting it all together Logical view of a multicore processor Episode LEN/HT MUX Processors/L1$ Network Classify applications based on sensitivity to network BW/LAT L2$ and Mem. Controllers Design LAT/BW optimized networks Use network episode length/ height to dynamically identify apps 61 Design: Putting it all together Logical view of a multicore processor Episode LEN/HT MUX Processors/L1$ Network Classify applications based on sensitivity to network BW/LAT L2$ and Mem. Controllers Design LAT/BW optimized networks Use network episode length/ height to dynamically identify apps Prioritization within networks 62 Summary • A NoC paradigm based on top-down approach (application demand/ requirement analysis) • An efficient design paradigm for future heterogeneous multicores 63 Summary • A NoC paradigm based on top-down approach (application demand/ requirement analysis) • An efficient design paradigm for future heterogeneous multicores Big core Latency critical Small core Throughput (BW) critical GPGPUs Throughput (BW) critical Accelerators/ ASIC Latency critical (realtime constraints) 64 Summary • A NoC paradigm based on top-down approach (application demand/ requirement analysis) • An efficient design paradigm for future heterogeneous multicores Big core Latency critical Providing all these guarantees in one network is hard Small core Throughput (BW) critical GPGPUs Multiple networks: each customized for one metric Throughput (BW) critical Accelerators/ ASIC Latency critical (realtime constraints) 65 Summary • A NoC paradigm based on top-down approach (application demand/ requirement analysis) • An efficient design paradigm for future heterogeneous multicore m/c Episode LEN/HT/?? Long haul comm. Butterfly/express channels MUX Local communication Hybrid/ fewer connectivity network Power Power efficient links/DVFS router Throughput 1 cycle/ high bandwidth Latency 1 cycle/ bufferless/Faster routers Share 2D space or 3D layers 66