On-Chip Networks from a Networking Perspective: Congestion and Scalability in Many-Core Interconnects George Nychis✝, Chris Fallin✝, Thomas Moscibroda★, Onur Mutlu✝, Srinivasan Seshan✝ Carnegie Mellon University ✝ Microsoft Research Asia ★ 1 What is the On-Chip Network? Multi-core Processor (9-core) CPU CPU CPU L1 L1 L1 L2 bank L2 bank L2 bank Cores GPUs CPU CPU CPU L1 L1 L1 L2 bank L2 bank L2 bank CPU CPU CPU L1 L1 L1 L2 bank L2 bank L2 bank 2 Cache Banks Memory Controllers What is the On-Chip Network? Multi-core Processor (9-core) CPU CPU CPU L1 L1 L1 L2 bank L2 bank L2 bank CPU CPU CPU L1 L1 L2 bank L2 bank CPU CPU CPU L1 L1 L1 L2 bank L2 bank L2 bank L1 L2 bank D S Router 3 Network Links Networking Challenges • Familiar discussion in the architecture community, e.g.: - How to reduce congestion - How to scale the network - Choosing an effective topology - Routing and buffer size All historical problems in our field… 4 On-Chip Network Can We Apply Traditional Solutions? (1) 1. Different constraints: unique network design 2. Different workloads: Bufferless Area: -60% Power: -40% unique style of traffic and flow On-Chip Network (3x3) Zoomed In D X Routing min. complexity: X-Y-Routing, low latency Coordination global is often less expensive S Links links cannot be over-provisioned 5 Can We Apply Traditional Solutions? (2) 1. Different constraints: unique network design 2. Different workloads: Bufferless Area: -60% Power: -40% Closed-Loop Instruction Window Limits In-Flight Traffic Per-Core unique style of traffic and flow Zoomed In Architecture (Instruct. Win) Insn. R5 i5 i6 R7 i7 i8 R8 i9 Router Network Layer 6 Routing min. complexity: X-Y-Routing, low latency Coordination global is often less expensive Links links cannot be over-provisioned Traffic and Congestion • Injection only when • output link is free Arbitration: oldest pkt first (dead/live-lock free) On-Chip Network D Manifestation of Congestion 1. Deflection: arbitration causing non-optimal hop S1 0 1 2 1 0 S2 contending for top port, age is initialized oldest first, newest deflected 7 Traffic and Congestion • Injection only when • output link is free Arbitration: oldest pkt first (dead/live-lock free) On-Chip Network Manifestation of Congestion 1. Deflection: arbitration causing non-optimal hop 2. Starvation: when a core cannot inject (no loss) Definition: Starvation rate is a fraction of starved cycles 8 Can’t inject packet without a free port Outline • Bufferless On-Chip Networks: Congestion & Scalability Study of congestion at network and application layers Impact of congestion on scalability - • Novel application-aware congestion control mechanism • Evaluation of congestion control mechanism - Able to effectively scale the network 9 Congestion and Scalability Study • Prior work: moderate intensity workloads, small on-chip net - Energy and area benefits of going bufferless - throughput comparable to buffered • Study: high intensity workloads & large network (4096 cores) - Still comparable throughput with the benefits of bufferless? • Use real application workloads (e.g., matlab, gcc, bzip2, perl) - Simulate the whole chip, including NoC 10 avg. net latency (cycles) Congestion at the Network Level • Evaluate 700 different • appl. mixes in 16-core system Finding: net latency remains stable with congestion/deflects • 40 30 20 25% Increase 10 0 0 Unlike traditional networks average starvation rate • What about starvation rate? • Starvation increases • 50 0.2 0.4 0.6 0.8 average network utilization 1 0.4 700% 0.3 Increase in network latency EachIncrease point represents a 0.2 under congestion is only single workload 0.1 ~5-6 cycles significantly with congestion Finding: starvation likely to impact performance; indicator of congestion 11 0.0 0.0 0.2 0.4 0.6 0.8 average network utilization 1.0 Congestion at the Application Level • Define system throughput as sum of instructions-per-cycle (IPC) of all applications in system: • Unthrottle apps in single wkld • Finding 1: • Throughput decreases under congestion Finding 2: Self-throttling of cores prevents collapse • Finding 3: Throttled Sub-optimal Unthrottled Throughput Does with congestion Not Collapse Static throttling can provide some gain (e.g., 14%), but we will show up to 27% gain with app-aware 12 throttling Impact of Congestion on Scalability • Prior work: 16-64 cores Our work: up to 4096 cores • As we increase system’s size: Starvation Rate 0.4 0.3 0.2 0.1 0 16 • A core can be starved up to 37% of all cycles! - Per-node throughput decreases with system’s size • Up to 38% reduction 13 Throughput (IPC/Node) - Starvation rate increases 64 256 1024 Number of Cores 4096 1 0.8 0.6 0.4 0.2 0 16 64 256 1024 Number of Cores 4096 Summary of Congestion Study • Network congestion limits scalability and performance - Due to starvation rate, not increased network latency - Starvation rate is the indicator of congestion in on-chip net • Self-throttling nature of cores prevent congestion collapse • Throttling: reduced congestion, improved performance - Motivation for congestion control Congestion control must be application-aware 14 Outline • Bufferless On-Chip Networks: Congestion & Scalability Study of congestion at network and application layers Impact of congestion on scalability - • Novel application-aware congestion control mechanism • Evaluation of congestion control mechanism - Able to effectively scale the network 15 Developing a Congestion Controller • Traditional congestion controllers designed to: - Improve network efficiency - Maintain fairness of network access When - Provide stability (and avoid collapse) Considering: - Operate in a distributed manner On-Chip Network • A controller for on-chip networks must: - Have minimal complexity - Be area-efficient - We show: be application-aware …in paper: global and simple controller 16 Need For Application Awareness • Throttling reduces congestion, improves system throughput Under congestion, what core should be throttled? • Use 16-core system, alternate 90% throttle rate to Instruction Throughput Avg.Avg. Instruction Throughput Avg. Instruction Throughput applications Gromacs MCF Nothing Gromacs MCF Nothing • Finding 1: the app Gromacs MCF Nothing +21% 2.5 2.5 that is throttled 2.5 2 2 -9% impacts system 0.6 1.521.5 <1% performance 1.5 1 1 • Finding 2: application +33% 1.3 0.510.5 throughput does not 0.5 0 0 Baseline Throttle Throttle dictate whom to Baseline Throttle Throttle Gromacs MCF 0 Gromacs MCF Baseline Throttle Throttle throttle Gromacs MCF • Finding 3: different applications respond differently to an increase in network throughput (unlike gromacs, mcf barely Unlike traditional congestion controllers gains)(e.g., TCP): cannot be application agnostic 17 Instructions-Per-Packet (IPP) • Key Insight: Not all packets are created equal L1 misses more traffic to progress -- more Throttling during a “high” IPP phase will hurt performance 25000 25000 20000 20000 15000 15000 10000 10000 5000 5000 0 0 0 0 IPP IPP Instructions-Per-Packet (IPP) = I/P deal.ii deal.ii - IPP IPP • IPP only depends on the L1 miss rate Phase behavior 50 60 60 70 70 80 80 independent of the level 10 of10congestion 20 20 30 30 40 40 &50execution on the order of rate 10000 10000 xml millions of cycles xml low value: need many 1000 packets shift window 1000 100 100 provides the application-layer insight needed 10 10 1 1 0.10.1 0 0 • Since L1 miss rate varies over execution, IPP is dynamic 10 10 20 20 30 30 40 40 50 50 60 60 70 70 80 80 Execution Cycle (million) Execution Cycle (million) 18 Instructions-Per-Packet (IPP) • Since L1 miss rate varies over execution, IPP is dynamic - Throttling during a “high” IPP phase will hurt performance • IPP provides application-layer insight in who to throttle - Dynamic IPP Throttling must be dynamic • When congested: • Fairness: throttle applications with low IPP scale throttling rate by application’s IPP - Details in paper show that fairness in throttling 19 Outline • Bufferless On-Chip Network: Congestion & Scalability Study of congestion at network and application layers Impact of congestion on scalability - • Novel application-aware congestion control mechanism • Evaluation of congestion control mechanism - Able to effectively scale the network 20 Evaluation of Improved Efficiency • Evaluate with 875 real workloads (700 16-core, 175 64core) - generate balanced set of CMP workloads (cloud computing) • Improvement up to 27% System Throughput % Improvement % Improvement under congested workloads 3030 2525 • Does not degrade 2020 non-congested workloads 1515 1010 • Only 4/875 workloads have 55 perform. reduced > 0.5% 00 -5-5 The improvement in system 0.00.0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 1.0 1.0 •throughput Do not unfairly throttle workloads baseline baselineaverage averagenetwork networkutilization utilization applicationsfor down (in paper) Network Utilization With No Congestion Control 21 Evaluation of Improved Scalability • • • • Buffered: area/power expensive Contribution: keep area and power benefits of bufferless, while achieving comparable performance • • • Baseline bufferless: doesn’t scale Throughput (IPC/Node) Comparison points… 1 0.8 0.6 0.4 Baseline Bufferless Buffered 0.2 Throttling Bufferless 0 16 64 256 1024 Number of Cores 4096 Application-aware throttling Overall reduction in congestion Power consumption reduced through increase in net efficiency % Reduction in Power Consumption • 20 15 10 5 0 16 Many other results in paper, e.g., Fairness, starvation, latency… 22 64 256 1024 Number of Cores 4096 Summary of Study, Results, and Conclusions • Highlighted a traditional networking problem in a new context - Unique design requires novel solution • • • We showed congestion limited efficiency and scalability, and that self-throttling nature of cores prevents collapse Study showed congestion control would require appawareness Our application-aware congestion controller provided: - A more efficient network-layer (reduced latency) - Improvements in system throughput (up to 27%) - Effectively scale the CMP (shown for up to 4096 cores) 23 Discussion • Congestion is just one of many similarities, discussion in paper, e.g., - Traffic Engineering: “hotspots” - Data Centers: multi-threaded workloads w/ similar topology, dynamic routing & computation - Coding: “XOR’s In-The-Air” adapted to the on-chip network: • i.e., instead of deflecting 1 of 2 packets, XOR the packets and forward the combination over the optimal hop 24