On-Chip Networks from a Networking Perspective: Congestion and Scalability in Many-Core Interconnects George Nychis✝, Chris Fallin✝, Thomas Moscibroda★, Onur Mutlu✝, Srinivasan Seshan✝ Carnegie Mellon University ✝ Microsoft Research Asia ★ Presenter:Zhi Liu es What is the On-Chip Network? Multi-core Processor (9-core) CPU CPU CPU L1 L1 L1 L2 bank L2 bank L2 bank CPU CPU CPU L1 L1 L1 L2 bank L2 bank L2 bank CPU CPU CPU L1 L1 L1 L2 bank L2 bank L2 bank Cache Banks Memory Controllers 2 What is the On-Chip Network? Multi-core Processor (9-core) CPU CPU CPU L1 L1 L1 L2 bank L2 bank L2 bank CPU Network Links CPU CPU L1 L1 L2 bank L2 bank L2 bank CPU CPU CPU L1 L1 L1 L2 bank L2 bank L2 bank L1 D S uter 3 Networking Challenges he architecture community, e.g.: educe congestion scale the network an effective topology On-Chip Network g and buffer size 4 Characteristics of Bufferless NoCs s: unique network design unique style of traffic and flow ess a: -60% On-Chip Network (3x3) 40% Zoomed In X S Routing min. complexity: X-Y-Routing, low latency D Coordination global is often practical for the known topology Links links cannot be over-provisioned Cs Characteristics s: unique network design Routing min. complexity: X-Y-Routing, low latency unique style of traffic and flow ess 0% 40% Zoomed In d-Loop uction dow Traffic Per-Core Coordination global is often Insn. i5 R5 i6 i7 R7 i8R8i9 practical for the known topology Architecture (Instruct. Win) Router Network Layer Links links cannot be over-provisioned Traffic and Congestion • Injection only when output link is free • Arbitration: oldest pktOn-Chip first (dead/live-lock free) D Network Congestion 1. Deflection: arbitration causing non-optimal hop2 S1 0 1 1 0 S2 contending for top port, oldest first, newest deflected age is initialized Traffic and Congestion • Injection only when output link is free • Arbitration: oldest pktOn-Chip first (dead/live-lock free) Network Congestion 1. Deflection: arbitration causing non-optimal hop 2. Starvation: when a core cannot inject (no loss) Definition: Starvation rate is a fraction of starved cycles Can’t inject packet without a free output port Outline • Bufferless On-Chip Networks: Congestion & Scalability - Study of congestion at network and application layers - Impact of congestion on scalability • Novel application-aware congestion control mechanism • Evaluation of congestion control mechanism Congestion and Scalability Study • Prior work: moderate intensity workloads, small on-chip net - Energy and area benefits of going bufferless - throughput comparable to buffered • Study: high intensity workloads & large network (4096 cores) - Still comparable throughput with the benefits of bufferless? • Methodology : Use real application workloads (e.g., matlab, gcc, bzip2, perl) avg. net latency (cycles) Congestion at the Network Level 50 • Evaluate 700 different 40 loads(intensity of high 30 with congestion/deflects 25% • Finding: net latency remains stable medium and low) in 1620 •core Unlike traditional networks system Increase 10 0 0 0.2 0.4 0.6 0.8 average network utilization 1 average starvation rate • What about starvation rate? 700% 0.4 congestion is only Increase in network latency under • Starvation increasesIncrease significantly with congestion Each point represents a single workload 0.3 ~5-6 cycles 0.2 0.1 performance; indicator o • Finding: starvation likely to impact 0.0 0.0 0.2 0.4 0.6 0.8 average network utilization 1.0 Congestion at the Application Level • Define system throughput as sum of instructions-per-cycle (IPC) of all applications in system: • Unthrottle apps in single high intensity wkld in 4x4 system • Throttled Finding 1: Throughput decreases under congestion Sub-optimal Unthrottled Throughput Does with congestion Not Collapse • Finding 2: Self-throttling of cores prevents collapse • Finding 3: Static throttling can provide some gain (e.g., 14% Impact of Congestion on Scalability Our work: up to 4096 cores • As we increase system’s size Starvation Rate • Prior work: 16-64 cores 0.4 0.3 0.2 0.1 0 16 avg_hop=1): - Starvation rate increases - Aupcore can be starved to 37% of all cycles! - Per-node throughput decreases with system’s size - Up to 38% reduction Throughput (IPC/Node) (high load, locality model, 64 256 1024 Number of Cores 4096 1 0.8 0.6 0.4 0.2 0 16 64 256 1024 Number of Cores 4096 Summary of Congestion Study • Network congestion limits scalability and performance - Due to starvation rate, not increased network latency - Starvation rate is the indicator of congestion in on-chip net • Self-throttling nature of cores prevent congestion collapse • Throttling: reduced congestion, improved performance Motivation for congestion control be application-aware Outline • Bufferless On-Chip Networks: Congestion & Scalability - Study of congestion at network and application layers - Impact of congestion on scalability • Novel application-aware congestion control mechanism • Evaluation of congestion control mechanism Developing a Congestion Controller • Traditional congestion controllers designed to: - Improve network efficiency When - Maintain fairness of network access Considering: - Provide stability (and avoid collapse) On-Chip Network - Operate in a distributed manner • A controller for on-chip networks must: - Have minimal complexity Be area-efficient al and simple controller - We show: be application-aware Need For Application Awareness • Throttling reduces congestion, improves system throughput • Use 16-core system, 8 instances each app, alternate 90% thr - Under congestion, which core should be throttled? • Finding 1: the app that is throttled impacts system +21%performan Instruction Throughput Avg.Avg. Instruction Throughput Avg. Instruction Throughput Gromacs Gromacs Gromacs 2.5 2.5 Nothing Nothing MCF Nothing -9% 2.5 2 2 1.521.5 MCF MCF 0.6 <1% +33% • Finding 2: application throughput does not dictate who to thro 1.3 1.5 1 1 0.510.5 0.5 0 0 0 Baseline Baseline Baseline Throttle Throttle Gromacs Gromacs Throttle Gromacs Throttle Throttle MCF MCF Throttle MCF • Finding different applications respondagnostic differently to an incr rollers (e.g.,3: TCP): cannot be application Instructions-Per-Packet (IPP) • Key Insight: Not all packets are created equal more L1 misses more traffic to progress Throttling during on a “high” phase •- IPP only depends the L1IPP miss rate will hurt performance IPP IPP IPP IPP - independent of the level of congestion & execution rate r-Packet I/Pmore packets into network - low(IPP) value: =inject behavior - provides a stable measure of app’slcurrent network intens millions of cycles - MCF’s IPF: 0.583, Gromacs IPF: 12.41 25000 25000 20000 20000 15000 15000 10000 10000 5000 5000 0 0 0 0 10000 10000 1000 1000 100100 10 10 1 1 0.10.1 0 0 deal.ii deal.ii 10 10 20 20 30 30 40 40 50 50 60 60 70 70 80 80 xmlxml • Since L1 miss rate varies over execution, IPP is dynamic 10 10 20 20 30 30 40 40 50 50 60 60 70 70 80 80 Execution Cycle (million) Execution Cycle (million) 1 Instructions-Per-Packet (IPP) • Since L1 miss rate varies over execution, IPP is dynamic - Throttling during a “high” IPP phase will hurt performance -layer insight in who to throttle Throttling must be dynamic ottle applications with low IPP ling rate by application’s IPP 1 Summary of Congestion Controlling gestion control algorithm every 100,000 cycles : based on starvation rates termines IPF of applications ottle applications with low IPP scale throttling rate by application’s IPP for applications with lower IPP Outline • Bufferless On-Chip Network: Congestion & Scalability - Study of congestion at network and application layers - Impact of congestion on scalability • Novel application-aware congestion control mechanism • Evaluation of congestion control mechanism • Evaluate with 875 real workloads (700 16-core, 175 64core) generate balanced set of CMP workloads (cloud • Improvement up to 27% under congested workloads • computing) Does not degrade non-congested workloads 3030 2525 2020 1515 Only 4/875 workloads have perform. 1010 reduced > 0.5% 55 00 throughput for workloads -5-5down (in paper) Do not unfairly throttle applications 0.00.0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 • • System Throughput % Improvement % Improvement m Evaluation of Improved Efficiency 1.0 1.0 baseline baselineaverage averagenetwork networkutilization utilization Network Utilization With No Congestion Control Evaluation of Improved Scalability • • • Baseline bufferless: doesn’t scale Buffered: area/power expensive Contribution: keep area and power benefits of bufferless, while achieving comparable performance • • • Throughput (IPC/Node) Comparison points… 1 0.8 0.6 0.4 Baseline Bufferless Buffered 0.2 Throttling Bufferless 0 16 64 256 1024 Number of Cores 4096 Application-aware throttling % Reduction in Power Consumption • Overall reduction in congestion Power consumption reduced through increase in net efficiency 20 15 10 5 0 16 64 256 1024 Number of Cores 4096 Summary of Study, Results, and Conclusions • Highlighted a traditional networking problem in a new context - Unique design requires novel solution • • • We showed congestion limited efficiency and scalability, and that self-throttling nature of cores prevents collapse Study showed congestion control would require appawareness Our application-aware congestion controller provided: - A more efficient network-layer (reduced latency) - Improvements in system throughput (up to 27%) - Effectively scale the CMP (shown for up to 4096 cores)