Understanding Networks-on-Chip: A Decade and Beyond Radu Marculescu Carnegie Mellon University Pittsburgh, PA 15213, USA NOCS, Tempe, AZ April 23, 2013 Multicores are critical to IT revolution Embedded computing (cell phones, automotive, gaming) Cyber-physical systems (healthcare, transportation, smart grid) High performance computing (scientific applications, forecasting, data centers) 2 Multicore platforms are large scale distributed systems at nanoscale; they are dominated by communication costs Last level cache (LLC) Memory controller (MC) & channels I/O controller(s) QPI controller, Power control unit (PCU), etc. PCU Memory Controller Core* C R LLC C Packet LLC C R LLC LLC C C MC C R Q P I R LLC C R R LLC C R LLC R LLC C C LLC R P C I e C MC [U. Ogras, Tutorial ASPLOS 2012] 3 Need to understand the behavior of thousand core systems. Network (routers+links) is the missing link in understanding. “Knowing in part may make a fine tale, but wisdom comes from seeing the whole. “ The rest of this presentation focuses on this “whole”, i.e., a few emerging ideas from a decade of NoC research Structure Architecture and small world effects Dynamics Workloads and multiscale behavior Control Power and resource management 5 Our first insight into communication-based design came through architecture (topology, buffer, etc.) optimization R R R R R R R R R R R R R R R R R [K. Srinivasan et al. IEEE T-VLSI 2006] [Bolotin et al. DATE 2007] 6 [J. Kim et al. ISCA 2007] Can induce small-world effects in regular NoCs. This brings huge performance improvements Completely structured Complete graph Customized via LRL Longrange link m N nodes n b(n) = D(m, n) = m + n − 2 b(n) = D′(m, n) < m + n − 2 b(n) = log 2 N b(n): minimum number of broadcasting steps D: network diameter 7 This way, the fundamental idea of Small World networks (aka “six degrees of separation”) enters the multicore world Small world effects can also be exploited to reduce hop count in 3D wireless NoCs and improve performance Root Chip 3 0 1 2 3 4 5 6 Root’ 7 Chip 2 VC0 Chip 1 UltraSPARC L1 cache (I & D) L2 cache bank [H. Matsutani et al. ASPDAC 2013] 8 VC1 Chip 0 Wired and wireless NoCs can be used intra-chip, while interchip communication is based on wireless inductive-coupling. Flow-control mechanisms, reconfigurability, adaptive routing have all been used to improve performance EVCs Regional hub router [Q. Zhiliang et al. CODES-ISSS 2012] 9 Bi-channels One way or another, they all exploit small world effects… Optimum path-seeking behavior may be goal-oriented. Need to collect fitness information locally, at routing nodes Channel 2 Channel 2 H2 H2 Current channel Channel 1 fitnessi = α ⋅ min ⋅ win−1 m = avg. number of free slots w = avg. waiting time in channel 10 Channel 1 H1 1 2 1 1 1 3 5 2 2 The rest of this presentation focuses on this “whole”, i.e., a few emerging ideas from a decade of NoC research Structure Architecture and small world effects Dynamics Workloads and multiscale behavior Control Power and resource management 12 Packet inter-arrival times at interface queues play a fundamental part in network behavior Buffer 2 Buffer 1 PE 13 11 clock cycles Time Inter-event times Cumulative events PE Non-Markovian behavior Markovian behavior # events Fractals are geometrical objects or stochastic processes displaying self-similar behavior over multiple scales. Real world processes are not always smooth. Fractional dynamics needs a formalism stronger than integer calculus Integer calculus approximation f(t) Fractional behavior t f(t) Impact of power law memory kernel t f(t) = dαt /dtα Mandelbrot formalism (1975) H Δx ∝ Δt , 0 < H < 1 d α x(t ) dt α = lim Δt →0 1 Δt α [ t Δt ] α (−1) j x(t − jΔt ) j j =0 t 14 Main idea is to exploit workload variations and optimize system behavior (minimize power, latency, etc.) Buffer 1 Buffer 2 Buffer 3 PE queue dynamics (traffic variability) PE d α k xk (t ) dt α k utilization = a k (t )xk (t ) + bk (t ) f j (t ) − ck (t ) f l (t ) Write/Read x ( k ) = x ( k − 1) + Tλ1 ( k − 1) − Tμ1 ( k − 1) 15 Inter-arrival times dictate the fractal exponent (αk) of state equations; this is used to characterize the system dynamics. Performance analysis at high injection rates needs a nonequilibrium approach that accounts for fractal behavior 16 High injection rates cause inter-arrival times deviate from exponential distribution and exhibit power law correlations. Network understanding needs to evolve from deterministic and stochastic, to statistical physics type of approaches Deterministic Stochastic [P. Bogdan et al. FTnEDA 2009] 17 System optimization and resource management becomes all about time-varying dynamic processes running over networks. The rest of this presentation focuses on this “whole”, i.e., a few emerging ideas from a decade of NoC research Structure Architecture and small world effects Dynamics Workloads and multiscale behavior Control Power and resource management 18 8-Core Xeon® Processor has three clock domains and three voltage domains that help minimizing power. Core Domain (MCLK) Un-Core Domain (UCLK) QPI QPI I/O Domain (QCLK) QPI QPI QPI C3 QPI QPI QPI C4 Core Supply C2 Core Supply I/O Domain 1.1V fixed C5 Fuse C1 Un-Core Domain 0.9-1.1V fixed C6 C0 C7 SMI SMI Filter PLL IO DLLs IO Un - core PLL Core Core Supply Uncore Supply SMI BCLK PLLs Core Supply Core Domain 0.85-1.1V variable SMI PLLs Largest device count reported for a microprocessor 2.3B transistors. 19 [Rusu, ISSCC 2009] Fine-grain power management becomes possible by exploiting workload variations Globally asynchronous, locally synchronous (GALS) Clock domain 1 Volt./Freq. Island VFI 1 (V1, f1) Clock domain 2 VFI 2 (V2, f2) 20 Fine-grain power management can be implemented via control-theoretic approaches Applications (workload) Physical constraints + _ VoltageFrequency Controller ∑ V1 , f1 ∈ [ f1min , f1max ] … Utilization (reference) values for interface FIFOs VN r , f N r ∈ [ X ref = [ x1ref x2ref ... x refq ]T Nr VFI 3 f Nmin , r VFI 2 f Nmax ] r V/F interface FIFO d αk xk dt α k 21 VFI 1 (V1, f1, Vt1) ( = G xk , f j , f l , t ) State feedback Actual utilization of interface FIFOs X = [ x1 x2 ... x N q ]T r xkmin ≤ xk ≤ xkmax Workload characteristics and application deadlines are critical. Also, control signals (voltage, frequency) need to be constrained. V/F controller selects the minimum operating frequencies s.t. the queues reference values are satisfied Computational workload d yi (t ) αi αi Traffic variability d = ai (t ) yi (t ) + bi (t ) f i (t ) − ci (t ) f j (t ), x 4j xj yi (t ) xk (t ) (i,j)-th VFI for PE x1j (t ) PE S PE (i,j-1)th VFI Router Pdyn, Pleak, PT Write/Read = a k (t )xk (t ) + bk (t ) f j (t ) − ck (t ) f l (t ), dt α k 0 ≤ xkmin ≤ xk (t ) ≤ xkmax ≤ 1, k = 1 ÷ 4, j , l = 1 ÷ N r dt 0 ≤ yimin ≤ yi (t ) ≤ yimax ≤ 1, i = 1 ÷ N PE (i,j-1)th VFI for PE αk Router x 2j (t ) x 3j (t ) Pdyn, Pleak, PT S [P. Bogdan et al. NOCS 2012] tf N PE ( 1 min wi yi (t ) − yiref 2 ti i =1 22 ) 2 + 1 z i f i (t )2 + 2 f i min ≤ f i (t ) ≤ f i max , i = 1 ÷ N PE , Nr 1 2 r j =1 and j f j (t )2 + 1 2 q (x 4 k k =1 f jmin ≤ f j (t ) ≤ f jmax , k j 2 (t ) − xkref ) dt j = 1 ÷ Nr Accurate mathematical modeling and rigorous optimization can enable cross-layer power management Queues utilization at tiles (0,0), (1,1) and (1,2) for a 4×4 mesh NoC running Apache HTTP webserver application FOC keeps the utilization of all queues below 0.1 by adjusting the operating frequencies of all PEs and routers. FOC consumes less power compared to LQR controller 23 Workload analysis should not be an afterthought. In real applications, traffic is rarely Poisson or stationary Classical dynamics: Linear Dependence & Exponential Inter-Event Distribution dP(a, t ) ∝ P ( a, t ) dt dM 1 (t ) ∝ M 1 (t ) dt d α [tP (a, t )] Fractal dynamics: Linear Dependence & Power-Law Inter-Event Distribution dt α d α [tM 1 (t )] dt α ∝ P ( a, t ) Lat time Lat ∝ M 1 (t ) time Statistical properties of the workload have deep implications in resource allocation, architectural design, scheduling, etc. 24 For thousand core systems distributed approaches for power management are of crucial importance PerforPE-1 mance PerforPE-2 mance Power Power DVFS set DVFS set Performance Performance Power Power DVFS set DVFS set PE-3 15-router synch NoC that connects 22 processing units PE-4 [F. Clermidy et al., ISSCC’10] Distributed objective function 26 Local maximization algorithm Both centralized and distributed approaches have their own limitations Scalability issues due to cost and latency of long wires. Long synchronization times V1, f1 V2, f2 q1 Highly scalable but potential problems in control performance V1, f1 V2, f2 q1 q3 q2 q3 q2 V3, f3 V3, f3 Need ‘best-of-both-worlds’ between fully-centralized and fullydistributed solutions. This is true for thermal management too. 27 Custom feedback control offers a good compromise between fully centralized and fully distributed solutions Performance Small-world networks Fully-centralized Unexplored Design Space Fully-distributed Full (global) feedback (64 feedback channels) Implementation cost [S. Garg et al., ISLPED 2010] 28 Video Encoder Fully-centralized Local + partially global feedback Unexplored feedback channels) Custom(16 Feedback Design Space Control (CF) Fully-decentralized Only local feedback (8 feedback channels) Implementatio n Cost An hierarchy of globally distributed locally centralized control may help the system self-organize • Orders of magnitude! • Gets better with size System Size Flat Mesh (nJ) WiNoC (nJ) Factor 128 1319 22.57 58x 256 2936 24.02 122x 512 4992 37.48 133x [Ganguly et al. IEEE Trans. Comp., 2010] Local control w/ full state information, global control w/ partial information. Small world effects help convergence 29 On-chip communication is essential for multi-kernel OS Current: Monolithic (single kernel) OS (e.g., Windows) User + Applications + Services Future: Distributed (multi-kernel) OS (e.g., Barrelfish) User 1 + Applications 1 User p + Services Applications p Single shared state & lock OS Core 1 Core 2 Core 2 On-chip interconnect OS 1 OS 2 state state replica replica OS m state replica Core 1 Core 2 Core n On-chip interconnect OS execution as dynamic graphs. Fundamental design constraints (power, performance) depend on HW-OS-application interaction. 30 OS inter-event times [sec] OS inter-event times exhibit a self-similar structure over time. Hurst exponent for some real OS traces is 0.9 Hurst exp = 0.99 Hurst exp = 0.49 Time [sec] [P. Bogdan et al. TECHON 2012] 31 In summary, thousand core systems offer ample opportunities to bring science and engineering even closer Workload analysis should not be an afterthought, from chip all the way to the cloud Application/OS optimization and resources management represent the next frontier to conquer Optimization evolves towards selforganization; new theories and design paradigms are needed 32 Understanding networks (i.e. “whole”) is crucial for sci&eng. Lessons learned so far are valuable as we move forward. Finally… Collaborators - Paul Bogdan (USC) and Umit Y. Ogras (Intel) Siddharth Garg (Univ. of Waterloo), Mike Kishinevsky (Intel), Edward Ma (CMU), Diana Marculescu (CMU), Hiroki Matsutani (Keio Univ.), Chi-Ying Tsui (HK Univ. Sci. Tech), Qian Zhiliang (HK Univ. Sci. Tech), Simon Hollis (Univ. of Bristol), Relevant papers - www.ece.cmu.edu/~sld Sponsors Intel Corporation 33 PE 34 L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 L1 L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 Questions? L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 R R L2 R L2 R L2 R L2 PE L1 PE L1 PE L1 PE L2 R L2 R L2 R L2 PE L1 PE L1 PE L1 PE L2 R L2 R L2 R L2 R R L2 R L2 R L2 R L2 PE L1 PE L1 PE L1 PE L2 R L2 R L2 R L2 PE L1 PE L1 PE L1 PE R L2 R L2 R L2 L2 Thank you! R R L2 R L2 R PE L1 PE L1 L2 R L2 R PE L1 PE L1 L2 R L2 R