Design of Future Integrated Systems: A Cyber-physical Systems Approach Radu Marculescu Carnegie Mellon University Pittsburgh, PA 15213, USA PATMOS-VARI, Karlsruhe Sept 9, 2013 Integrated systems are critical to IT revolution Embedded computing (cell phones, automotive, gaming) Cyber-physical computing (healthcare, transportation, smart grid) High performance computing (scientific applications, forecasting, data centers) 2 Three paradigm shifts, i.e. low-power (~90s), network-centric (~2K), and cyber-physical design (~2010). Exciting times ahead… Low-power Design. Computation vs. communication [D. Greenfield, S. Moore, Computer J., 2008] Multicore platforms are large scale distributed systems at nanoscale; they are dominated by communication costs Last level cache (LLC) Memory controller (MC) & channels I/O controller(s) QPI controller, Power control unit (PCU), etc. PCU Memory Controller Core* C R LLC C Packet LLC C R LLC LLC C C MC C R Q P I R LLC C R R LLC C R LLC R LLC C C LLC R P C I e C MC [U. Ogras, Tutorial ASPLOS 2012] 4 Need to understand the behavior of thousand core systems. Network (routers+links) is the missing link in understanding. CPS consist of networks of computational devices and networks that monitor and control physical processes a(t) a(t) a(t) a(t) a(t) – CPS workload r(t) Control e(t) Multi-scale behavior has been observed in many technological, economical, and biological systems 5 MPSoCs as CPS: Current multicore platforms bring together computation, communication, and control Physical processes VFI2 (V2, f2) Sensing Fan Mixed clock/ mixed voltage FIFO Actuation VFI 3(V3, f3) System state (e.g., queue occupancy, temp.) Heat sink VFI1 (V1, f1) Heat spreader Fan speed control V/F, on/off control Die Heat sink state (e.g., temp.) Joint (c) Computing/Cooling Controller Processing cores interconnected via routers and links; they operate at various voltages/frequencies thus saving power 6 This presentation focuses on the new CPS paradigm and its impact on future integrated systems Structure Architecture and small world effects Dynamics Workloads and multiscale behavior Control Power and resource management 7 Our first insight into communication-based design came through architecture (topology, buffer, etc.) optimization R R R R R R R R R R R R R R R R R [K. Srinivasan et al. IEEE T-VLSI 2006] [Bolotin et al. DATE 2007] 8 [J. Kim et al. ISCA 2007] Can induce small-world effects in regular NoCs. This brings huge performance improvements Completely structured Complete graph Customized via LRL Longrange link m N nodes n b(n) = D(m, n) = m + n − 2 b(n) = D′(m, n) < m + n − 2 b(n) = log 2 N b(n): minimum number of broadcasting steps D: network diameter 9 This way, the fundamental idea of Small World networks (aka “six degrees of separation”) enters the multicore world Small world effects can also be exploited to reduce hop count in 3D wireless NoCs and improve performance Root Chip 3 0 1 2 3 4 5 6 Root’ 7 Chip 2 VC0 Chip 1 UltraSPARC L1 cache (I & D) L2 cache bank [H. Matsutani et al. ASPDAC 2013] 10 VC1 Chip 0 Wired and wireless NoCs can be used intra-chip, while interchip communication is based on wireless inductive-coupling. Flow-control mechanisms, reconfigurability, adaptive routing have all been used to improve performance Regional hub router EVCs Bi-channels [Q. Zhiliang et al. CODES-ISSS 2012] 11 One way or another, they all exploit the small world effects… Optimum path-seeking behavior may be goal-oriented. Need to collect fitness information locally, at routing nodes Channel 2 Current channel Channel 2 H2 H2 Channel 1 fitnessi = α ⋅ min ⋅ win−1 Channel 1 H1 m = avg. number of free slots w = avg. waiting time in channel 1 2 1 12 [T. Mak et al. IEEE Cir&Sys Mag.2011] 1 1 3 5 2 2 This presentation focuses on the new CPS paradigm and its impact on future integrated systems Structure Architecture and small world effects Dynamics Workloads and multiscale behavior Control Power and resource management 13 Packet inter-arrival times at interface queues play a fundamental part in network behavior o1(n) o2(n) Switch o4(n) o3(n) PE Non-Markovian behavior Markovian behavior 14 Network behavior at high injection rates needs a nonequilibrium approach that accounts for fractal behavior 15 High injection rates cause inter-arrival times deviate from exponential distribution and exhibit power law correlations. Fractals are geometrical objects or stochastic processes displaying self-similar behavior over multiple scales Space usec 9 8 Time Packet inter-arrival times 7 6 5 4 3 2 1 0 7.7 7.8 7.9 8 8.1 Time 17 Bellcore traffic 8.2 8.3 Seconds 8.4 8.5 5 x 10 How long is the coast of Britain? Answering this question involves statistical self-similarity and fractal dimension Fractal dimension can be computed via box counting N (R ) ≈ R − D Choose an increasing set R of edge lengths For each size ri in R – Super-impose a series of distinct squares (boxes) over the data – Count the minimum # of boxes needed to cover data and store it in vector N Compute fractal dimension by fitting the equation between R and N Box counting method can be applied to determine the fractal dimension of 1D, 2D , or 3D vectors 18 Real world processes are not always smooth. Fractional dynamics needs a formalism stronger than integer calculus Integer calculus approximation Fractional behavior f(t) f(t) t t Mandelbrot (1975) Impact of power law memory kernel f(t) = dαt /dtα Δx ∝ Δt H , 0 < H < 1 α d x(t ) dt α = lim Δt →0 1 Δt α [ t Δt ] α (−1) j x(t − jΔt ) j j =0 t 19 For CPS, time is an intrinsic component of the programming model. Consequently, system dynamics becomes essential. This presentation focuses on the new CPS paradigm and its impact on future integrated systems Structure Architecture and small world effects Dynamics Workloads and multiscale behavior Control Power and resource management 20 8-Core Xeon® Processor has three clock domains and three voltage domains that help minimizing power. Core Domain (MCLK) Un-Core Domain (UCLK) QPI QPI I/O Domain (QCLK) QPI QPI QPI C3 C4 C2 QPI QPI Core Supply QPI Core Supply I/O Domain 1.1V fixed C5 Fuse C1 Un-Core Domain 0.9-1.1V fixed C6 C0 C7 SMI SMI Filter PLL IO DLLs IO Un - core PLL Core PLLs Core Supply Uncore Supply SMI BCLK PLLs Core Supply Core Domain 0.85-1.1V variable SMI [Rusu, ISSCC 2009] Device count reported 2.3B transistors. The chip has 9 temperature sensors, one in each core hot spot and one in the die center. 21 Fine-grain power management becomes possible by exploiting workload variations Globally asynchronous, locally synchronous (GALS) Clock domain 1 Volt./Freq. Island VFI 1 (V1, f1) Clock domain 2 VFI 2 (V2, f2) 22 Minimizing energy consumption Total energy consumption to be minimized #VFIs n E Total = E App + E VFI (i ) i =1 Application (useful) Overhead of ith VFI energy consumption (comp+comm) u Integrated power controller – independent of CPUs Single inductor multiple output (SIMO) converters Adaptive-output converters Voltage regulator controller (VRC) on Intel SCC [D. Ma, Cir&Syst Mag, 2010] Fine-grain power management can be implemented via control-theoretic approaches Applications (workload) Physical constraints + ∑ _ VN r , f N r ∈ [ X ref = [ x1ref x2ref ... x refq ]T Nr VFI 2 VFI 3 f Nmin , r f Nmax ] r V/F interface FIFO d αk xk dt α k 24 VFI 1 (V1, f1, Vt1) … Utilization (reference) values for interface FIFOs V1 , f1 ∈ [ f1min , f1max ] VoltageFrequency Controller ( = G xk , f j , f l , t ) Actual utilization of interface FIFOs State feedback X = [ x1 x2 ... x N q ]T r xkmin ≤ xk ≤ xkmax Inter-arrival times dictate the fractal exponent (αk) of state equations; this is used to characterize the system dynamics. V/F controller selects the minimum operating frequencies s.t. the queues reference values are satisfied Computational workload d yi (t ) αi αi Traffic variability d = ai (t ) yi (t ) + bi (t ) f i (t ) − ci (t ) f j (t ), x 4j (t ) xj yi xk (t ) (i,j)-th VFI for PE x1j (t ) PE S PE (i,j-1)th VFI Router Pdyn, Pleak, PT Write/Read = a k (t )xk (t ) + bk (t ) f j (t ) − ck (t ) f l (t ), dt α k 0 ≤ xkmin ≤ xk (t ) ≤ xkmax ≤ 1, k = 1 ÷ 4, j , l = 1 ÷ N r dt 0 ≤ yimin ≤ yi (t ) ≤ yimax ≤ 1, i = 1 ÷ N PE (i,j-1)th VFI for PE αk Router x 2j (t ) x 3j (t ) Pdyn, Pleak, PT S [P. Bogdan et al. NOCS 2012] tf N PE ( 1 min wi yi (t ) − yiref 2 ti i =1 25 ) 2 + 1 z i f i (t )2 + 2 f i min ≤ f i (t ) ≤ f i max , i = 1 ÷ N PE , 1 2 r j f j (t ) + 2 j =1 Nr and 1 2 q (x 4 k k =1 f jmin ≤ f j (t ) ≤ f jmax , k j 2 (t ) − xkref ) j = 1 ÷ Nr dt Accurate mathematical modeling and rigorous optimization can enable cross-layer power management Queues utilization at tiles (0,0), (1,1) and (1,2) for a 4×4 mesh NoC running Apache HTTP webserver application FOC keeps the utilization of all queues below 0.1 by adjusting the operating frequencies of all PEs and routers. FOC consumes less power compared to a LQR controller 26 How about core-to-core variability? Use frequency asymmetry to reduce performance loss P0 M2 P4 M6 P8 M1 P1 M1 0 2 4 M0 P2 M4 P6 M8 P1 M1 P1 0 2 4 P1 M3 P5 M7 P9 M1 P1 M1 1 3 5 M9 P1 M1 P1 1 3 5 M1 P3 M5 P7 Proportion of samples M2 M6 P8 P1 M1 2 0 P2 P6 M0 M4 P1 P1 0 4 P5 M1 M1 1 5 M7 M1 P1 M5 P3 P7 M9 M1 4 M8 M1 P9 P1 M1 P1 P1 3 1 5 2 3 Fine-grained design Coarse-grained design 0.15 0.10 0.05 52.3% 47.7% 0.00 -0.6 -0.4 ← leakier 27 P4 M3 0.20 Bias towards lower V/F levels Greater power reduction Greater performance reduction P0 -0.2 0.0 peff, σ 0.2 0.4 0.6 Bias towards higher V/F levels Smaller power increase Smaller performance increase less leaky → Run leakier cores at lower voltages, but at higher than normal frequencies for those voltages. [Herbert et al, IEEE TVLSI 2012] For thousand core systems distributed approaches for power management are of crucial importance PerforPE-1 mance PerforPE-2 mance Power Power DVFS set DVFS set Performance Performance Power Power DVFS set DVFS set PE-3 15-router synch NoC that connects 22 processing units PE-4 [F. Clermidy et al., ISSCC’10] Distributed objective function Local maximization algorithm 28 Not only a power issue… Distributed and hierarchical controllers: • Energy Controller (EC) – Output Frequency fEC • fEC ,TCORE, TNEIGHBOURS – Output fEC i,k+1 • Core frequency (fTC) TCi ECN fEC N,k+1 TN-1,k ECi Ti-1,k CPI N,k+1 CPI i,k+1 • Minimize power – CPI based • Performance degradation < 5% • Temperature Controller (TC) – Distributed MPC – Inputs: controller node Ti+1,k TCN Tx,k CoreN fTC N,k+1 T i,k Corei Core1 T2,k T i,k multicore fTC i,k+1 [Slide courtesy of L. Benini, Bologna U.] Peak temperature is important. DTM cannot be achieved by considering power alone. Physical context is crucial… [T. Ebi et al. ICCAD 2009] Agent-based systems can help system components (re)configure and optimize their resource usage independently Both centralized and distributed approaches have their own limitations Scalability issues due to cost and latency of long wires. Long synchronization times V1, f1 V2, f2 q1 q2 V3, f3 Highly scalable but potential problems in control performance V1, f1 V2, f2 q1 q3 q3 q2 V3, f3 Need ‘best-of-both-worlds’ between fully-centralized and fullydistributed solutions. This is true for thermal management too. 31 An hierarchy of globally distributed locally centralized control may help the system self-organize • Orders of magnitude! • Gets better with size System Size Flat Mesh (nJ) WiNoC (nJ) Factor 128 1319 22.57 58x 256 2936 24.02 122x 512 4992 37.48 133x [Ganguly et al. IEEE Trans. Comp., 2010] 33 [Heisswolf et al. ACM Trans. Reconfig. Tech&Syst, 2013] Local control w/ full state information, global control w/ partial information. Small world effects help convergence This presentation focuses on the new CPS paradigm and its impact on future integrated systems Structure Architecture and small world effects Dynamics Workloads and multiscale behavior Control Power and resource management 36 A pacemaker analyzes the function of the heart; if necessary, sends signals to correct certain abnormalities 37 [S. Barold et al. 2004] 37 Typical ECG tracing of the cardiac cycle (heartbeat) consists of a P wave, a QRS complex, a T wave, and a U wave R-R interval: Time between an R wave and the next R wave. Normal resting heart rate is between 60-100 bpm (0.6-1.0sec) 38 Kurtosis Variance R-R Intervals [sec] Mean Heart rate is non-stationary. This means that its distribution and its higher order moments vary with shifts in time Beat Number Beat Number Statistics of R-R intervals vary with time as a function of various activities (sleep, running, etc.) 39 Heart rate datasets available at: http://www.physionet.org/ DFA can reveal the fractal dimension of R-R intervals ith inter-beat interval avg. inter-beat interval k y (k ) = [B (i ) − Bave ] i =1 F ( n) = n = 100 y100 (k ) box length n (# of intervals considered) [C.-K. Peng et al.] 40 1 N 2 N [ y (k ) − y (k )] n k =1 log F(n) slope log n Interpretation - slope ~ 0.5 white noise - slope ~ 1 LRD - slope >> 1 Brownian noise Breakdown of fractal physiological control mechanism can lead to highly periodic output (single scale) or randomness Atrial fibrillation R-R intervals R-R intervals Healthy Beat number Beat number 41 DFA can be applied to heart rate variability. Degradation of fractal correlations can be used as a quick diagnostic tool. α = 1.032 α = 1.13 F (n ) ≈ nα 42 Controller finds the pacing frequency which minimizes the deviation of ISE of heart rate from the reference value tj ( 1 min w y (t ) − y ref 2 t ) + 12 zf (t ) dt 2 2 ( ) y (t i ) = y 0 and y t f = y1 f min ≤ f (t ) ≤ f max i yref(t) Fractal Optimal Controller f(t) (pacing freq.) Model of Heart Rate Variability y(t) (HR activ) Sensing, State Estimation, Model Identification d α y (t ) 44 dt α = a (t ) y (t ) + b(t ) f (t ), y min ≤ y (t ) ≤ y max Healthy R-R intervals (0.66 to 1 sec) Dangerous R-R intervals (as low as 0.20 sec) [seconds] Atrial fibrillation is characterized by short R-R intervals. If left untreated, it can lead to congestive heart failure X = 0.096 Y = 0.7166 (corresponds to a heart rate of 83 beats per min) [seconds] 45 FOC brings the R-R interval from 0.40 to 0.80 secs (i.e., a healthy heart rate of about 75 beats per minute) Workload analysis should not be an afterthought. In real CPS, network traffic is neither Poisson, nor stationary Classical dynamics: Linear Dependence & Exponential Inter-Event Distribution dP(a, t ) ∝ P ( a, t ) dt dM 1 (t ) ∝ M 1 (t ) dt d α P(a, t ) Fractal dynamics: Linear Dependence & Power-Law Inter-Event Distribution dt α d α M 1 (t ) dt α ∝ P ( a, t ) ∝ M 1 (t ) Statistical properties of the workload have deep implications in resource allocation, architectural design, RT scheduling, etc. 46 In summary, thousand core systems offer ample opportunities to bring science and engineering even closer Software Programming 47 User/application/OS optimization and resource management is the next frontier to conquer. Finally… Contributors Paul Bogdan (Univ. of Southern Calif.), Umit Y. Ogras (Intel), Siddharth Garg (Univ. of Waterloo), Diana Marculescu (Carnegie Mellon Univ), Chen-Ling Chou (Qualcomm), Qian Zhiliang (Hong Kong Univ Sci&Tech), Chi-Ying Tsui (Hong Kong Univ Sci&Tech), Hiroki Matsutani (Keio Univ). Relevant papers - www.ece.cmu.edu/~sld Sponsors Intel Corporation 48 PE 49 L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 L1 L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 L2 R L2 R L2 R PE L1 PE L1 PE L1 L2 R L2 R L2 R L2 R L2 Questions? PE L1 PE L1 PE L1 PE L1 PE R L2 R L2 R L1 PE L1 PE L1 L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R L2 R PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 PE L1 R R L2 R L2 R L2 R L2 PE L1 PE L1 PE L1 PE L2 R L2 R L2 R L2 PE L1 PE L1 PE L1 PE L2 R L2 R L2 R L2 R R L2 R L2 R L2 R L2 PE L1 PE L1 PE L1 PE L2 R L2 R L2 R L2 PE L1 PE L1 PE L1 PE R L2 R L2 R L2 L2 Thank you! R R L2 R L2 R PE L1 PE L1 L2 R L2 R PE L1 PE L1 L2 R L2 R