Design of Future Integrated Systems: A Cyber-physical Systems Approach Radu Marculescu

advertisement
Design of Future Integrated Systems:
A Cyber-physical Systems Approach
Radu Marculescu
Carnegie Mellon University
Pittsburgh, PA 15213, USA
PATMOS-VARI, Karlsruhe
Sept 9, 2013
Integrated systems are critical to IT revolution
Embedded computing (cell phones,
automotive, gaming)
Cyber-physical computing (healthcare,
transportation, smart grid)
High performance computing (scientific
applications, forecasting, data centers)
2
Three paradigm shifts, i.e. low-power (~90s), network-centric
(~2K), and cyber-physical design (~2010). Exciting times ahead…
Low-power Design. Computation vs. communication
[D. Greenfield, S. Moore, Computer J., 2008]
Multicore platforms are large scale distributed systems at
nanoscale; they are dominated by communication costs
Last level cache (LLC)
Memory controller (MC) &
channels
I/O controller(s)
QPI controller,
Power control unit (PCU),
etc.
PCU
Memory Controller
Core*
C
R
LLC
C
Packet
LLC
C
R
LLC
LLC
C
C
MC
C
R
Q
P
I
R
LLC
C
R
R
LLC
C
R
LLC
R
LLC
C
C
LLC
R
P
C
I
e
C
MC
[U. Ogras, Tutorial ASPLOS 2012]
4
Need to understand the behavior of thousand core systems.
Network (routers+links) is the missing link in understanding.
CPS consist of networks of computational devices and
networks that monitor and control physical processes
a(t)
a(t)
a(t)
a(t)
a(t) – CPS
workload
r(t)
Control
e(t)
Multi-scale behavior has been observed in many technological,
economical, and biological systems
5
MPSoCs as CPS: Current multicore platforms bring together
computation, communication, and control
Physical processes
VFI2 (V2, f2)
Sensing
Fan
Mixed clock/
mixed voltage FIFO
Actuation
VFI 3(V3, f3)
System state
(e.g., queue
occupancy, temp.)
Heat sink
VFI1 (V1, f1)
Heat spreader
Fan speed
control
V/F, on/off
control
Die
Heat sink state
(e.g., temp.) Joint
(c)
Computing/Cooling
Controller
Processing cores interconnected via routers and links; they
operate at various voltages/frequencies thus saving power
6
This presentation focuses on the new CPS paradigm and its
impact on future integrated systems
Structure
Architecture and small world effects
Dynamics
Workloads and multiscale behavior
Control
Power and resource management
7
Our first insight into communication-based design came
through architecture (topology, buffer, etc.) optimization
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
[K. Srinivasan et al. IEEE T-VLSI 2006]
[Bolotin et al. DATE 2007]
8
[J. Kim et al. ISCA 2007]
Can induce small-world effects in regular NoCs. This brings
huge performance improvements
Completely structured
Complete graph
Customized via LRL
Longrange link
m
N nodes
n
b(n) = D(m, n) = m + n − 2
b(n) = D′(m, n) < m + n − 2
b(n) = log 2 N
b(n): minimum number of broadcasting steps
D: network diameter
9
This way, the fundamental idea of Small World networks (aka
“six degrees of separation”) enters the multicore world
Small world effects can also be exploited to reduce hop
count in 3D wireless NoCs and improve performance
Root
Chip 3
0
1
2
3
4
5
6
Root’ 7
Chip 2
VC0
Chip 1
UltraSPARC
L1 cache (I & D)
L2 cache bank
[H. Matsutani et al. ASPDAC 2013]
10
VC1
Chip 0
Wired and wireless NoCs can be used intra-chip, while interchip communication is based on wireless inductive-coupling.
Flow-control mechanisms, reconfigurability, adaptive
routing have all been used to improve performance
Regional hub router
EVCs
Bi-channels
[Q. Zhiliang et al. CODES-ISSS 2012]
11
One way or another, they all exploit the small world effects…
Optimum path-seeking behavior may be goal-oriented.
Need to collect fitness information locally, at routing nodes
Channel 2
Current channel
Channel 2
H2
H2
Channel 1
fitnessi = α ⋅ min ⋅ win−1
Channel 1
H1
m = avg. number of free slots
w = avg. waiting time in channel
1
2
1
12
[T. Mak et al. IEEE Cir&Sys Mag.2011]
1
1
3
5
2
2
This presentation focuses on the new CPS paradigm and its
impact on future integrated systems
Structure
Architecture and small world effects
Dynamics
Workloads and multiscale behavior
Control
Power and resource management
13
Packet inter-arrival times at interface queues play a
fundamental part in network behavior
o1(n)
o2(n)
Switch
o4(n)
o3(n)
PE
Non-Markovian
behavior
Markovian
behavior
14
Network behavior at high injection rates needs a nonequilibrium approach that accounts for fractal behavior
15
High injection rates cause inter-arrival times deviate from
exponential distribution and exhibit power law correlations.
Fractals are geometrical objects or stochastic processes
displaying self-similar behavior over multiple scales
Space
usec
9
8
Time
Packet inter-arrival times
7
6
5
4
3
2
1
0
7.7
7.8
7.9
8
8.1
Time
17
Bellcore traffic
8.2
8.3
Seconds
8.4
8.5
5
x 10
How long is the coast of Britain? Answering this question
involves statistical self-similarity and fractal dimension
Fractal dimension can be
computed via box counting
N (R ) ≈ R − D
Choose an increasing set R of edge lengths
For each size ri in R
– Super-impose a series of distinct
squares (boxes) over the data
– Count the minimum # of boxes
needed to cover data and store it in
vector N
Compute fractal dimension by fitting the
equation between R and N
Box counting method can be applied to determine the fractal
dimension of 1D, 2D , or 3D vectors
18
Real world processes are not always smooth. Fractional
dynamics needs a formalism stronger than integer calculus
Integer calculus approximation
Fractional behavior
f(t)
f(t)
t
t
Mandelbrot (1975)
Impact of power law memory kernel
f(t) = dαt /dtα
Δx ∝ Δt H , 0 < H < 1
α
d x(t )
dt α
= lim
Δt →0
1
Δt α
[ t Δt ]
α 
(−1) j   x(t − jΔt )
 j
j =0

t
19
For CPS, time is an intrinsic component of the programming
model. Consequently, system dynamics becomes essential.
This presentation focuses on the new CPS paradigm and its
impact on future integrated systems
Structure
Architecture and small world effects
Dynamics
Workloads and multiscale behavior
Control
Power and resource management
20
8-Core Xeon® Processor has three clock domains and three
voltage domains that help minimizing power.
Core Domain
(MCLK)
Un-Core Domain
(UCLK)
QPI
QPI
I/O Domain
(QCLK)
QPI
QPI
QPI
C3
C4
C2
QPI
QPI
Core
Supply
QPI
Core
Supply
I/O Domain
1.1V
fixed
C5
Fuse
C1
Un-Core
Domain
0.9-1.1V
fixed
C6
C0
C7
SMI
SMI
Filter PLL
IO DLLs
IO
Un - core PLL
Core
PLLs
Core
Supply
Uncore
Supply
SMI
BCLK
PLLs
Core
Supply
Core Domain
0.85-1.1V
variable
SMI
[Rusu, ISSCC 2009]
Device count reported 2.3B transistors. The chip has 9 temperature
sensors, one in each core hot spot and one in the die center.
21
Fine-grain power management becomes possible by
exploiting workload variations
Globally asynchronous,
locally synchronous (GALS)
Clock
domain 1
Volt./Freq. Island VFI 1
(V1, f1)
Clock
domain 2
VFI 2 (V2, f2)
22
Minimizing energy consumption
Total energy consumption to be minimized
#VFIs
n
E Total = E App +  E VFI (i )
i =1
Application (useful)
Overhead of ith VFI
energy consumption
(comp+comm)
u
Integrated power controller
– independent of CPUs
 Single inductor multiple


output (SIMO) converters
Adaptive-output
converters
Voltage regulator
controller (VRC) on Intel
SCC
[D. Ma, Cir&Syst Mag, 2010]
Fine-grain power management can be implemented via
control-theoretic approaches
Applications
(workload)
Physical constraints
+
∑
_
VN r , f N r ∈ [
X ref = [ x1ref x2ref ... x refq ]T
Nr
VFI 2
VFI 3
f Nmin
,
r
f Nmax
]
r
V/F interface FIFO
d
αk
xk
dt α k
24
VFI 1 (V1, f1, Vt1)
…
Utilization
(reference) values
for interface FIFOs
V1 , f1 ∈ [ f1min , f1max ]
VoltageFrequency
Controller
(
= G xk , f j , f l , t
)
Actual utilization of
interface FIFOs
State
feedback
X = [ x1 x2 ... x N q ]T
r
xkmin ≤ xk ≤ xkmax
Inter-arrival times dictate the fractal exponent (αk) of state
equations; this is used to characterize the system dynamics.
V/F controller selects the minimum operating frequencies
s.t. the queues reference values are satisfied
Computational workload
d yi (t )
αi
αi
Traffic variability
d
= ai (t ) yi (t ) + bi (t ) f i (t ) − ci (t ) f j (t ),
x 4j
(t )
xj
yi
xk (t )
(i,j)-th
VFI for
PE
x1j (t )
PE
S
PE
(i,j-1)th VFI
Router
Pdyn, Pleak, PT
Write/Read
= a k (t )xk (t ) + bk (t ) f j (t ) − ck (t ) f l (t ),
dt α k
0 ≤ xkmin ≤ xk (t ) ≤ xkmax ≤ 1, k = 1 ÷ 4, j , l = 1 ÷ N r
dt
0 ≤ yimin ≤ yi (t ) ≤ yimax ≤ 1, i = 1 ÷ N PE
(i,j-1)th VFI
for PE
αk
Router
x 2j (t )
x 3j
(t )
Pdyn, Pleak, PT
S
[P. Bogdan et al. NOCS 2012]
tf
 N PE
(
 1
min   wi yi (t ) − yiref
2
ti 
 i =1 

25
)
2
+
1

z i f i (t )2  +
2

f i min ≤ f i (t ) ≤ f i max , i = 1 ÷ N PE ,
1
2
 r j f j (t ) +
2
j =1 

Nr

and
1
  2 q (x
4
k
k =1
f jmin ≤ f j (t ) ≤ f jmax ,
k
j
2 

(t ) − xkref )
j = 1 ÷ Nr
  dt
  
Accurate mathematical modeling and rigorous optimization
can enable cross-layer power management
Queues utilization at tiles (0,0), (1,1) and (1,2) for a 4×4
mesh NoC running Apache HTTP webserver application
FOC keeps the utilization of all queues below 0.1 by
adjusting the operating frequencies of all PEs and routers.
FOC consumes less power compared to a LQR controller
26
How about core-to-core variability? Use frequency
asymmetry to reduce performance loss
P0
M2
P4
M6
P8
M1
P1
M1
0
2
4
M0
P2
M4
P6
M8
P1
M1
P1
0
2
4
P1
M3
P5
M7
P9
M1
P1
M1
1
3
5
M9
P1
M1
P1
1
3
5
M1
P3
M5
P7
Proportion of samples
M2
M6
P8
P1
M1
2
0
P2
P6
M0
M4
P1
P1
0
4
P5
M1
M1
1
5
M7
M1
P1
M5
P3
P7
M9
M1
4
M8
M1
P9
P1
M1
P1
P1
3
1
5
2
3
Fine-grained design
Coarse-grained design
0.15
0.10
0.05
52.3%
47.7%
0.00
-0.6
-0.4
← leakier
27
P4
M3
0.20
Bias towards lower V/F levels
Greater power reduction
Greater performance reduction
P0
-0.2
0.0
peff, σ
0.2
0.4
0.6
Bias towards higher V/F levels
Smaller power increase
Smaller performance increase
less leaky →
Run leakier cores at lower voltages, but at higher than normal
frequencies for those voltages.
[Herbert et al, IEEE TVLSI 2012]
For thousand core systems distributed approaches for
power management are of crucial importance
PerforPE-1
mance
PerforPE-2
mance
Power
Power
DVFS set
DVFS set
Performance
Performance
Power
Power
DVFS set
DVFS set
PE-3
15-router synch NoC that
connects 22 processing units
PE-4
[F. Clermidy et al., ISSCC’10]
Distributed
objective
function
Local
maximization
algorithm
28
Not only a power issue…
Distributed and hierarchical controllers:
• Energy Controller (EC)
– Output Frequency fEC
• fEC ,TCORE, TNEIGHBOURS
– Output
fEC i,k+1
• Core frequency (fTC)
TCi
ECN
fEC N,k+1
TN-1,k
ECi
Ti-1,k
CPI N,k+1
CPI i,k+1
• Minimize power – CPI based
• Performance degradation < 5%
• Temperature Controller (TC)
– Distributed MPC
– Inputs:
controller
node
Ti+1,k
TCN
Tx,k
CoreN
fTC N,k+1
T i,k
Corei
Core1
T2,k
T i,k
multicore
fTC i,k+1
[Slide courtesy of L. Benini, Bologna U.]
Peak temperature is important. DTM cannot be achieved by
considering power alone. Physical context is crucial…
[T. Ebi et al. ICCAD 2009]
Agent-based systems can help system components (re)configure
and optimize their resource usage independently
Both centralized and distributed approaches have their
own limitations
Scalability issues due to cost and latency
of long wires. Long synchronization times
V1, f1
V2, f2
q1
q2
V3, f3
Highly scalable but potential
problems in control performance
V1, f1
V2, f2
q1
q3
q3
q2
V3, f3
Need ‘best-of-both-worlds’ between fully-centralized and fullydistributed solutions. This is true for thermal management too.
31
An hierarchy of globally distributed locally centralized
control may help the system self-organize
• Orders of magnitude!
• Gets better with size
System Size
Flat Mesh (nJ)
WiNoC (nJ)
Factor
128
1319
22.57
58x
256
2936
24.02
122x
512
4992
37.48
133x
[Ganguly et al. IEEE Trans. Comp., 2010]
33
[Heisswolf et al. ACM Trans. Reconfig. Tech&Syst, 2013]
Local control w/ full state information, global control w/
partial information. Small world effects help convergence
This presentation focuses on the new CPS paradigm and its
impact on future integrated systems
Structure
Architecture and small world effects
Dynamics
Workloads and multiscale behavior
Control
Power and resource management
36
A pacemaker analyzes the function of the heart; if
necessary, sends signals to correct certain abnormalities
37
[S. Barold et al. 2004]
37
Typical ECG tracing of the cardiac cycle (heartbeat) consists
of a P wave, a QRS complex, a T wave, and a U wave
R-R interval: Time between an R wave and the next R wave.
Normal resting heart rate is between 60-100 bpm (0.6-1.0sec)
38
Kurtosis
Variance
R-R Intervals [sec]
Mean
Heart rate is non-stationary. This means that its distribution
and its higher order moments vary with shifts in time
Beat Number
Beat Number
Statistics of R-R intervals vary with time as a function of various
activities (sleep, running, etc.)
39
Heart rate datasets available at: http://www.physionet.org/
DFA can reveal the fractal dimension of R-R intervals
ith inter-beat interval
avg. inter-beat interval
k
y (k ) =  [B (i ) − Bave ]
i =1
F ( n) =
n = 100
y100 (k )
box length n
(# of intervals considered)
[C.-K. Peng et al.]
40
1
N
2
N
 [ y (k ) − y (k )]
n
k =1
log F(n)
slope
log n
Interpretation
- slope ~ 0.5 white noise
- slope ~ 1 LRD
- slope >> 1 Brownian noise
Breakdown of fractal physiological control mechanism can
lead to highly periodic output (single scale) or randomness
Atrial fibrillation
R-R intervals
R-R intervals
Healthy
Beat number
Beat number
41
DFA can be applied to heart rate variability. Degradation of
fractal correlations can be used as a quick diagnostic tool.
α = 1.032
α = 1.13
F (n ) ≈ nα
42
Controller finds the pacing frequency which minimizes the
deviation of ISE of heart rate from the reference value
tj
(
1
min  w y (t ) − y ref
2
t

) + 12 zf (t ) dt
2
2
( )
y (t i ) = y 0 and y t f = y1
f min ≤ f (t ) ≤ f max
i
yref(t)
Fractal Optimal
Controller
f(t)
(pacing
freq.)
Model of Heart
Rate Variability
y(t) (HR activ)
Sensing, State Estimation, Model
Identification
d α y (t )
44
dt
α
= a (t ) y (t ) + b(t ) f (t ),
y min ≤ y (t ) ≤ y max
Healthy R-R intervals (0.66
to 1 sec)
Dangerous R-R intervals (as
low as 0.20 sec)
[seconds]
Atrial fibrillation is characterized by short R-R intervals. If
left untreated, it can lead to congestive heart failure
X = 0.096 Y = 0.7166
(corresponds to a heart
rate of 83 beats per min)
[seconds]
45
FOC brings the R-R interval from 0.40 to 0.80 secs (i.e., a
healthy heart rate of about 75 beats per minute)
Workload analysis should not be an afterthought. In real
CPS, network traffic is neither Poisson, nor stationary
Classical dynamics: Linear Dependence &
Exponential Inter-Event Distribution
dP(a, t )
∝ P ( a, t )
dt
dM 1 (t )
∝ M 1 (t )
dt
d α P(a, t )
Fractal dynamics: Linear Dependence &
Power-Law Inter-Event Distribution
dt α
d α M 1 (t )
dt α
∝ P ( a, t )
∝ M 1 (t )
Statistical properties of the workload have deep implications
in resource allocation, architectural design, RT scheduling, etc.
46
In summary, thousand core systems offer ample
opportunities to bring science and engineering even closer
Software Programming
47
User/application/OS optimization and resource management
is the next frontier to conquer.
Finally…
Contributors
Paul Bogdan (Univ. of Southern Calif.), Umit Y. Ogras (Intel),
Siddharth Garg (Univ. of Waterloo), Diana Marculescu (Carnegie
Mellon Univ), Chen-Ling Chou (Qualcomm), Qian Zhiliang (Hong
Kong Univ Sci&Tech), Chi-Ying Tsui (Hong Kong Univ Sci&Tech),
Hiroki Matsutani (Keio Univ).
Relevant papers - www.ece.cmu.edu/~sld
Sponsors
Intel Corporation
48
PE
49
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
L1
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
L2
R
L2
R
L2
R
PE
L1
PE
L1
PE
L1
L2
R
L2
R
L2
R
L2
R
L2
Questions?
PE
L1
PE
L1
PE
L1
PE
L1
PE
R
L2
R
L2
R
L1
PE
L1
PE
L1
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
R
R
L2
R
L2
R
L2
R
L2
PE
L1
PE
L1
PE
L1
PE
L2
R
L2
R
L2
R
L2
PE
L1
PE
L1
PE
L1
PE
L2
R
L2
R
L2
R
L2
R
R
L2
R
L2
R
L2
R
L2
PE
L1
PE
L1
PE
L1
PE
L2
R
L2
R
L2
R
L2
PE
L1
PE
L1
PE
L1
PE
R
L2
R
L2
R
L2
L2
Thank you!
R
R
L2
R
L2
R
PE
L1
PE
L1
L2
R
L2
R
PE
L1
PE
L1
L2
R
L2
R
Download