Understanding Networks-on-Chip: A Decade and Beyond Multicores are critical to IT revolution

advertisement
Understanding Networks-on-Chip:
A Decade and Beyond
Radu Marculescu
Carnegie Mellon University
Pittsburgh, PA 15213, USA
NOCS, Tempe, AZ
April 23, 2013
Multicores are critical to IT revolution
Embedded computing (cell phones,
automotive, gaming)
Cyber-physical systems (healthcare,
transportation, smart grid)
High performance computing (scientific
applications, forecasting, data centers)
2
Multicore platforms are large scale distributed systems at
nanoscale; they are dominated by communication costs
Last level cache (LLC)
Memory controller (MC) &
channels
I/O controller(s)
QPI controller,
Power control unit (PCU),
etc.
PCU
Memory Controller
Core*
C
R
LLC
C
Packet
LLC
C
R
LLC
LLC
C
C
MC
C
R
Q
P
I
R
LLC
C
R
R
LLC
C
R
LLC
R
LLC
C
C
LLC
R
P
C
I
e
C
MC
[U. Ogras, Tutorial ASPLOS 2012]
3
Need to understand the behavior of thousand core systems.
Network (routers+links) is the missing link in understanding.
“Knowing in part may make a fine tale, but wisdom
comes from seeing the whole. “
The rest of this presentation focuses on this “whole”, i.e., a
few emerging ideas from a decade of NoC research
Structure
Architecture and small world effects
Dynamics
Workloads and multiscale behavior
Control
Power and resource management
5
Our first insight into communication-based design came
through architecture (topology, buffer, etc.) optimization
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
[K. Srinivasan et al. IEEE T-VLSI 2006]
[Bolotin et al. DATE 2007]
6
[J. Kim et al. ISCA 2007]
Can induce small-world effects in regular NoCs. This brings
huge performance improvements
Completely structured
Complete graph
Customized via LRL
Longrange link
m
N nodes
n
b(n) = D(m, n) = m + n − 2
b(n) = D′(m, n) < m + n − 2
b(n) = log 2 N
b(n): minimum number of broadcasting steps
D: network diameter
7
This way, the fundamental idea of Small World networks (aka
“six degrees of separation”) enters the multicore world
Small world effects can also be exploited to reduce hop
count in 3D wireless NoCs and improve performance
Root
Chip 3
0
1
2
3
4
5
6
Root’ 7
Chip 2
VC0
Chip 1
UltraSPARC
L1 cache (I & D)
L2 cache bank
[H. Matsutani et al. ASPDAC 2013]
8
VC1
Chip 0
Wired and wireless NoCs can be used intra-chip, while interchip communication is based on wireless inductive-coupling.
Flow-control mechanisms, reconfigurability, adaptive
routing have all been used to improve performance
EVCs
Regional hub router
[Q. Zhiliang et al. CODES-ISSS 2012]
9
Bi-channels
One way or another, they all exploit small world effects…
Optimum path-seeking behavior may be goal-oriented.
Need to collect fitness information locally, at routing nodes
Channel 2
Channel 2
H2
H2
Current channel
Channel 1
fitnessi = α
⋅ min ⋅ win−1
m = avg. number of free slots
w = avg. waiting time in channel
10
Channel 1
H1
1
2
1
1
1
3
5
2
2
The rest of this presentation focuses on this “whole”, i.e., a
few emerging ideas from a decade of NoC research
Structure
Architecture and small world effects
Dynamics
Workloads and multiscale behavior
Control
Power and resource management
12
Packet inter-arrival times at interface queues play a
fundamental part in network behavior
Buffer 2
Buffer 1
PE
13
11 clock cycles
Time
Inter-event times
Cumulative events
PE
Non-Markovian
behavior
Markovian
behavior
# events
Fractals are geometrical objects or stochastic processes
displaying self-similar behavior over multiple scales.
Real world processes are not always smooth. Fractional
dynamics needs a formalism stronger than integer calculus
Integer calculus approximation
f(t)
Fractional behavior
t
f(t)
Impact of power law memory kernel
t
f(t) = dαt /dtα
Mandelbrot formalism (1975)
H
Δx ∝ Δt , 0 < H < 1
d α x(t )
dt α
= lim
Δt →0
1
Δt α
[ t Δt ]
α 
(−1) j   x(t − jΔt )
 j
j =0

t
14
Main idea is to exploit workload variations and optimize
system behavior (minimize power, latency, etc.)
Buffer 1
Buffer 2
Buffer 3
PE
queue dynamics
(traffic variability)
PE
d α k xk (t )
dt α k
utilization
= a k (t )xk (t ) + bk (t ) f j (t ) − ck (t ) f l (t )
Write/Read
x ( k ) = x ( k − 1) + Tλ1 ( k − 1) − Tμ1 ( k − 1)
15
Inter-arrival times dictate the fractal exponent (αk) of state
equations; this is used to characterize the system dynamics.
Performance analysis at high injection rates needs a nonequilibrium approach that accounts for fractal behavior
16
High injection rates cause inter-arrival times deviate from
exponential distribution and exhibit power law correlations.
Network understanding needs to evolve from deterministic
and stochastic, to statistical physics type of approaches
Deterministic
Stochastic
[P. Bogdan et al. FTnEDA 2009]
17
System optimization and resource management becomes all
about time-varying dynamic processes running over networks.
The rest of this presentation focuses on this “whole”, i.e., a
few emerging ideas from a decade of NoC research
Structure
Architecture and small world effects
Dynamics
Workloads and multiscale behavior
Control
Power and resource management
18
8-Core Xeon® Processor has three clock domains and three
voltage domains that help minimizing power.
Core Domain
(MCLK)
Un-Core Domain
(UCLK)
QPI
QPI
I/O Domain
(QCLK)
QPI
QPI
QPI
C3
QPI
QPI
QPI
C4
Core
Supply
C2
Core
Supply
I/O Domain
1.1V
fixed
C5
Fuse
C1
Un-Core
Domain
0.9-1.1V
fixed
C6
C0
C7
SMI
SMI
Filter PLL
IO DLLs
IO
Un - core PLL
Core
Core
Supply
Uncore
Supply
SMI
BCLK
PLLs
Core
Supply
Core Domain
0.85-1.1V
variable
SMI
PLLs
Largest device count reported for a microprocessor 2.3B transistors.
19
[Rusu, ISSCC 2009]
Fine-grain power management becomes possible by
exploiting workload variations
Globally asynchronous,
locally synchronous (GALS)
Clock
domain 1
Volt./Freq. Island VFI 1
(V1, f1)
Clock
domain 2
VFI 2 (V2, f2)
20
Fine-grain power management can be implemented via
control-theoretic approaches
Applications
(workload)
Physical constraints
+
_
VoltageFrequency
Controller
∑
V1 , f1 ∈ [ f1min , f1max ]
…
Utilization
(reference) values
for interface FIFOs
VN r , f N r ∈ [
X ref = [ x1ref x2ref ... x refq ]T
Nr
VFI 3
f Nmin
,
r
VFI 2
f Nmax
]
r
V/F interface FIFO
d
αk
xk
dt α k
21
VFI 1 (V1, f1, Vt1)
(
= G xk , f j , f l , t
)
State
feedback
Actual utilization of
interface FIFOs
X = [ x1 x2 ... x N q ]T
r
xkmin ≤ xk ≤ xkmax
Workload characteristics and application deadlines are critical.
Also, control signals (voltage, frequency) need to be constrained.
V/F controller selects the minimum operating frequencies
s.t. the queues reference values are satisfied
Computational workload
d yi (t )
αi
αi
Traffic variability
d
= ai (t ) yi (t ) + bi (t ) f i (t ) − ci (t ) f j (t ),
x 4j
xj
yi
(t )
xk (t )
(i,j)-th
VFI for
PE
x1j (t )
PE
S
PE
(i,j-1)th VFI
Router
Pdyn, Pleak, PT
Write/Read
= a k (t )xk (t ) + bk (t ) f j (t ) − ck (t ) f l (t ),
dt α k
0 ≤ xkmin ≤ xk (t ) ≤ xkmax ≤ 1, k = 1 ÷ 4, j , l = 1 ÷ N r
dt
0 ≤ yimin ≤ yi (t ) ≤ yimax ≤ 1, i = 1 ÷ N PE
(i,j-1)th VFI
for PE
αk
Router
x 2j (t )
x 3j
(t )
Pdyn, Pleak, PT
S
[P. Bogdan et al. NOCS 2012]
tf
 N PE
(
 1
min   wi yi (t ) − yiref
2
ti 
 i =1 

22
)
2
+
1

z i f i (t )2  +
2

f i min ≤ f i (t ) ≤ f i max , i = 1 ÷ N PE ,
Nr
1
  2 r
j =1
and

j
f j (t )2 +
1
  2 q (x
4
k
k =1
f jmin ≤ f j (t ) ≤ f jmax ,
k
j
2 

(t ) − xkref )
  dt
  
j = 1 ÷ Nr
Accurate mathematical modeling and rigorous optimization
can enable cross-layer power management
Queues utilization at tiles (0,0), (1,1) and (1,2) for a 4×4
mesh NoC running Apache HTTP webserver application
FOC keeps the utilization of all queues below 0.1 by
adjusting the operating frequencies of all PEs and routers.
FOC consumes less power compared to LQR controller
23
Workload analysis should not be an afterthought. In real
applications, traffic is rarely Poisson or stationary
Classical dynamics: Linear Dependence &
Exponential Inter-Event Distribution
dP(a, t )
∝ P ( a, t )
dt
dM 1 (t )
∝ M 1 (t )
dt
d α [tP (a, t )]
Fractal dynamics: Linear Dependence &
Power-Law Inter-Event Distribution
dt α
d α [tM 1 (t )]
dt α
∝ P ( a, t )
Lat
time
Lat
∝ M 1 (t )
time
Statistical properties of the workload have deep implications
in resource allocation, architectural design, scheduling, etc.
24
For thousand core systems distributed approaches for
power management are of crucial importance
PerforPE-1
mance
PerforPE-2
mance
Power
Power
DVFS set
DVFS set
Performance
Performance
Power
Power
DVFS set
DVFS set
PE-3
15-router synch NoC that
connects 22 processing units
PE-4
[F. Clermidy et al., ISSCC’10]
Distributed
objective
function
26
Local
maximization
algorithm
Both centralized and distributed approaches have their
own limitations
Scalability issues due to cost and latency
of long wires. Long synchronization times
V1, f1
V2, f2
q1
Highly scalable but potential
problems in control performance
V1, f1
V2, f2
q1
q3
q2
q3
q2
V3, f3
V3, f3
Need ‘best-of-both-worlds’ between fully-centralized and fullydistributed solutions. This is true for thermal management too.
27
Custom feedback control offers a good compromise
between fully centralized and fully distributed solutions
Performance
Small-world networks
Fully-centralized
Unexplored
Design Space
Fully-distributed
Full (global) feedback (64 feedback channels)
Implementation cost
[S. Garg et al., ISLPED 2010]
28
Video Encoder
Fully-centralized
Local + partially global feedback
Unexplored
feedback channels)
Custom(16
Feedback
Design Space
Control (CF)
Fully-decentralized
Only local feedback
(8 feedback channels)
Implementatio
n Cost
An hierarchy of globally distributed locally centralized
control may help the system self-organize
• Orders of magnitude!
• Gets better with size
System Size
Flat Mesh (nJ)
WiNoC (nJ)
Factor
128
1319
22.57
58x
256
2936
24.02
122x
512
4992
37.48
133x
[Ganguly et al. IEEE Trans. Comp., 2010]
Local control w/ full state information, global control w/
partial information. Small world effects help convergence
29
On-chip communication is essential for multi-kernel OS
Current: Monolithic
(single kernel) OS (e.g., Windows)
User + Applications + Services
Future: Distributed
(multi-kernel) OS (e.g., Barrelfish)
User 1 +
Applications 1
User p +
Services
Applications p
Single shared state & lock OS
Core 1 Core 2
Core 2
On-chip interconnect
OS 1 OS 2
state state
replica replica
OS m
state
replica
Core 1 Core 2
Core n
On-chip interconnect
OS execution as dynamic graphs. Fundamental design constraints
(power, performance) depend on HW-OS-application interaction.
30
OS inter-event times [sec]
OS inter-event times exhibit a self-similar structure over
time. Hurst exponent for some real OS traces is 0.9
Hurst exp = 0.99
Hurst exp = 0.49
Time [sec]
[P. Bogdan et al. TECHON 2012]
31
In summary, thousand core systems offer ample
opportunities to bring science and engineering even closer
Workload analysis should not be
an afterthought, from chip all the
way to the cloud
Application/OS optimization and
resources management represent
the next frontier to conquer
Optimization evolves towards selforganization; new theories and
design paradigms are needed
32
Understanding networks (i.e. “whole”) is crucial for sci&eng.
Lessons learned so far are valuable as we move forward.
Finally…
Collaborators - Paul Bogdan (USC) and Umit Y. Ogras (Intel)
Siddharth Garg (Univ. of Waterloo), Mike Kishinevsky (Intel),
Edward Ma (CMU), Diana Marculescu (CMU), Hiroki Matsutani
(Keio Univ.), Chi-Ying Tsui (HK Univ. Sci. Tech), Qian Zhiliang (HK
Univ. Sci. Tech), Simon Hollis (Univ. of Bristol),
Relevant papers - www.ece.cmu.edu/~sld
Sponsors
Intel Corporation
33
PE
34
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
L1
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
Questions?
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
L2
R
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
PE
L1
R
R
L2
R
L2
R
L2
R
L2
PE
L1
PE
L1
PE
L1
PE
L2
R
L2
R
L2
R
L2
PE
L1
PE
L1
PE
L1
PE
L2
R
L2
R
L2
R
L2
R
R
L2
R
L2
R
L2
R
L2
PE
L1
PE
L1
PE
L1
PE
L2
R
L2
R
L2
R
L2
PE
L1
PE
L1
PE
L1
PE
R
L2
R
L2
R
L2
L2
Thank you!
R
R
L2
R
L2
R
PE
L1
PE
L1
L2
R
L2
R
PE
L1
PE
L1
L2
R
L2
R
Download