A Performance Evaluation of NoC-based Multicore Systems: From Zhiliang Qian

advertisement
A
Performance Evaluation of NoC-based Multicore Systems: From
Traffic Analysis to NoC Latency Modelling
Zhiliang Qian, Shanghai Jiao Tong University
Paul Bogdan, University of Southern California
Chi-Ying Tsui, The Hong Kong University of Science and Technology
and Radu Marculescu, Carnegie Mellon University
In this survey, we review several approaches for predicting performance of NoC-based multicore systems,
starting from the traffic models to the complex NoC models for latency evaluation. We first review typical
traffic models to represent the application workloads in NoC. Specifically, we review Markovian and nonMarkovian (e.g., self-similar or long range memory dependent) traffic models and discuss their applications
on multicore platform design. Then, we review the analytical techniques to predict NoC performance under
given input traffic. We investigate analytical models for average as well as maximum delay evaluation. We
also review the developments and design challenges of NoC simulators. One interesting research direction in
NoC performance evaluation consists of combining simulation and analytical models in order to exploit their
advantages together. Towards this end, we discuss several newly proposed approaches that use hardwarebased or learning-based techniques. Finally, we summarize several open problems and our perspective to
address these challenges.
Categories and Subject Descriptors: A.1 [General and reference]: Introductory and survey; C.4 [Performance of systems]: Modeling techniques
Additional Key Words and Phrases: Performance evaluation, Network-on-Chips (NoCs), analytical model,
average and maximum delay, simulation, traffic models
1. INTRODUCTION
With IC technology continuously shrinking down, state-of-the-art computing platforms
widely use architectures such as Multi-Processor System-on-Chips (MPSoCs) and Chip
Multi-Processors (CMPs); therefore, Network-on-Chips (NoCs) architectures are suggested in these designs as the future communication infrastructure to manage the information transfer among the Processing Elements (PEs) [Benini and De Micheli 2002;
Dally and Towles 2001]. When designing NoC-based multi-core platforms, the latency
metric usually creates a big challenge during the design space exploration [Ogras et al.
2010; Kiasari et al. 2013a]. In order to make a proper design choice, an accurate and
fast evaluation of each design candidate which has different configurations is needed
[Bogdan and Marculescu 2009; Ogras et al. 2010]. More specifically, to evaluate the
This work was supported in part by the Hong Kong Research Grants Council (RGC) under Grant
GRF619813, in part by the HKUST Sponsorship Scheme for Targeted Strategic Partnerships. P.B. acknowledges the support by the US National Science Foundation (NSF) CAREER Award CPS-1453860 and CyberSEES CCF-1331610.
Author’s addresses: Z.L.Qian, Micro- and Nano- Department, Shanghai Jiao Tong University, Shanghai,
200240, email: qianzl@sjtu.edu.cn; P. Bogdan, Ming Hsieh Department of Electrical Engineering, University
of Southern California, 90089, email: pbogdan@usc.edu; C.Y. Tsui, Electronic and Computer Engineering Department, Hong Kong University of Science and Technology, email: eetsui@ust.hk; R. Marculescu, Electrical
and Computer Engineering Department, Carnegie Mellon University, 15213, email: radum@ece.cmu.edu.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights
for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component
of this work in other works requires prior specific permission and/or a fee. Permissions may be requested
from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)
869-0481, or permissions@acm.org.
c YYYY ACM 1084-4309/YYYY/01-ARTA $15.00
⃝
DOI:http://dx.doi.org/10.1145/0000000.0000000
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
A:2
Z. Qian et al.
Packet rate
126
PE0
126
Memory
PE1
vld
vld
357
RLD
idct
313
362
iQuant
313
16
362
Up samp
Stripe
memory
362
AC/DC
Memory
Stripe
memory
Up samp
27
16
PE7
R
Inner loop
synthesis
exploration
PE8
R
R
362
362
Inverse
Scan
353
iQuant
313
idct
357
94
a)
PE5
R
R
362
ARM
313
padding
Time
300
540
Vop
PE4
R
PE6
49
49
PE3
500
ARM
540
300
500
Vop
memory
Application
traffic
Vop
353
Vop
memory
94
padding
27
AC/DC
NoC
architecture
R
70
RLD
Inverse
Scan
PE2
R
R
70
b)
T1
PE6
T2
T3
T4
PE2
Routing path
Task mapping
& Core placement
Feedback
Performance
evaluation
Simulation/
Prototyping
Feedback
Fig. 1. A typical NoC-based MPSoC design flow for Dual Video Object Plane Decoder (DVOPD) application:
a) application task graph for DVOPD [Pullini et al. 2007], where the numbers on the edge represent communication data volume b) a typical optimization flow for design space exploration [Ogras et al. 2010; Kiasari
et al. 2013a]
performance of each feasible candidate, the designer should first characterize the application to extract a specific traffic model [Varatkar and Marculescu 2002; Bogdan
and Marculescu 2011]. The traffic model needs to abstract two key features, i.e., the
source and destination of every flow, as well as the packets inter-arrival time distribution. For instance, a simple representation of an application uses the application task
graph [Hu and Marculescu 2003; 2005] to capture these two features (Fig 1-a shows
an example of the Dual Video Object Plane Decoder (DVOPD) application from [Pullini
et al. 2007; Bertozzi et al. 2005]). In this type of representation, a directed edge points
from the source to the destination of each flow while the weight on the edge represents
the mean traffic data rate with the Poisson inter-arrival time distribution. After precharacterizing the application, designers then need to schedule and allocate the tasks
on the proper PEs followed by the exploration of the PE core placement and routing algorithm design in the system. For each feasible design configuration, the performance
analysis step needs to be performed to evaluate the quality of the design [Ogras et al.
2010](shown in Fig. 1-b). Towards this end, both simulation-based and analytical models are used in the NoC design flow.
In general, NoC simulators attempt to model the architectural implementation details of NoC routers (i.e., both the data and control paths) and can provide estimations
with very high accuracy. That is necessary to have detailed performance evaluation
before prototyping. One limitation is that it usually takes a lot of time to simulate a
system with large size. Moreover, it is not easy to figure out the performance bottlenecks (e.g., the choice of buffer size of a specific router) from the simulation results;
which makes the simulations less powerful and efficient in design space explorations
[Ogras et al. 2010; Kiasari et al. 2013a]. Because of these inefficiencies, analytical NoC
models are also used in the early design stage to support fast exploration of a large design space. Compared to simulations, the accuracy in analytical models is compromised
for the flexibility and speed enhancement [Ogras et al. 2010; Kiasari et al. 2013a].
In this survey, we review and summarize both the analytical and simulation approaches that have been used in evaluating NoC performance. We first notice that a
specific NoC performance highly depends on the traffic/workload applied onto the system [Bogdan and Marculescu 2011]. Therefore, in section 2, we first review types of
workload which are widely adopted in the literature. Then, we review the mathematACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis to NoC Latency ModellingA:3
Control
path
Router
Crossbar
To/From Router
Processor core (PE)
Packets
Data and
control path Memory
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
Network
interface
PE
R
PE
R
PE
R
PE
R
Injection
Method
Statistical
model
Data path
Traffic
pattern
Poisson
Fractal
Application-driven
pattern
Memoryless workload High
Accuracy
Fractal/Self-similar workload
Speed
Trace-based
High
PE
R
PE
R
Synthetic
pattern
Application-driven
workload
Execution-based
NoC system
Fig. 2. NoC architectures overview [Dally and Towles 2003] and workload classification based on the traffic
pattern (spatial characteristics) and injection method (temporal characteristics)
ical approaches that capture the non-Poisson traffic behavior of the application. After
obtaining a traffic model to characterize the input information, the next step is to use
an NoC model to predict the performance. In section 3, we review the analytical NoC
models. Both the average and maximum latency models are discussed in this section.
In section 4, we summarize the simulation-based techniques for NoCs with large sizes.
Then, in section 5, we discuss the research direction which combines the simulationbased and analytical methods to exploit the advantages from both models. Specifically,
we review two types of approaches that use learning and hardware based modelling
techniques to speedup the performance evaluation and maintain the prediction accuracy. In section 6, we summarize the open problems in NoC performance evaluations
and present our perspective. Finally, we summarize and conclude this survey in section
7.
2. NOC TRAFFIC ANALYSIS
In this section, we first categorize the NoC workload models used in the performance evaluation. Then we summarize the methods to observe and characterize fractal
and non-stationary behaviors. In particular, several representative techniques are reviewed, including the Hurst-parameter based models [Varatkar and Marculescu 2002],
the phase type based models [Kuhn 2013], non-equilibrium statistical physics inspired
models [Bogdan and Marculescu 2011] and multifractal models [Bogdan 2015].
2.1. NoC workload categorization
In general, there are two important aspects when describing an NoC workload [Dally
and Towles 2003], i.e., the traffic patterns and the injection inter-arrival time (IAT) distributions [Bogdan and Marculescu 2011]. The former describes the distribution of the
source and destination nodes of every flow in the topology. The latter dictates when to
issue (inject) a packet from the source Processing Element (PE) into the network or the
packets inter-arrival processes at the intermediate router buffer channels, so it represents the temporal characteristics of the traffic [Soteriou et al. 2006]. In Fig. 2, we
categorize various NoC workloads based on these two features (i.e., spatial patterns
and temporal injection processes). The terminology in the figure follows the conventions and descriptions in [Dally and Towles 2003]. Specifically, for the traffic patterns,
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
A:4
Z. Qian et al.
we classify them into synthetic and application-driven patterns. The synthetic traffic
patterns use mathematical methods to create the source and sink PE addresses; while
application-driven patterns determine the destination of each flow from the mapping
results of the applications running on target CMP/MPSoC platforms. For the injection
processes, mathematical models (e.g., Poisson or Fractal models) can be used to dictate the packets IAT distributions. Moreover, the packets IAT can also be extracted
from real applications. For example, designers can use trace logs or fully execute the
application to determine when to issue the next packet in a flow.
To evaluate an NoC design, the most accurate workloads should describe both the
traffic pattern and injection method from real applications. In the table of Fig. 2,
the crossed entry of realistic pattern (column) and injection method (row) is named
as application-driven workload. It can be further classified into execution-driven and
trace-driven workloads. For the execution-driven workload, the traffic is produced by
modelling the PEs together with the NoC. Specifically, the processor executes the whole
application program and decides when to issue a packet at run time. Because such injection method requires a detailed model of both the processors and NoC, this type of
evaluation is also named as full-system simulation. In summary, the full-system simulation achieves the highest accuracy; however it takes a very long evaluation time.
Moreover, this methodology is not very flexible because if a parameter (e.g., number of
virtual channels or buffer size) changes, the whole evaluation needs to be performed
again; this introduces additional timing overhead to explore the design space [Ogras
et al. 2010]. To improve the speed, the trace-based workload has also been used. In
trace-based workload, after performing the full-system simulation once, the traffic
traces (e.g., a file recording the exact time/cycle when a PE injects a cache-coherent or
request/response packet) are logged. Then, during subsequent simulations, the packets are issued into the network following the trace logs. The execution models of the
processors are no longer required.
One limitation of using the application-driven workloads for MPSoC platforms is the
users need to provide detailed application information at the very beginning. While for
general purpose CMP platforms, there already exists a number of benchmarks, such as
Parsec [PARSEC 2009], Splash-2 [SPLASH-2 1995] and Spec [Spradling 2007] suite, to
provide representative applications in different domains (e.g., high performance computing, image/video processing). However, it is difficult to extrapolate the performance
from these benchmarks to a new application class which may have very different characteristics (e.g., different levels of burstiness) [Soteriou et al. 2006; Gratz and Keckler
2010]. Another limitation of trace-based workloads is that the dependency among the
packets may be changed after applying them on a new design [Hestness and Keckler 2010]. Therefore, researchers have found it is better to pre-analyze the collected
traces before applying them in the new evaluations. The Netrace tool tries to address
this dependency problem and improve the evaluation accuracy [Hestness and Keckler 2010]. This is achieved by performing a dependency analysis on the traces. The
identified dependency relationship among packets is kept in the subsequent networkonly simulation by preserving the order of packets that is injected in NoC. Similarly,
in [Mahadevan et al. 2005], the dependency among the packets is kept by making a
new transaction happen only after the PE receiving all responses from its dependent
processors.
As shown in Fig. 2, in addition to application-driven workloads, mathematical (or
statistical) models are also widely used to describe the traffic patterns or packet interarrival times. For the traffic patterns, they can either be abstracted from applications
and represented in a graph or be synthetically created without using any application
information. For the packet inter-arrival times, they can be described using different
models (e.g., Poisson or self-similar models). One advantage of mathematical injection
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis to NoC Latency ModellingA:5
methods is that by controlling some parameters (e.g., the Hurst parameter) in the
model, a variety of traffic inputs with user-desired property (e.g., self-similarity) can
be created. In the following, we review these traffic models. In particular, we elaborate
on the memory dependent models (e.g., fractal or self-similar) in NoC traffic analysis.
2.2. Traffic pattern characterization
To describe the distributions of the traffic sources and destinations, the application
task graphs and synthetic traffic patterns are two widely used methods [Marculescu
and Bogdan 2009]. In Fig. 2, they belong to the application-driven and synthetic pattern, respectively. Specifically, the task graphs are structures used to represent some
profiled applications. In a task graph G = (C, A), the vertex set C represents all computation tasks and each directed edge ai,j in A represents the communication flow from
vertex ci to cj . The weights on the edge characterize the communication data volume
(bits) between tasks [Hu and Marculescu 2003]. After mapping tasks onto proper PEs,
the architecture characterization graph (ARCG) ARCG = (T, P ) further describes the
communications among the PEs in NoC [Hu and Marculescu 2003]. There are several popular task-graph based benchmarks characterizing different multimedia and
communication applications in NoC, including the MMS (Multimedia system) [Hu and
Marculescu 2004b], PIP (Picture in picture), MWD (Multi-window detection), MPEG
(MPEG decoder) [Bertozzi et al. 2005], DVOPD (Dual Video Object Plane Decoder)
[Pullini et al. 2007], H264D (H264 decoder) [Liu et al. 2011] and LDPC (Low density
parity check encoder/decoder) [Liu et al. 2011; Wang et al. 2014] applications.
In addition to using task graphs, synthetic traffic generators, which operate on the
source and target PE addresses to produce a variety of artificial patterns, are also
used. For example, in random traffic pattern, every PE in the system can choose a destination with same probability. This in general creates uniform traffic loads across different channels in the system. For uneven traffic, permutations on the node addresses
are used [Gratz and Keckler 2010]. For instance, to obtain the destination address, the
permutation can be performed by exchanging the X- and Y- coordinates of the source
address (i.e., transpose traffic); it can also be performed in bit-level, such as reversing order of bits (i.e., bit-reversal), or complementing each bit (i.e., bit-complement)
[Dally and Towles 2003]. Compared to task graphs and application-driven workloads,
synthetic traffic patterns provide more artificial traffic scenarios to evaluate a design
[Gratz and Keckler 2010].
As shown in Fig. 2, for both task-graph based and synthetic-based patterns, their
injection processes can be memoryless or memory-dependent. To describe the packet
injection process, let a random variable x represent the packet inter-arrival time (IAT)
of the target flow. The simplest model to describe the distribution of x is based on Poisson injection process, which means x follows a negative exponential distribution, i.e.,
Px (t) = λe−λt , where Px (t) is the probability density function and λ is the mean packet
rate (packets/cycle). In general, Poisson traffic models belong to the type of memoryless workload, which means the probability distribution does not depend on previous
states. One limitation is that it cannot reflect the burstiness and packet dependencies
[Wu et al. 2010; Kouvatsos et al. 2005]. Therefore, statistical techniques which considers non-Poisson IAT distributions are needed. Specifically, these models are built upon
fractal/self-similar traffic models to capture the short-range and long-range memory
dependencies among the IATs. In the following section, we survey these techniques.
2.3. Techniques to model and analyze mono- and multi-fractal traffic
In this section, we summarize the techniques to model fractal/self-similar NoC traffic.
To unify the symbols of different approaches, based on the parameter conventions, the
notations in Table I are used in the presentation of this section.
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
A:6
Z. Qian et al.
Table I. Symbols and notations used in the traffic analysis
Notation
Definition
A random variable (e.g., represents the packet inter-arrival time (IAT))
Probability density function (PDF) of x, i.e., probability density of x = t
Cumulative distribution function (CDF) of x, i.e., probability of x ≤ T
A random process which is either continuous time (with index variable t) or discrete time
(with index variable n)1
X(n)
The nth indexed random variable in the random process X
Xm
A discrete time random process, whose element is the random variable averaged over m
consecutive data in X
X m (n)
The nth indexed random variable in process X m
E(X)
Expectation (mean) of the random process X
σ 2 (X)
Variance of the random process X; σ(X) is the standard derivation
σ 2 (X m )/V ar(m) Variance of the random process X m , which is a function of parameter m
rX (l)
Autocorrelation function of random process X under the lag size l
fX (w)/fX (w, H) Spectral density function of X; w is the frequency parameter; H is hurst parameter
n(t)
A random variable represents the number of packets generated during interval [0, t)
Ni (t)
Average number of router channels in NoC that have i packets at time t
x
Px (t)
Fx (T )
X
1 In
section 2.3.1, we assume X is second-order stationary (wide-sense stationary). Second-order stationary means
the first two moments of X (mean and co-variance) does not depend on specific index variable t (for continuous
time) or n (for discrete time).
2.3.1. Review of LRD/fractal/self-similar traffic . Self-similarity, fractal and long-range dependence (LRD) are concepts that have been developed from observing the natural
data sequences with memory dependencies between two different time instances or
indexes [Bogdan et al. 2010; Yoshihara et al. 2001; Ryu and Lowen 2000]. There are
several ways to formally characterize self-similar/fractal/LRD traffic flows. For example, suppose a set of random variables is used to represent the packet arrivals from the
same flow over time (i.e., a sequence of random variable x denoting the packet interarrival times). This random variable sequence forms a discrete time random process
X. The random variable X(n) with an index n represents the IAT between the nth and
(n + 1)th packet. When analyzing such random process X, we usually start from widesense stationary (or namely second-order stationary) assumption [Yoshihara et al.
2001], which means the first (i.e., mean) and second moment (i.e., co-variance) of X
does not depend on specific index choices of the random variables. For example, for a
discrete time random process X, by definition, the mean of X should be a function of
n where the nth element corresponds to the expected value of X(n). Under the widesense stationary condition, the mean at different indexes are the same. Therefore it
can be represented by a single parameter E(X).
Based on above assumption, the definition "long-range-dependency" comes from observing the correlations between two random variables in X, whose indexes are l lags
away. Formally, the autocorrelation function rX (l) of X is defined as [Varatkar and
Marculescu 2004; Park and Willinger 2000]:
rX (l) =
E[(X(n) − E(X))(X(n + l) − E(X))]
σ 2 (X)
(1)
where X(n) and X(n + l) are two random variables at index n and n + l, respectively.
Under second-order stationary condition, the auto-correlation rX (l) only depends on l
and does not rely on the choices of instance n.
Using Eqn. 1, rX (l) can be computed with respect to different choices of value l.
Intuitively, rX (l) reveals how current data is affected by its neibhors that are l lags
away. For short-range dependency (SRD) series, rX (l) decreases exponentially with l;
while for long-range dependent (LRD) random process, rX (l) decreases much slowly.
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis to NoC Latency ModellingA:7
Theoretical derivations have shown a close to power law decreasing with respect to
the leg length l [Varatkar and Marculescu 2004; Park and Willinger 2000].
Besides examining the autocorrelation function rX (l), the dependency among the
data samples can also be presented by the concepts of "self-similarity" or "fractal".
These two definitions come from observing the random process X m under different
scale levels [Varatkar and Marculescu 2004]. More precisely, in these definitions, for
each scale level m, where m is an integer value chosen by the observer, a new process X m can
variable X m (n) equal to:
∑mbe produced by making each indexed random
m
m
X (n) =
is then calculated and is a
i=1 X((n − 1)m + i)/m. The variance of X
function of the scale-level m. In Table I, this variance σ 2 (X m ) is also represented as
V ar(m). Of note, when calculating V ar(m), for independent or SRD sequence X, by
taking the average of every m samples in X, it is very likely the data is smoothed.
Consequently the data variance of X m should decrease very fast. Typically, V ar(m)
decays with m exponentially. On the other hand, for self-similar X, the increasing of
scale level m does not significantly reduce the variance of the new sequence X m . This
is reflected in V ar(m), which decreases much slowly as: V ar(m) ∼ m−β , where β typically locates in the range (0, 1) [Varatkar and Marculescu 2002].
The third way to describe a LRD process X is to transfer and observe the series in
the frequency domain. Formally, let f (w) represent the spectral density of the random
process X; for fractal/self-similar X, it has been shown f (w) ∼ bw−γ when w → 0
[Varatkar and Marculescu 2004].
Based on above discussions, two methods have been widely used to examine whether
a time series is long-range dependent or self-similar 1 . The detail derivations are provided in [Varatkar and Marculescu 2002; 2004; Min and Ould-Khaoua 2004]. Here, we
just highlight the ideas in those works as follows: The first method is the variance-time
analysis which plots the curve of log(V ar(m)) against log(m). For LRD/self-similar process X, it has been shown log(V ar(m)) decreases linearly with log(m) and the slope −β
satisfies 0 < β < 1. The second method is based on Hurst effect. Specifically, it calculates the "rescaled adjusted range statistics" (i.e., R/S statistics [Qian and Rasheed
2004]) of the random process X. Then, the Hurst parameter H is calculated as the
changing rate of the R/S statistics with respect to the data sequence size n. By examining the value of H, the dependencies among the time series can be described quantitatively. For memoryless time series, H = 0.5; while for LRD sequences, 0.5 < H < 1.
Moreover, it is observed the Hurst parameter H in R/S method equals to 1 − β/2 in the
variance-time method.
2.3.2. Generative traffic models for self-similar traffic. The variance time and R/S statistic
methods are useful in analyzing time sequences. However, they can not be used to
generate self-similar traffic. To produce fractal traffic, two kinds of models are widely
used. The first type utilizes only one parameter (e.g., the Hurst parameter H) to represent the dependency and self-similarity (e.g., [Varatkar and Marculescu 2004] and
[Soteriou et al. 2006]). While the second type is based on "Phase method" which uses
different phases to describe the packets generation process [Kuhn 2013]. Examples are
the Generalized-exponential (GP) process [Wu et al. 2010], Markov-modulated Poisson process (MMPP) [Kiasari et al. 2013b; Fischer and Meier-Hellstern 1993] and a
more generalized Markov arrival process (MAP) [Diamond and Alfa 2000; Klemm et al.
2002]. In the following, we summarize the principles of these two models, respectively.
1 For
non-stationary random processes, the "detrended fluctuation analysis" (DFA) [Kantelhardt et al. 2002]
should be applied, which uses a regression function to pre-fit the temporal trend over the data. The readers
can find more details and a step-by-step tutorial of implementing DFA in matlab in [Ihlen 2012]
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
A:8
Z. Qian et al.
H = Hurst parameter
Frequency domain
1
f(w,H)=FGN
Power
specified power
spectrum
spectrum
Approximating
g(w,H)
f(w,H)
2
Sampling
n = Trace length
Discrete g(H)
spectrum
Time domain
Expand freqency
spectrum to length n
Desired inter-arrival
times FX(t)
4
3
Inverse Time series
X*
DTFT
Transformation
FX -1(FX*(X*(n)))
Fig. 3. Procedures of FGN-based synthetic traffic generation [Varatkar and Marculescu 2004]
1) Hurst-parameter-based modeling: In [Varatkar and Marculescu 2002; 2004], the
authors for the first time propose to consider the dependencies in NoC traffic by using
the Hurst parameter H. To produce a synthetic packet sequence with a user-needed
self-similarity level H (0.5 < H < 1), Fractional Gaussian Noise (FGN) model [J.Beran
1994] is used. Their procedures of applying FGN model to generate synthetic NoC traffic traces is shown in Fig. 3. We summarized their methods as follows: the inputs to the
generative model include the user-specified H value, the length of the total sequence n
and the desired packet inter-arrival time (IAT) distribution Fx (t). The first step of the
overall procedure is to produce a data series that has a self-similarity level H; this is
done via sampling the FGN spectrum f (w, H) (w is the frequency component), which
is given by [Paxson 1997]:
f (w, H) = A(w, H) × [|w|−2H−1 + B(w, H)](H ∈ (0, 1); w ∈ (−π, π))
(2)
In Eqn. 2, A(w, H) and B(w, H) are specific frequency functions whose closed-form
are derived in [Paxson 1997]. In order to obtain f (w, H), it requires to compute a summation of infinite terms due to the existence of function B(w, H). Following the steps
in [Paxson 1997], a simple approximation of f (w, H) with finite terms of summation
is obtained and is denoted as g(w, H). The continuous power spectrum g(w, H) is then
used as the input of the subsequent sampling procedure (i.e., Step 2 in Fig. 3). After
sampling, a Discrete Fourier Transform (DFT) spectrum with n/2 points is obtained.
Based on the symmetric property of the spectrum, the DFT components can be mirrored and extended to the size of n. Transforming these n points in the frequency
domain back via "inverse Discrete Time Fourier Transform" (IDTFT) creates a n-point
time series X ∗ (Step 4). This time series has a desired Hurst parameter H. However,
it is noticed in the previous sampling step, since the FGN spectrum is used, the output data sequence actually has a Gaussian distribution shape whose mean equals to
zero. Consequently, we need to match the distribution of X ∗ with the desired cumulative distribution function Fx (t). To achieve this, the following transformation is used
to create the nth data X(n) from X ∗ (n) [Varatkar and Marculescu 2004]:
X(n) = Fx−1 (Fx∗ (x∗ (n)))
(3)
∗
where Fx∗ (t) is the cumulative distribution function (CDF) of the sequence X after
step 4. X is the created new time series with X(n) denoting the nth packet IAT. The
generated traffic trace can then be fed into NoC simulators to evaluate the latency
performance under the impacts of fractal traffic inputs.
2) Phase-type random process based traffic models: Phase-type models belong to
the type of generative methods which are widely adopted in queuing analysis to reproduce the customer arrival or service process using multiple stages (phases) [KleinACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis to NoC Latency ModellingA:9
rock 1975]. More precisely, each phase is a simpler random process (e.g., Markovian
process) dictating the specific time interval distribution in current stage. Every time,
after experiencing this time interval, the process may switch to a next stage (phase)
with a certain probability or directly enter into the ending (absorption) state which
means the generative process (e.g., the inter-arrival time or the service time) finishes
for the current customer [Kuhn 2013]. In the following, we review three such models.
In [Kouvatsos et al. 2005; Wu et al. 2010], the Generalized exponential (GE) traffic
model is used in the queuing network analysis. Specifically, a direct packet generation
path with zero inter-arrival time is added to the Poisson IAT model. In this way, the
specific time intervals of packets can come from the normal memoryless model with
probability τ or alternatively from the direct path with probability 1 − τ [Wu et al.
2010]. By controlling parameter τ (i.e., controlling the number of packets being issued
in burst with zero intervals), different level of arrival burstiness can be modelled.
In [Kiasari et al. 2013b; Yoshihara et al. 2001], the traffic burstiness is proposed to
be modelled as an Markovian Modulated Poisson Process (MMPP). In these works, an
m−state MMPP is made up of m separate Markovian phases. Each stage i describes
a negative exponential IAT distribution with a mean value 1/λi (λi is the average
packet generation rate in state i). In order to represent the transition rates (or namely
transition probabilities) between any two states, a matrix Q is used. Specifically, the
matrix entry of row i and column j in Q (i.e., qij ) corresponds to the rate (or probability)
that changes from phase i to j [Kleinrock 1975].
In MMPP traffic models, a function n(t) which represents the total number of packets generated during time interval (0, t] is used to determine λi of each state and qij
in matrix Q [Yoshihara et al. 2001]. Specifically, to fit the traffic traces using a 2-state
MMPP-based traffic model, the mean E(n(t)), the variance σ 2 (n(t)) as well as the Index of Dispersion Count (IDC) IDC(n(t)) of the random function n(t) are measured
for the application traffic first; then, the fitting procedure can be performed by matching the measured statistics with the closed-form formula of these metrics derived in a
stochastic 2-state MMPP model [Shahram and Tho 2000; Yoshihara et al. 2001].
The 2−state MMPP traffic model has been widely used due to its simplicity, although
it sometimes still introduces certain errors in fitting some applications [Shahram and
Tho 1998]. To better capture the traffic self-similarity, multiple Markovian statesbased modelling techniques were developed in which the traffic is modelled by the
mixture of several interrupted Poisson processes (IPPs) [Yoshihara et al. 2001; Min
and Ould-Khaoua 2004].
In [Yoshihara et al. 2001], a multiple state MMPP traffic model was developed which
consists of d (d > 1) IPPs. To be more specific, the d-state MMPP traffic model is obtained by aggregating several 2-state MMPP processes described previously. Compared
to the case of a 2-state MMPP, there are more parameters need to be derived in the
d-state model. The traffic trace is pre-processed under d different time scales. For each
scale level m (e.g., m = 1, 2, .., d), a new data series is generated by averaging the original sequence with a time window size m. Then, the parameter fitting procedure is
performed on the obtained d sequences separately.
Besides MMPP models, a more generalized Markov Arrival Process (MAP) has also
be used in traffic analysis [Klemm et al. 2002]. In summary, the fitting algorithm of
such traffic model is still based on matching the statistic moments which are obtained
analytically with the measurements in given traffic traces. For example, in [Casale
et al. 2008], a MAP traffic model is obtained by matching the first three moments of
IAT as well as the correlation between X(n) and X(n + 1) (i.e., E[X(n) × X(n + 1)]) in
the analytical model with the real traffic measurements.
Compared to the Hurst-parameter-based traffic modeling, phase-type based traffic
models are compatible with many existing queuing models (such as the MMPP/G/1
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
A:10
Z. Qian et al.
Fig. 4. An example illustrating the non-stationary traffic characteristics in a 4×4 mesh NoC (From [Bogdan
and Marculescu 2010; 2011]): a) for a channel port, the packet header flit inter-arrival time distribution is
shown; the IAT is compared with Poisson process b) the plot of third order moment versus two time lags
derived from the FGN model c) the actual third order moment calculated from the application traffic
and MAP/G/1 queues [Min and Ould-Khaoua 2004]). They can be used in a performance model to evaluate NoC latency under specific traffic characteristics. However,
for highly LRD traffic models, it is usually very difficult to obtain a closed-form solution in queuing theory. Moreover, most phase-type traffic models are based on stationary or wide sense stationary IAT assumption. Therefore, they do not well model the
non-stationary inter-arrival processes. In the next sections, we survey the techniques
proposed for multi-fractal NoC traffic analysis.
2.3.3. Motivations for multi-fractal traffic modeling. Many existing NoC traffic models assume that the random process X is wide-sense stationary (WSS). Consequently, the
traffic models built upon techniques such as FGN (shown in Fig. 3) can be classified
as mono-fractal analysis, where the traffic correlations between time instance t and
t + δt can be represented using a single and unified parameter H [Bogdan and Marculescu 2010]: |X(t + δt) − X(t)|q ∼ (δt)qH . Recently, some profiling results of real
NoC applications show a different non-stationary behavior due to the dependencies
among different PEs as well as the request-response correlations of the packets [Bogdan and Marculescu 2011; 2010]. As a result, many traffic traces are non-stationary
and can not be fully characterized by a single Hurst exponent. Instead, they are
better characterized by a set of scaling exponents [Bogdan and Marculescu 2010]:
|X(t + δt) − X(t)|q ∼ (δt)g(q) , where q is the moment index described in [Bogdan and
Marculescu 2011; 2010] and g(q) is a function of q which is non-linear in general.
Based on the above observations, one interesting and on-going research direction
on NoC traffic modeling is to extend beyond the existing single exponent (i.e., monofractal) traffic model to a more accurate multi-fractal model.
In Fig. 4-a, an example is shown which compares the Poisson, mono- and multifractal traffic models [Bogdan and Marculescu 2010; 2011]. In this example, a sequence
of packet inter-arrival times (IAT) is used for analysis 2 . From Fig. 4, it shows that
when observing a specific router in NoC, the IAT may have a long tail form while conventional Poisson model cannot capture this kind of long-range dependency. To capture
the burstiness, a Fractal Gaussian Noise (FGN) model described in Section 2.3.1 can
be employed. That model is usually accurate in matching the first two moments of
the input traffic trace. However, [Bogdan and Marculescu 2010] further compared the
2 The IAT distribution is plotted for the west input port of a node in a 4 × 4 mesh NoC under the execution
of a multi-threaded NoC application [Bogdan and Marculescu 2010; 2011]
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis to NoC Latency ModellingA:11
Fig. 5. The multifractal spectrum showing the multifractalness of realistic NoC applications (From [Bogdan
2015]): a) the multifractal spectrum of 3 cores in Intel SCC platform executing SPEC applications b) the
multifractal spectrum of 7 cores in an 8 × 8 NoC executing a PARSEC application.
third order cumulant C3 which is defined as:
C3 = X(t1 )X(t2 )X(t3 ) − X(t1 ) × X(t2 )X(t3 ) − X(t2 ) × X(t1 )X(t3 )
−X(t3 ) × X(t1 )X(t2 ) + 2X(t1 ) × X(t2 ) × X(t3 )
(4)
where t1 , t2 and t3 are three instances of the overall data sequence Xt . For an IAT
distribution fitted by a Gaussian random process, the plot of C3 with respect to two
intervals t2 − t1 and t3 − t1 (i.e., time Lag 1 and Lag 2 in the figure) is shown in Fig.
4-b. In Fig. 4-b, C3 is zero across different combinations of t2 − t1 and t3 − t1 in FGN
model. However, the actual third order moment of the IAT is also plotted in Fig. 4-c.
It can be observed C3 of the actual trace collected from simulations has several peaks
and is non-uniform in general, which are not captured by the mono-fractal Gaussian
model. Because of this, it motivates to develop a new multi-fractal analysis framework
for non-stationary traffic modelling.
2.3.4. Multifractal spectrum of real applications. One useful method to identify a multifractal data series from mono-fractal sequence is to build the multifractal spectrum (or
namely singularity spectrum), which plots the singularity dimension D(h) against the
holder exponent h [Lopes and Betrouni 2009]. Intuitively, the holder exponent h dictates an additional (t−ti )h term when using expansions to approximate the data series
around point ti ; while D(h) is the fractal dimension of the data set that is made up of
the points having the same h [Physionet 2004]. For mono- or memoryless series, the
multifractal spectrum has a narrow width. For multifractal sequences, the multifractal spectrum spans a variety of h values [Ihlen 2012]. In [Bogdan 2015], the author
plotted the multifractal spectrum for two realistic NoC applications (shown in Fig. 5).
Specifically, Fig. 5-a is the multifractal spectrum observed from three cores in Intel
Single-Chip Cloud Computer (SCC) platform executing a SPEC MPI application. Fig.
5-b is the spectrum for a 8 × 8 mesh NoC running a PARSEC application. Similar
observations have been reported for different PARSEC applications [Bogdan and Xue
2015]. From these spectra, we can conclude many realistic NoC applications indeed
have multifractal characteristics. A set of exponents can better characterize the traffic
behaviors instead of a single Hurst parameter.
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
A:12
Z. Qian et al.
2.3.5. Modeling the system dynamics using mean-field approach. To address the nonstationary property, a mean field (MF) traffic modelling technique is used in [Bogdan
and Marculescu 2011; 2010]. In summary of their approach, the packet transmission
in NoC is modelled using a random graph (RG). In a RG, the nodes represent the buffer
channels in the router and the edges represent the application packets that are transferred between the RG nodes. At run time, the IN and OU T degree of each node in the
RG correspond to the number of incoming and outgoing customers (packets), respectively. Of note, the randomness of the graph refers to the connections in the graph will
change dynamically depending on the arrival and departure of packets at the channel
[Bogdan and Marculescu 2011]. Based on these concepts, the authors derive that the
IN degree of nodes in RG satisfies the following equation [Bogdan and Marculescu
2010]:
pηi (t)
rθi (t)
∂Ni
=
f1 (Ni−1 , Ni ) −
f2 (Ni , Ni+1 )
(5)
∂t
Mi (t)
Zi (t)
where Ni (t) represents the average number of buffer channels at time t whose IN
degree equals to i (i.e., having i arrival packets). In Eqn. 5, the first term considers
the effects of packets arriving at a buffer channel. Specifically, by the definition of the
RG, an edge is connected to a node if a packet reaches the corresponding buffer node.
Therefore, at time t, if a packet arrives at any one of the Ni−1 nodes which have an
IN degree i − 1, then Ni should increase by one at t + δt 3 . On the other hand, if a
packet arrives at any one of Ni nodes that already have i packets, Ni will decrease
by one and Ni+1 is augmented by one as a result. Thus, the function f1 (Ni−1 , Ni ) in
Eqn. 5 models the packet arrival effects by first adding the number of new nodes in
RG that enter into Ni state from Ni−1 and then subtracting the number of nodes that
leave from Ni to Ni+1 . In Eqn. 5, pηi (t)/Mi (t) is a specific time-dependent function
which denotes the probability of a new packet arriving at one node in RG at time
instance t. Similar to the first part of Eqn. 5, the second term in Eqn. 5 considers
the effects of packets departing from the current router channel. rθi (t)/Zi (t) in Eqn. 5
represents a time-dependent probability that a packet will leave the current node at
time instance t. The function f2 (Ni , Ni+1 ) considers two cases: if a packet leaves node
with IN degree Ni+1 , then Ni increases; on contrary, if a packet leaves node with IN
degree Ni , then Ni decreases. In pηi (t)/Mi (t) and rθi (t)/Zi (t) of Eqn. 5, the factors p and
r represent the probabilities of a packet arriving at or departing from a buffer channel
in RG. They can be pre-characterized from the obtained packet injection traces and the
routing algorithms; Mi (t) and Zi (t) are two normalization parameters for pηi (t) and
rθi (t) which are given in [Bogdan and Marculescu 2011; 2010]. Finally, the two timedependent fitness functions ηi (t) and θi (t) play the key roles on fitting the final system
traffic. In [Bogdan and Marculescu 2009], these two fitness functions are developed in
analogy to the energy fitness functions in statistical physics.
Based on Eqn. 5, given the pre-assumed functions η and θ, the time-dependent probability P (i, t|η, θ) which describes the possibility of finding a node in RG that has an
IN degree i at time instance t, can be calculated as [Bogdan and Marculescu 2010]:
∂[tβ P (i, t|η, θ)]
η
θ ∂ 2 [iP (i, t|η, θ)]
θ
η ∂[iP (i, t|η, θ)]
=[
+ ]
+[ −
]
(6)
2
∂t
M
Z
∂i
Z
M
∂i
For β = 0, Eqn. 6 is reduced to the form of a conventional Poisson traffic model. On
the other hand, when β ̸= 0, Eqn. 6 is able to represent a statistical process which
displays multifractal characteristics.
3 Assume
δt is the infinitesimal time interval
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis to NoC Latency ModellingA:13
rx(l)
•
Memoryless or
Short-range dependent
Application
l
•
Poisson traffic (random/
transpose pattern with
Poisson injection etc .)
Batch injection traffic
•
D(h)
Monofractal
rx(l)
•
h
Long range dependent
•
D(h)
l
Multifractal
•
Markovian Modulated
Poisson Process
(MMPP)
Constant Hurst
parameter model
Realistic benchmarks
(PARSEC, SPEC etc.)
Variable Hurst
parameter model
h
Fig. 6. Summary of NoC traffic models. The short-range and long-range dependencies are identified by the
autocorrelation function rX (l) defined in Table I. The mono- and multi-fractal traffic differ in the singularity
spectrum D(h), which is a function of singularity exponent h [Ihlen 2012].
By solving the equations as Eqns 5 and 6, users can predict the system dynamics
which evolve with time. This consequently enables a more accurate run-time control
over the system. For example, in [Bogdan and Xue 2015] and [Bogdan 2015], the authors demonstrate a framework using model predictive control (MPC) for run-time
power management. The multi-fractal workload model and system dynamic equations
are developed first. Then, based on the predictions on the system dynamics, the controller aims to adjust the voltage/frequency of each tile at run time to reduce the power
consumption. The authors compared the multifractal control with Poisson and monofractal approaches. It is observed the multifractal control framework accurately captures the system dynamics and therefore significantly saves system energy.
2.3.6. Summary of the traffic analysis techniques. In Fig. 6, we summarize different NoC
traffic models discussed in previous sub-sections. In general, we can first compute the
autocorrelation function rX (l) of a time sequence. For memoryless or short range dependent traffic, rX (l) rapidly reduces as l increases. On the other hand, for long range
dependent traffic, rX (l) has a long tail. The synthetic workload with Poisson injection process is an example of memoryless traffic. For such traffic, conventional M/M/1
queuing model or diffusion approximation approaches [Kobayashi 1974] can be used
in the performance analysis. For the long-range dependent (LRD) traffic model, we
need to further consider the mono- and multi-fractal property by comparing the multifractal spectrum D(h). In summary, the Hurst-parameter-based and phase-type-based
models discussed in Section 2 use a single exponent to characterize the workload. They
belong to the mono-fractal model. On the other hand, many real applications such as
those in PARSEC and SPEC benchmarks have a more complex multi-fractal behavior
as shown in Fig. 5. One limitation of current multi-fractal traffic model is that most
analytical performance models cannot directly take it as input. Therefore, a new performance analysis framework supporting multi-fractal traffic and can be embedded in
the design space exploration is required.
3. ANALYTICAL MODELS FOR NOC
In this section, we first investigate the queuing-theory-based models for mean end-toend delay prediction in NoC. Then, we review the approaches to derive the maximum
delay bounds. For a clear presentation, the parameters used in the models and the corACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
A:14
Z. Qian et al.
Table II. Parameters in the analytical performance models [Qian et al. 2015; Ben-Itzhak et al. 2011]
Notation
Description and Definition
Ls,d
v/vs
ηs,d
hs,d
lif
λf
Fl
d(f, l, i)
s/sl
sfl lit
Rl2
ulf
Delay of a flow from the Processing Element (PE) s to PE d
Waiting times at the injection buffer of PE s
Total time spent to transfer a flit from PE s to PE d
Accumulated contention delays for the flow which is from source s to sink d
The ith channel that flow f traverses sequentially during the routing
The aggregated flow arrival rate (packets/cycle)
The set that contains all the flows routing over the link l
The ith downstream channel which f traverses starting from the link l
Average service time of all packets traversing the channel l
Average flit service time in the buffer of channel l
Second moment (SCV) of the packet service time for link l
Maximum duration of packets in flow f leaves link lif
i
RC
Switch
Allocator
flow to south
Credit
North
flow to east
North input channel
East
Local
flow to local
Weighted average
of service time
Crossbar
a)
b)
Equivalent
Server
North input channel
Fig. 7. Extracting an equivalent queuing system for each channel in NoC routers a) a typical single channel
wormhole router architecture [Dally and Towles 2003] b) each input port channel can be treated as a queuing
system by weighted averaging the service time of flows routing towards different output directions [Ogras
et al. 2010; Hu and Kleinrock 1997; Kiasari et al. 2013b]
responding notations are summarized in Table II. Of note, these notations are similar
to those used in previous works such as [Hu and Kleinrock 1997; Ben-Itzhak et al.
2011; Qian et al. 2015].
3.1. Average-case performance evaluation
For most multi-core systems which do not have a deadline requirement (e.g., a generalpurpose computing platform that provides best-effort services), the average-case metric is used for design space exploration [Ogras et al. 2010]. To evaluate the mean latency, the queuing theory-based models [Lysne 1998; Ogras et al. 2010; Hu and Kleinrock 1997; Fischer and Fettweis 2013] are most widely used. In the following, we summarize the basic ideas of queuing models.
3.1.1. The taxonomy of the queuing models. In Fig. 7, we first illustrate how to extract a
queuing system from an NoC router during the queueing analysis. Specifically, each input buffer channel (e.g., the north input buffer highlighted in Fig. 7-a) is abstracted as
a queuing system. The customers of this queueing system are packets or flits being proACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis to NoC Latency ModellingA:15
cessed in the router. Because every customer at the current link may be routed to different output directions, the service process is usually approximated by the weighted
average of the service times towards each output direction, where the weights are
the amount of traffic (shown in Fig. 7-b) [Ogras et al. 2010; Kiasari et al. 2013b]. For
such queueing system, the Kendall’s representation [Kiasari et al. 2013a; G. Bolch and
Trivedi 2006] uses the "A/B/m/K" abbreviation to characterize the system as follows:
— "A" represents the arrival process: For example, the abbreviation "M" of "A" used in
the queuing model stands for the Markovian arrival (i.e., Poisson arrival) process.
"Er" stands for Erlange arrival while "G" corresponds to a general independent IAT
distribution. Of note, some bursty arrival processes reviewed in previous sections
have also been considered in NoC queuing models. Examples are the MMPP arrival
process in [Min and Ould-Khaoua 2004] and the "GE" (generalized exponential distribution) arrival process in [Qian et al. 2014; Wu et al. 2010].
— "B" denotes the system service process. Similar to the arrival process, "B" can be
markovian with the abbreviation "M", deterministic with the abbreviation "D" or a
more general process with the abbreviation "G".
— "m" is the total number of available servers in the queuing system. For wormhole
NoCs with a single channel per port 4 , since there is only one physical channel in the
downstream router to serve the packets, the queuing system has only one server and
m = 1 [Hu and Kleinrock 1997]. On the other hand, for NoCs with multiple virtual
channels, the packets actually contend for one of the available VCs in the routing.
Therefore, the effective number of servers will be larger than one depending on the
available VCs [Ben-Itzhak et al. 2011].
— "K" represents the number of customers that can be held in the queuing system.
Existing NoC analytical models define the customer in the queuing system as either
a single packet or a flit. Consequently, "K" is calculated from the router buffer size
based on the customer granularity.
In queuing models, when computing the waiting time of a queue, the analysis procedure usually computes the state probability Pi first, which represents the probability of
having i customers (packets or flits) in the queuing system [Kiasari et al. 2013a]. Then,
the average number of customers N in the system with capacity K can be calculated
∑K
as: [Kleinrock 1975; Donald and Harris 2008]: N = i=0 i × Pi . Finally, according to
the Little’s Law [Kleinrock 1975], the average waiting time in the queue Ws equals to
[Kleinrock 1975; Donald and Harris 2008]: Ws = N
λ , where λ is the average customer
arrival rate at the channel.
In this survey, we summarize several representative NoC analytical models from the
following aspects:
— The arrival process: Most NoC analytical models consider the router models under
the assumption of Poisson arrival process. For example, in [Guz et al. 2007], an M/M/1
based channel model is designed to analyze impact of link capacity on flow latency.
Similarly, in [Lai et al. 2009; Nikitin and Cortadella 2009; Ben-Itzhak et al. 2011], the
IAT distribution of the header flits in each flow is assumed to be Poisson. Recently,
much research efforts have been spent on generalizing the arrival process model. In
[Wu et al. 2010], the traffic burstiness as well as the short-range dependencies (SRD)
are characterized using a generalized exponential (GE) distribution [Wu et al. 2010].
In [Kiasari et al. 2013b], instead of using a GE arrival model, a 2-state MMPP model
4 In
this section, we use the notation "Single Channel" to represent the wormhole NoC routers with a single
channel at each port, while the noation "Multiple VC" represents the NoC routers with multiple virtual
channels.
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
A:16
Z. Qian et al.
Table III. Comparison of NoC analytical models [Qian et al. 2015; Qian et al. 2014]
Queuing theory based analytical models
[Guz et al.
2007]
[Ogras et al.
2010; Lai
et al. 2009]
[Kiasari et al.
2013b]
[Ben-Itzhak
et al. 2011]
[Wu et al.
2010]
[Min and
Ould-Khaoua
2004]
Queue
model
M/M/1
M/G/1/K
G/G/1/∞
M/M/m/K
Ge/G/1/∞
MMPP/G/1/∞
Arrival
process
Service
process
Traffic
pattern
Poisson
Poisson
General
exponential
General
independent
Uniform
MMPP
Router architecture
Single
Channel
Single
Channel
Buffer
size
Arbitration
scheme
Negligible
Negligible
Single
Channel/
Multiple VC
Negligible
Round-robin
Round-robin
Memoryless
Arbitrary
Round-robin
Queuing model summary
2-state
Poisson
MMPP
General
General
Memoryless
independent
independent
Arbitrary
Arbitrary
Arbitrary
Supported router architecture
Single
Single
Single
Channel/
Channel
Channel/
Multiple VC
Multiple VC
K packets
B flits
Negligible
(K > 1)
(B > 0)
Round-robin
Fixed
Round-robin
priority
General
independent
Uniform
is used to capture the bursty arrivals at the injection sources. Specifically, in that
work, a second moment term, namely the "squared coefficient of variance" (SCV)
is computed to characterize the bursty traffic. In [Min and Ould-Khaoua 2004], a
multiple-state MMPP traffic model is employed to better represent the self-similarity
in the application traffic. Then the MMPP/G/1 queuing model is used to analyze the
latency performance for a specific topology (hyper-cubes) in supercomputers.
— The service process: Under the assumption that packet sizes follow the negative exponential distribution, the service time of the packets can also be approximated as
exponentially distributed [Kleinrock 1975]. Based on this assumption, several works
use the memoryless service time models to predict the flow delays under different
link capacities or buffer sizes [Guz et al. 2007; Hu and Marculescu 2004a]. On the
other hand, there are also many multi-core applications which use constant packet
size [Nikitin and Cortadella 2009] or a more general packet length distribution [Kiasari et al. 2013b]. In order to work for a constant packet length distribution, in
[Nikitin and Cortadella 2009], a modified M/D/1 queuing model is proposed to address the non-memoryless service time distribution. In order to model a general independent service time distribution, in [Ogras et al. 2010] and [Kiasari et al. 2013b],
the M/G/1 and G/G/1 queue-based models are used respectively. More specifically, the
second moments of the service times are calculated in these models before applying
the M/G/1 and G/G/1 queuing formula.
— The number of servers: Most work assumes a single server at each input port. In
multiple-VC router architectures, since the upstream packets can request the usage
of any one of the available VCs at the downstream node. Therefore, instead of using
a single server queuing system, [Ben-Itzhak et al. 2011] proposed to use a multipleserver model.
— The system capacity calculation: Many existing analytical models assume a single
flit buffer [Ben-Itzhak et al. 2011] or the buffer size is negligible [Fischer and Fettweis 2013; Guz et al. 2007]. Under these assumptions, a packet actually occupies the
whole routing path during its transfer [Ben-Itzhak et al. 2011]. Therefore, for each
intermediate link channel in the routing path, the contending packets at the link
channel are actually accumulated at the injection sources. Since the source queue is
usually assumed to have infinite capacity [Ben-Itzhak et al. 2011], the queueing forACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis to NoC Latency ModellingA:17
mula with infinite capacity (e.g., M/M/1/∞ or M/G/1/∞ queue) are used to analyze the
link channels [Ben-Itzhak et al. 2011]. On the other hand, there are also analytical
models such as [Lai et al. 2009; Ogras et al. 2010] which are developed assuming a
single channel buffer can hold N packets at one time. N is an integer larger than one.
Then, the queueing system capacity is modelled as N . The derivation of the queuing
capacity becomes more complicated if the NoC router has a finite size buffer (e.g.,
several flits) and can only hold a portion of packet in its buffer. In [Hu and Kleinrock
1997; Kouvatsos et al. 2005], the authors develop techniques to calculate the queuing
capacity under this situation. Specifically, in their approaches, for any link channel,
its capacity is the summation of two parts. The first part represents the number of
contention flows sending towards this link; the second part computes the average
packets that are stored at the input port, which is further calculated from the packet
length distribution.
In Table III, we summarize several representative queuing-theory-based models in
NoC. As discussed above, each model relies on its specific assumptions and is most
accurate when the assumptions capture the target application and router architecture
characteristics.
Based on the models proposed in [Qian et al. 2015; Ogras et al. 2010; Hu and Kleinrock 1997; Kiasari et al. 2013b], we summarize a typical procedure of applying queuing theory to predict NoC average latency. For the sake of simplicity, as in [Nikitin and
Cortadella 2009; Qian et al. 2014], we assume each packet has a fixed size of L flits.
Interested readers can refer to [Hu and Kleinrock 1997; Kouvatsos et al. 2005; Arjomand and Sarbazi-Azad 2010] for additional modifications that are needed to extend
the model from fixed-length assumption to a general packet size distribution.
3.1.2. Summary of latency calculation for an NoC flow. In typical NoC analytical models, the
delay Ls,d (shown in Fig. 8-a and -b) is used to represent the overall packet transfer
time from source s to destination d. It is broken down into three parts [Ben-Itzhak
et al. 2011; Qian et al. 2014]: 1) the waiting time at the injection processor s (i.e., vs ),
2) the packet transfer time (i.e., ηs,d ) after being allocated the channel and 3) the path
contention delay (i.e., hs,d ). More specifically, it is expressed in these works as:
Ls,d = vs + ηs,d + hs,d
(7)
To compute the path contention delay hs,d , it is required to aggregate the delays of
each specific channel along the path of f (Fig 8-b illustrates the channels that should
∑df
be considered for the flow f6,2 ). Therefore, in [Ben-Itzhak et al. 2011], hs,d = i=1
hlf ,
i
where lif is the i − th link of flow f and df is the path length.
Similarly, the packet transfer time ηs,d can be calculated as [Ben-Itzhak et al. 2011;
∑df
Ogras et al. 2010]: ηs,d =
i=1 ηlif + (L − 1), where the transfer time of the header
flit is added across every link, then the second term represents the absorption of the
remaining (L − 1) flits at the sink PE.
In summary, to derive Ls,d , the packet competition times h and the packet transfer
times η at every link channel should be computed. The queuing models should address
two issues: i) First, we need to analyze the dependency among the channels and decide
the orders of links for applying queueing formula [Hu and Kleinrock 1997; Qian et al.
2014]. ii) Second, it is important to extract equivalent queuing systems for each link
channel and clearly identify the arrival and service processes of the queueing system.
Particularly, when modeling the arrival and service processes, we need to differentiate
between the Single Channel and Multiple VC routers, which are shown in Fig. 8-c and
-d, respectively. In section 3.1.3, we summarize the analytical procedures for Single
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
A:18
Z. Qian et al.
0
1
R
R
0
f6,2
3
4
R
2
R
R
(b)
VCi
North
η
Local
VC0
North
Multiple flows
current flow
5
East
VC0
contention flow
R
(c) Local
v6
VCi
East
R
R
l8,5
l
l
6,7
7,8
6
7R
8
5
(a)
R
l5,2
3
7
2
h
6
(d)
Fig. 8. An illustration of the queueing delay calculation [Qian et al. 2014]: a) an application represented
as an architecture characterization graph (ARCG), b) the link channels which form the routing path of flow
f6,2 , c) the illustration of path competition time h, the flit transfer delay η and d) flow sharing effects in
Multiple VC router architectures [Ben-Itzhak et al. 2011].
current flow
Point A
Point D
contention flow
Point B
η
North
R1
R2
R3
R4
East
Local
Point C
contention delay h
Total buffer slots of L
flits (a packet size)
Fig. 9. An illustration of the channel service scenario for a packet in the east link channel of a Single
Channel wormhole router architecture (c.f. [Hu and Kleinrock 1997; Kouvatsos et al. 2005; Qian et al. 2014])
Channel routers. In section 3.1.4, we discuss the extensions that are required to model
Multiple VCs.
To resolve the first issue of channel dependencies, it is realized the enforced dependency is due to the flow control between NoC routers. For example, as shown in Fig. 8,
both links l6,7 and l7,8 are part of the routing path of flow f6,2 . The waiting times of l7,8
affect the packets transmission at channel l6,7 . There are several methods to address
this link dependency issue. For example, in [Kiasari et al. 2013b], a channel indexing
method is proposed based on the distances to the destinations. In [Lai et al. 2009], a
channel sorting method is proposed based on the routing algorithm used. A more commonly used method is to build the link dependency graph (LDG) of the application. A
proper link order ensures when analyzing a vertex (i.e., a link channel) in LDG, all its
precedents have been computed. For more details, the readers are suggested to refer to
[Hu and Kleinrock 1997; Foroutan et al. 2013; Arjomand and Sarbazi-Azad 2010; Qian
et al. 2014], where the procedures of building LDG and deriving the link orders from
LDG are provided.
3.1.3. Single Channel Wormhole Router models . In the following, we summarize a typical
procedure to calculate the competition delay h, the transfer delay η and the source
queue waiting delay vs for Single Channel wormhole router architectures.
1) Flit transmission delay queueing model: In [Ben-Itzhak et al. 2011; Qian et al.
2015], the flit transfer time η over link l corresponds to the time cost to send the flit
from the input port buffer to the head of the buffer at the next router, assuming the flit
has been allocated to use that link. To be more specific, in Fig. 9, the flit transfer time
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis to NoC Latency ModellingA:19
over the link connecting router R1 and R2 is depicted as the time duration from Point
A to D. In [Qian et al. 2014], it was further broken it into two parts for calculation. The
first part consists of the time needed to traverse the current router. In a pipelined NoC
router, this time can be approximated as the depth of router pipelines. After leaving
the router, the second component in η calculates the transfer delay of a flit to move to
the next channel (i.e., moving from Point C to Point D in Fig. 9).
In order to calculate the transfer time from Point C to D, [Qian et al. 2015] proposed
a queuing model whose capacity equals to B + 1. Specifically, B is the router buffer size
defined in Table II. The additional one in the capacity B +1 considers the flit location at
the upstream buffer where the request to the current link is generated and granted. In
order to apply the queueing formula, the arrival and service process of this equivalent
queueing system are identified. The mean customer (i.e., flit) arrival rate is computed
∑
by aggregating that of every flow in set Fl [Ogras et al. 2010]: λfl lit = L × f ∈Fl λf ,
where λf is the mean rate of flow f , L is the packet size. The mean service time is
approximated as the weighted average of different flows f passing through the link l
[Hu and Kleinrock 1997]. Specifically, it is calculated as [Qian et al. 2015]:
∑
hd(f,l,1)
+ 1−P b1d(f,l,1) )]
∀f ∈Fl [λf × (
L
f lit
∑
(8)
sl =
∀f ∈Fl λf
where d(f, l, 1) represents the downstream neighboring channel of link l regarding
to flow f . To understand Eqn. 8, we can consider the service process of a specific packet
during the routing. This service process always starts with the downstream channel
arbitration from the header flit in packet. In Eqn.8, it assumes the header flit takes
hd(f,l,1) cycles to be allocated the downstream link. Then, for the body and tail flits,
due to the Single Channel router architecture, they can use the link channel for transmission exclusively. Therefore, they are sent out smoothly without any further delay.
The only exceptional case which interrupts the smooth transmission is due to the flow
control. Specifically, if the downstream channel indicates its buffer is full, the body/tail
flits have to be stalled. Let P bd(f,l,1) represent the probability of having a full buffer at
link d(f, l, 1). Then, similar to derivation in [Lai et al. 2009], the mean time to route a
flit under P bd(f,l,1) can be calculated as 1−P b1d(f,l,1) . Hence, the overall service time for
a flit belonging to flow f is the average of different flits in the same packet and can be
h
computed as d(f,l,1)
+ 1−P b1d(f,l,1) in Eqn. 8.
L
After modelling the queuing system in the flit transfer process, the corresponding
queuing formula such as M/M/1/K in [Ben-Itzhak et al. 2011] or M/G/1/K in [Hu and
Kleinrock 1997; Lai et al. 2009] can be used to evaluate the system waiting time. The
derived waiting time then provides an estimation of flit transmission time η.
2) Path competition delay queueing model: The arbitration or flow competition delay
h is defined in [Ben-Itzhak et al. 2011] as the time for a packet header to be successfully
granted its next channel over other competitors. In general, we can also abstract a
queuing system for this arbitration process [Ben-Itzhak et al. 2011; Hu and Kleinrock
1997]. Various analytical models have been proposed. Examples are the G/G/1 [Kiasari
et al. 2013b], M/G/1/K [Lai et al. 2009; Arjomand and Sarbazi-Azad 2009], GE/G/1
[Wu et al. 2010] or MMPP/G/1 [Min and Ould-Khaoua 2004] queuing models. In these
queuing models, the capacity K of the queuing system is usually modelled by the total
number of flows contending for the same output direction [Ben-Itzhak et al. 2011; Hu
and Kleinrock 1997].
Similar to the procedure of deriving η, the mean customer arrival rate of the queue
extracted for calculating h is obtained by accumulating the traffic rate of all flows passing the channel. In addition to the first moment, other higher moments of the arrival
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
A:20
Z. Qian et al.
process can also be computed depending on the specific traffic model employed in the
analysis. For example, if the G/G/1 [Kiasari et al. 2013b] or GE/G/1 [Qian et al. 2014]
queuing model is used to derive the contention delay h, the second moments (i.e., SCVs)
of the packet arriving process over the channel l is also required. The derivations of
the input traffic SCV can be calculated following the approximations in [Kiasari et al.
2013b; Qian et al. 2015].
To represent the service process of the queuing model, a widely used service time
model is illustrated in Fig. 9 (c.f. [Hu and Kleinrock 1997; Qian et al. 2014]). In Fig.
9, assuming we need to calculate the contention delay of a packet in the east input
channel of router R1, the service of that packet begins right after its header flit in Point
A being granted the next channel and finishes at the instance that the tail flit leaves
the buffer. This time duration is denoted as the service time because after that, other
flows requesting the usage of l can arbitrate again for the physical link. In this way,
the waiting time of the extracted queueing system has the meaning of the contention
delay to be allocated a channel. For a network under light/medium load, this service
duration just equals to the packet size because each flit can pass through the channel
directly; on the other extreme, under heavy traffic load (e.g., the downstream channels
are under severe congestion), this time is calculated as the duration of the header flit
to move to the location (i.e., Point B in Fig. 9) where the entire packet can be stored
across that location and the current buffer slot (i.e., Point A) [Hu and Kleinrock 1997;
Kouvatsos et al. 2005; Arjomand and Sarbazi-Azad 2010]. Let xfl represent this worstcase service time, then the mean queue service time sfl with respect to flow f is a value
dictated by the length of packet L (minimum value) and xfl (maximum value) [Hu and
Kleinrock 1997]. In practice, xfl is calculated first by accumulating the waiting times
along the channels from Point A to B; then, sfl is approximated based on L and xfl as in
[Hu and Kleinrock 1997; Qian et al. 2015]. After computing every sfl , the mean service
time for packets traversing l can be computed by averaging over sfl [Hu and Kleinrock
∑
∑
1997], i.e.,: sl = ∀f ∈Fl (λf × sfl )/ ∀f ∈Fl λf .
Besides the mean value of the service time, for queuing models with non-memoryless
service time distribution, such as the M/G/1, G/G/1 and MMPP/G/1 models, the second
moment term, i.e., SCV of the service process is also required. One way to approximate
SCV is given in [Kiasari et al. 2013b; Qian et al. 2015]:
Rl2
=
(sfl )2
(sfl )2
∑
−1=(
∀f ∈Fl
∑
λf × (sfl )2
∀f ∈Fl λf
)/(sl )2 − 1
(9)
Based on above discussions, after mathematically building the models for the arrival
and service process in the queuing system, the M/G/1/K [Ben-Itzhak et al. 2011], G/G/1
[Kiasari et al. 2013b] or GE/G/1 [Qian et al. 2015] queuing formula can then be applied
to evaluate the waiting times of the queueing system, i.e., the flow contention delay h
for the target link channel.
3) Waiting times at the source node: Typically, the traffic injection queues at the NoC
network interfaces (NIs) is modelled as a system whose capacity is infinite; this is
because the inner memory bandwidth of the PEs are usually much higher than that
of the router channels, which therefore supports accumulating much more packets
at the sources before injecting them into the network [Dally and Towles 2003]. To
model the source waiting time, different queuing models can be applied. For example,
[Ben-Itzhak et al. 2011] proposes to use an M/M/1/∞ queue and [Wu et al. 2010; Qian
et al. 2014] uses the GE/G/1/∞ queuing model. Of note, for the source queues, the
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis to NoC Latency ModellingA:21
traffic arrival and service processes are calculated in a way similar to that discussed
in characterizing queuing systems for contention delay h.
3.1.4. Extensions to router architectures with multiple VCs . For Multiple VC router architectures, the physical channel bandwidth is time-division multiplexed (TDM) among
several flows at the same port [Ben-Itzhak et al. 2011]. To model the VC router architectures, two methods are widely used. The first way is to employ a multiple servers
queue to replace the single-server model in Single Channel routers. Specifically, in
[Ben-Itzhak et al. 2011], when calculating the path contention delay h, an M/M/m/K
queuing is used. The number of servers m in the model equals to the number of effective VCs at the downstream channel. Moreover, the authors also propose to consider
the facts that when there are multiple flows being granted downstream VCs of the
same port, the flits actually share the same physical bandwidth together. Therefore,
the flits appeared on the same link can come from several flows which is different from
the case in Single Channel wormhole routers. To address this, the derivation of the
flit transfer time η should also be modified. Towards this end, the concept of "effective channel bandwidth" of a particular flow is introduced to represent the equivalent
bandwidth seen by a VC during the switch transmission. With these two modifications,
the path contention delay h and flit transfer time η should be able to incorporate the
effects of VC sharing. They can then be used to predict the mean flow latency as in the
Single Channel models.
The second way to model Multiple VC router architectures is to refine the obtained
mean packet waiting time from Single Channel model by a factor V̄ [Ould-Khaoua
1999; Dally 1992]. The parameter V̄ represents the mean "degree of VC multiplexing"
at a specific link channel [Ould-Khaoua 1999]. With the scaling factor V̄ , the mean
flow delay from source tile s to destination tile d is rewritten as [Ould-Khaoua 1999]:
Ls,d = (vs + ηs,d + hs,d ) × V̄ . To calculate V̄ , a typical way is to employ the VC state
transition diagram (STD) [Kiasari et al. 2008; Wu et al. 2010]. In STD, each vertex
Vi represents a state that there are i VCs at the input port already being allocated to
some packets. Specifically, the weight on the edge of STD pointing from state Vi to Vi+1
denotes the probability that a new VC is allocated to other packets under the existence
of i busy VCs. Similarly, the weight on the edge pointing from Vi to Vi−1 reflects the
process of finishing serving a packet and releasing one VC for further routing [Kiasari
et al. 2008; Wu et al. 2010].
The transition rate between Vi and Vi+1 depends on the current channel state (i.e.,
number of VCs being used) as well as the characterized arrival/service processes. In
[Wu et al. 2010] and [Min and Ould-Khaoua 2004], the transition probability from Vi
to Vi+1 is derived according to an optimization algorithm that maximizes the overall
entropy for the GE and MMPP type traffics; the transition rate from Vi to Vi−1 is
dictated by the packet service rate, which is 1/s (s is the average time for serving
flits at the current channel). After calculating the transition rates on STD, the state
probability, Pv , (0 ≤ v ≤ V ), which represents the probability that there are v VCs
being allocated, can be computed by solving the extracted Markov chain as in [Min
and Ould-Khaoua 2004; Wu et al. 2010; Ould-Khaoua
1999]. Finally, the degree of VC
∑V
v 2 Pv
v=0
multiplexing V̄ is given by [Dally 1992]: V = ∑V vP
v=0
v
3.2. Worst-case latency prediction for NoC
For many real-time systems such as the healthcare and embedded control related applications, it is more important to guarantee the packets are received and the subsequent actions are taken in time. For NoC systems, since most of the resources (e.g.,
links and buffer channels) are shared by flows, the contentions introduce a large delay
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
A:22
Z. Qian et al.
variation, especially under heavy workload. Therefore, for these systems, one challenge is to predict the worst-case delay accurately to avoid over-design and ensure a
configuration can meet the hard/soft deadline requirement. In [Kiasari et al. 2013a],
the authors have surveyed three worst-case performance analysis approaches, that are
network calculus based method, the dataflow analysis framework and the schedulability analysis. In this section, we first present another approach used in NoC community,
i.e., real-time bound (RTB) analysis [Rahmati et al. 2013; Lee 2003; Rahmati et al.
2009]. For the sake of completeness, we also highlight the concepts and ideas of the
other three approaches.
3.2.1. Real-time bound analysis for maximum delay prediction. The rationale behind realtime bound (RTB) analysis is to predict the target flow’s latency under a scenario that
provides least support for current packet transmission [Rahmati et al. 2009]. Specifically, this worst-case transmission of a flow f occurs when all the router channels
locating in f ′ s routing path are full and flow f has the least priority to be allocated the
routing resources when competing with other flows. Under this assumption, the packets of f have to wait until all other flows have been served. Therefore, in [Rahmati
et al. 2009; 2013], the upper bound latency D¯f of flow f is re-written as:
D¯f = ts1 + ts2 +
df
∑
ulf
i
lif ∈ Pf
(10)
j=0
where ts1 (ts2 ) represents the delay of issuing (collecting) a packet with multiple flits
into (out of) the NoC and ulf is the maximum delay required for the packets in flow f to
i
leave the current channel lif . Of note, to be consistent with the previous sections, also
following the conventions in [Hu and Kleinrock 1997], here we use lif to represent the
f
i − th hop channel in the routing path of flow f . li+1
represents the next link channel
in flow f ’s routing path.
ulf is made up of two parts [Rahmati et al. 2009; 2013]. First, it includes the residual
i
(remaining) time to route the packets which are already being served at the channel
buffer. These packets belong to either flow f or its competitor f ∗ . Due to backpressure
or flow-control, this residual time corresponds to the maximum waiting times of the
downstream links: max(ulf , ulf ∗ ), ∀f ∗ ∈ Ilf (f ), where Ilf (f ) denotes the set of flows
i+1
i
i
i+1
f
f∗
that contends with f at the current link lif . li+1
and li+1
represent the downstream
channel of flow f and f ∗ , respectively.
The second part in ulf considers the additional delay due to the loss of arbitration
i
under the least priority assumption. More specifically, after the remaining packets
leave the channel, all the flows compete to use that channel again. For the worst-case
transfer situation, the packet of current
flow f has to wait for packets of all other flows
∑
finishing transmission once (i.e., f ∗ ulf ∗ ) since it always loses the arbitration.
i+1
Finally, the overall delay of ulf is given by adding these two parts together [Rahmati
i
et al. 2009]:
∑
ulf = max(ulf , ulf ∗ ) +
ulf ∗ ∀f ∗ ∈ Ilf (f )
(11)
i
i+1
i+1
f∗
i+1
i
At the destination nodes, since they consume the packets immediately; both ulf and
i+1
ulf ∗ in Eqn. 11 equal to the packet length. Therefore, in practice, ulf is calculated from
i+1
i
the flow destinations backwards to the sources [Rahmati et al. 2009]. After obtaining
ulf , the worst-case delay bound D¯f can be calculated accordingly.
i
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis to NoC Latency ModellingA:23
3.2.2. Network calculus based worst-case delay prediction. Following the discussions in
[Boudec and Thiran 2004; Kiasari et al. 2013a; Chang 2000], we can summarize the rationale of network calculus as follows: for a queuing system, let I(t) and O(t) represent
its input and output characteristic function, respectively. More precisely, I(t) denotes
the total number of packets received at a specific router port until time t. O(t) represents the number of packets sending out of that port until the same time instance.
Under these notations, b(t) = I(t) − O(t) actually represents the number of customers
that have to wait in the buffer at time t; while d(t) = inf τ ≥0 {I(t) ≤ O(t + τ )} describes
the time duration that a customer has to stay in the system before it is routed to the
next node under the assumption this customer arrives at time t [Qian et al. 2009a;
Kiasari et al. 2013a].
For the worst-case delay derivation, the network calculus approaches use two curves
to bound the input and output characteristic functions, namely the arrival curve α(t)
and the service curve β(t) [Kiasari et al. 2013a]. α(t) and β(t) bound the arrival function I(t) and output function O(t) as follows [Chang 2000; Qian et al. 2009a]:
I(t) − I(s) ≤ α × (t − s), ∀s ≤ t
(12)
O(t) ≥ I(s) + β × (t − s), ∀s ≤ t
(13)
Giving the characterized α(t) and β(t) curves, the maximum delay of forwarding a
customer is obtained by computing the maximum horizontal distance between β(t)
and α(t) curves over all the time instances t [Kiasari et al. 2013a; Chang 2000].
To calculate the maximum routing delay of a specific flow f in NoC, all the routers
through which flow f passes are considered first. Then, an equivalent service curve βf
of this flow is derived by convolving each router’s service process [Qian et al. 2009a].
To resolve different contention scenarios that a flow may encounter, in [Qian et al.
2010b], the authors propose a systematic way to build a contention tree. The flows
competing with the current flow f at each router are recognized. Then, the contention
scenarios are classified into three typical cases. The equivalent service curve is derived
respectively for these three cases. The model is further optimized in [Qian et al. 2009b]
by considering the feedback between the neighboring routers using credits for flow
control.
Recently, the network-calculus based models have evolved in the following directions:
— Analyzing self-similar traffic: In [Qian et al. 2009c], the network calculus based techniques are applied to estimate the delay under self-similar traffic. Specifically, instead of using a deterministic arrival curve, the self-similar arrival curve is derived
by including a few additional parameters (namely the excess probability in that work)
to capture the LRD in the traffic. By using the new arrival curve model, the authors
demonstrate a more practical delay bound can be obtained for fractal applications.
— Modeling for more complicated router architectures: in [Qian et al. 2010a], the network calculus approach is applied for the VC router architectures by further considering the service process of VC allocation. In [Zhao and Lu 2013], the NoC elements are
modelled using a formal description approach called eXecutable Micro-Archiectural
Specification (xMAS) [Chatterjee et al. 2012]. Then the xMAS NoC models are analyzed using network calculus techniques to compute the delay bound of each flow.
Compared to the previous models, the xMAS model takes the advantage of the xMAS
framework to integrate more control details in the NoC routers.
— Applying real-time calculus techniques: Real time calculus [Thiele et al. 2000] is
proposed to extend the network calculus idea on the real time systems. Compared
to the conventional network calculus approaches, the upper and lower bound curves
which are the functions of the time interval, are used to characterize the arrival
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
A:24
Z. Qian et al.
and service processes, respectively [Kiasari et al. 2013a]. For instance, the upper
bound arrival curve with a parameter δ indicates the maximum number of arriving
customers during the time interval δ. Based on the upper and lower bound curves,
[Chakraborty et al. 2003] develops a framework to analyze the maximum delay for
the real-time embedded systems.
— Predicting statistical delay distribution: Recently, the stochastic network calculus
[Jiang and Liu 2008] technique has been applied in NoC to predict the statistical
delay bound for NoC systems [Lu et al. 2014]. Specifically, the original deterministic
arrival and service curves have been extended with an additional function to describe
the probability of bounding the actual arrivals and services [Kiasari et al. 2013a]. In
this way, the delay distribution of a target flow can be predicted using the statistical
arrival and service curves; the result is more useful for systems with soft-deadline
requirements.
3.2.3. Schedulability and data flow analysis techniques. Besides above two approaches, the
schedulability analysis and data flow analysis techniques have also been applied in
NoC-based systems (e.g., [Shi and Burns 2008; Bekooij et al. 2004]). In general, these
two approaches are more often used in evaluating whether a scheduling/priority assignment policy can satisfy the deadline requirement of real-time systems. For the
schedulability analysis in [Shi and Burns 2008], the authors identified two types of
interferences of a target flow f . Specifically, the direct interfering flows share the links
with f and cause resource contentions. The indirect interferences only implicitly affect f via influencing the issuing time of packets in f ’s direct interferences. Based on
the priority assignment and the preemption assumption, a worst case delay can be derived accordingly. Such analysis framework is used to explore an optimal (i.e., lowest
hardware cost) task mapping and priority assignment for NoC-systems with deadline
constraints [Shi and Burns 2010]. For the data flow analysis, the data flow graph of
the system is built first. Then, it can be used to evaluate whether a given buffer sizing
or scheduling approach meet all the flow deadline requirements [Bekooij et al. 2004;
Hansson et al. 2008].
4. SIMULATION BASED EVALUATION METHODS
Besides mathematical models, NoC simulations are also important in performance
evaluations due to their higher accuracy. In this section, we review the development of
simulation-based evaluation methods in NoC.
4.1. NoC simulators design
When using an NoC simulator to evaluate a design, the user needs to choose one with
a proper abstraction level. Generally, the most accurate simulator implements every
component (e.g., crossbar switch, switch allocators etc.) in the router exactly following the guidelines to be laid out. Netmaker [Netmaker 2009] is such a simulation
platform which is implemented in SystemVerilog and can be synthesized in ASIC or
FPGA. It supports a variety of router and network configurations (e.g., routers and
links with variable numbers of pipeline stages). It also supports some advanced router
structures such as the speculative router [Peh and Dally 2001]. Another example is
CONNECT [Connect 2011], which is a flexible Bluespec SystemVerilog based NoC emulator supporting different configurations in architectures (e.g., input queued, virtual
output queued or virtual channel based routers). Moreover, the CONNECT users can
also configure the network into different topologies such as Mesh, Torus, Fat Tree and
Butterfly. In general, the hardware description language (HDL) based platforms are
most suitable for the final stage system prototyping. When such simulators are used
in the design space exploration, it takes a significant time to re-synthesize the designs
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis to NoC Latency ModellingA:25
Table IV. Summary of current NoC simulators
Netmaker
CONNECT
Booksim
Noxim
WormSim
HNoC
NNSE
gpNoCsim
TOPAZ
Garnet
gMemNoCsim
Xmulator
DARSIM
NoCTweak
Xpipes
OCIN_TSIM
HORNET
Topology
Router
Backpressure
Traffic
Mesh
Mesh/Torus/Fat Tree
Mesh/Torus etc.
Mesh
Mesh/Torus
Regular/Irregular
Mesh/Torus
Mesh/Torus/Fat Tree
Mesh/Torus/Midimew
Regular/Irregular
Mesh/Ring/Custom
Regular/Semi-regular
Regular/Irregular
Mesh/Torus/Ring
Application-specific
Mesh/Star
Regular/Irregular
VC; SPECa
VC; IQ; VOQ b
VC
WHc
WH
HVCe
VC
VC
VC, VCTf
VC, EVC
VC
VC
VC
VC; BFLg
VC
VC
VC/BDLi
Credit
Credit/Peek Flow Control
Credit
ABPd
Buffer availability
Credit
Stress Values
Credit/Buffer availability
Credit/Stop and go
Credit
ACK-NACK/Stall and goh
Credit
-
Synthetic
Synthetic
Synthetic/Task-graph/Traces
Sythetic/Task-graph
Synthetic/Task-graph/Traces
Synthetic/Task-graph/Traces
Uniform/Locality/Task-graph
Synthetic
Synthetic/Traces
Synthetic
Traces
Synthetic
Synthetic/Traces
Synthetic/Traces
Task-graph
Synthetic/Traces
Synthetic/Traces
a SPEC:
the Speculative router architecture [Peh and Dally 2001]
Input queued router; VOQ: Virtual output queued router
c WH: the wormhole router without using virtual channels
d ABP: Alternating Bit Protocol for flow control [Noxim 2011]
e HVC: Heterogeneous Virtual channel architecture. The VCs of each port may be different.
f VCT: Virtual Cut Through router architecture
g BFL: Bufferless router architecture
h Xpipes Lite [Stergiou et al. 2005] supports the stall and go protocol [Flich and Bertozzi 2010].
i BDL: NoC architecture with Bi-directional Links
b IQ:
and re-run the simulations especially when the data- or control-path components still
need to be largely explored.
To speedup the evaluation, the other type of NoC simulators abstracts the router
components in a higher level language (e.g., SystemC or C++) and provides cycle- and
flit-accurate evaluations of the target design. Compared to hardware description languages, more flexible data types and structures can be used in these models. Therefore,
many router components such as buffers and allocators can be implemented more efficiently. The simulations are then performed either cycle-by-cycle (i.e., cycle-driven)
or use an event queue to maintain the next operation (i.e., event-driven) [Dally and
Towles 2003]. Another feature of these simulators is the parametrized design, which
allows the users to explore over a large design space such as the buffer sizes, link
bandwidth, routing schemes without re-compiling the simulators.
In Table IV and V, several widely used NoC simulation tools in the research community are summarized and compared. They are Netmaker [Netmaker 2009], CONNECT [Papamichael and Hoe 2012], Booksim [Jiang et al. 2013], Noxim [Noxim 2011],
Wormsim [Wormsim 2008], HNoC [Ben-Itzhak et al. 2012; HNoC 2013], NNSE [Lu
et al. 2005], gpNoCsim [Hossain et al. 2007], TOPAZ [Abad et al. 2012], Garnet [Agarwal et al. 2009], gMemNoCsim [gMemNoCsim 2011; Lodde and Flich 2012], Xmulator
[Nayebi et al. 2007], DARSIM [Lis et al. 2010], NoCTweak [Tran and Baas 2012],
Xpipes [Bertozzi and Benini 2004; Stergiou et al. 2005], OCIN_TSIM [Prabhu 2010]
and HORNET [Lis et al. 2011] 5 . Specifically, in Table IV, we summarized the topologies, the router architectures, the backpressure signal characteristics and the traffic
5 Actually,there
are more open-source simulators that are available and have been used in NoC evaluations,
examples are the SICOSYS [Puente et al. 2002], OCCN [OCCN 2003], Nigram [NIRGAM 2007] and Atlas
[Atlas 2011]. They share many similar features as those in Table IV and V. Interested readers may refer to
[Ababei et al. 2012], which summarized the information of simulators used by different groups.
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
A:26
Z. Qian et al.
Table V. Summary of current NoC simulators (Cont.)
Parameters
Netmaker
CONNECT
Booksim
Noxim
WormSim
HNoC
NNSE
gpNoCsim
TOPAZ
Garnet
gMemNoCsim
Xmulator
DARSIM
NoCTweak
Xpipes
OCIN_TSIM
HORNET
Modeling Language
Parallel sim
Embedded in FSSa
Packet/VCb /Allocatorc
etc.
SystemVerilog
VC/Allocator etc.
Bluespec SystemVerilog
√
Pipelined /VC/Allocator/Packet etc.
C++
√
Buffer/Packet etc.
SystemC
Buffer/Packet etc.
C++
√
VC/Link etc.
OMNeT++
VC/Packet etc.
SystemC
Packet/VC etc.
Java
√
√
Pipeline/Packet/VC etc.
C++
√
Pipeline/VC/Bandwidth etc.
C++
√
VC/Packet etc.
C/C++
VC/Message etc.
C#
√
√
VC/Packet/Bandwidth etc.
C++
VC/Pipeline/Allocator etc.
SystemC
√
VC/Switchese etc.
SystemC/Verilog
VC/Pipeline/Allocator etc.
C++
√
√
VC/Packet/Bandwidth etc.
C/C++
a FSS: Full-system simulator. The entries with √ represent the simulator has been embedded in some full-system simulator,
such as Simics [Simics 2012] , Gem5 [Gem5 2009] and Graphite [Graphite 2010] for execution-driven simulation
b The virtual channel number, buffer depth can be specified by the user
c The allocator and arbiter type can be configured
d The pipeline depth can be specified
e The switch (crossbar) size can be configured
supported by these simulators. As shown in Table IV, most simulators support the
exploration of regular or semi-regular topologies (e.g., mesh, torus and hyper-cubes).
These topologies are more commonly used in Chip Multiprocessor (CMP) architectures,
where the processing elements are homogeneous. On the other hand, there are other
simulation frameworks such as Xpipes and HNoC, which support arbitrary topologies
and heterogeneous configuration of different routers. Therefore, they are suitable to
explore the design of Multiprocessor System-on-Chip (MPSoC) architectures. In Table
V, we summarize the features of different simulation engines, including their configurable parameters, the modeling language, whether they provide the native support
of parallel simulation and can be directly embedded in a full-system simulator. For the
configurable parameters, most simulators allow the user to specify the VC (e.g.,number
of VCs per port and VC buffer depth) and the packet (e.g., packet size, fixed- or variable length distributions) configurations. Some simulators also provide the options for
users to explore the router architectures (e.g., allocator type and pipeline depth). For
the power evaluation, two methods are widely used. The first is to use bit-level energy
model which estimates the energy consumption of transferring a single bit over the
switch and the link [Hu and Marculescu 2005]. Noxim and Wormsim simulators are
two examples. The second method is to include the Orion [Kahng et al. 2012] library,
which is a detailed NoC power model, in the simulators. For example, Booksim and
Garnet simulators have integrated Orion. Wormsim also provides an option to evaluate power using Orion model.
4.2. Addressing scalability challenge in NoC simulations
Previously, for flit-level NoC simulators, there are built on either cycle-based or eventdriven models [Dally and Towles 2003]. Cycle-based simulators check and update the
network status (e.g., the buffer occupancy, the allocation request/response) every cycle.
On the other hand, event-driven simulators update the network status only at the specific instances when an event/transaction was previously scheduled in the event queue.
Both these two simulators accurately predict the design performance in the system
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis to NoC Latency ModellingA:27
level. However, with the advancement in IC integration, there may be hundreds or
thousands of cores on future NoC systems [Borkar 2009]. An emerging challenge for
both cycle-based and event-driven evaluations is to provide fast and efficient simulation of a large scale NoC system. In the following, we summarize several representative
approaches:
— High level abstraction for simulation: INSEE [Ridruejo Perez and Miguel-Alonso
2005] is such an simulation framework which contains a functional simulator named
FSIN. Different from cycle-accurate simulators, FSIN omits some modeling details
such as cycle-accurate behaviors of the routers. Therefore, the simulator is simple
and fast for early phase design space exploration. The users can use FSIN to simulate a much larger network compared to the SICOSYS [Puente et al. 2002] simulator.
However, the accuracy is compromised in such functional simulators.
— Reduce simulation points by statistical sampling techniques: To compensate the accuracy loss in functional simulation, one way is to combine detailed and functional
simulations. In general, the statistical sampling techniques help to choose only a
portion of the application that needs to be fully simulated [Dai and Jerger 2014]. For
the other parts, either only functional simulation is conducted or they are simply
skipped. In [Dai and Jerger 2014], the idea of previous sampling based full-system
simulation (e.g., [Carlson et al. 2013],[Ardestani and Renau 2013] and [Wenisch
et al. 2006]) is extended to NoC-based systems. Two schemes are proposed. The first
scheme uses sampling theory to explore the size of samples based on the metric (e.g.,
latency) variation and confident interval requirements. The second scheme identifies different phases in the application by using the clustering/classification method.
Then, the sampling is done by choosing samples from each class. It has been demonstrated that both approaches can achieve an order of magnitude speedup with a small
error (less than 10%).
— Exploring capabilities for parallel simulation: To enable fast and large-scale
simulation, instead of executing the simulator program sequentially on a host
PC/workstation, parallel computing resources (e.g., multithreads or multicore resources on a CPU or GPU card) are exploited to accelerate the overall evaluation
process [Eggenberger and Radetzki 2013]. In [Lis et al. 2011], a highly flexible and
multithreaded-based NoC simulator named HORNET is developed. Specifically, parallel evaluation is achieved by dividing the whole NoC system into several different
parts; each part is mapped and scheduled onto an individual thread. During the
simulation, different threads are synchronized periodically to exchange the network
status of different regions. The synchronizing frequency determines the tradeoff between the simulation speedup and the evaluation accuracy. On one hand, frequent
synchronizations ensure each region to keep pace with the state changes that happen
in the neighboring region. On the other hand, frequent synchronizations create barriers and significantly affect the overall speedup [Eggenberger and Radetzki 2013].
In [Zolghadr et al. 2011], an NoC simulator is developed on Nvidia GPU platform
based on CUDA library. In this simulator, each GPU thread is used for the evaluation of a router or a link. Every cycle, the shared or global memory of GPU is used
to synchronize/update the transactions. Therefore, it requires a significant effort on
synchronization among threads; the overall speedup performance under very large
network size is limited [Eggenberger and Radetzki 2013]. In [Pinto et al. 2011], the
General Purpose GPU (GPGPU) based simulator for large scale NoC system is designed. Each GPU thread is used to evaluate a tile (the core and router). The GPU
global memory is used to store the packets to speedup the data access across different threads. To simulate a large-scale network, the instruction instead of cycle level
accuracy is provided. Recently, in [Eggenberger and Radetzki 2013], the authors proACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
A:28
Z. Qian et al.
pose a simulation method with high scalability. Specifically, the simulation tasks are
ordered first based on their dependencies. The maximum number of tasks that can
be simulated simultaneously without causing interferences with each other is identified. An efficient synchronization scheme is then used together with a load balancing
scheme to dynamically distribute the simulation workload among different threads
within the host PC. Based on above optimizations, when executing the simulations
on a multi-core host computer, a significant speedup can be achieved even for an NoC
with more than one thousand processors.
4.3. NoC benchmarking efforts
Standardization efforts have been taken to provide methodologies for NoC benchmarking. The Open Core Protocol International Partnership Association (OCP-IP) formed a
working group on NoC benchmarking [Mackintosh 2008] to propose standard benchmarks and methods for evaluating and comparing NoC designs with different topology,
traffic and router implementation characteristics [Grecu et al. 2007]. With a similar
target, the NoCBench [NoCbench 2011] project has also released its models which contains benchmark traffic and router models for performance evaluation.
5. COMBINING ADVANTAGES OF ANALYTICAL MODELS AND SIMULATIONS
The NoC analytical models discussed in Section 3 provides a fast way for evaluating
NoC performance in the early design stage. On the other hand, the NoC simulators
closely model the real system behaviors at run time but require significantly longer
time. Therefore, one interesting research direction in NoC performance evaluation is
to combine the advantages in these two approaches and explore a new way for estimating NoC performance. In this section, we review several interesting progresses towards
this end. We first discuss using FPGAs to accelerate the simulations and then introduce the machine learning based techniques to improve the analytical model accuracy
from simulation training samples.
5.1. Hardware based NoC simulator
Instead of implementing the software-based NoC simulators, one recent direction in
NoC emulator design is to use real hardware platform for performance evaluation. Of
note, this is different from previous work [Wolkotte et al. 2007; Genko et al. 2005]
which implements the NoC exactly as an ASIC chip or a FPGA embedded system
for prototyping. The concept of hardware based NoC simulator still tries to explore
a higher level abstraction of the router model (for example, using the FPGA to implement the mathematical models as in [Papamichael et al. 2011] or the behavioral
models as in [Wang et al. 2011]). The motivation of implementing such a high level
NoC model on hardware is from two aspects [Wang et al. 2011]: first, the abstraction
in a higher level takes the advantage of larger flexibility. Therefore, they are suitable
to explore and compare different design choices before the data- and control- path being fixed. Second, compared to executing the simulation as a software program, the
hardware based simulator emulates different PEs and routers in parallel every cycle. Therefore, the simulator is more scalable for evaluating a large network size and
reduces the overall evaluation time.
For example, in [Wang et al. 2011], a flexible and efficient FPGA-based simulation
architecture was developed. The authors claimed instead of implementing the system
exactly on hardware, it is more flexible to map the major data- and control- path components as generic library components in FPGA. For different NoC architectures to be
evaluated, the users just need to make a collection of corresponding basic elements
from their library (e.g., the specific arbiters and buffers). In this way, the FPGA simulation engine can be viewed as a virtualization of the architecture to be explored.
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis to NoC Latency ModellingA:29
Applying SVR-NoC
New traffic inputs
ARCG(T,P)
Training flow
Training traffic
Booksim 2.0
simulator
Link dependency analysis
Link order
NoC queuing model
Simulated
waiting times
Waiting times based
on queuing model
Form features
Form labels
XCQ
source waiting
time
XSQ
SV Regression
SV Regression
Channel
waiting time
fCQ
Application info
fSQ
XCQ , XSQ
Channel waiting
and source queuing
delay
Flow latency
Ls,d
Fig. 10. The workflow chart of the SVR-NoC model [Qian et al. 2013; 2015]: the overall framework trains
the collected data first to obtain the waiting time models fCQ and fSQ . The obtained SVR models are applied
on the new traffic ARCG to predict the flow delay.
Moreover, by using hardware rather than a software program to conduct simulations,
an over 100× speedup compared to the Booksim simulator can be achieved without
degradation in the final simulation accuracy [Wang et al. 2011].
In [Papamichael et al. 2011], a fast and simple FPGA based simulator (FIST) for
evaluating NoC latency is developed. The objective of FIST simulator is to implement
an analytical router model in hardware so as to replace the detailed full-system evaluations. The main idea of FIST design is to model each router as a lookup table in FPGA
which corresponds to a load-delay curve identified from the queuing-theory based models. At run time, during every time window, by checking the current load at the intermediate router, a delay value can be obtained which determines the waiting time of a
packet in the current router buffer before being routed. By following the packet routing
path and adding together the load-dependent delays at each intermediate router, the
user can obtain an end-to-end delay estimation for a specific packet/flow. Compared
to the conventional queuing models, the FIST simulator adjusts the delays based on
the router’s load dynamically. Therefore, it better reflects the temporal behaviors (e.g.,
latency variations over different time period) of a traffic flow. Compared to the cyclebased full-system simulation, a 43× speedup and less than 6% error are reported for
the FIST simulator [Papamichael et al. 2011].
5.2. Learning based NoC performance evaluation
As discussed in Section 3, most NoC analytical models are based on queuing theory,
which use the M/M/1, M/D/1, M/G/1/K or G/G/1 formula to estimate the channel waiting times. These models can achieve high prediction precision if some assumptions are
met, such as the injection process of the packet headers is Poisson [Ogras et al. 2010].
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
A:30
Z. Qian et al.
However, for the cases where the application or the NoC configuration does not satisfy these assumptions, the accuracy of the analytical model degrades. To address such
problem, in [Qian et al. 2013; 2015], SVR-NoC, a support vector regression based latency model is proposed for NoC latency prediction. The rationale of the SVR-NoC is
to refine the analytical models by applying learning techniques on the collected simulation samples. To be more specific, the workflow of the SVR-NoC can be summarized
as follows (shown in Fig. 10) [Qian et al. 2013; 2015]: First, based on the NoC architecture on which the application is mapped and executed, several synthetic traffic
patterns are fed into a corresponding simulator to collect training data samples. The
non-parametric support vector regression [Vapnik 1998] is then applied on the collected training data to estimate the relationship between the waiting times and their
features. In the learning process, the users need to collect two sets of features, i.e., XCQ
and XSQ which correspond to the features of the waiting times at the buffer channels
(i.e., the channel waiting delay CQ) and source queues (i.e., the source waiting delay
SQ). Of note, the features in the set XCQ and XSQ include the factors that directly
affect the channel and source waiting times. For example, the arrival rate elements
in XCQ and XSQ reflect the traffic load injected onto the channel, while the forwarding probability elements in XCQ aim to recognize the potential contentions among
the flows within the same router. After applying support vector regression (SVR) for
learning, two model functions, namely the channel queuing and source queuing model
(fCQ and fSQ ) are obtained from the training data. During the training process, in order to avoid data-overfitting, cross-validation needs to be used to determine the best
hyper-parameter combinations [Vapnik 1998]. Finally, to apply the learning model to
evaluate the NoC latency of a new application, the input traffic is first characterized
as an architecture characterization graph (ARCG) discussed in Section 2.1. Then, for
each source channel and intermediate link channels, the corresponding feature vectors
XCQ and XSQ are calculated based on the extracted ARCG. By applying the regression
functions (fCQ and fSQ ) on the new feature sets, the waiting times at each link channels as well as the overall flow latency can be predicted accordingly.
6. OPEN PROBLEMS AND RESEARCH PERSPECTIVES
Based on discussions in the previous sections, several open problems that need to be
addressed are identified and some potential research directions are presented in the
following discussions:
— Modeling non-stationary and time-dependent NoC traffic: The major problem of
memorlyess (i.e., Poisson or Markovian) and short-range dependent traffic models
(i.e., GE, 2-state MMPP) is that they cannot capture the time-dependent arrival processes. For NoCs under light or medium load, the temporal correlations among the
packets are not obvious. Therefore, these models may still provide good approximations. However, under heavy workload (i.e., high packet injection rates [Bogdan and
Marculescu 2011]), the contentions for the shared resources introduce a long tail of
the arrival and service time distribution, which indicates the earlier packets in the
network may affect their subsequent ones. The accuracy of delay prediction applying
such models is greatly reduced. To overcome this limitation, there are two possible directions. The first direction is to generalize over the current Poisson or 2-state MMPP
traffic models. Specifically, in the subject of Internet traffic modeling and operational
research, the more general processes such as Batch Markovian Arrival Processes
(BMAP) have been used to model the long-range dependency relations [Klemm et al.
2002]. The self-similarity and traffic burstiness can be more efficiently characterized. The second direction is to use the multi-fractal traffic model. Currently, the
model is complicated and requires a set of master equations to be solved. An efficient
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis to NoC Latency ModellingA:31
implementation that can be embedded in the whole synthesis loop is an interesting
research topic.
— Developing corresponding performance models for multi-fractal traffic: The multifractal analysis improves the accuracy over previous models, especially when the
system works near the saturation point. Moreover, when fixing some of the exponent parameters, it has been observed that multi-fractal model reduces to the previous memoryless or short-range dependent models. In conclusion, the multi-fractal
analysis framework provides a more general solution in both the traffic analysis and
performance prediction. However, the complexity of solving master equations is much
higher than other models. Also, current research efforts only focused on the workload
characterization. The performance evaluation model (e.g., models for average-case,
worst-case or statistical metrics) based on multi-fractal input is still rarely explored.
One research direction is to consider solving the multi-fractal formalism presented
in [Bogdan 2015] and derive the probability of large (rare) events in the formalism.
Then, these probabilities could be used to compute the real-time performance metrics
such as the delay bounds.
— Relaxing assumptions in current queuing models: The common limitation of analytical queuing models is they rely on different assumptions that are made before the
derivation. The accuracy and usage is constrained by how close the real cases match
with the assumptions. For the average-case prediction, as summarized in Table III,
each queueing model is built based on its specific traffic input and router architecture
(e.g., buffer size, arbitration policy). The accuracy degrades if the model is applied to a
very different situation. Recently, some generalization efforts have been made, such
as assuming general independent arrival and service processes, removing the assumptions on the buffer depths and packet sizes. However, these models still do not
capture the time-dependency among the inter-arrival and service processes. Therefore, one future direction is to extend the queueing model with long-range dependent
arrival and service processes while relaxing the assumptions on the topology, buffer
size and arbitration scheme at the same time.
— Transient analysis for Network-on-Chips: Most analytical queuing models only focus
on the aggregated average performance (e.g., mean delay). However, the authors in
[Ohmann et al. 2014] pointed out for non-stationary applications, since the system is
unstable, transient behavioral analysis is more useful. In [Ohmann et al. 2014], they
also propose a transient queueing model for a single router element. They discussed,
for this emerging direction, additional efforts are required to provide a working model
for a router network. Moreover, the methods of applying transient analysis for better
flow control or buffer sizing are also desired [Bogdan 2015].
— Efficient resources management strategies: The NoC performance models have been
widely used in buffer sizing [Du et al. 2014] and flow regulation [Jafari et al. 2010].
However, if the workload is time-dependent, the configuration decision made may
be optimistic and not satisfy the QoS requirement [Varatkar and Marculescu 2004].
This limits the usage of such models for some real-time systems. On the other hand,
if a multi-fractal traffic model is used, it requires to develop an efficient hardware
implementation which can support collecting the information and re-acting to the
network congestions in time.
— Developing efficient approaches for modeling NoC resources contentions in real-time
systems: In NoC, the resources (e.g., link channels) are shared among different flows.
Therefore, it has been realized in research that the contentions are the major causes
of delay variations. How to efficiently model the effects of resources sharing is one
key problem. Especially for real time systems with QoS requirement, over-estimating
contention delays introduces additional resources overhead while under-estimating
the delays causes a potential system failure. In the worst-case analytical models,
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
A:32
Z. Qian et al.
the real bound analysis makes a conservative assumption that the target flow always loses arbitration. The bound tightness is compromised. The network calculus
bounds depend on whether the arrival and service curves can capture the real-time
or statistical behaviors. How to model a network with more complicated contention
scenarios still needs to be explored. For the schedulability and data flow analysis,
they make assumptions on the priority assignment of the traffic flows. Different VCs
in the same port may be allocated to different priority levels. Therefore, they are
mostly suitable for priority-based architectures. For best-effort NoC or real time NoC
with soft-deadline requirements, these two approaches cannot be directly applied.
— Addressing scalability challenge in NoC simulators design: For a large-scale system
with more than hundreds of cores, the full-system simulation is unaffordable. Even
with current parallel simulation techniques, how to efficiently partition the simulations among multiple threads and address the synchronization barrier is still difficult. Statistical sampling techniques help to reduce the simulation points. However,
the determination of sampling strategy is application-specific which relies on the preanalysis of the application characteristics. Another direction is to use hardware based
modeling or learning based techniques to accelerate the evaluation. For the learning
based modeling approach, the current model only works for the average latency prediction. How to reduce the training data size and extend the learning method for
more performance metrics (e.g., worst-case or statistical delay) remain open to the
whole community.
7. CONCLUSIONS
In this survey, we have reviewed several techniques for NoC performance evaluation.
These techniques range from building the traffic models to developing the analytical
and simulation-based latency models. We first summarized the typical workloads that
are employed in NoC-based system evaluation. Then, we have reviewed the approaches
in traffic analysis for capturing both the short- and long-range dependence behaviors.
In addition, a multi-fractal based approach to model the non-stationary traffic is introduced. We concluded the traffic analysis techniques by summarizing the features
of each traffic model. For the NoC latency performance evaluation, we have presented
the techniques to predict the average-case and worst-case delay. The simulation-based
evaluation models are also reviewed. We have also reviewed the state-of-the-art progresses that combines the advantages in analytical models and simulations to provide
rapid and scalable performance evaluations. The performance evaluation for NoCbased multicore system still faces many interesting challenges. We have discussed
several open problems and potential research directions towards this purpose.
ACKNOWLEDGEMENT
The authors would like to thank the reviewers for their detailed and valuable comments and suggestions.
REFERENCES
C. Ababei, P. P. Pande, and S. Pasricha. 2012. Network-on-chips (NoC) Blog. (2012). http://networkonchip.
wordpress.com/
P. Abad, P. Prieto, L. G. Menezo, A. Colaso, V. Puente, and J. A. Gregorio. 2012. TOPAZ: An Open-Source
Interconnection Network Simulator for Chip Multiprocessors and Supercomputers. In Networks on Chip
(NoCS), 2012 Sixth IEEE/ACM International Symposium on. 99–106.
N. Agarwal, T. Krishna, L. S. Peh, and N. K. Jha. 2009. GARNET: A detailed on-chip network model inside
a full-system simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE
International Symposium on. 33 –42.
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis to NoC Latency ModellingA:33
E.K. Ardestani and J. Renau. 2013. ESESC: A fast multicore simulator using Time-Based Sampling. In
High Performance Computer Architecture (HPCA2013), 2013 IEEE 19th International Symposium on.
448–459.
M. Arjomand and H. Sarbazi-Azad. 2009. A comprehensive power-performance model for NoCs with multiflit channel buffers. In Proceedings of the 23rd international conference on Supercomputing (ICS ’09).
ACM, New York, NY, USA, 470–478.
M. Arjomand and H. Sarbazi-Azad. 2010. Power-Performance Analysis of Networks-on-Chip With Arbitrary
Buffer Allocation Schemes. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on 29, 10 (Oct 2010), 1558–1571.
Atlas. 2011. Atlas environment for Network-on-Chips. (2011). http://corfu.pucrs.br/redmine/projects/atlas/
wiki
Marco Bekooij, Orlando Moreira, O Moreira, Peter Poplavko, Bart Mesman, Milan Pastrnak, and Jef Van
Meerbergen. 2004. Predictable Embedded Multiprocessor System Design. In In Proc. International
Workshop on Software and Compilers for Embedded Systems (SCOPES), LNCS 3199. Springer.
Y. Ben-Itzhak, I. Cidon, and A. Kolodny. 2011. Delay analysis of wormhole based heterogeneous NoC. In
Networks on Chip (NoCS), 2011 Fifth IEEE/ACM International Symposium on. 161–168.
Y. Ben-Itzhak, E. Zahavi, I. Cidon, and A. Kolodny. 2012. HNOCS: Modular open-source simulator for Heterogeneous NoCs. In Embedded Computer Systems (SAMOS), 2012 International Conference on. 51–57.
L. Benini and G. De Micheli. 2002. Networks on chips: a new SoC paradigm. Computer 35, 1 (jan 2002), 70
–78.
D. Bertozzi and L. Benini. 2004. Xpipes: a network-on-chip architecture for gigascale systems-on-chip. Circuits and Systems Magazine, IEEE 4, 2 (2004), 18–31.
D. Bertozzi, A. Jalabert, S. Murali, R. Tamhankar, S. Stergiou, L. Benini, and G. De Micheli. 2005. NoC
synthesis flow for customized domain specific multiprocessor systems-on-chip. Parallel and Distributed
Systems, IEEE Transactions on 16, 2 (2005), 113–129.
Paul Bogdan. 2015. Mathematical Modeling and Control of Multifractal Workloads for Data-Center-on-aChip Optimization. In Networks-on-Chip (NoCS), 2015 Ninth IEEE/ACM International Symposium on.
P. Bogdan, M. Kas, R. Marculescu, and O. Mutlu. 2010. QuaLe: A Quantum-Leap Inspired Model for Nonstationary Analysis of NoC Traffic in Chip Multi-processors. In Networks-on-Chip (NOCS), 2010 Fourth
ACM/IEEE International Symposium on. 241–248.
P. Bogdan and R. Marculescu. 2009. Statistical physics approaches for network-on-chip traffic characterization. In Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign
and system synthesis (CODES+ISSS ’09). ACM, New York, NY, USA, 461–470.
P. Bogdan and R. Marculescu. 2010. Workload characterization and its impact on multicore platform design. In Hardware/Software Codesign and System Synthesis (CODES+ISSS), 2010 IEEE/ACM/IFIP
International Conference on. 231–240.
P. Bogdan and R. Marculescu. 2011. Non-Stationary Traffic Analysis and Its Implications on Multicore
Platform Design. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on 30,
4 (2011), 508–519.
Paul Bogdan and Yuankun Xue. 2015. Mathematical Models and Control Algorithms for Dynamic Optimization of Multicore Platforms: A Complex Dynamic Approach. In Computer-Aided Design, 2015. ICCAD
2015. IEEE/ACM International Conference on.
S. Borkar. 2009. Design perspectives on 22nm CMOS and beyond. In Design Automation Conference, 2009.
DAC ’09. 46th ACM/IEEE. 93–94.
J. Y. Le Boudec and P. Thiran. 2004. Network Calculus: A Theory of Deterministic Queuing Systems for the
Internet. Lecture Notes in Computer Science,Springer-Verlag, Berlin, Germany.
T.E. Carlson, W. Heirman, and L. Eeckhout. 2013. Sampled simulation of multi-threaded applications. In
Performance Analysis of Systems and Software (ISPASS), 2013 IEEE International Symposium on. 2–
12.
G. Casale, E. Z. Zhang, and E. Smirni. 2008. KPC-Toolbox: Simple Yet Effective Trace Fitting Using Markovian Arrival Processes. In Quantitative Evaluation of Systems, 2008. QEST ’08. Fifth International
Conference on. 83–92.
S. Chakraborty, S. Kunzli, and L. Thiele. 2003. A general framework for analysing system properties in
platform-based embedded system designs. In Design, Automation and Test in Europe Conference and
Exhibition, 2003. 190–195.
C. S. Chang. 2000. Performance Guarantees in Communication Networks. Springer-Verlag, New York.
S. Chatterjee, M. Kishinevsky, and U.Y. Ogras. 2012. xMAS: Quick Formal Modeling of Communication
Fabrics to Enable Verification. Design Test of Computers, IEEE 29, 3 (2012), 80–88.
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
A:34
Z. Qian et al.
Connect. 2011. Configurable NEtwork Creation Tool. (2011). http://users.ece.cmu.edu/~mpapamic/connect/
Wenbo Dai and N.E. Jerger. 2014. Sampling-based approaches to accelerate network-on-chip simulation. In
Networks-on-Chip (NoCS), 2014 Eighth IEEE/ACM International Symposium on. 41–48.
W. Dally. 1992. Virtual-channel flow control. Parallel and Distributed Systems, IEEE Transactions on 3, 2
(1992), 194–205.
W.J. Dally and B. Towles. 2001. Route packets, not wires: on-chip interconnection networks. In Design Automation Conference, 2001. Proceedings. 684–689.
W. Dally and B. Towles. 2003. Principles and Practices of Interconnect Networks. Morgan Kaufmann, San
Francisco, CA.
J. Diamond and A. Alfa. 2000. On approximating higher-order MAPs with MAPs of order two. Queueing
systems 34 (2000), 269–288.
G. Donald and C. M. Harris. 2008. Fundamentals of Queueing Theory. Wiley.
Gaoming Du, Miao Li, Zhonghai Lu, Minglun Gao, and Chunhua Wang. 2014. An analytical model for worstcase reorder buffer size of multi-path minimal routing NoCs. In Networks-on-Chip (NoCS), 2014 Eighth
IEEE/ACM International Symposium on. 49–56.
M. Eggenberger and M. Radetzki. 2013. Scalable parallel simulation of networks on chip. In Networks on
Chip (NoCS), 2013 Seventh IEEE/ACM International Symposium on. 1–8.
E. Fischer and G.P. Fettweis. 2013. An accurate and scalable analytic model for round-robin arbitration in
network-on-chip. In Networks on Chip (NoCS), 2013 Seventh IEEE/ACM International Symposium on.
1–8.
W. Fischer and K. Meier-Hellstern. 1993. The Markov-modulated poisson process (MMPP) cookbook. Performance Evaluation, Elsevier 18, 2 (1993), 149–171.
Jose Flich and Davide Bertozzi (Eds.). 2010. Designing Network On-Chip Architectures in the Nanoscale Era.
Chapman and Hall/CRC.
S. Foroutan, Y. Thonnart, and F. Petrot. 2013. An Iterative Computational Technique for Performance Evaluation of Networks-on-Chip. Computers, IEEE Transactions on 62, 8 (Aug 2013), 1641–1655.
H. de Meer G. Bolch, S. Greiner and K. S. Trivedi. 2006. Queueing Networks and Markov Chains: Modeling
and Performance Evaluation with Computer Science Applications,2nd Edition. John Wiley and Sons.
Gem5. 2009. Gem5 simulator. (2009). http://www.m5sim.org/
N. Genko, D. Atienza, G. De Micheli, J. M. Mendias, R. Hermida, and F. Catthoor. 2005. A Complete NetworkOn-Chip Emulation Framework. In Proceedings of the conference on Design, Automation and Test in
Europe - Volume 1 (DATE ’05). 246–251.
gMemNoCsim. 2011. gMemNoCsim simulator. (2011). http://www.gap.upv.es/index.php?option=com_
content&view=article&id=72&Itemid=108
Graphite. 2010. Graphite simulator. (2010). http://groups.csail.mit.edu/carbon/
P. Gratz and S. W. Keckler. 2010. Realistic Workload Characterization and Analysis for Networks-on-Chip
Design. In The 4th Workshop on Chip Multiprocessor Memory Systems and Interconnects (CMP-MSI).
Cristian Grecu, Andre Ivanov, Partha P, Axel Jantsch, Erno Salminen, and Radu Marculescu. 2007. An
Initiative towards Open Network-on-Chip Benchmarks.
Z. Guz, I. Walter, E. Bolotin, I. Cidon, R. Ginosar, and A. Kolodny. 2007. Network Delays and Link Capacities
in Application-Specific Wormhole NoCs. VLSI Design (2007).
A. Hansson, M. Wiggers, A. Moonen, K. Goossens, and M. Bekooij. 2008. Applying Dataflow Analysis to
Dimension Buffers for Guaranteed Performance in Networks on Chip. In Networks-on-Chip, 2008. NoCS
2008. Second ACM/IEEE International Symposium on. 211–212.
J. Hestness and S. W. Keckler. 2010. Netrace: Dependency-Driven, Trace-Based Network-on-Chip Simulation". In 3rd International Workshop on Network on Chip Architectures (NoCArc).
HNoC. 2013. HNoC simulator. (2013). http://hnocs.eew.technion.ac.il/
H. Hossain, M. Ahmed, A. Al-Nayeem, T.Z. Islam, and M.M. Akbar. 2007. Gpnocsim - A General Purpose
Simulator for Network-On-Chip. In Information and Communication Technology, 2007. ICICT ’07. International Conference on. 254–257.
Jingcao Hu and R. Marculescu. 2003. Energy-aware mapping for tile-based NoC architectures under performance constraints. In Design Automation Conference, 2003. Proceedings of the ASP-DAC 2003. Asia
and South Pacific. 233–239.
J. Hu and R. Marculescu. 2004a. Application-specific buffer space allocation for networks-on-chip router
design. In Computer Aided Design, 2004. ICCAD-2004. IEEE/ACM International Conference on. 354–
361.
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis to NoC Latency ModellingA:35
J. Hu and R. Marculescu. 2004b. Energy-aware communication and task scheduling for network-on-chip
architectures under real-time constraints. In Design, Automation and Test in Europe Conference and
Exhibition, 2004. Proceedings, Vol. 1. 234–239 Vol.1.
Jingcao Hu and R. Marculescu. 2005. Energy- and performance-aware mapping for regular NoC architectures. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on 24, 4 (April
2005), 551–562.
P. C. Hu and L. Kleinrock. 1997. An Analytical Model for Wormhole Routing with Finite Size Input Buffers.
In 15th International Teletraffic Congress,University of California, Los Angeles.
E.A.F. Ihlen. 2012. Introduction to Multifractal Detrended Fluctuation Analysis in Matlab. Frontiers in
Physiology 3 (2012), 141.
F. Jafari, Z. Lu, A. Jantsch, and M. H. Yaghmaee. 2010. Buffer Optimization in Network-on-Chip Through
Flow Regulation. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on 29,
12 (2010), 1973–1986.
J.Beran. 1994. Statistics for long-memory processes. Chapman and Hall.
Nan Jiang, D.U. Becker, G. Michelogiannakis, J. Balfour, B. Towles, D.E. Shaw, J. Kim, and W.J. Dally. 2013.
A detailed and flexible cycle-accurate Network-on-Chip simulator. In Performance Analysis of Systems
and Software (ISPASS), 2013 IEEE International Symposium on. 86–96.
Y. Jiang and Y. Liu. 2008. Stochastic Network Calculus. Springer-Verlag, London, UK.
A. B. Kahng, B. Li, L. S. Peh, and K. Samadi. 2012. ORION 2.0: A Power-Area Simulator for Interconnection
Networks. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on 20, 1 (Jan. 2012), 191
–196.
J. W. Kantelhardt, S. A. Zschiegner, E. Koscielny-Bunde, S. Havlin, A. Bunde, and H. E. Stanley. 2002. Multifractal detrended fluctuation analysis of nonstationary time series. Physica A Statistical Mechanics
and its Applications 316 (Dec. 2002), 87–114.
A. E. Kiasari, A. Jantsch, and Z. Lu. 2013a. Mathematical formalisms for performance evaluation of
networks-on-chip. ACM Comput. Surv. 45, 3, Article 38 (July 2013), 41 pages.
A. E. Kiasari, Z. Lu, and A. Jantsch. 2013b. An Analytical Latency Model for Networks-on-Chip. Very Large
Scale Integration (VLSI) Systems, IEEE Transactions on 21, 1 (jan. 2013), 113 –123.
A. E. Kiasari, D. Rahmati, H. Sarbazi-Azad, and S. Hessabi. 2008. A Markovian Performance Model for
Networks-on-Chip. In Parallel, Distributed and Network-Based Processing, 2008. PDP 2008. 16th Euromicro Conference on. 157–164.
L. Kleinrock. 1975. Queueing Systems, Volume I: Theory. Wiley.
A. Klemm, C. Lindemann, and M. Lohmann. 2002. Traffic Modeling of IP Networks Using the Batch Markovian Arrival Process. In Computer Performance Evaluation: Modelling Techniques and Tools, Tony Field,
PeterG. Harrison, Jeremy Bradley, and Uli Harder (Eds.). Lecture Notes in Computer Science, Vol. 2324.
Springer Berlin Heidelberg, 92–110.
Hisashi Kobayashi. 1974. Application of the Diffusion Approximation to Queueing Networks I: Equilibrium
Queue Distributions. J. ACM 21, 2 (April 1974), 316–328.
D. D. Kouvatsos, A. SALAM, and M. Ould-Khaoua. 2005. Performance modeling of wormhole-routed hypercubes with bursty traffic and finite buffers. Int J Simul Pract Syst Sci Technol 6 (2005), 69–81.
P. J. Kuhn. 2013. Tutorial on Queuing Theory. University of Stuttgart.
M. C. Lai, L. Gao, N. Xiao, and Z. Y. Wang. 2009. An accurate and efficient performance analysis approach
based on queuing model for network on chip. In Computer-Aided Design - Digest of Technical Papers,
2009. ICCAD 2009. IEEE/ACM International Conference on. 563 –570.
S. Lee. 2003. Real-time wormhole channels. Journal Of Parallel And Distributed Computing 63 (2003), 299–
311.
M. Lis, P. Ren, M. H. Cho, K. S. Shim, C. W. Fletcher, O. Khan, and S. Devadas. 2011. Scalable, accurate
multicore simulation in the 1000-core era. In Performance Analysis of Systems and Software (ISPASS),
2011 IEEE International Symposium on. 175–185.
Mieszko Lis, Keun Sup Shim, Myong Hyon Cho, Pengju Ren, Omer Khan, and Srinivas Devadas. 2010.
DARSIM: a parallel cycle-level NoC simulator. In MoBS 2010 - Sixth Annual Workshop on Modeling, Benchmarking and Simulation, Lieven Eeckhout and Thomas Wenisch (Eds.). Saint Malo, France.
https://hal.inria.fr/inria-00492982
W. Liu, J. Xu, X. Wu, Y. Ye, X. Wang, W. Zhang, M. Nikdast, and Z. Wang. 2011. A NoC Traffic Suite Based
on Real Applications. In VLSI (ISVLSI), 2011 IEEE Computer Society Annual Symposium on. 66–71.
Mario Lodde and Jose Flich. 2012. Memory Hierarchy and Network Co-design through Trace-Driven Simulation. In In Proc. of 7th International Summer School on Advanced Computer Architecture and Compilation for High-Performance and Embedded Systems.
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
A:36
Z. Qian et al.
R. Lopes and N. Betrouni. 2009. Fractal and multifractal analysis: A review. Medical Image Analysis 13, 4
(2009), 634 – 649.
Zhonghai Lu, Rikard Thid, Mikael Millberg, Erland Nilsson, and Axel Jantsch. 2005. NNSE: Nostrum
Network-on-Chip Simulation Environment. In Swedish System-on-Chip Conference (SSoCC). 1–4.
Zhonghai Lu, Yuan Yao, and Yuming Jiang. 2014. Towards stochastic delay bound analysis for Network-onChip. In Networks-on-Chip (NoCS), 2014 Eighth IEEE/ACM International Symposium on. 64–71.
O. Lysne. 1998. Towards a Generic Analytical Model of Wormhole Routing Networks. microprocessors and
microsystems 21, 7-8 (1998), 491–498.
Ian R. Mackintosh. 2008. OCP-IP NoC Benchmarking WG activities. Design Test of Computers, IEEE 25, 5
(Sept 2008), 504–504.
S. Mahadevan, F. Angiolini, M. Storoaard, R. G. Olsen, J. Sparsoe, and J. Madsen. 2005. Network traffic
generator model for fast network-on-chip simulation. In Design, Automation and Test in Europe, 2005.
Proceedings. 780–785 Vol. 2.
R. Marculescu and P. Bogdan. 2009. The Chip Is the Network: Toward a Science of Network-on-Chip Design.
Foundations and Trends in Electronic Design Automation 2, 4 (2009), 371–461.
G. Min and M. Ould-Khaoua. 2004. A performance model for wormhole-switched interconnection networks
under self-similar traffic. Computers, IEEE Transactions on 53, 5 (2004), 601–613.
A. Nayebi, S. Meraji, A. Shamaei, and H. Sarbazi-Azad. 2007. XMulator: A Listener-Based Integrated Simulation Platform for Interconnection Networks. In Modelling Simulation, 2007. AMS ’07. First Asia
International Conference on. 128–132.
Netmaker. 2009. Netmaker interconnection networks simulator. (2009). http://www-dyn.cl.cam.ac.uk/
~rdm34/wiki/index.php?title=Main_Page
N. Nikitin and J. Cortadella. 2009. A performance analytical model for Network-on-Chip with constant
service time routers. In Computer-Aided Design - Digest of Technical Papers, 2009. ICCAD 2009.
IEEE/ACM International Conference on. 571–578.
NIRGAM. 2007. NIRGAM simulator. (2007). http://nirgam.ecs.soton.ac.uk/home.php
NoCbench. 2011. NoCbench. (2011). http://www.tkt.cs.tut.fi/research/nocbench/index.html
Noxim. 2011. Noxim simulator. (2011). http://noxim.sourceforge.net/
OCCN. 2003. OCCN modeling framework. (2003). http://occn.sourceforge.net/
U. Y. Ogras, P. Bogdan, and R. Marculescu. 2010. An Analytical Approach for Network-on-Chip Performance Analysis. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on 29,
12 (2010), 2001–2013.
D. Ohmann, E. Fischer, and G. Fettweis. 2014. Transient queuing models for input-buffered routers in
Network-on-Chip. In Networks-on-Chip (NoCS), 2014 Eighth IEEE/ACM International Symposium on.
57–63.
M. Ould-Khaoua. 1999. A performance model for Duato’s fully adaptive routing algorithm in k-ary n-cubes.
Computers, IEEE Transactions on 48, 12 (1999), 1297–1304.
Michael K. Papamichael and James C. Hoe. 2012. CONNECT: Re-examining Conventional Wisdom for Designing Nocs in the Context of FPGAs. In Proceedings of the ACM/SIGDA International Symposium on
Field Programmable Gate Arrays (FPGA ’12). 37–46.
M. K. Papamichael, J. C. Hoe, and O. Mutlu. 2011. FIST: A fast, lightweight, FPGA-friendly packet latency estimator for NoC modeling in full-system simulations. In Networks on Chip (NoCS), 2011 Fifth
IEEE/ACM International Symposium on. 137–144.
K. Park and W. Willinger. 2000. Self-similar network traffic and performance evaluation. John Wiley and
Sons, New York.
PARSEC. 2009. PARSEC Benchmark Suite. (2009). http://parsec.cs.princeton.edu/
V. Paxson. 1997. Fast, approximate synthesis of fractional gaussian noise for generating self-similar network
traffic. Computer Communication Review 27 (1997), 5–18.
Li-Shiuan Peh and William J. Dally. 2001. A Delay Model and Speculative Architecture for Pipelined
Routers. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture (HPCA ’01). IEEE Computer Society, Washington, DC, USA, 255–.
Physionet. 2004. A Brief Overview of Multifractal Time Series. (2004). http://www.physionet.org/tutorials/
multifractal/index.shtml
C. Pinto, S. Raghav, A. Marongiu, M. Ruggiero, D. Atienza, and L. Benini. 2011. GPGPU-Accelerated Parallel
and Fast Simulation of Thousand-Core Platforms. In Cluster, Cloud and Grid Computing (CCGrid),
2011 11th IEEE/ACM International Symposium on. 53–62.
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
Performance Evaluation of NoC-based Multicore Systems: From Traffic Analysis to NoC Latency ModellingA:37
Subodh Prabhu. 2010. OCIN_TSIM-a DVFS aware simulator for NoC design space exploration and optimization. M.Sc. thesis. Texas A&M University.
V. Puente, J.A. Gregorio, and R. Beivide. 2002. SICOSYS: an integrated framework for studying interconnection network performance in multiprocessor systems. In Parallel, Distributed and Network-based
Processing, 2002. Proceedings. 10th Euromicro Workshop on. 15–22.
A. Pullini, F. Angiolini, P. Meloni, D. Atienza, S. Murali, L. Raffo, G. De Micheli, and L. Benini. 2007. NoC
Design and Implementation in 65nm Technology. In Networks-on-Chip, 2007. NOCS 2007. First International Symposium on. 273–282.
Bo Qian and Khaled Rasheed. 2004. HURST EXPONENT AND FINANCIAL MARKET PREDICTABILITY.
(2004).
Y. Qian, Z. Lu, and Q. Dou. 2010a. QoS scheduling for NoCs: Strict Priority Queueing versus Weighted
Round Robin. In Computer Design (ICCD), 2010 IEEE International Conference on. 52–59.
Y. Qian, Z. Lu, and W. Dou. 2009a. Analysis of communication delay bounds for network on chips. In Design
Automation Conference, 2009. ASP-DAC 2009. Asia and South Pacific. 7–12.
Y. Qian, Z. Lu, and W. Dou. 2009b. Analysis of worst-case delay bounds for best-effort communication in
wormhole networks on chip. In Networks-on-Chip, 2009. NoCS 2009. 3rd ACM/IEEE International
Symposium on. 44–53.
Y. Qian, Z. Lu, and W. Dou. 2009c. Applying network calculus for performance analysis of self-similar
traffic in on-chip networks. In Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis (CODES+ISSS ’09). ACM, New York, NY, USA, 453–460.
Y. Qian, Z. Lu, and W. Dou. 2010b. Analysis of Worst-Case Delay Bounds for On-Chip Packet-Switching Networks. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on 29, 5 (2010),
802–815.
Z. Qian, D. Juan, P. Bogdan, C. Tsui, D. Marculescu, and R. Marculescu. 2015. A Support Vector Regression (SVR) based Latency Model for Network-on-Chip (NoC) Architectures. Computer-Aided Design of
Integrated Circuits and Systems, IEEE Transactions on PP, 99 (2015), 1–1.
Zhiliang Qian, Da-Cheng Juan, P. Bogdan, Chi ying Tsui, D. Marculescu, and R. Marculescu. 2014. A comprehensive and accurate latency model for Network-on-Chip performance analysis. In Design Automation Conference (ASP-DAC), 2014 19th Asia and South Pacific. 323–328.
Z. L. Qian, D. C. Juan, P. Bogdan, C. Y. Tsui, D. Marculescu, and R. Marculescu. 2013. SVR-NoC: A Performance Analysis Tool for Network-on-Chips Using Learning-Based Support Vector Regression Model. In
ACM/IEEE Design Automation and Test in Europe (DATE).
D. Rahmati, S. Murali, L. Benini, F. Angiolini, G. De Micheli, and H. Sarbazi-Azad. 2009. A method for
calculating hard QoS guarantees for Networks-on-Chip. In Computer-Aided Design - Digest of Technical
Papers, 2009. ICCAD 2009. IEEE/ACM International Conference on. 579–586.
D. Rahmati, S. Murali, L. Benini, F. Angiolini, G. De Micheli, and H. Sarbazi-Azad. 2013. Computing Accurate Performance Bounds for Best Effort Networks-on-Chip. Computers, IEEE Transactions on 62, 3
(March 2013), 452–467.
F.J. Ridruejo Perez and J. Miguel-Alonso. 2005. INSEE: An Interconnection Network Simulation and Evaluation Environment. In Euro-Par 2005 Parallel Processing. Lecture Notes in Computer Science, Vol.
3648. Springer Berlin Heidelberg, 1014–1023.
B. Ryu and S. Lowen. 2000. Fractal traffic models for Internet simulation. In Computers and Communications, 2000. Proceedings. ISCC 2000. Fifth IEEE Symposium on. 200–206.
S. H. Shahram and L. N. Tho. 1998. Multiple-state MMPP models for multimedia ATM traffic. In Internat.
Conf. on Telecommunications (ICT’98)Proceedings. 435–439.
S. H. Shahram and L. N. Tho. 2000. MMPP models for multimedia traffic. Telecommunication Systems 15,
3-4 (2000), 273–293.
Zheng Shi and A. Burns. 2008. Real-Time Communication Analysis for On-Chip Networks with Wormhole
Switching. In Networks-on-Chip, 2008. NoCS 2008. Second ACM/IEEE International Symposium on.
161–170.
Zheng Shi and Alan Burns. 2010. Schedulability analysis and task mapping for real-time on-chip communication. Real-Time Systems 46, 3 (2010), 360–385.
Simics. 2012. Simics simulator. (2012). http://www.virtutech.com/
V. Soteriou, H. S. Wang, and L. S. Peh. 2006. A Statistical Traffic Model for On-Chip Interconnection Networks. In Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, 2006. MASCOTS 2006. 14th IEEE International Symposium on. 104–116.
SPLASH-2. 1995. SPLASH-2 Benchmark Suite. (1995). http://www.capsl.udel.edu/splash/
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
A:38
Z. Qian et al.
C. D. Spradling. 2007. SPEC CPU2006 Benchmark Tools. SIGARCH Computer Architecture News 35 (March
2007). Issue 1.
S. Stergiou, F. Angiolini, Salvatore Carta, L. Raffo, D. Bertozzi, and G. De Micheli. 2005. times;pipes Lite:
a synthesis oriented design library for networks on chips. In Design, Automation and Test in Europe,
2005. Proceedings. 1188–1193 Vol. 2. DOI:http://dx.doi.org/10.1109/DATE.2005.1
L. Thiele, S. Chakraborty, and M. Naedele. 2000. Real-time calculus for scheduling hard real-time systems.
In Circuits and Systems, 2000. Proceedings. ISCAS 2000 Geneva. The 2000 IEEE International Symposium on, Vol. 4. 101–104 vol.4.
Anh T. Tran and Bevan M. Baas. 2012. NoCTweak: a Highly Parameterizable Simulator for Early Exploration of Performance and Energy of Networks On Chip.
V. Vapnik. 1998. Statistical Learning theory. John Wiley and Sons.
G. Varatkar and R. Marculescu. 2002. Traffic analysis for on-chip networks design of multimedia applications. In Design Automation Conference, 2002. Proceedings. 39th. 795–800.
G. V. Varatkar and R. Marculescu. 2004. On-chip traffic modeling and synthesis for MPEG-2 video applications. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on 12, 1 (2004), 108–119.
D. Y. Wang, N. E. Jerger, and J. G. Steffan. 2011. DART: A programmable architecture for NoC simulation
on FPGAs. In Networks on Chip (NoCS), 2011 Fifth IEEE/ACM International Symposium on. 145 –152.
Zhe Wang, Weichen Liu, Jiang Xu, Bin Li, R. Iyer, R. Illikkal, Xiaowen Wu, Wai Ho Mow, and Wenjing Ye.
2014. A Case Study on the Communication and Computation Behaviors of Real Applications in NoCBased MPSoCs. In VLSI (ISVLSI), 2014 IEEE Computer Society Annual Symposium on. 480–485.
T.F. Wenisch, R.E. Wunderlich, M. Ferdman, A. Ailamaki, B. Falsafi, and J.C. Hoe. 2006. SimFlex: Statistical
Sampling of Computer System Simulation. Micro, IEEE 26, 4 (July 2006), 18–31.
P. T. Wolkotte, P. K. F. Holzenspies, and G. J. M. Smit. 2007. Fast, Accurate and Detailed NoC Simulations.
In Networks-on-Chip, 2007. NOCS 2007. First International Symposium on. 323–332.
Wormsim. 2008. Wormsim simulator. (2008). http://www.ece.cmu.edu/~sld/software/worm_sim.php
Y. Wu, G. Min, M. Ould-Khaoua, H. Yin, and L. Wang. 2010. Analytical modelling of networks in multicomputer systems under bursty and batch arrival traffic. The Journal of Supercomputing 51, 2 (2010),
115–130.
T. Yoshihara, S. Kasahara, and Y. Takahashi. 2001. Practical Time-Scale Fitting of Self-Similar Traffic with
Markov-Modulated Poisson Process. (2001).
X. Zhao and Z. Lu. 2013. Per-flow delay bound analysis based on a formalized microarchitectural model. In
Networks on Chip (NoCS), 2013 Seventh IEEE/ACM International Symposium on. 1–8.
M. Zolghadr, K. Mirhosseini, S. Gorgin, and A. Nayebi. 2011. GPU-based NoC simulator. In Formal Methods
and Models for Codesign (MEMOCODE), 2011 9th IEEE/ACM International Conference on. 83–88.
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
Download