> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK... 1 the intra- and inter-VFI data exchanges, ...

advertisement
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
1
Wireless NoC and Dynamic VFI Co-Design:
Energy Efficiency without Performance Penalty
Ryan Gary Kim, Student Member, IEEE, Wonje Choi, Student Member, IEEE, Zhuo Chen, Student
Member, IEEE, Partha Pratim Pande, Senior Member, IEEE, Diana Marculescu, Fellow, IEEE, and
Radu Marculescu, Fellow, IEEE

Abstract—Multiple Voltage Frequency Island (VFI)-based
designs can reduce the energy dissipation in multicore platforms
by taking advantage of the varying nature of the application
workloads. Indeed, the voltage/frequency (V/F) levels of the VFIs
can be tailored dynamically by considering the workload-driven
variations in the application. Traditionally, mesh-based Networkson-Chip (NoCs) have been used in VFI-based systems; however,
they have large latency and energy overheads due to the inherently
long multi-hop paths. Consequently, in this paper, we explore the
emerging paradigm of wireless Network-on-Chip (WiNoC) and
demonstrate that by incorporating WiNoC, VFI, and dynamic V/F
tuning in a synergistic manner, we can design energy-efficient
multicore platforms without introducing noticeable performance
penalty. Our experimental results show that for the benchmarks
considered, the proposed approach can achieve between 5.7% to
46.6% energy-delay product (EDP) savings over the state-of-theart system and 26.8% to 60.5% EDP savings over a standard
baseline non-VFI mesh-based system. This opens up a new of class
of co-design approaches that can make WiNoCs the
communication technology of choice for future multicore
platforms.
Index Terms—Dynamic Voltage and Frequency Scaling,
Network-on-Chip, Voltage Frequency Islands
I. INTRODUCTION
I
N recent years, multiple Voltage Frequency Island (VFI)
designs have increasingly made their way into commercial
and research multicore platforms. This is because, for VFIbased multicore systems, it is possible to implement efficient
power and thermal management via dynamically fine-tuning
the voltage and frequency (V/F) of each island under given
performance constraints. Moreover, dynamically tuned VFIs
(DVFI) reduce the area overhead associated with a fully
distributed per core dynamic voltage frequency scaling
(DVFS). Hence, a hierarchy of globally distributed (inter-VFI)
and locally centralized (intra-VFI) control mechanisms can
provide the best trade-off between power and resource
management. However, DVFI necessitates that time varying
core and traffic statistics are sent to a decision-making
controller. Due to the nature of a VFI-based system, we employ
a distributed control mechanism for reducing global
communication overhead. To reduce the time overhead
associated with the decision-making process (V/F tuning) and
This work is in part supported by the US National Science Foundation (NSF)
grants CCF-0845504, CNS-1059289, CNS-1128624, CCF-1162202 and CCF1514206 as well as Army Research Office grant W911NF-12-1-0373.
Ryan Gary Kim, Wonje Choi and Partha Pratim Pande are with Washington
State University, Pullman, WA, 99164. E-mails: {rkim, wchoi,
pande}@eecs.wsu.edu
the intra- and inter-VFI data exchanges, we need an efficient
communication backbone. Most of the existing VFI-partitioned
designs use the conventional multi-hop, mesh-based NoC
architecture. However, for large-scale systems, the inter-VFI
data exchanges through traditional mesh NoCs introduce
unnecessary latency and energy overheads. Therefore, in this
work, we propose a new approach to designing a small-world
wireless NoC, which leads to a WiNoC-enabled DVFI-based
multicore system that can achieve significant energy savings
without paying a noticeable performance penalty. At the very
heart of this communication infrastructure lies the small-world
effect induced by the wireless links that enables the efficient
exchange of information among various cores.
The main contributions of this work are as follows:
 First, we propose a new VFI clustering methodology that
utilizes machine learning to: i) allow for non-uniform VFI
clusters and ii) take into account the temporal variations of
application workloads to support and enable DVFI. This
method generalizes previous VFI clustering approaches
that use average core-level statistics.
 Second, we design and implement a lightweight VFI
controller that determines suitable V/F values for each VFI,
during runtime.
 Next, we design the WiNoC with knowledge of the VFI
structure in order to optimize intra-VFI, inter-VFI and core
to dynamic V/F controller communication.
 We demonstrate how the co-design of these three
paradigms (VFI, WiNoC and dynamic V/F tuning) can
significantly reduce the Energy-Delay Product (EDP)
without performance penalty for commonly used CMP
benchmarks.
II. RELATED WORK AND NEW CONTRIBUTIONS
Multiple VFI designs have become commonplace for both
embedded and high performance multi-core platforms where
the optimization of energy dissipation while minimizing
performance degradation and area overhead is a must [1-4]. A
framework for the synthesis of VFIs have been proposed in [5],
where the system is partitioned into VFIs based on the
maximum number of VFIs allowed and the bandwidth
requirement of each task.
Zhuo Chen, Diana Marculescu and Radu Marculescu are with Carnegie
Mellon University, Pittsburgh, PA, 15213: E-mails: {tonychen, dianam,
radum}@cmu.edu
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
2
Figure. 1. Illustration of the proposed VFI-partitioned multi-core co-design methodology. During the VFI clustering phase, each application is profiled in
order to obtain key time-varying core and network statistics. These statistics are used to find the optimal clustering, within particular size constraints, for
each application. During system design, these clusters are then mapped to physical cores. The NoC and VFI Controller (VFI CTRL), the dedicated
hardware block that dynamically tunes the V/F of the VFI, are then placed to tailor to the traffic and cluster characteristics of each application. During
runtime, the VFI CTRL obtains the core and traffic data in order to determine optimal V/F levels for the VFI.
The limitations and design challenges associated with
existing NoC architectures are discussed in [6] and [7].
WiNoCs are seen as a new enabling architecture to achieve high
performance and energy efficiency in multicore platforms. A
comprehensive survey regarding various WiNoC architectures
and their design principles is presented in [6]. A WiNoC
architecture suitable for VFI-based systems has been presented
in [8] and [9]. However, [8] and [9] consider only static V/F
allocation and a clustering algorithm that works under equallysized cluster constraints and hence it may be suboptimal.
Although exploiting small-world effects has been initially
used to improve multicore performance [10], it has been later
demonstrated that small-worldness can also benefit the power
management via control-theoretic approaches [11]. More
recently, it has been shown that small-world WiNoCs can help
improve the temperature profile of the NoC switches and links
compared to a traditional mesh in presence of DVFS [12]. But
prior work [12], considers distributed DVFS where the V/F pair
of each NoC element is fine-tuned as per traffic distribution.
Also, authors in [13] present a Model-Predictive Controller
(MPC) to implement DVFS on each core in order to optimize
energy dissipation within given thermal constraints. In [14], the
authors create specific NoC architectures that are tailored for
the use-cases of a System-on-Chip (SoC). This work also does
some preliminary DVFS investigations to tailor the operating
voltages of the NoC for the varying workload requirements.
However, with VFI-based designs, the area overhead of
implementing per core DVFS can be reduced. Dynamic V/F
control in a VFI has been demonstrated in [15], [16], but the
focus is on the NoC and controlling inter-VFI queue occupancy,
as opposed to the full-system performance. Hardware-based
control has been demonstrated first in the context of application
specific systems [17], [18], but without considering the impact
of a NoC-based communication paradigm. Furthermore, with a
few recent exceptions [19], [20], [21], most approaches rely on
heuristics and do not employ machine learning-based
techniques for VFI clustering, system design, and runtime
management of power and performance as described in this
paper.
Consequently, in this work, we improve the state-of-the-art
by proposing a new co-design methodology that exploits the
machine learning based VFI-partitioning, DVFI, and the
emerging WiNoC paradigms in order to improve the energy
dissipation of a multicore chip without increasing the execution
time compared to traditional mesh-based architectures. Our
new VFI clustering methodology creates and exploits nonuniform VFI clusters with time-varying computation and
communication statistics to better accommodate both WiNoC
and DVFI. The WiNoC is designed with knowledge of the
DVFI in order to provide efficient VFI communication, reduce
the utilization on inter-VFI links and quicker core-to-DVFI
controller communication. Lastly, DVFI is utilized in order to
provide a lowered energy profile by reducing the frequency
while maintaining the performance of the core and network.
To illustrate the general design flow of our proposed VFIpartitioned, dynamic V/F-enabled multicore platform, Fig. 1
outlines the key processes at each stage of the design flow. As
shown, we divide the design process into three stages, i.e. VFI
Clustering, System Design and Runtime. This is processed in a
single sequential design flow. During VFI Clustering, we obtain
the benchmark-specific core and network data in order to create
optimal clustering for each application. In this stage we also
impose a minimum cluster size requirement in order to ensure
that we fully utilize the VFI paradigm. After VFI Clustering we
enter the System Design stage which includes the creation of
the WiNoC, the placement of threads, and the placement of the
VFI controller (VFI CTRL). Lastly, we design the VFI CTRL
in order to take advantage of the application’s workload
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
Variables
𝑢𝑡𝑖𝑙
𝑐𝑜𝑚𝑚
𝑛𝑢𝑚𝑢𝑡𝑖𝑙
𝑛𝑢𝑚𝑐𝑜𝑚𝑚
𝑚𝑖𝑛𝑢𝑡𝑖𝑙
𝑚𝑖𝑛𝑐𝑜𝑚𝑚
𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔𝑢𝑡𝑖𝑙
𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔
|𝑐𝑙𝑠𝑡|
TABLE I
CLUSTERING TERMINOLOGY
Definition
utilization (util) data
communication (comm) data
number of util-based clusters
number of comm-based clusters
min. number of cores in util-based clusters
min. number of cores in comm-based clusters
clustering after util-based method
final clustering after comm-based method
size of cluster 𝑐𝑙𝑠𝑡
Algorithm 1 Pseudocode of Util-based clustering
1:
Input: 𝑢𝑡𝑖𝑙, 𝑛𝑢𝑚𝑢𝑡𝑖𝑙 , 𝑚𝑖𝑛𝑢𝑡𝑖𝑙
2:
Output: 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔𝑢𝑡𝑖𝑙
3:
𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔𝑢𝑡𝑖𝑙 ← K-means(𝑢𝑡𝑖𝑙, 𝑛𝑢𝑚𝑢𝑡𝑖𝑙 , 𝑚𝑖𝑛𝑢𝑡𝑖𝑙 ) [Eq. 5]
4:
for each cluster 𝑐𝑙𝑠𝑡 in 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔𝑢𝑡𝑖𝑙 do
5:
while |𝑐𝑙𝑠𝑡| < 𝑚𝑖𝑛𝑢𝑡𝑖𝑙 do
6:
𝜇𝑐𝑙𝑠𝑡 ← 𝑐𝑒𝑛𝑡𝑟𝑜𝑖𝑑(𝑐𝑙𝑠𝑡) [Eq. 4]
7:
𝑐𝑙𝑜𝑠𝑒𝑠𝑡 ← argmin 𝐷(𝑥𝑖 , 𝜇𝑐𝑙𝑠𝑡 ) [Eq. 3], 𝑥𝑖 ∉ 𝑐𝑙𝑠𝑡, and
𝑥𝑖
𝑥𝑖 has not yet moved
8:
variation during the Runtime stage.
In the following sections we discuss each stage in Fig. 1 in
detail: VFI Clustering (Section III.A), System Design (Section
III.B), and Runtime (Section III.C). We will also demonstrate
how the information of each stage can be utilized to improve
the overall design.
9:
Move 𝑐𝑙𝑜𝑠𝑒𝑠𝑡 core to 𝑐𝑙𝑠𝑡
end while
10:
end for
12:
return 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔𝑢𝑡𝑖𝑙
where 𝑗 is the cluster number. For the distance measure, 𝐷(𝑥𝑖,
𝜇𝑗), we use squared L2-norm:
2
𝐷(𝑥𝑖 , 𝜇𝑗 ) = ‖𝑥𝑖 − 𝜇𝑗 ‖2
III. VFI ARCHITECTURE
In this section, we describe how we design the entire system
to support both VFIs and dynamic V/F tuning. First, we discuss
the VFI creation methodology by considering the cores’ timevarying busy utilization and inter-core traffic characteristics.
Then, we describe how we implement the dynamic V/F tuning
for each VFI. Next, we elaborate on the design of the WiNoC
architecture for the VFI-partitioned system. Finally, we
summarize how the three paradigms can be combined in a codesign methodology that takes advantage of each other.
A. VFI Clustering
The proposed VFI clustering approach relies on the principle
of clustering together cores with similar behavior so as to
benefit from coordinated V/F tuning. For example, cores with
low utilization should be clustered together and tuned to a low
V/F level, while cores with high utilization should have their
V/F levels boosted together. To this end, we use the time traces
for the instruction per cycle (IPC) values per core and traffic
statistics to capture the utilization and communication behavior
of the cores, respectively. We propose to use unsupervised
machine learning techniques to cluster the cores with similar
behavioral patterns.
1) K-means Clustering: K-means is a well-known machine
learning algorithm that is able to identify and cluster the data
into groups without training on a parameterized model [22].
Assuming we have 𝑁 data points and 𝐽 clusters, K-means tries
to minimize the distortion measure:
𝐽
Ψ = ∑𝑁
𝑖=1 ∑𝑗=1 𝛿𝑖𝑗 ∙ 𝐷(𝑥𝑖 , 𝜇𝑗 )
(1)
where 𝛿𝑖𝑗 is an indicator function which is one if and only if
point 𝑥𝑖 belongs to cluster 𝑗, otherwise it is zero. 𝐷(𝑥𝑖, 𝜇𝑗)
measures the distance between the point 𝑥𝑖 and the cluster
center 𝜇𝑗. A cluster’s center (centroid) is defined as the mean of
all points in that cluster:
𝜇𝑗 =
∑𝑁
𝑖=1 𝛿𝑖𝑗 ∙𝑥𝑖
∑𝑁
𝑖=1 𝛿𝑖𝑗
3
(2)
(3)
The distortion measure is the sum of the intra-cluster distance,
therefore minimizing it is equivalent to minimizing the intragroup distortion and maximizing the inter-group distance. Kmeans does this by iterating on two steps: (1) For each group 𝑗,
assume 𝜇𝑗 is fixed and label the closest points as belonging to
group 𝑗; (2) Recalculate the center of the groups, i.e., 𝜇𝑗’s, when
the clustering of the points, i.e., 𝛿𝑖𝑗 value, is fixed. This
algorithm is guaranteed to converge, since each step decreases
the Ψ value [22].
In our method, we model each time trace as a multidimensional point and then cluster points using the abovementioned K-means clustering. Therefore, cores with similar
time-dependent behaviors will be allocated to the same group,
while cores with very different patterns will be clustered in
different groups. In the following sections, we will first
illustrate two clustering methods: one based on cores utilization
and another based on inter-cores communication.
Subsequently, we combine these metrics and introduce our
hybrid clustering method.
2) Utilization-based Clustering: The idea behind
utilization-based clustering is to group together the cores with
similar utilization patterns such that all cores in the same group
can benefit from dynamic V/F techniques. Traditionally,
average IPC values have been used to cluster the cores when
the sampled time traces are unavailable [9]. However, if time
traces have opposite program phases (e.g. for cores that spawn
the threads and the cores that execute the threads) grouping
them together is actually undesirable for dynamic V/F
techniques.
To fully exploit the time-dependent information of the IPC
values, we propose to model each trace as a point in a multidimensional space and cluster them based on their squared L2norm. If the sampling period is 𝜙, and the total execution time
of the benchmark is Φ, then the time trace of each core 𝑖 consists
of 𝜏 = Φ/𝜙 sample points. Suppose we have 𝑁 cores, N 𝜏dimensional points (𝑢𝑖 (1), 𝑢𝑖 (2), … , 𝑢𝑖 (𝜏)), 𝑖 = 1, … , 𝑁, i.e.,
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
the IPC time traces for each core 𝑖, then we can cluster these
points (time traces) by using the K-means algorithm. The center
of each cluster 𝑗, 𝜇𝑗𝑢 , is hence also a 𝜏-dimensional point:
𝜇𝑗𝑢 = (𝜇𝑗𝑢 (1), 𝜇𝑗𝑢 (2), … , 𝜇𝑗𝑢 (𝜏)) =
∑𝑁
𝑖=1 𝛿𝑖𝑗 ∙𝑢𝑖
(4)
∑𝑁
𝑖=1 𝛿𝑖𝑗
𝐽
𝑢
𝜏
Ψ𝑢 = ∑𝑁
𝑖=1 ∑𝑗=1 ∑𝑡=1 𝛿𝑖𝑗 ∙ ‖𝑢𝑖 (𝑡) − 𝜇𝑗 (𝑡)‖
2
(5)
2
By minimizing Ψ𝑢 , we find the groups of cores with the most
similar time behavior for their IPC values. The pseudocode of
utilization-based clustering is shown in Algorithm 1. For
clarity, the meaning of each variable used in the clustering
algorithm is explained in Table I. Line 3 in Algorithm 1 uses Kmeans to obtain the clusters based on utilization values.
However, it is possible that K-means generates very unbalanced
clusters since the algorithm only considers the similarity among
the points and is agnostic of other architectural constraints.
Therefore, we place a constraint on the minimum number of
cores in each cluster due to the implementation cost-efficiency.
If a cluster does not meet this constraint after the K-means
clustering, we evaluate the points outside the cluster to find the
point closest to the cluster (distance measure: Eq. (3) above).
This point is moved to the cluster and this process is repeated
until the minimum core constraint is satisfied (Line 5-9 in
Algorithm 1). For example, if we require at least four cores in
each cluster and cluster 1 only has three cores after clustering,
we look for the closest point outside of cluster 1, 𝑥𝑘 that
minimizes 𝐷(𝑥𝑘 , 𝜇𝑗u ) and move it to cluster 1.
3) Communication-based Clustering: Following the same
idea, we model each communication traffic trace as a multidimensional point: (𝑓(1), 𝑓(2), … , 𝑓(𝜏)). The pseudocode of
communication-based clustering is shown in Algorithm 2.
However, instead of associating one multi-dimensional point
with each core as in the previous section, we need to associate
one point to one pair of cores since communication traffic is
defined for exactly two cores. As a result, the traffic trace
between core 𝑘 and core 𝑙 defines a 𝜏-dimensional point:
(𝑓𝑘𝑙 (1), 𝑓𝑘𝑙 (2), … , 𝑓𝑘𝑙 (𝜏)), where 𝑓𝑘𝑙 (𝑡) is the traffic volume
between core k and core l during time t, and the corresponding
distortion measure is:
𝑓
𝐽
𝑁
𝜏
Ψ𝑓 = ∑𝑁
𝑘=1 ∑𝑙=1 ∑𝑗=1 ∑𝑡=1 𝛿𝑘𝑗 ∙ 𝛿𝑙𝑗 ∙ ‖𝑓𝑘𝑙 (𝑡) − 𝜇𝑗 (𝑡)‖
2
2
(6)
where 𝜇𝑗 is the center of cluster 𝑗 as defined in the following:
𝑓
𝑓
𝑓
𝑓
𝜇𝑗 = (𝜇𝑗 (1), 𝜇𝑗 (2), … , 𝜇𝑗 (𝜏)) =
𝑁
∑𝑁
𝑘=1 ∑𝑙=1 𝛿𝑘𝑗 ∙𝛿𝑙𝑗 ∙𝑓𝑘𝑙
𝑁
∑𝑁
𝑘=1 ∑𝑙=1 𝛿𝑘𝑗 ∙𝛿𝑙𝑗
Algorithm 2 Pseudocode of Comm-based clustering
1:
Input: 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔𝑢𝑡𝑖𝑙 , 𝑐𝑜𝑚𝑚, 𝑛𝑢𝑚𝑐𝑜𝑚𝑚 , 𝑚𝑖𝑛𝑐𝑜𝑚𝑚
2:
Output: 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔
3:
for each cluster clst in 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔𝑢𝑡𝑖𝑙 do
4:
The distortion measure of utilization-based clustering is:
(7)
Line 6 in Algorithm 2 uses K-means to minimize the Ψ𝑓 value.
This approach not only minimizes the intra-cluster traffic
pattern differences, but also attempts to achieve a balance
between performance and energy driven partitioning.
In classic performance-driven, communication-based
clustering, one obtains low inter-cluster traffic magnitude and
low intra-cluster pattern similarity [9]; this clearly benefits
applications in which on-chip communication is a performance
bottleneck. In the case of energy-driven clustering, one can
obtain medium inter-cluster traffic magnitude but similar intra-
4
for 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑛𝑢𝑚 = 𝑛𝑢𝑚𝑐𝑜𝑚𝑚 : 1 do
𝑐𝑜𝑚𝑚𝑝𝑎𝑟𝑡𝑖𝑎𝑙 = communication matrix of cores in
clst
𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔𝑡𝑒𝑚𝑝 ← Kmeans (𝑐𝑜𝑚𝑚𝑝𝑎𝑟𝑡𝑖𝑎𝑙 , 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑛𝑢𝑚 , 𝑚𝑖𝑛𝑐𝑜𝑚𝑚 ) [Eq. 6]
5:
6:
7:
𝑐𝑙𝑠𝑡𝑀𝑎𝑥 = the clustering with the largest centroid
value in 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔𝑡𝑒𝑚𝑝
8:
while |𝑐𝑙𝑠𝑡𝑀𝑎𝑥 | < 𝑚𝑖𝑛𝑐𝑜𝑚𝑚 do
9:
𝜇𝑐𝑙𝑠𝑡𝑀𝑎𝑥 ← 𝑐𝑒𝑛𝑡𝑟𝑜𝑖𝑑(𝑐𝑙𝑠𝑡𝑀𝑎𝑥 ) [Eq. 7]
10:
𝑐𝑙𝑜𝑠𝑒𝑠𝑡 ← argmin 𝐷(𝑥𝑖 , 𝜇𝑐𝑙𝑠𝑡𝑀𝑎𝑥 ) [Eq. 3], 𝑥𝑖 ∉
11:
𝑐𝑙𝑠𝑡𝑀𝑎𝑥 , and 𝑥𝑖 has not yet moved
Move 𝑐𝑙𝑜𝑠𝑒𝑠𝑡 core to 𝑐𝑙𝑠𝑡𝑀𝑎𝑥
𝑥𝑖
12:
end while
13:
while
|𝑐𝑙𝑠𝑡 − 𝑐𝑙𝑠𝑡𝑀𝑎𝑥 | < (𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑛𝑢𝑚 − 1)*𝑚𝑖𝑛𝑐𝑜𝑚𝑚 do
𝜇𝑐𝑙𝑠𝑡𝑀𝑎𝑥 ← 𝑐𝑒𝑛𝑡𝑟𝑜𝑖𝑑(𝑐𝑙𝑠𝑡𝑀𝑎𝑥 ) [Eq. 7]
14:
15:
𝑓𝑎𝑟𝑡ℎ𝑒𝑠𝑡 ← argmax 𝐷(𝑥𝑖 , 𝜇𝑐𝑙𝑠𝑡𝑀𝑎𝑥 ) [Eq. 3],
16:
𝑥𝑖 ∈ 𝑐𝑙𝑠𝑡𝑀𝑎𝑥 , and 𝑥𝑖 has not yet moved
Remove 𝑓𝑎𝑟𝑡ℎ𝑒𝑠𝑡 core from 𝑐𝑙𝑠𝑡𝑀𝑎𝑥
𝑥𝑖
17:
end while
18:
𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔.append(𝑐𝑙𝑠𝑡𝑀𝑎𝑥 )
19:
𝑐𝑙𝑠𝑡 ← 𝑐𝑙𝑠𝑡 − 𝑐𝑙𝑠𝑡𝑀𝑎𝑥
20:
end for
21:
end for
22:
return 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔
Algorithm 3 Pseudocode of Hybrid clustering
1:
Input: 𝑢𝑡𝑖𝑙, 𝑐𝑜𝑚𝑚, 𝑛𝑢𝑚𝑢𝑡𝑖𝑙 , 𝑛𝑢𝑚𝑐𝑜𝑚𝑚 , 𝑚𝑖𝑛𝑢𝑡𝑖𝑙 , 𝑚𝑖𝑛𝑐𝑜𝑚𝑚
2:
Output: 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔
3:
𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔𝑢𝑡𝑖𝑙 ← 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑢𝑡𝑖𝑙 (𝑢𝑡𝑖𝑙, 𝑛𝑢𝑚𝑢𝑡𝑖𝑙 , 𝑚𝑖𝑛𝑢𝑡𝑖𝑙 )
4:
𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔 ← 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑐𝑜𝑚𝑚 (𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔𝑢𝑡𝑖𝑙 , 𝑐𝑜𝑚𝑚,
𝑛𝑢𝑚𝑐𝑜𝑚𝑚 , 𝑚𝑖𝑛𝑐𝑜𝑚𝑚 )
10:
return 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔
cluster traffic patterns. For example, if all the cores in cluster 1
have similar traffic patterns, then we can expect that the intracluster traffic will be generated proportionally slower or faster
for all cores in the cluster depending on the V/F levels used.
The K-means-based approach tries to find the best trade-off
between these cases, i.e., maximize the intra-cluster similarity
while keeping the inter-cluster traffic magnitude low. Like in
the case of utilization-based clustering, it is also possible to get
very unbalanced clusters. We use the same strategy as stated
previously (Line 8-12) to move the closest points and the
associated cores, into the clusters, which cannot satisfy the
constraints. Similarly, if the cluster contains too many cores
such that the remaining cluster will be unable to satisfy the
constraints, we remove the farthest cores from the cluster (Line
13-17).
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
4) Hybrid Clustering: To combine the advantages of both
utilization-based and communication-based clustering, we
propose a hybrid clustering method as shown in Algorithm 3.
In hybrid clustering, we cannot simply combine Ψ𝑢 and Ψ𝑓 in a
weighted cost function (e.g., 𝛼·Ψ𝑢 + (1 – 𝛼)·Ψ𝑓, with 0 ≤ 𝛼 ≤
1), because their state space dimensions are different (Ψ𝑢:
𝑁⨯𝐽⨯𝜏, Ψ𝑓: 𝑁⨯𝑁⨯𝐽⨯𝜏). Therefore, we propose to use a
hierarchical method, i.e., we first use the utilization-based
method to partition the cores into 𝑛𝑢𝑚𝑢𝑡𝑖𝑙 groups. Line 3
invokes Algorithm 1 for utilization-based clustering. In each
group, the cores have similar utilizations and hence can benefit
from the same V/F level tuning. Then, we deploy the
communication-based clustering to create 𝑛𝑢𝑚𝑐𝑜𝑚𝑚 clusters in
each of the 𝑛𝑢𝑚𝑢𝑡𝑖𝑙 groups generated in the first step. Line 4
invokes Algorithm 2 and uses the result from Algorithm 1,
𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔𝑢𝑡𝑖𝑙 . Our data shows that the resulting Ψ𝑢 remains
stable through the communication clustering. Consequently, we
achieve similar utilization patterns and communication patterns
within each cluster. In our case, we generate four VFIs, hence
we first create two groups with utilization-based clustering and
then divide each of the groups into two through
communication-based clustering. This approach can, of course,
be used for any number of clusters.
5) Static VF Levels: With the partitioned VFIs, we determine
the static V/F level of each VFI such that it minimizes the power
consumption under a 95% performance constraint. To estimate
the power and performance values under different V/F levels,
we use the method from [23], which proposed and validated a
power and performance model with Root-Mean-SquaredPercentage-Error (RMSPE) of only 4.37%. Using this model,
we optimally solve for the best V/F levels. These statically
tuned VFIs will be used as a comparison against our proposed
dynamically tuned system.
6) VFI Interface: In this VFI-enabled system, each island can
work with its own voltage and frequency. As such,
communication across different VFIs is achieved through
mixed-clock/mixed-voltage first-input first-output (FIFO)
interfaces. This provides the flexibility to scale the frequency
and voltage of various VFIs in order to minimize the overall
energy consumption [24]. We present the latency and energy
models used in our simulations in section IV.A.
B. WiNoC to Support VFIs
In this work, we design the WiNoC using inspiration from
small-world graphs [25]. Small-world graphs are characterized
by many short-distance links between neighboring nodes, as
well as a few relatively long-range (direct) shortcuts. The longrange shortcuts implemented through mm-wave wireless links,
operating in the 10-100 GHz range, have been shown to
improve the energy dissipation profile and latency
characteristics of multicore chips [6]. Also, it has been seen that
by utilizing the wireless links, the network load can be
significantly reduced with respect to conventional mesh
topologies in a very flexible manner [26]. This allows us to
implement more aggressive dynamic VFI while maintaining the
required network throughput. Hence, in this work, we design
the WiNoC architecture to support efficient data exchanges
among various VFI domains. This is done by creating the
5
wireline network, physically arranging the cores, and placing
the wireless links using the knowledge of the VFI domains and
their traffic characteristics.
1) Wireless NoC Architecture: In WiNoC, the wireline links
are designed using a power-law model [27]. We assume an
average number of connections, ⟨𝑘⟩, from each NoC switch to
the other switches. The value of ⟨𝑘⟩ is chosen to be four so that
the WiNoC does not introduce any additional switch overhead
with respect to a conventional mesh. Also, an upper bound,
𝑘𝑚𝑎𝑥 , is imposed on the number of ports attached to a particular
switch so that no switch becomes unrealistically large. This also
reduces the skew in the distribution of links among the
switches. There is no specific lower bound on the number of
ports attached to a switch but a fully connected network implies
that this number must be at least 1. Both ⟨𝑘⟩ and 𝑘𝑚𝑎𝑥 do not
include the local NoC switch port to the core.
Due to the nature of the VFI clustering, additional constraints
need to be applied to the connectivity of the WiNoC. The
distribution of links is divided into two steps: VFI intra-cluster
connections need to ensure each cluster’s connectivity, and VFI
inter-cluster connections, to enable communication between the
clusters. This is to ensure that both intra-cluster and intercluster communications have sufficient resources and none of
them becomes a bottleneck in the overall data exchange.
For each switch, ⟨𝑘⟩ is divided into two parts, ⟨𝑘𝑖𝑛𝑡𝑟𝑎 ⟩ and
⟨𝑘𝑖𝑛𝑡𝑒𝑟 ⟩, the average number of intra-cluster and inter-cluster
connections to other switches respectively. For the VFI intracluster connections, each cluster is treated separately. A
network is created for each cluster such that the connectivity
follows the power-law model; the network cluster is fully
connected and has an average intra-cluster connectivity,
⟨𝑘𝑖𝑛𝑡𝑟𝑎 ⟩.
The VFI inter-cluster connections are created such that the
connectivity also follows the same power-law model as the
intra-cluster connections and has an average inter-cluster
connectivity ⟨𝑘𝑖𝑛𝑡𝑒𝑟 ⟩. The number of links going from one
cluster to another is decided by the inter-VFI traffic. The
proportion of links allocated between two clusters is directly
related to the proportion of inter-cluster traffic between the two
clusters in total inter-cluster traffic.
The two principal wireless interface (WI) components are the
antenna and the transceiver. The on-chip antenna for the
WiNoC has to provide the best power gain for the smallest area
overhead. A metal zigzag antenna has been demonstrated to
possess these characteristics, and hence it is considered in this
work [28]. To ensure high throughput and energy efficiency,
the WI transceiver circuitry has to provide a very wide
bandwidth, as well as low power consumption. The detailed
description of the transceiver circuit is out of the scope of this
paper. With a data rate of 16 Gbps, the wireless link dissipates
1.95 pJ/bit. The total area overhead per wireless transceiver is
0.25 mm2 [29].
2) Wireless Link and Core Placement: To help facilitate
predominantly long-distance communication, we use mm-wave
wireless links to communicate among distant cores. These
wireless links along with the small-world wireline architecture,
aid in quick and efficient inter-core, inter-VFI, and core-to-
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
controller communication, especially for large VFI clusters. It
is possible to create three non-overlapping channels with onchip mm-wave wireless links. Using these three channels, we
overlay the wireline small-world connectivity with the wireless
links such that a few switches get an additional wireless port.
Each of these wireless ports will have a wireless interface (WI)
tuned to one of the three wireless channels. The WI placement
is most energy-efficient when the distance between them is at
least 7.5 mm for the 65 nm technology node [6]. The optimum
number of WIs is twelve, four WIs assigned to each wireless
channel, for a 64-core system size [29].
In this work, we physically arrange the cores and place the
wireless links in order to minimize the traffic-weighted hop
count. We determine the physical locations of all the cores
running a particular thread in order to minimize the distance of
highly communicating cores. Then, the wireline network is
created as described in Section III.B.1. Simulated annealing is
finally used to find the optimal WI placements that minimize
the average traffic-weighted hop-count assuming the WI
constraints discussed earlier. More frequently-communicating
WIs are assigned to the same channel to optimize the overall
hop count.
3) Routing and Flow Control: Due to the irregular nature of
our WiNoC architecture, routing is done as presented in [8].
Adaptive Layered Shortest Path Routing (ALASH) is used as
the routing algorithm [30], allowing for messages to be routed
along the shortest path between the source and destination while
maintaining deadlock freedom. A wireless token passing
protocol is used to arbitrate access to the WIs where the WI
holding the token is given access to the wireless channel.
C. Dynamically Tuned VFIs
The application characteristics, core utilization and traffic
information, tend to vary throughout the runtime of every
benchmark. Therefore, static V/F tuning, although simple, tends
to be suboptimal. Here, we take advantage of the temporal
variations in the application by dynamically tuning the V/F of
each VFI.
For our VFI-enabled system, we create a dynamic V/F
controller for each VFI that determines how to tune the V/F
pairs every T cycles. The major difficulty in dynamically tuning
the V/F pairs for VFIs when compared to traditional single
core/router DVFS mechanisms lies in two parts: (i) the
determination of a suitable V/F to apply to all cores/routers
within the VFI and (ii) the transmission of core utilization and
traffic information from each core in the VFI to a local
controller.
In order to determine a suitable V/F to apply to all cores and
routers within a VFI, we obtain a metric that incorporates
information from all elements in the VFI. We start by defining
the core utilization of core 𝑖 (𝑢𝑖 (𝑡)) and link utilization for the
link between core 𝑘 and 𝑙 (𝑙𝑢𝑘𝑙 (𝑡)):
𝑢𝑖 (𝑡) =
𝐵𝑢𝑠𝑦(𝑡,𝑖)
𝐶𝑦𝑐𝑙𝑒𝑠(𝑡,𝑉𝐹𝐼𝑗)
𝐹𝑙𝑖𝑡𝑠(𝑡,𝑘,𝑙)
𝑙𝑢𝑘𝑙 (𝑡) =
, 𝑖 ∈ 𝑉𝐹𝐼𝑗
𝐶𝑦𝑐𝑙𝑒𝑠(𝑡,𝑉𝐹𝐼𝑗 )
, 𝑙 ∈ 𝑉𝐹𝐼𝑗
𝐶𝑦𝑐𝑙𝑒𝑠(𝑡, 𝑉𝐹𝐼𝑗 ) is the number of total cycles for window 𝑡 for
VFI 𝑗 and 𝐹𝑙𝑖𝑡𝑠(𝑡, 𝑘, 𝑙) is the number of flits received by core 𝑙
from core 𝑘 during window 𝑡. Each VFI j has its own V/F
controller that operates independently that calculates a metric,
𝑚(𝑡), using the core and link data; this metric is used in the V/F
determination based on the information during each window 𝑡:
𝑚(𝑡) = 𝜔𝑢 ∑∀𝑖∈𝑉𝐹𝐼𝑗
𝑢𝑖 (𝑡)
|𝑉𝐹𝐼𝑗 |
+ 𝜔𝑐 ∑∀𝑘∉𝑉𝐹𝐼𝑗
∀𝑙∈𝑉𝐹𝐼𝑗
𝑙𝑢𝑘𝑙 (𝑡)
𝐼𝑛𝑡𝑒𝑟𝐿𝑖𝑛𝑘𝑠(𝑉𝐹𝐼𝑗 )
(10)
where |𝑉𝐹𝐼𝑗 | is the number of cores in VFI j, 𝐼𝑛𝑡𝑒𝑟𝐿𝑖𝑛𝑘𝑠(𝑉𝐹𝐼𝑗 )
is the number of inter-VFI links connected to VFI j, and 𝜔𝑢 and
𝜔𝑐 are the weights for the utilization and communication parts
respectively. Intuitively, m(𝑡) is the weighted sum of the VFI’s
average core utilization and average incoming inter-VFI link
utilization. The weights 𝜔𝑢 and 𝜔𝑐 are calculated as the
proportion of core utilization to traffic during the current time
window t. Although each V/F controller operates
independently, the inter-VFI link utilization inherently carries
information from the other VFIs. We can model the
𝐹𝑙𝑖𝑡𝑠(𝑡, 𝑘, 𝑙) equation as:
𝐹𝑙𝑖𝑡𝑠(𝑡, 𝑘, 𝑙) = min (𝜆̅𝑘𝑙 (𝑡) ∗ 𝐶𝑦𝑐𝑙𝑒𝑠(𝑡, 𝑉𝐹𝐼𝑛 ) , 𝐶𝑦𝑐𝑙𝑒𝑠(𝑡, 𝑉𝐹𝐼𝑗 ))
(11)
where 𝜆̅𝑘𝑙 (𝑡) is the average arrival rate for the link going from
core k to core l during window 𝑡 and 𝐶𝑦𝑐𝑙𝑒𝑠(𝑡, 𝑉𝐹𝐼𝑛 ) is the
number of cycles for VFI 𝑛. Therefore, a higher V/F level from
the sending VFI n would increase 𝐶𝑦𝑐𝑙𝑒𝑠(𝑡, 𝑉𝐹𝐼𝑛 ) and raise the
inter-VFI link utilization. Therefore, we can infer the V/F level
of connected VFIs based on the level of inter-VFI link
utilization.
Based on the value of 𝑚(𝑡), a threshold mechanism is used
to calculate the predicted V/F for the next time window, 𝑡 + 1,
and the V/F is adjusted for the VFI accordingly. We propose the
following threshold mechanism shown in Eq. (12), where the
V/F is set to a value with respect to the nominal V/F value:
1.0,
0.9,
𝑉/𝐹 =
⋮
0.6,
{ 0.5
if 𝑚(𝑡) + Δ𝑀 > 0.9
if 0.9 ≥ 𝑚(𝑡) + Δ𝑀 > 0.8
⋮
if 0.6 ≥ 𝑚(𝑡) + Δ𝑀 > 0.5
if 0.5 ≥ 𝑚(𝑡) + Δ𝑀
(12)
where ΔM is a threshold offset that adjusts the mapping of
the metric to a particular V/F by a fixed value. Fig. 2 illustrates
how various values of ΔM affect the mapping from m(t) to V/F.
Increasing ΔM gives rise to an increase in m(t) resulting in
higher V/F values. Also, decreasing ΔM lowers m(t) resulting
in lower V/F values. Therefore higher values of ΔM would
(8)
(9)
where 𝑉𝐹𝐼𝑗 is the set of cores in VFI cluster 𝑗, 𝐵𝑢𝑠𝑦(𝑡, 𝑖) is the
number of core busy cycles for window 𝑡 for core 𝑖,
6
Fig. 2. Effects of ΔM on the mapping from m(𝑡) to V/F.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
7
demonstrate how we can co-design the elements in this
framework to lead to significant energy savings without losing
performance. Therefore, we adopt a flexible lightweight
controller that can be tuned using information about the VFI
structure.
Fig. 3. Block diagram of the V/F controller.
typically lead to lower execution times and higher energy
dissipation and vice versa. We introduce ΔM as a way to
statically compensate for tracking errors.
One possible usage for ΔM is to use it to help compensate for
the tracking error. We can choose the ΔM value such that it
minimizes the mean absolute error (MAE) between the
predicted V/F and the metric from the following window 𝑚(𝑡 +
1). We will refer to the ΔM that minimizes tracking MAE as
ΔMopt.
The ΔM can also be used to compensate for the intra-VFI core
utilization variation. Since the V/F value decided for each VFI
is based on the average utilization and traffic, a number of cores
will have performance penalties if there is significant variation
within the cluster. Therefore, we should choose a ΔM that takes
into account the average standard deviation (ΔMstd) of the core
utilization within each VFI to account for the variability among
the cores in the VFI.
In this work, we propose to use ΔMopt+std as a suitable ΔM,
where ΔMopt+std is defined as:
ΔMopt+std = ΔMopt + ΔMstd
(13)
Intuitively, the ΔMopt element allows the ΔM to compensate
for tracking errors created by following 𝑚(𝑡) while the ΔMstd
component augments this value in order to overcome the
shortcomings of using average utilization and traffic values in
𝑚(𝑡). Together ΔMopt+std, will optimize the energy by fitting
closely to the application characteristics (using ΔMopt) while
minimizing execution time penalties by accommodating more
cores (using ΔMstd). Later in Section IV, we will investigate the
use of this value ΔM and show its usefulness for the
benchmarks considered in this work.
Due to the simplicity of this controller, each core only needs
to send its core utilization and inter-VFI link utilization
information to its local V/F controller per decision-making
window. We leverage the overall NoC architecture to
communicate this information and analyze its impact on the
system in section IV.C. Fig. 3 presents a high-level block
diagram of the V/F controller: the Static Counter is a simple
counter that determines the V/F switching frequency, Metric is
the block that calculates m(𝑡), and V/F Calc. is the threshold
mechanism that calculates the proper V/F for time window 𝑡 +
1. In section IV.C, we discuss how the frequency of the
decision-making and the threshold values are calculated to
optimize the system for each application. We aim to
D. Overall Integration
Through the previous sections, we have described how the
VFI, WiNoC and dynamic V/F tuning can be co-designed to
create an optimized full-system design. Fig. 4 summarizes the
key ideas that are traded among these three paradigms. When
implementing VFIs, we can cluster highly communicating
cores to aid the WiNoC and cluster cores with similar
application characteristic variations to aid dynamic V/F tuning.
WiNoCs can be designed to take into account the shape and
characteristics of the VFI in order to reduce the performance
degradation and reduce the delay between each core and its V/F
controller. Lastly, dynamic V/F tuning can take advantage of
the application slack present in both VFIs and WiNoCs to
optimize the energy with low performance penalties.
IV. EXPERIMENTAL RESULTS AND ANALYSIS
A. Experimental Setup
In this section, we evaluate the performance and energy
dissipation of the WiNoC-enabled multicore chip compared to
conventional wireline mesh-based designs in the presence of
the hybrid VFI clustering (section III.A) and dynamic V/F
tuning (section III.C). We use GEM5 [31], a full system
simulator, to obtain detailed processor and network-level
information. In all experiments, we consider a system running
Linux within the GEM5 platform in full-system (FS) mode.
Since FS mode running Linux with Alpha cores is limited to a
maximum of 64 cores, all experiments are done on a 64-core
system. All full-system simulations were run with the default
GEM5 packet size of 6-flits of 128-bits. The
MOESI_CMP_directory memory setup is used with private
64KB L1 instruction and data caches and a shared 8MB (128
KB distributed per core) L2 cache. Four SPLASH-2
benchmarks, i.e., FFT, RADIX, LU, and WATER [32], and
four PARSEC benchmarks, i.e., CANNEAL, DEDUP,
FLUIDANIMATE (FLUID) and VIPS [33] are considered. The
Fig. 4. Inter-paradigm benefits between WiNoC, VFI and V/F Tuning
paradigms
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
TABLE III
VFI SIZES FOR HETEROGENEOUS CLUSTERING
AND THEIR RESPECTIVE STATIC V/F LEVELS.
Benchmark
VFI1
VFI2
VFI3
CANNEAL
22 – 0.6
22 – 1.0
16 – 0.6
DEDUP
40 – 0.9
16 – 1.0
4 – 1.0
FFT
29 – 0.9
23 – 1.0
7 – 0.9
FLUID
40 – 0.9
16 – 1.0
4 – 0.7
LU
32 – 0.8
24 – 1.0
4 – 0.6
RADIX
37 – 1.0
19 – 0.9
4 – 0.9
VIPS
30 – 0.7
26 – 0.9
4 – 0.7
WATER
33 – 0.8
23 – 1.0
4 – 0.9
VFI4
16 – 0.9
16 – 0.9
16 – 0.9
16 – 1.0
16 – 1.0
16 – 0.9
16 – 1.0
16 – 1.0
processor-level statistics generated by the GEM5 simulations
are incorporated into McPAT to determine the processor-level
power values [34].
In this work, we consider a nominal range operation. Hence,
the adopted dynamic V/F strategy uses discrete V/F pairs that
maintain a linear relationship. The considered V/F pairs are:
1V/2.5GHz, 0.9V/2.25GHz, 0.8V/2.0GHz, 0.7V/1.75GHz,
0.6V/1.5GHz, and 0.5V/1.25GHz. For the remainder of this
work these V/F pairs will be referred to as 1.0, 0.9, 0.8, 0.7, 0.6
and 0.5 respectively. By using on-chip voltage regulators with
fast transitions, latency penalties and energy overheads due to
voltage transitions can be kept low. We estimate the energy
overhead introduced by the regulators due to voltage transition
as:
𝐸𝑟𝑒𝑔𝑢𝑙𝑎𝑡𝑜𝑟 = (1 − 𝜂) ∙ 𝐶𝑓𝑖𝑙𝑡𝑒𝑟 ∙ |𝑉22 − 𝑉12 |
(14)
where, Eregulator is the energy dissipated by the voltage regulator
due to a voltage transition, η is the power efficiency of the
regulator, Cfilter is the regulator filter capacitance, and V2 and V1
are the two voltage levels. Both the regulator switching and
dynamic VFI controller energies, described in section IV.C, are
taken into account while analyzing entire system energy.
The synchronization delay associated with the mixedclock/mixed-voltage FIFOs at the boundaries of each VFI has
been incorporated into the simulations following [35]. The
energy overhead for each VFI has been taken into the
simulations as proposed in [24].
𝐸𝑉𝐹𝐼 = 𝐸𝐶𝑙𝑘𝐺𝑒𝑛 + 𝐸𝑟𝑒𝑔𝑢𝑙𝑎𝑡𝑜𝑟 + 𝐸𝑀𝑖𝑥𝐶𝑙𝑘𝐹𝑖𝑓𝑜
(15)
where 𝐸𝐶𝑙𝑘𝐺𝑒𝑛 is the energy overhead of generating additional
clock signals [36] and 𝐸𝑀𝑖𝑥𝐶𝑙𝑘𝐹𝑖𝑓𝑜 is the energy overhead of the
mixed-clock/mixed-voltage FIFOs. We should also note that
this overhead is same irrespective of the interconnect
architecture considered. Indeed, this is a fixed overhead in all
the VFI NoC architectures considered in this work.
B. VFI Parameters
Following the methodology described in section III.A.5, we
determine the V/F pairs for all clusters using the hybrid
clustering approach. In this work, we investigate two separate
configurations: Homogenous (Hom) clustering, where we
consider the whole system to be divided into four equally sized
clusters and heterogeneous (Het) clustering, where we still have
four clusters, but they are not of equal size. In Het clustering,
the clusters are only required to contain at least four cores.
Tables II and III show the cluster size followed by its respective
static V/F pair (VFI SIZE – V/F) for Hom and Het clustering
Avg. Inter-VFI Interface
per Message
TABLE II
VFI SIZES FOR HOMOGENEOUS CLUSTERING
AND THEIR RESPECTIVE STATIC V/F LEVELS.
Benchmark
VFI1
VFI2
VFI3
CANNEAL
16 – 0.7
16 – 0.6
16 – 1.0
DEDUP
16 – 1.0
16 – 1.0
16 – 0.9
FFT
16 – 1.0
16 – 1.0
16 – 0.9
FLUID
16 – 0.8
16 – 0.8
16 – 1.0
LU
16 – 0.7
16 – 0.7
16 – 1.0
RADIX
16 – 1.0
16 – 0.9
16 – 1.0
VIPS
16 – 0.6
16 – 0.8
16 – 0.7
WATER
16 – 0.8
16 – 0.8
16 – 0.9
1.2
Hom Mesh
Het Mesh
Hom WiNoC
8
VFI4
4 – 0.9
4 – 1.0
5 – 0.9
4 – 0.8
4 – 0.9
4 – 0.8
4 – 0.9
4 – 0.7
Het WiNoC
0.9
0.6
0.3
0
Fig. 5. Average number of inter-VFI interfaces traversed per message for
VFI Hom Mesh, VFI Het Mesh, VFI Hom WiNoC and VFI Het WiNoC.
respectively. These configurations were obtained by the
clustering approach described in Section III.A for a
performance target of α = 95%, which means achieving at least
95% of performance of the system with all clusters running at
the nominal V/F, i.e. 1.0. For the dynamic V/F configurations,
the V/F values will be adjusted throughout the execution in
response to changing application characteristics. When creating
the WiNoC, we determine the parameters described in section
III.B.1: as shown in [9], ⟨𝑘𝑖𝑛𝑡𝑒𝑟 ⟩ = 1, ⟨𝑘𝑖𝑛𝑡𝑟𝑎 ⟩ = 3, 𝑘𝑚𝑎𝑥 = 7
optimizes the network latency and energy.
To further demonstrate how the WiNoC network is better
suited for VFIs, we analyze the number of inter-VFI interfaces
that a message needs to traverse on average in both mesh and
WiNoC networks (refer to Fig. 5). It can be seen that the
messages in WiNoC networks need to traverse less inter-VFI
interfaces on average since the connectivity and ALASH
guarantees that each message will pass through at most one
interface. On the other hand, mesh depends on the structure of
the VFIs and the routing (standard X-Y is considered here) and
makes no such guarantees. Also, Het WiNoC is able to reduce
the number of interfaces traversed over Hom WiNoC by
clustering more traffic within each cluster.
C. Implementation of the Dynamic VFI Control Circuit
In this section we discuss the implementation of the dynamic
V/F controller as described in section III.C. The dynamic V/F
controller is synthesized from a RTL level design using the
TSMC 65-nm CMOS process and Synopsys™ Design Vision.
The circuit has been designed to operate on a 1.25 GHz clock.
The area overhead for each controller is 0.06 mm2 in the worstcase scenario, i.e., a controller for a VFI containing all 64 cores.
Due to the simplicity of the application information used, as
described in Section III.C, the core-to-controller traffic is
insignificant when compared to the total amount of traffic
traversing the system. It is measured that the total traffic
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
VFI1
VFI2
VFI3
VFI4
VFI1
0.3
Tracking MAE
Tracking MAE
0.3
0.2
0.1
0
VFI2
VFI3
9
VFI4
0.2
0.1
0
-0.4
-0.3
-0.2
-0.1
0
Δ𝑀
0.1
0.2
0.3
0.4
-0.4
Optimal Δ𝑀 VFI1: -0.09, VFI2: -0.09, VFI3: -0.09, and VFI4: -0.09
-0.3
-0.2
-0.1
0
Δ𝑀
0.1
0.2
0.3
0.4
Optimal Δ𝑀 VFI1: -0.05, VFI2: -0.07, VFI3: -0.06, and VFI4: -0.06
(a)
(b)
Core Utilization
Fig. 6. Mean absolute error (MAE) between the predicted V/F value and the next window’s metric for various values of ΔM for (a) FLUID and (b)
RADIX. The optimal Δ𝑀 that minimizes the MAE for each VFI is also noted.
1.0
1.0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
Time (Window)
(b)
Time (Window)
(a)
Fig. 7. A sample of the average core utilization for each window of a VFI running (a) FLUID and (b) RADIX. The error bars notate the minimum and
maximum core utilization during that particular window.
Previously in Section III.C, we had discussed using two
different values to determine ΔM: ΔMopt, the value of ΔM that
minimizes the tracking error, and ΔMstd, the standard deviation
in the core utilization of the cluster. The combination of these
two, ΔMopt+std (Eq. (13)), would provide benefits from both the
two measures. In order to determine ΔMopt, we swept the values
of ΔM per VFI for each benchmark to find the value of ΔM that
minimizes the tracking MAE. Fig. 6 shows how the MAE
changes while sweeping through values of ΔM. We consider
two benchmarks, FLUID and RADIX, as examples. The
optimal ΔM for each VFI is also shown here. Following the
same process, we determined the optimal ΔM of each VFI for
all the benchmarks under consideration. The parameter ΔMstd is
calculated as the average standard deviation of the core
utilization of all dynamic decision windows per VFI. To justify
our usage of ΔMstd, in Fig. 7 we show the average, minimum
(notated as negative error) and maximum (notated as positive
error) core utilization for a sample period during the FLUID and
RADIX benchmarks for a VFI. From this plot, we can see that
there are cores that significantly deviate from the mean, which
could cause large execution time penalties if we calculate the
V/F simply from the average core utilization. Hence, we use
ΔMopt+std as the ΔM value for all DVFI configurations.
In order for the controller to make the best decision possible,
the ability to quickly receive the most up-to-date information
from each core in the VFI is essential. In this context, the
maximum delay between any core and its local V/F controller
is the most relevant metric. To demonstrate how the WiNoC
helps facilitate core to controller communication, we show in
Fig. 8 that the WiNoC reduces the maximum hop-count
between a core and its local V/F controller compared to
standard mesh architectures. Not only do lower hop-counts
result in quicker communication, but it also reduces the coreto-controller communication’s impact on standard inter-core
traffic. It should be noted that the difference between the Mesh
and WiNoC maximum hop-count values increase with the size
Max Hop Count
generated for the V/F controllers contribute less than 0.05% of
the total traffic for all benchmarks considered. The decisionmaking delay for the V/F controller is 8.8 ns. In this work we
ensure that the V/F controller delay be less than 1% of the V/F
switching window period, T, hence, the controller doesn’t
introduce any significant time overhead with respect to the
overall switching window. Therefore, we set the lower bound
𝑇 ≥ 1𝜇𝑠. The switching window period 𝑇 is swept throughout
the range 1𝑚𝑠 ≥ 𝑇 ≥ 1𝜇𝑠. The energy per switching window,
in the worst case, is 39.1 𝑛𝐽. This energy is added to the VFI
overhead Eq. (15) for the dynamic V/F configurations. In
section IV.D, we present the results for the value of T that
optimizes the full-system energy-delay product (EDP) for each
benchmark. The optimal values of T were found to be between
9𝜇𝑠 and 1𝑚𝑠 depending on the benchmark considered.
8
Mesh
WiNoC
6
4
2
0
Fig. 8. Maximum hop-count between any core and its respective V/F
controller for Mesh and WiNoC architectures.
Execution Time (w.r.t.
NVFI Mesh)
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
1.2
Mesh NVFI
Mesh SVFI-HOM
Mesh SVFI-HET
WiNoC NVFI
WiNoC SVFI-HOM
10
WiNoC SVFI-HET
1
0.8
0.6
CANNEAL
DEDUP
FFT
FLUID
LU
RADIX
VIPS
WATER
1
0.8
0.6
0.4
0.2
0
HOM
HET
Fig. 10. Percentage of inter-VFI traffic of total traffic for homogeneous
(HOM) and heterogeneous (HET) clustering.
of the largest VFI cluster, VFI1, as seen in Table III. For
example, FLUID has the largest VFI1 cluster and the largest
difference between Mesh and WiNoC maximum hop-count
value. On the other hand CANNEAL has the smallest VFI1
cluster and the smallest difference between Mesh and WiNoC
maximum hop-count values. This is a result of the better
connectivity of the small-world architecture and the long-range
wireless shortcuts. By using the WiNoC, we provide a scalable
and efficient communication backbone for DVFI.
D. Full System Performance Evaluation
In order to evaluate the performance of our proposed
framework for multicore chips, we consider the effects of the
different VFI clustering, network configurations and V/F
tuning. As the baseline to all our configurations, we consider
the commonly used non-VFI (NVFI) mesh architecture.
1) Effects of VFI Clustering: In this section we consider two
VFI configurations. The first is a statically-tuned VFI (SVFI)
with Hom clustering (SVFI-HOM) which makes four equalsized VFI clusters with statically configured V/F values, this is
used as a comparison to currently existing VFI work [9]. The
second configuration is SVFI with Het clustering (SVFI-HET)
which creates VFI clusters according to Section III.A.5 with
statically configured V/F values. We also include these
clustering configurations with both Mesh and WiNoC
topologies to analyze the effects of the network.
We compare the execution time for all of the configurations
to the baseline NVFI Mesh design in Fig. 9. Since Het
clustering places significantly less restrictions on the VFI
clustering, this type of clustering mechanism is able to obtain a
better configuration than Hom clustering. Therefore, the Het
cluster is able to contain more of the traffic within each VFI
cluster which improves the overall performance of the platform.
In Fig. 10 we can see that Het clustering is able to reduce the
Average TrafficWeighted Hop-Count
Traffic Interaction (%
of total traffic)
Fig. 9. The execution time of the NVFI, SVFI-HOM, and SVFI-HET configurations using Mesh and WiNoC with respect to NVFI Mesh for all benchmarks
considered.
6
Mesh
WiNoC
4
2
0
Fig. 11. Average traffic-weighted hop-count for both Mesh and WiNoC
architectures.
amount of inter-VFI traffic when compared to Hom clustering
for all benchmarks considered. This helps Het clustering
maintain or improve the execution time compared to Hom
clustering. Also as expected in the presence of VFI, traditional
mesh-based designs suffer from the degradation of execution
time. For the WiNoC-enabled designs, we can see that they all
outperform their Mesh counterparts due to the better
connectivity of the WiNoC. We analyze the connectivity of the
WiNoC by investigating the average traffic-weighted hopcount in Fig. 11, it can be clearly seen that WiNoC significantly
reduces the average traffic-weighted hop-count when compared
to Mesh for all the benchmarks considered. Therefore, WiNoC
is able to transfer inter-core communication much more quickly
than Mesh.
The other important parameter in analyzing our VFI design
is the energy dissipation profile. Since in VFI designs we
principally save energy at the cost of performance, the most
relevant metric to consider is the energy delay product (EDP)
when analyzing the energy profile. By delay we consider the
execution time here. Fig. 12 shows the EDP for all
configurations with respect to NVFI Mesh. We can see that for
all benchmarks considered, the WiNoC configuration is able to
save EDP over their Mesh counterpart. This is due to WiNoC’s
ability to reduce the performance impact and lower network
energy through better connectivity and low-power wireless
links. Similarly to the execution time analysis above, Het
clustering is able to more effectively group the cores with
similar utilization together compared to Hom clustering. As
such, Het clustering is able to apply lower V/F levels without
significantly impacting performance, resulting in better EDP
EDP (w.r.t. NVFI Mesh)
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
Mesh NVFI
1.2
Mesh SVFI-HOM
Mesh SVFI-HET
WiNoC NVFI
WiNoC SVFI-HOM
11
WiNoC SVFI-HET
1
0.8
0.6
0.4
0.2
Execution Time (w.r.t.
NVFI Mesh)
CANNEAL
DEDUP
FFT
FLUID
LU
RADIX
VIPS
WATER
Fig. 12. The EDP of the NVFI, SVFI-HOM, and SVFI-HET configurations using Mesh and WiNoC with respect to NVFI Mesh for all benchmarks
considered.
1.2
Mesh NVFI
Mesh SVFI
Mesh DVFI
WiNoC NVFI
WiNoC SVFI
WiNoC DVFI
1
0.8
0.6
CANNEAL
DEDUP
FFT
FLUID
LU
RADIX
VIPS
WATER
1
0.8
0.6
0.4
0.2
0
Fig. 14. Traffic intensity of all the benchmarks considered with respect
to the highest injection rate benchmark (RADIX).
profiles. Due to this analysis we utilize Het clustering when
comparing statically tuned and dynamically tuned VFIs.
2) Static vs. Dynamic V/F Tuning: In this section, the main two
configurations under consideration are: SVFI and DVFI, both
with Het clustering. We also include the Mesh and WiNoC
configurations in order to give a final picture of all the
components acting in harmony.
We compare the execution time for all of the configurations
considered here to the baseline NVFI mesh in Fig. 13. Again
we can see that WiNoC outperforms Mesh in every
configuration. One important thing to note is that the DVFI
WiNoC is able to perform at or better than NVFI Mesh for a
majority of the benchmarks considered. The only exceptions to
this is DEDUP and VIPS, where DVFI WiNoC operates at a
2.5% and 3.5% penalty respectively. This is mostly due to the
low traffic injection rates of these benchmarks. Fig. 14 shows
the relative traffic intensity of each benchmark, measured as the
traffic injection rate normalized to the highest value, in this case
RADIX. It can be seen that VIPS is the lowest traffic intensity
benchmark, thus reducing the network’s ability to improve the
performance. The other extremely low traffic intensity
benchmarks DEDUP and FLUID also have limited benefits
from using the WiNoC compared to the other benchmarks. Also
Traffic Interaction (%
of total traffic)
Traffic Intensity
Fig. 13. The execution time of the NVFI, SVFI, and DVFI configurations using Mesh and WiNoC with respect to NVFI Mesh for all benchmarks
considered.
1.0
x <=2.5mm
2.5< x <=5
5<x<=7.5
7.5 < x
0.5
0.0
Fig. 15. Distance of communication characteristics of the benchmarks
considered.
the level of long-range communication has an effect on how
much improvement can be gained through the network. Fig. 15
shows the proportion of total traffic between cores of specific
distance ranges for each benchmark. It can be seen for WATER,
the amount of traffic traversing over 7.5 mm is significantly
lower than the other benchmarks resulting in lower
performance gains with WiNoC. Traffic traversing over 7.5 mm
is considered to be long-range traffic since the wireless link
becomes more energy efficient than the wireline counterpart
beyond this distance as mentioned in Section III.B.2.
Like before, we analyze the EDP as the relevant metric. Fig.
16 shows the EDP for all configurations with respect to NVFI
Mesh. Again, we can see that for all benchmarks considered,
the WiNoC configuration is able to save EDP over their Mesh
counterpart. This is due to the same reasoning discussed in the
previous section. We also see that the DVFI configurations
outperform the other systems running the same network
configuration due to DVFI’s ability to reduce the V/F levels in
the system with little performance impact.
Fig. 17 demonstrates this capability for a snapshot of a VFI
running FFT, it is seen that DVFI is able to reduce its V/F to
save more energy (e.g. time windows 93-106, 113-130) and
increase its V/F to reduce the execution time penalty (e.g. time
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
Mesh NVFI
EDP (w.r.t. NVFI Mesh)
1.2
Mesh SVFI
Mesh DVFI
WiNoC NVFI
WiNoC SVFI
12
WiNoC DVFI
1
0.8
0.6
0.4
0.2
CANNEAL
DEDUP
FFT
FLUID
LU
RADIX
VIPS
WATER
1
Avg Core Utilization
SVFI
DVFI
1
Average VF
Core Utilization &
V/F
Fig. 16. The EDP of the NVFI, SVFI, and DVFI configurations using Mesh and WiNoC with respect to NVFI Mesh for all benchmarks considered.
0.5
SVFI-HOM
SVFI-HET
DVFI
0.8
0.6
0.4
1
9
17
25
33
41
49
57
65
73
81
89
97
105
113
121
129
0
Time (window)
Fig. 17. Core average utilization and V/F calculated for SVFI and DVFI
during a snapshot of the FFT benchmark.
windows 23-87) when compared to the SVFI configuration.
The EDP improvement comes from three components: a better
NoC architecture, heterogeneous VFI clustering and
dynamically tuned VFI. The role of each can be analyzed from
Fig. 16. As an example, for the FFT benchmark, the better NoC
architecture improves the EDP by 14.7% (NVFI Mesh vs. NVFI
WiNoC), the VFI clustering improves the EDP 5.3% further
(NVFI WiNoC vs. SVFI WiNoC) and the dynamic V/F tuning
improves the EDP 7.9% further (SVFI WiNoC vs. DVFI
WiNoC) with respect to NVFI Mesh. By co-designing all three
methodologies we are able to save a total of 27.9% EDP for the
FFT benchmark. For the other benchmarks, we also see a
similar trend.
Fig. 18 shows the average V/F level for SVFI-HOM, SVFIHET and DVFI Mesh across the system and application
runtime. Here we discuss DVFI Mesh as an example; DVFI
WiNoC will exhibit a similar trend and show the same benefit
with respect to SVFI-HOM and SVFI-HET. As it can be seen,
DVFI Mesh lowers the average V/F for all benchmarks
considered allowing for significant EDP reduction. Our
proposed DVFI WiNoC saves up to 46.6% EDP (VIPS) and an
average of 17.4% EDP over the state of the art static VFI system
[9]. It should also be noted that the proposed DVFI design saves
up to 60.5% EDP (VIPS) and an average of 38.9% EDP with
respect to NVFI Mesh for all benchmarks considered.
V. CONCLUSION AND FUTURE WORK
In this paper, we have demonstrated that by incorporating
WiNoCs, VFIs, and dynamic VF tuning in a synergistic
manner, we are able to create an energy efficient design for
multicore chips without sacrificing noticeable performance. We
have shown that for all benchmarks considered, with the
exception of only two (DEDUP and VIPS), there is no
performance degradation for DVFI WiNoC with respect to
Fig. 18. Average V/F level for three VFI configurations, SVFI-HOM,
SVFI-HET and DVFI Mesh.
NVFI Mesh. Along with this low impact on execution time, we
are able to save significant full-system energy-delay product
(EDP) over traditional non-VFI Mesh. As such, we have
demonstrated the importance of an integrated design approach
involving VFI, dynamic V/F tuning and wireless NoC to
achieve energy efficiency for multicore chips.
In this work we have mainly demonstrated an integrated
design approach for multicore chips. It is only natural that
future work would include further advancements of each
component of the design process including investigations to
increase the synergy between each component.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
S. Rusu, S. Tam, H. Muljono, D. Ayers, J. Chang, R. Varada, M. Ratta
and S. Vora. , “A 45nm 8-core enterprise xeon processor,” in Proc. of ASSCC, 2009, pp. 9-12.
B. Stackhouse, S. Bhimji, C. Bostak, D. Bradley, B. Cherkauer, J. Desai,
E. Francom, M. Gowan, P. Gronowski, D. Krueger, C. Morganti and S.
Troyer, “A 65 nm 2-billion transistor quad-core itanium processor,” IEEE
J. Solid-States Circuits, vol. 44, no. 1, 2009, pp. 18-31.
J. Friedrich, B. McCredie, N. James, B. Huott, B. Curran, E. Fluhr, G.
Mittal, E. Chan, Y. Chan, D. Plass, S. Chu, H. Le, L. Clark, J. Ripley, S.
Taylor, J. Dilullo and M. Lanzerotti, “Design of the power6
microprocessor,” in Proc. of ISSCC, 2007, pp. 96-97.
H. Mair, A. Wang, G. Gammie, D. Scott, P. Royannez, S. Gururajarao,
M. Chau, R. Lagerquist, L. Ho, M. Basude, N. Culp, A. Sadate, D. Wilson,
F. Dahan, J. Song, B. Carlson and U. Ko, “A 65-nm mobile multimedia
applications processor with an adaptive power management scheme to
compensate for variations,” in Proc. of VLSIC, 2007, pp. 224-225.
N. Kapadia and S. Pasricha. “A framework for low power synthesis of
interconnection networks-on-chip with multiple voltage islands.”
Integration, the VLSI Journal, vol. 45, issue 3, June 2012.
S. Deb, K. Chang, X. Yu, S.P. Sah, M. Cosic, A. Ganguly, P.P. Pande, B.
Belzer and D. Heo, "Design of an Energy-Efficient CMOS-Compatible
NoC Architecture with Millimeter-Wave Wireless Interconnects,"
Computers, IEEE Transactions on, vol.62, no.12, pp.2382,2396, Dec.
2013
R. Marculescu, U. Ogras, L.-S. Peh, N.E. Jerger and Y. Hoskote,
“Outstanding research problems in NoC design: system,
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
microarchitecture, and circuit perspectives, “IEEE Trans. on CAD, vol.
28, Jan. 2009, pp. 3-21.
R. Kim, G. Liu, P. Wettin, R. Marculescu, D. Marculescu, and P.P. Pande,
"Energy-efficient VFI-partitioned multicore design using wireless NoC
architectures," Compilers, Architecture and Synthesis for Embedded
Systems (CASES), 2014 International Conference on, vol., no., pp.1,9.
R.G. Kim, W. Choi, G. Liu, E. Mohandesi, P.P. Pande, D. Marculescu
and R. Marculescu, "Wireless NoC for VFI-Enabled Multicore Chip
Design: Performance Evaluation and Design Trade-offs," Computers,
IEEE Transactions on, (in press)
U.Y. Ogras and R. Marculescu, “It’s a small world after all: NoC
performance optimization via long-range link insertion,” IEEE Trans.
Very Large Scale Integr. Syst., vol. 14, no. 7, 2006, pp. 693-706.
S. Garg, D. Marculescu, and R. Marculescu, “Custom feedback control:
enabling truly scalable on-chip power management for MPSoCs,” in Proc.
of ISLPED, 2010.
J. Murray, R. Kim, P. Wettin, P. P. Pande, and B. Shirazi. 2014.
“Performance Evaluation of Congestion-Aware Routing with DVFS on a
Millimeter-Wave Small-World Wireless NoC,” J. Emerg. Technol.
Comput. Syst. 11, 2, Article 17, November 2014.
A. Bartolini, M. Cacciari, A. Tilli, L. Benini, "Thermal and Energy
Management of High-Performance Multicores: Distributed and SelfCalibrating Model-Predictive Controller," Parallel and Distributed
Systems, IEEE Transactions on, vol.24, no.1, pp.170,183, Jan. 2013
S. Murali, M. Coenen, A. Radulescu, K. Goossens, and G. De Micheli,
"A Methodology for Mapping Multiple Use-Cases onto Networks on
Chips," in Design, Automation and Test in Europe, 2006. DATE '06.
Proceedings, vol.1, no., pp.1-6, 6-10 March 2006
U.Y. Ogras, R. Marculescu, D. Marculescu, "Variation-adaptive feedback
control for networks-on-chip with multiple clock domains," Design
Automation Conference (DAC) 2008. 45th ACM/IEEE, vol., no.,
pp.614,619
S. Garg, D. Marculescu, R. Marculescu, and U. Ogras, “Technologydriven Limits on DVFS Controllability of Multiple Voltage-Frequency
Island Designs” in Proc. of IEEE/ACM Design Automation Conference
(DAC), July 2009.
P. Choudhary, D. Marculescu, “Power Management of
Voltage/Frequency Island-Based Systems Using Hardware Based
Methods,” in IEEE Trans. on VLSI Systems, vol.17, no.3, pp. 427-438,
March 2009.
P. Choudhary, D. Marculescu, “Hardware based Frequency/Voltage
Control of Voltage Frequency Island Systems,” in Proc. IEEE/ACM Intl.
Conference on Hardware-Software Codesign and System Synthesis
(CODES-ISSS), Seoul, South Korea, Oct. 2006.
B.C. Lee, D.M. Brooks, B.R. de Supinski, M. Schulz, K. Singh, S.A.
McKee, "Methods of inference and learning for performance modeling of
parallel applications," In Proceedings of the 12th ACM SIGPLAN
symposium on Principles and practice of parallel programming (PPoP),
2007.
Y. Tan, W. Liu, Q. Qiu, "Adaptive power management using
reinforcement learning," in Proceedings of the IEEE/ACM International
Conference on Computer-Aided Design (ICCAD), 2009.
Z. Chen, D. Marculescu, "Distributed reinforcement learning for power
limited many-core system performance optimization," In Proceedings of
the IEEE/ACM Design, Automation & Test in Europe Conference &
Exhibition (DATE), 2015.
C.M. Bishop, “Pattern recognition and machine learning,” Vol. 4. No. 4.
New York: springer, 2006.
D.-C. Juan, S. Garg, J. Park and D. Marculescu, et al. "Learning the
optimal operating point for many-core systems with extended range
voltage/frequency scaling." Hardware/Software Codesign and System
Synthesis (CODES+ ISSS), 2013 International Conference on. IEEE,
2013.
U.Y. Ogras, R. Marculescu, D. Marculescu, Eun Gu Jung, "Design and
Management of Voltage-Frequency Island Partitioned Networks-onChip," Very Large Scale Integration (VLSI) Systems, IEEE Transactions
on, vol.17, no.3, pp.330,341, March 2009
D.J. Watts and S.H. Strogatz, 1998. Collective Dynamics of ‘SmallWorld’ Networks. Nature. 393, 440-442
J. Murray, R. Kim, P. Wettin, P.P. Pande, and B. Shirazi. 2014.
Performance Evaluation of Congestion-Aware Routing with DVFS on a
Millimeter-Wave Small-World Wireless NoC. J. Emerg. Technol.
Comput. Syst. 11, 2, Article 17 November 2014.
T. Petermann and P. De Los Rios, “Spatial small-world networks: a
wiring cost perspective,” arXiv:cond-mat/0501420v2.
13
[28] B.A. Floyd, C.-M. Hung, and K.K. O, “Intra-chip wireless interconnect
for clock distribution implemented with integrated antennas, receivers,
and transmitters,” IEEE J. Solid-State Circuits, vol. 37, no. 5, pp. 543552.
[29] P. Wettin, R. Kim, J. Murray, X. Yu, P.P. Pande, A. Ganguly and D. Heo,
"Design Space Exploration for Wireless NoCs Incorporating Irregular
Network Routing," Computer-Aided Design of Integrated Circuits and
Systems, IEEE Transactions on, vol.33, no.11, pp.1732,1745, Nov. 2014
[30] O. Lysne, T. Skeie, S.-A. Reinemo and I. Theiss, “Layered routing in
irregular networks,” IEEE Trans. On Parallel and Distributed Systems,
vol. 17, no. 1, 2006, pp. 51-65.
[31] N. Binkert, B. Beckmann, G. Black, S.K. Reinhardt, A. Saidi, A. Basu, J.
Hestness, D.R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M.
Shoaib, N. Vaish, M.D. Hill and D.A. Wood, “The GEM5 Simulator,”
ACM SIGARCH Computer Architecture News, 39(2), 2011, pp. 1-7.
[32] S. C. Woo, M. Ohara, E. Torrie, J.P. Singh and A. Gupta, “The SPLASH2 programs: characterization and methodological considerations,” Proc.
of ISCA, 1995, pp. 24-36.
[33] C. Bienia, “Benchmarking modern multiprocessors,” Ph.D. Dissertation,
Princeton Univ., Princeton NJ, Jan. 2011.
[34] S. Li, J.H. Ahn, R.D. Strong, J.B. Brockman, D.M. Tullsen, and N.P.
Jouppi, “McPAT: an integrated power, area, and timing modeling
framework for multicore and manycore architectures,” in Proc. of
MICRO, 2009, pp. 469-480.
[35] S. Beer, R. Ginosar, M. Priel, R. Dobkin and A. Kolodny, "The
Devolution of Synchronizers," Asynchronous Circuits and Systems
(ASYNC), 2010 IEEE Symposium on, vol., no., pp.94,103, 3-6 May 2010
[36] D. E. Duarte, N. Vijaykrishnan and M. J. Irwin, “A clock power model to
evaluate impact of architectural and technology optimizations,” IEEE
Trans. Very Large Scale Integr. (VLSI) Syst., vol. 10, no. 6. Pp. Dec.
2002.
Ryan Gary Kim is a fourth year PhD candidate in the
Electrical Engineering and Computer Science
Department, Washington State University, Pullman,
USA. His research interests include low-power
wireless NoC design through power management
techniques.
Wonje Choi received the B.S. degree in Computer
Engineering from Washington State University,
Pullman, WA, USA in 2013, where he is currently
working towards the PhD degree.
Zhuo Chen received the B.S. degree in Electronics
Engineering and Computer Science from Peking
University, Beijing, China in 2013. He is currently
working toward the Ph.D. degree in the Department of
Electrical and Computer Engineering, Carnegie Mellon
University, Pittsburgh, PA. His research interests are in
the area of energy-aware computing. In particular, his
research
focuses
on
multi-core
heterogeneous/homogeneous system optimization, and low-power applicationspecific system design
Partha Pratim Pande is a Professor and holder of the
Boeing Centennial Chair in computer engineering at
the school of Electrical Engineering and Computer
Science, Washington State University, Pullman, USA.
His current research interests are novel interconnect
architectures for multicore chips, on-chip wireless
communication networks, and hardware accelerators
for Biocomputing.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
Diana Marculescu is a Professor of Electrical and
Computer Engineering at Carnegie Mellon University.
She has won several best paper awards in top
conferences and journals. Her research interests include
energy-, reliability-, and variability-aware computing
and CAD for non-silicon applications. She is an IEEE
fellow.
Radu Marculescu is a professor in the Department of
Electrical and Computer Engineering at Carnegie
Mellon University. He has received several Best Paper
Awards in top conferences and journals covering
design automation of integrated systems and embedded
systems. His current research focuses on modeling and
optimization of embedded and cyber-physical systems.
He is an IEEE Fellow.
14
Download