> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 1 Wireless NoC and Dynamic VFI Co-Design: Energy Efficiency without Performance Penalty Ryan Gary Kim, Student Member, IEEE, Wonje Choi, Student Member, IEEE, Zhuo Chen, Student Member, IEEE, Partha Pratim Pande, Senior Member, IEEE, Diana Marculescu, Fellow, IEEE, and Radu Marculescu, Fellow, IEEE Abstract—Multiple Voltage Frequency Island (VFI)-based designs can reduce the energy dissipation in multicore platforms by taking advantage of the varying nature of the application workloads. Indeed, the voltage/frequency (V/F) levels of the VFIs can be tailored dynamically by considering the workload-driven variations in the application. Traditionally, mesh-based Networkson-Chip (NoCs) have been used in VFI-based systems; however, they have large latency and energy overheads due to the inherently long multi-hop paths. Consequently, in this paper, we explore the emerging paradigm of wireless Network-on-Chip (WiNoC) and demonstrate that by incorporating WiNoC, VFI, and dynamic V/F tuning in a synergistic manner, we can design energy-efficient multicore platforms without introducing noticeable performance penalty. Our experimental results show that for the benchmarks considered, the proposed approach can achieve between 5.7% to 46.6% energy-delay product (EDP) savings over the state-of-theart system and 26.8% to 60.5% EDP savings over a standard baseline non-VFI mesh-based system. This opens up a new of class of co-design approaches that can make WiNoCs the communication technology of choice for future multicore platforms. Index Terms—Dynamic Voltage and Frequency Scaling, Network-on-Chip, Voltage Frequency Islands I. INTRODUCTION I N recent years, multiple Voltage Frequency Island (VFI) designs have increasingly made their way into commercial and research multicore platforms. This is because, for VFIbased multicore systems, it is possible to implement efficient power and thermal management via dynamically fine-tuning the voltage and frequency (V/F) of each island under given performance constraints. Moreover, dynamically tuned VFIs (DVFI) reduce the area overhead associated with a fully distributed per core dynamic voltage frequency scaling (DVFS). Hence, a hierarchy of globally distributed (inter-VFI) and locally centralized (intra-VFI) control mechanisms can provide the best trade-off between power and resource management. However, DVFI necessitates that time varying core and traffic statistics are sent to a decision-making controller. Due to the nature of a VFI-based system, we employ a distributed control mechanism for reducing global communication overhead. To reduce the time overhead associated with the decision-making process (V/F tuning) and This work is in part supported by the US National Science Foundation (NSF) grants CCF-0845504, CNS-1059289, CNS-1128624, CCF-1162202 and CCF1514206 as well as Army Research Office grant W911NF-12-1-0373. Ryan Gary Kim, Wonje Choi and Partha Pratim Pande are with Washington State University, Pullman, WA, 99164. E-mails: {rkim, wchoi, pande}@eecs.wsu.edu the intra- and inter-VFI data exchanges, we need an efficient communication backbone. Most of the existing VFI-partitioned designs use the conventional multi-hop, mesh-based NoC architecture. However, for large-scale systems, the inter-VFI data exchanges through traditional mesh NoCs introduce unnecessary latency and energy overheads. Therefore, in this work, we propose a new approach to designing a small-world wireless NoC, which leads to a WiNoC-enabled DVFI-based multicore system that can achieve significant energy savings without paying a noticeable performance penalty. At the very heart of this communication infrastructure lies the small-world effect induced by the wireless links that enables the efficient exchange of information among various cores. The main contributions of this work are as follows: First, we propose a new VFI clustering methodology that utilizes machine learning to: i) allow for non-uniform VFI clusters and ii) take into account the temporal variations of application workloads to support and enable DVFI. This method generalizes previous VFI clustering approaches that use average core-level statistics. Second, we design and implement a lightweight VFI controller that determines suitable V/F values for each VFI, during runtime. Next, we design the WiNoC with knowledge of the VFI structure in order to optimize intra-VFI, inter-VFI and core to dynamic V/F controller communication. We demonstrate how the co-design of these three paradigms (VFI, WiNoC and dynamic V/F tuning) can significantly reduce the Energy-Delay Product (EDP) without performance penalty for commonly used CMP benchmarks. II. RELATED WORK AND NEW CONTRIBUTIONS Multiple VFI designs have become commonplace for both embedded and high performance multi-core platforms where the optimization of energy dissipation while minimizing performance degradation and area overhead is a must [1-4]. A framework for the synthesis of VFIs have been proposed in [5], where the system is partitioned into VFIs based on the maximum number of VFIs allowed and the bandwidth requirement of each task. Zhuo Chen, Diana Marculescu and Radu Marculescu are with Carnegie Mellon University, Pittsburgh, PA, 15213: E-mails: {tonychen, dianam, radum}@cmu.edu > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 2 Figure. 1. Illustration of the proposed VFI-partitioned multi-core co-design methodology. During the VFI clustering phase, each application is profiled in order to obtain key time-varying core and network statistics. These statistics are used to find the optimal clustering, within particular size constraints, for each application. During system design, these clusters are then mapped to physical cores. The NoC and VFI Controller (VFI CTRL), the dedicated hardware block that dynamically tunes the V/F of the VFI, are then placed to tailor to the traffic and cluster characteristics of each application. During runtime, the VFI CTRL obtains the core and traffic data in order to determine optimal V/F levels for the VFI. The limitations and design challenges associated with existing NoC architectures are discussed in [6] and [7]. WiNoCs are seen as a new enabling architecture to achieve high performance and energy efficiency in multicore platforms. A comprehensive survey regarding various WiNoC architectures and their design principles is presented in [6]. A WiNoC architecture suitable for VFI-based systems has been presented in [8] and [9]. However, [8] and [9] consider only static V/F allocation and a clustering algorithm that works under equallysized cluster constraints and hence it may be suboptimal. Although exploiting small-world effects has been initially used to improve multicore performance [10], it has been later demonstrated that small-worldness can also benefit the power management via control-theoretic approaches [11]. More recently, it has been shown that small-world WiNoCs can help improve the temperature profile of the NoC switches and links compared to a traditional mesh in presence of DVFS [12]. But prior work [12], considers distributed DVFS where the V/F pair of each NoC element is fine-tuned as per traffic distribution. Also, authors in [13] present a Model-Predictive Controller (MPC) to implement DVFS on each core in order to optimize energy dissipation within given thermal constraints. In [14], the authors create specific NoC architectures that are tailored for the use-cases of a System-on-Chip (SoC). This work also does some preliminary DVFS investigations to tailor the operating voltages of the NoC for the varying workload requirements. However, with VFI-based designs, the area overhead of implementing per core DVFS can be reduced. Dynamic V/F control in a VFI has been demonstrated in [15], [16], but the focus is on the NoC and controlling inter-VFI queue occupancy, as opposed to the full-system performance. Hardware-based control has been demonstrated first in the context of application specific systems [17], [18], but without considering the impact of a NoC-based communication paradigm. Furthermore, with a few recent exceptions [19], [20], [21], most approaches rely on heuristics and do not employ machine learning-based techniques for VFI clustering, system design, and runtime management of power and performance as described in this paper. Consequently, in this work, we improve the state-of-the-art by proposing a new co-design methodology that exploits the machine learning based VFI-partitioning, DVFI, and the emerging WiNoC paradigms in order to improve the energy dissipation of a multicore chip without increasing the execution time compared to traditional mesh-based architectures. Our new VFI clustering methodology creates and exploits nonuniform VFI clusters with time-varying computation and communication statistics to better accommodate both WiNoC and DVFI. The WiNoC is designed with knowledge of the DVFI in order to provide efficient VFI communication, reduce the utilization on inter-VFI links and quicker core-to-DVFI controller communication. Lastly, DVFI is utilized in order to provide a lowered energy profile by reducing the frequency while maintaining the performance of the core and network. To illustrate the general design flow of our proposed VFIpartitioned, dynamic V/F-enabled multicore platform, Fig. 1 outlines the key processes at each stage of the design flow. As shown, we divide the design process into three stages, i.e. VFI Clustering, System Design and Runtime. This is processed in a single sequential design flow. During VFI Clustering, we obtain the benchmark-specific core and network data in order to create optimal clustering for each application. In this stage we also impose a minimum cluster size requirement in order to ensure that we fully utilize the VFI paradigm. After VFI Clustering we enter the System Design stage which includes the creation of the WiNoC, the placement of threads, and the placement of the VFI controller (VFI CTRL). Lastly, we design the VFI CTRL in order to take advantage of the application’s workload > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < Variables 𝑢𝑡𝑖𝑙 𝑐𝑜𝑚𝑚 𝑛𝑢𝑚𝑢𝑡𝑖𝑙 𝑛𝑢𝑚𝑐𝑜𝑚𝑚 𝑚𝑖𝑛𝑢𝑡𝑖𝑙 𝑚𝑖𝑛𝑐𝑜𝑚𝑚 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔𝑢𝑡𝑖𝑙 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔 |𝑐𝑙𝑠𝑡| TABLE I CLUSTERING TERMINOLOGY Definition utilization (util) data communication (comm) data number of util-based clusters number of comm-based clusters min. number of cores in util-based clusters min. number of cores in comm-based clusters clustering after util-based method final clustering after comm-based method size of cluster 𝑐𝑙𝑠𝑡 Algorithm 1 Pseudocode of Util-based clustering 1: Input: 𝑢𝑡𝑖𝑙, 𝑛𝑢𝑚𝑢𝑡𝑖𝑙 , 𝑚𝑖𝑛𝑢𝑡𝑖𝑙 2: Output: 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔𝑢𝑡𝑖𝑙 3: 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔𝑢𝑡𝑖𝑙 ← K-means(𝑢𝑡𝑖𝑙, 𝑛𝑢𝑚𝑢𝑡𝑖𝑙 , 𝑚𝑖𝑛𝑢𝑡𝑖𝑙 ) [Eq. 5] 4: for each cluster 𝑐𝑙𝑠𝑡 in 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔𝑢𝑡𝑖𝑙 do 5: while |𝑐𝑙𝑠𝑡| < 𝑚𝑖𝑛𝑢𝑡𝑖𝑙 do 6: 𝜇𝑐𝑙𝑠𝑡 ← 𝑐𝑒𝑛𝑡𝑟𝑜𝑖𝑑(𝑐𝑙𝑠𝑡) [Eq. 4] 7: 𝑐𝑙𝑜𝑠𝑒𝑠𝑡 ← argmin 𝐷(𝑥𝑖 , 𝜇𝑐𝑙𝑠𝑡 ) [Eq. 3], 𝑥𝑖 ∉ 𝑐𝑙𝑠𝑡, and 𝑥𝑖 𝑥𝑖 has not yet moved 8: variation during the Runtime stage. In the following sections we discuss each stage in Fig. 1 in detail: VFI Clustering (Section III.A), System Design (Section III.B), and Runtime (Section III.C). We will also demonstrate how the information of each stage can be utilized to improve the overall design. 9: Move 𝑐𝑙𝑜𝑠𝑒𝑠𝑡 core to 𝑐𝑙𝑠𝑡 end while 10: end for 12: return 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔𝑢𝑡𝑖𝑙 where 𝑗 is the cluster number. For the distance measure, 𝐷(𝑥𝑖, 𝜇𝑗), we use squared L2-norm: 2 𝐷(𝑥𝑖 , 𝜇𝑗 ) = ‖𝑥𝑖 − 𝜇𝑗 ‖2 III. VFI ARCHITECTURE In this section, we describe how we design the entire system to support both VFIs and dynamic V/F tuning. First, we discuss the VFI creation methodology by considering the cores’ timevarying busy utilization and inter-core traffic characteristics. Then, we describe how we implement the dynamic V/F tuning for each VFI. Next, we elaborate on the design of the WiNoC architecture for the VFI-partitioned system. Finally, we summarize how the three paradigms can be combined in a codesign methodology that takes advantage of each other. A. VFI Clustering The proposed VFI clustering approach relies on the principle of clustering together cores with similar behavior so as to benefit from coordinated V/F tuning. For example, cores with low utilization should be clustered together and tuned to a low V/F level, while cores with high utilization should have their V/F levels boosted together. To this end, we use the time traces for the instruction per cycle (IPC) values per core and traffic statistics to capture the utilization and communication behavior of the cores, respectively. We propose to use unsupervised machine learning techniques to cluster the cores with similar behavioral patterns. 1) K-means Clustering: K-means is a well-known machine learning algorithm that is able to identify and cluster the data into groups without training on a parameterized model [22]. Assuming we have 𝑁 data points and 𝐽 clusters, K-means tries to minimize the distortion measure: 𝐽 Ψ = ∑𝑁 𝑖=1 ∑𝑗=1 𝛿𝑖𝑗 ∙ 𝐷(𝑥𝑖 , 𝜇𝑗 ) (1) where 𝛿𝑖𝑗 is an indicator function which is one if and only if point 𝑥𝑖 belongs to cluster 𝑗, otherwise it is zero. 𝐷(𝑥𝑖, 𝜇𝑗) measures the distance between the point 𝑥𝑖 and the cluster center 𝜇𝑗. A cluster’s center (centroid) is defined as the mean of all points in that cluster: 𝜇𝑗 = ∑𝑁 𝑖=1 𝛿𝑖𝑗 ∙𝑥𝑖 ∑𝑁 𝑖=1 𝛿𝑖𝑗 3 (2) (3) The distortion measure is the sum of the intra-cluster distance, therefore minimizing it is equivalent to minimizing the intragroup distortion and maximizing the inter-group distance. Kmeans does this by iterating on two steps: (1) For each group 𝑗, assume 𝜇𝑗 is fixed and label the closest points as belonging to group 𝑗; (2) Recalculate the center of the groups, i.e., 𝜇𝑗’s, when the clustering of the points, i.e., 𝛿𝑖𝑗 value, is fixed. This algorithm is guaranteed to converge, since each step decreases the Ψ value [22]. In our method, we model each time trace as a multidimensional point and then cluster points using the abovementioned K-means clustering. Therefore, cores with similar time-dependent behaviors will be allocated to the same group, while cores with very different patterns will be clustered in different groups. In the following sections, we will first illustrate two clustering methods: one based on cores utilization and another based on inter-cores communication. Subsequently, we combine these metrics and introduce our hybrid clustering method. 2) Utilization-based Clustering: The idea behind utilization-based clustering is to group together the cores with similar utilization patterns such that all cores in the same group can benefit from dynamic V/F techniques. Traditionally, average IPC values have been used to cluster the cores when the sampled time traces are unavailable [9]. However, if time traces have opposite program phases (e.g. for cores that spawn the threads and the cores that execute the threads) grouping them together is actually undesirable for dynamic V/F techniques. To fully exploit the time-dependent information of the IPC values, we propose to model each trace as a point in a multidimensional space and cluster them based on their squared L2norm. If the sampling period is 𝜙, and the total execution time of the benchmark is Φ, then the time trace of each core 𝑖 consists of 𝜏 = Φ/𝜙 sample points. Suppose we have 𝑁 cores, N 𝜏dimensional points (𝑢𝑖 (1), 𝑢𝑖 (2), … , 𝑢𝑖 (𝜏)), 𝑖 = 1, … , 𝑁, i.e., > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < the IPC time traces for each core 𝑖, then we can cluster these points (time traces) by using the K-means algorithm. The center of each cluster 𝑗, 𝜇𝑗𝑢 , is hence also a 𝜏-dimensional point: 𝜇𝑗𝑢 = (𝜇𝑗𝑢 (1), 𝜇𝑗𝑢 (2), … , 𝜇𝑗𝑢 (𝜏)) = ∑𝑁 𝑖=1 𝛿𝑖𝑗 ∙𝑢𝑖 (4) ∑𝑁 𝑖=1 𝛿𝑖𝑗 𝐽 𝑢 𝜏 Ψ𝑢 = ∑𝑁 𝑖=1 ∑𝑗=1 ∑𝑡=1 𝛿𝑖𝑗 ∙ ‖𝑢𝑖 (𝑡) − 𝜇𝑗 (𝑡)‖ 2 (5) 2 By minimizing Ψ𝑢 , we find the groups of cores with the most similar time behavior for their IPC values. The pseudocode of utilization-based clustering is shown in Algorithm 1. For clarity, the meaning of each variable used in the clustering algorithm is explained in Table I. Line 3 in Algorithm 1 uses Kmeans to obtain the clusters based on utilization values. However, it is possible that K-means generates very unbalanced clusters since the algorithm only considers the similarity among the points and is agnostic of other architectural constraints. Therefore, we place a constraint on the minimum number of cores in each cluster due to the implementation cost-efficiency. If a cluster does not meet this constraint after the K-means clustering, we evaluate the points outside the cluster to find the point closest to the cluster (distance measure: Eq. (3) above). This point is moved to the cluster and this process is repeated until the minimum core constraint is satisfied (Line 5-9 in Algorithm 1). For example, if we require at least four cores in each cluster and cluster 1 only has three cores after clustering, we look for the closest point outside of cluster 1, 𝑥𝑘 that minimizes 𝐷(𝑥𝑘 , 𝜇𝑗u ) and move it to cluster 1. 3) Communication-based Clustering: Following the same idea, we model each communication traffic trace as a multidimensional point: (𝑓(1), 𝑓(2), … , 𝑓(𝜏)). The pseudocode of communication-based clustering is shown in Algorithm 2. However, instead of associating one multi-dimensional point with each core as in the previous section, we need to associate one point to one pair of cores since communication traffic is defined for exactly two cores. As a result, the traffic trace between core 𝑘 and core 𝑙 defines a 𝜏-dimensional point: (𝑓𝑘𝑙 (1), 𝑓𝑘𝑙 (2), … , 𝑓𝑘𝑙 (𝜏)), where 𝑓𝑘𝑙 (𝑡) is the traffic volume between core k and core l during time t, and the corresponding distortion measure is: 𝑓 𝐽 𝑁 𝜏 Ψ𝑓 = ∑𝑁 𝑘=1 ∑𝑙=1 ∑𝑗=1 ∑𝑡=1 𝛿𝑘𝑗 ∙ 𝛿𝑙𝑗 ∙ ‖𝑓𝑘𝑙 (𝑡) − 𝜇𝑗 (𝑡)‖ 2 2 (6) where 𝜇𝑗 is the center of cluster 𝑗 as defined in the following: 𝑓 𝑓 𝑓 𝑓 𝜇𝑗 = (𝜇𝑗 (1), 𝜇𝑗 (2), … , 𝜇𝑗 (𝜏)) = 𝑁 ∑𝑁 𝑘=1 ∑𝑙=1 𝛿𝑘𝑗 ∙𝛿𝑙𝑗 ∙𝑓𝑘𝑙 𝑁 ∑𝑁 𝑘=1 ∑𝑙=1 𝛿𝑘𝑗 ∙𝛿𝑙𝑗 Algorithm 2 Pseudocode of Comm-based clustering 1: Input: 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔𝑢𝑡𝑖𝑙 , 𝑐𝑜𝑚𝑚, 𝑛𝑢𝑚𝑐𝑜𝑚𝑚 , 𝑚𝑖𝑛𝑐𝑜𝑚𝑚 2: Output: 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔 3: for each cluster clst in 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔𝑢𝑡𝑖𝑙 do 4: The distortion measure of utilization-based clustering is: (7) Line 6 in Algorithm 2 uses K-means to minimize the Ψ𝑓 value. This approach not only minimizes the intra-cluster traffic pattern differences, but also attempts to achieve a balance between performance and energy driven partitioning. In classic performance-driven, communication-based clustering, one obtains low inter-cluster traffic magnitude and low intra-cluster pattern similarity [9]; this clearly benefits applications in which on-chip communication is a performance bottleneck. In the case of energy-driven clustering, one can obtain medium inter-cluster traffic magnitude but similar intra- 4 for 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑛𝑢𝑚 = 𝑛𝑢𝑚𝑐𝑜𝑚𝑚 : 1 do 𝑐𝑜𝑚𝑚𝑝𝑎𝑟𝑡𝑖𝑎𝑙 = communication matrix of cores in clst 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔𝑡𝑒𝑚𝑝 ← Kmeans (𝑐𝑜𝑚𝑚𝑝𝑎𝑟𝑡𝑖𝑎𝑙 , 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑛𝑢𝑚 , 𝑚𝑖𝑛𝑐𝑜𝑚𝑚 ) [Eq. 6] 5: 6: 7: 𝑐𝑙𝑠𝑡𝑀𝑎𝑥 = the clustering with the largest centroid value in 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔𝑡𝑒𝑚𝑝 8: while |𝑐𝑙𝑠𝑡𝑀𝑎𝑥 | < 𝑚𝑖𝑛𝑐𝑜𝑚𝑚 do 9: 𝜇𝑐𝑙𝑠𝑡𝑀𝑎𝑥 ← 𝑐𝑒𝑛𝑡𝑟𝑜𝑖𝑑(𝑐𝑙𝑠𝑡𝑀𝑎𝑥 ) [Eq. 7] 10: 𝑐𝑙𝑜𝑠𝑒𝑠𝑡 ← argmin 𝐷(𝑥𝑖 , 𝜇𝑐𝑙𝑠𝑡𝑀𝑎𝑥 ) [Eq. 3], 𝑥𝑖 ∉ 11: 𝑐𝑙𝑠𝑡𝑀𝑎𝑥 , and 𝑥𝑖 has not yet moved Move 𝑐𝑙𝑜𝑠𝑒𝑠𝑡 core to 𝑐𝑙𝑠𝑡𝑀𝑎𝑥 𝑥𝑖 12: end while 13: while |𝑐𝑙𝑠𝑡 − 𝑐𝑙𝑠𝑡𝑀𝑎𝑥 | < (𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑛𝑢𝑚 − 1)*𝑚𝑖𝑛𝑐𝑜𝑚𝑚 do 𝜇𝑐𝑙𝑠𝑡𝑀𝑎𝑥 ← 𝑐𝑒𝑛𝑡𝑟𝑜𝑖𝑑(𝑐𝑙𝑠𝑡𝑀𝑎𝑥 ) [Eq. 7] 14: 15: 𝑓𝑎𝑟𝑡ℎ𝑒𝑠𝑡 ← argmax 𝐷(𝑥𝑖 , 𝜇𝑐𝑙𝑠𝑡𝑀𝑎𝑥 ) [Eq. 3], 16: 𝑥𝑖 ∈ 𝑐𝑙𝑠𝑡𝑀𝑎𝑥 , and 𝑥𝑖 has not yet moved Remove 𝑓𝑎𝑟𝑡ℎ𝑒𝑠𝑡 core from 𝑐𝑙𝑠𝑡𝑀𝑎𝑥 𝑥𝑖 17: end while 18: 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔.append(𝑐𝑙𝑠𝑡𝑀𝑎𝑥 ) 19: 𝑐𝑙𝑠𝑡 ← 𝑐𝑙𝑠𝑡 − 𝑐𝑙𝑠𝑡𝑀𝑎𝑥 20: end for 21: end for 22: return 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔 Algorithm 3 Pseudocode of Hybrid clustering 1: Input: 𝑢𝑡𝑖𝑙, 𝑐𝑜𝑚𝑚, 𝑛𝑢𝑚𝑢𝑡𝑖𝑙 , 𝑛𝑢𝑚𝑐𝑜𝑚𝑚 , 𝑚𝑖𝑛𝑢𝑡𝑖𝑙 , 𝑚𝑖𝑛𝑐𝑜𝑚𝑚 2: Output: 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔 3: 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔𝑢𝑡𝑖𝑙 ← 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑢𝑡𝑖𝑙 (𝑢𝑡𝑖𝑙, 𝑛𝑢𝑚𝑢𝑡𝑖𝑙 , 𝑚𝑖𝑛𝑢𝑡𝑖𝑙 ) 4: 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔 ← 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑐𝑜𝑚𝑚 (𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔𝑢𝑡𝑖𝑙 , 𝑐𝑜𝑚𝑚, 𝑛𝑢𝑚𝑐𝑜𝑚𝑚 , 𝑚𝑖𝑛𝑐𝑜𝑚𝑚 ) 10: return 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔 cluster traffic patterns. For example, if all the cores in cluster 1 have similar traffic patterns, then we can expect that the intracluster traffic will be generated proportionally slower or faster for all cores in the cluster depending on the V/F levels used. The K-means-based approach tries to find the best trade-off between these cases, i.e., maximize the intra-cluster similarity while keeping the inter-cluster traffic magnitude low. Like in the case of utilization-based clustering, it is also possible to get very unbalanced clusters. We use the same strategy as stated previously (Line 8-12) to move the closest points and the associated cores, into the clusters, which cannot satisfy the constraints. Similarly, if the cluster contains too many cores such that the remaining cluster will be unable to satisfy the constraints, we remove the farthest cores from the cluster (Line 13-17). > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 4) Hybrid Clustering: To combine the advantages of both utilization-based and communication-based clustering, we propose a hybrid clustering method as shown in Algorithm 3. In hybrid clustering, we cannot simply combine Ψ𝑢 and Ψ𝑓 in a weighted cost function (e.g., 𝛼·Ψ𝑢 + (1 – 𝛼)·Ψ𝑓, with 0 ≤ 𝛼 ≤ 1), because their state space dimensions are different (Ψ𝑢: 𝑁⨯𝐽⨯𝜏, Ψ𝑓: 𝑁⨯𝑁⨯𝐽⨯𝜏). Therefore, we propose to use a hierarchical method, i.e., we first use the utilization-based method to partition the cores into 𝑛𝑢𝑚𝑢𝑡𝑖𝑙 groups. Line 3 invokes Algorithm 1 for utilization-based clustering. In each group, the cores have similar utilizations and hence can benefit from the same V/F level tuning. Then, we deploy the communication-based clustering to create 𝑛𝑢𝑚𝑐𝑜𝑚𝑚 clusters in each of the 𝑛𝑢𝑚𝑢𝑡𝑖𝑙 groups generated in the first step. Line 4 invokes Algorithm 2 and uses the result from Algorithm 1, 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔𝑢𝑡𝑖𝑙 . Our data shows that the resulting Ψ𝑢 remains stable through the communication clustering. Consequently, we achieve similar utilization patterns and communication patterns within each cluster. In our case, we generate four VFIs, hence we first create two groups with utilization-based clustering and then divide each of the groups into two through communication-based clustering. This approach can, of course, be used for any number of clusters. 5) Static VF Levels: With the partitioned VFIs, we determine the static V/F level of each VFI such that it minimizes the power consumption under a 95% performance constraint. To estimate the power and performance values under different V/F levels, we use the method from [23], which proposed and validated a power and performance model with Root-Mean-SquaredPercentage-Error (RMSPE) of only 4.37%. Using this model, we optimally solve for the best V/F levels. These statically tuned VFIs will be used as a comparison against our proposed dynamically tuned system. 6) VFI Interface: In this VFI-enabled system, each island can work with its own voltage and frequency. As such, communication across different VFIs is achieved through mixed-clock/mixed-voltage first-input first-output (FIFO) interfaces. This provides the flexibility to scale the frequency and voltage of various VFIs in order to minimize the overall energy consumption [24]. We present the latency and energy models used in our simulations in section IV.A. B. WiNoC to Support VFIs In this work, we design the WiNoC using inspiration from small-world graphs [25]. Small-world graphs are characterized by many short-distance links between neighboring nodes, as well as a few relatively long-range (direct) shortcuts. The longrange shortcuts implemented through mm-wave wireless links, operating in the 10-100 GHz range, have been shown to improve the energy dissipation profile and latency characteristics of multicore chips [6]. Also, it has been seen that by utilizing the wireless links, the network load can be significantly reduced with respect to conventional mesh topologies in a very flexible manner [26]. This allows us to implement more aggressive dynamic VFI while maintaining the required network throughput. Hence, in this work, we design the WiNoC architecture to support efficient data exchanges among various VFI domains. This is done by creating the 5 wireline network, physically arranging the cores, and placing the wireless links using the knowledge of the VFI domains and their traffic characteristics. 1) Wireless NoC Architecture: In WiNoC, the wireline links are designed using a power-law model [27]. We assume an average number of connections, ⟨𝑘⟩, from each NoC switch to the other switches. The value of ⟨𝑘⟩ is chosen to be four so that the WiNoC does not introduce any additional switch overhead with respect to a conventional mesh. Also, an upper bound, 𝑘𝑚𝑎𝑥 , is imposed on the number of ports attached to a particular switch so that no switch becomes unrealistically large. This also reduces the skew in the distribution of links among the switches. There is no specific lower bound on the number of ports attached to a switch but a fully connected network implies that this number must be at least 1. Both ⟨𝑘⟩ and 𝑘𝑚𝑎𝑥 do not include the local NoC switch port to the core. Due to the nature of the VFI clustering, additional constraints need to be applied to the connectivity of the WiNoC. The distribution of links is divided into two steps: VFI intra-cluster connections need to ensure each cluster’s connectivity, and VFI inter-cluster connections, to enable communication between the clusters. This is to ensure that both intra-cluster and intercluster communications have sufficient resources and none of them becomes a bottleneck in the overall data exchange. For each switch, ⟨𝑘⟩ is divided into two parts, ⟨𝑘𝑖𝑛𝑡𝑟𝑎 ⟩ and ⟨𝑘𝑖𝑛𝑡𝑒𝑟 ⟩, the average number of intra-cluster and inter-cluster connections to other switches respectively. For the VFI intracluster connections, each cluster is treated separately. A network is created for each cluster such that the connectivity follows the power-law model; the network cluster is fully connected and has an average intra-cluster connectivity, ⟨𝑘𝑖𝑛𝑡𝑟𝑎 ⟩. The VFI inter-cluster connections are created such that the connectivity also follows the same power-law model as the intra-cluster connections and has an average inter-cluster connectivity ⟨𝑘𝑖𝑛𝑡𝑒𝑟 ⟩. The number of links going from one cluster to another is decided by the inter-VFI traffic. The proportion of links allocated between two clusters is directly related to the proportion of inter-cluster traffic between the two clusters in total inter-cluster traffic. The two principal wireless interface (WI) components are the antenna and the transceiver. The on-chip antenna for the WiNoC has to provide the best power gain for the smallest area overhead. A metal zigzag antenna has been demonstrated to possess these characteristics, and hence it is considered in this work [28]. To ensure high throughput and energy efficiency, the WI transceiver circuitry has to provide a very wide bandwidth, as well as low power consumption. The detailed description of the transceiver circuit is out of the scope of this paper. With a data rate of 16 Gbps, the wireless link dissipates 1.95 pJ/bit. The total area overhead per wireless transceiver is 0.25 mm2 [29]. 2) Wireless Link and Core Placement: To help facilitate predominantly long-distance communication, we use mm-wave wireless links to communicate among distant cores. These wireless links along with the small-world wireline architecture, aid in quick and efficient inter-core, inter-VFI, and core-to- > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < controller communication, especially for large VFI clusters. It is possible to create three non-overlapping channels with onchip mm-wave wireless links. Using these three channels, we overlay the wireline small-world connectivity with the wireless links such that a few switches get an additional wireless port. Each of these wireless ports will have a wireless interface (WI) tuned to one of the three wireless channels. The WI placement is most energy-efficient when the distance between them is at least 7.5 mm for the 65 nm technology node [6]. The optimum number of WIs is twelve, four WIs assigned to each wireless channel, for a 64-core system size [29]. In this work, we physically arrange the cores and place the wireless links in order to minimize the traffic-weighted hop count. We determine the physical locations of all the cores running a particular thread in order to minimize the distance of highly communicating cores. Then, the wireline network is created as described in Section III.B.1. Simulated annealing is finally used to find the optimal WI placements that minimize the average traffic-weighted hop-count assuming the WI constraints discussed earlier. More frequently-communicating WIs are assigned to the same channel to optimize the overall hop count. 3) Routing and Flow Control: Due to the irregular nature of our WiNoC architecture, routing is done as presented in [8]. Adaptive Layered Shortest Path Routing (ALASH) is used as the routing algorithm [30], allowing for messages to be routed along the shortest path between the source and destination while maintaining deadlock freedom. A wireless token passing protocol is used to arbitrate access to the WIs where the WI holding the token is given access to the wireless channel. C. Dynamically Tuned VFIs The application characteristics, core utilization and traffic information, tend to vary throughout the runtime of every benchmark. Therefore, static V/F tuning, although simple, tends to be suboptimal. Here, we take advantage of the temporal variations in the application by dynamically tuning the V/F of each VFI. For our VFI-enabled system, we create a dynamic V/F controller for each VFI that determines how to tune the V/F pairs every T cycles. The major difficulty in dynamically tuning the V/F pairs for VFIs when compared to traditional single core/router DVFS mechanisms lies in two parts: (i) the determination of a suitable V/F to apply to all cores/routers within the VFI and (ii) the transmission of core utilization and traffic information from each core in the VFI to a local controller. In order to determine a suitable V/F to apply to all cores and routers within a VFI, we obtain a metric that incorporates information from all elements in the VFI. We start by defining the core utilization of core 𝑖 (𝑢𝑖 (𝑡)) and link utilization for the link between core 𝑘 and 𝑙 (𝑙𝑢𝑘𝑙 (𝑡)): 𝑢𝑖 (𝑡) = 𝐵𝑢𝑠𝑦(𝑡,𝑖) 𝐶𝑦𝑐𝑙𝑒𝑠(𝑡,𝑉𝐹𝐼𝑗) 𝐹𝑙𝑖𝑡𝑠(𝑡,𝑘,𝑙) 𝑙𝑢𝑘𝑙 (𝑡) = , 𝑖 ∈ 𝑉𝐹𝐼𝑗 𝐶𝑦𝑐𝑙𝑒𝑠(𝑡,𝑉𝐹𝐼𝑗 ) , 𝑙 ∈ 𝑉𝐹𝐼𝑗 𝐶𝑦𝑐𝑙𝑒𝑠(𝑡, 𝑉𝐹𝐼𝑗 ) is the number of total cycles for window 𝑡 for VFI 𝑗 and 𝐹𝑙𝑖𝑡𝑠(𝑡, 𝑘, 𝑙) is the number of flits received by core 𝑙 from core 𝑘 during window 𝑡. Each VFI j has its own V/F controller that operates independently that calculates a metric, 𝑚(𝑡), using the core and link data; this metric is used in the V/F determination based on the information during each window 𝑡: 𝑚(𝑡) = 𝜔𝑢 ∑∀𝑖∈𝑉𝐹𝐼𝑗 𝑢𝑖 (𝑡) |𝑉𝐹𝐼𝑗 | + 𝜔𝑐 ∑∀𝑘∉𝑉𝐹𝐼𝑗 ∀𝑙∈𝑉𝐹𝐼𝑗 𝑙𝑢𝑘𝑙 (𝑡) 𝐼𝑛𝑡𝑒𝑟𝐿𝑖𝑛𝑘𝑠(𝑉𝐹𝐼𝑗 ) (10) where |𝑉𝐹𝐼𝑗 | is the number of cores in VFI j, 𝐼𝑛𝑡𝑒𝑟𝐿𝑖𝑛𝑘𝑠(𝑉𝐹𝐼𝑗 ) is the number of inter-VFI links connected to VFI j, and 𝜔𝑢 and 𝜔𝑐 are the weights for the utilization and communication parts respectively. Intuitively, m(𝑡) is the weighted sum of the VFI’s average core utilization and average incoming inter-VFI link utilization. The weights 𝜔𝑢 and 𝜔𝑐 are calculated as the proportion of core utilization to traffic during the current time window t. Although each V/F controller operates independently, the inter-VFI link utilization inherently carries information from the other VFIs. We can model the 𝐹𝑙𝑖𝑡𝑠(𝑡, 𝑘, 𝑙) equation as: 𝐹𝑙𝑖𝑡𝑠(𝑡, 𝑘, 𝑙) = min (𝜆̅𝑘𝑙 (𝑡) ∗ 𝐶𝑦𝑐𝑙𝑒𝑠(𝑡, 𝑉𝐹𝐼𝑛 ) , 𝐶𝑦𝑐𝑙𝑒𝑠(𝑡, 𝑉𝐹𝐼𝑗 )) (11) where 𝜆̅𝑘𝑙 (𝑡) is the average arrival rate for the link going from core k to core l during window 𝑡 and 𝐶𝑦𝑐𝑙𝑒𝑠(𝑡, 𝑉𝐹𝐼𝑛 ) is the number of cycles for VFI 𝑛. Therefore, a higher V/F level from the sending VFI n would increase 𝐶𝑦𝑐𝑙𝑒𝑠(𝑡, 𝑉𝐹𝐼𝑛 ) and raise the inter-VFI link utilization. Therefore, we can infer the V/F level of connected VFIs based on the level of inter-VFI link utilization. Based on the value of 𝑚(𝑡), a threshold mechanism is used to calculate the predicted V/F for the next time window, 𝑡 + 1, and the V/F is adjusted for the VFI accordingly. We propose the following threshold mechanism shown in Eq. (12), where the V/F is set to a value with respect to the nominal V/F value: 1.0, 0.9, 𝑉/𝐹 = ⋮ 0.6, { 0.5 if 𝑚(𝑡) + Δ𝑀 > 0.9 if 0.9 ≥ 𝑚(𝑡) + Δ𝑀 > 0.8 ⋮ if 0.6 ≥ 𝑚(𝑡) + Δ𝑀 > 0.5 if 0.5 ≥ 𝑚(𝑡) + Δ𝑀 (12) where ΔM is a threshold offset that adjusts the mapping of the metric to a particular V/F by a fixed value. Fig. 2 illustrates how various values of ΔM affect the mapping from m(t) to V/F. Increasing ΔM gives rise to an increase in m(t) resulting in higher V/F values. Also, decreasing ΔM lowers m(t) resulting in lower V/F values. Therefore higher values of ΔM would (8) (9) where 𝑉𝐹𝐼𝑗 is the set of cores in VFI cluster 𝑗, 𝐵𝑢𝑠𝑦(𝑡, 𝑖) is the number of core busy cycles for window 𝑡 for core 𝑖, 6 Fig. 2. Effects of ΔM on the mapping from m(𝑡) to V/F. > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 7 demonstrate how we can co-design the elements in this framework to lead to significant energy savings without losing performance. Therefore, we adopt a flexible lightweight controller that can be tuned using information about the VFI structure. Fig. 3. Block diagram of the V/F controller. typically lead to lower execution times and higher energy dissipation and vice versa. We introduce ΔM as a way to statically compensate for tracking errors. One possible usage for ΔM is to use it to help compensate for the tracking error. We can choose the ΔM value such that it minimizes the mean absolute error (MAE) between the predicted V/F and the metric from the following window 𝑚(𝑡 + 1). We will refer to the ΔM that minimizes tracking MAE as ΔMopt. The ΔM can also be used to compensate for the intra-VFI core utilization variation. Since the V/F value decided for each VFI is based on the average utilization and traffic, a number of cores will have performance penalties if there is significant variation within the cluster. Therefore, we should choose a ΔM that takes into account the average standard deviation (ΔMstd) of the core utilization within each VFI to account for the variability among the cores in the VFI. In this work, we propose to use ΔMopt+std as a suitable ΔM, where ΔMopt+std is defined as: ΔMopt+std = ΔMopt + ΔMstd (13) Intuitively, the ΔMopt element allows the ΔM to compensate for tracking errors created by following 𝑚(𝑡) while the ΔMstd component augments this value in order to overcome the shortcomings of using average utilization and traffic values in 𝑚(𝑡). Together ΔMopt+std, will optimize the energy by fitting closely to the application characteristics (using ΔMopt) while minimizing execution time penalties by accommodating more cores (using ΔMstd). Later in Section IV, we will investigate the use of this value ΔM and show its usefulness for the benchmarks considered in this work. Due to the simplicity of this controller, each core only needs to send its core utilization and inter-VFI link utilization information to its local V/F controller per decision-making window. We leverage the overall NoC architecture to communicate this information and analyze its impact on the system in section IV.C. Fig. 3 presents a high-level block diagram of the V/F controller: the Static Counter is a simple counter that determines the V/F switching frequency, Metric is the block that calculates m(𝑡), and V/F Calc. is the threshold mechanism that calculates the proper V/F for time window 𝑡 + 1. In section IV.C, we discuss how the frequency of the decision-making and the threshold values are calculated to optimize the system for each application. We aim to D. Overall Integration Through the previous sections, we have described how the VFI, WiNoC and dynamic V/F tuning can be co-designed to create an optimized full-system design. Fig. 4 summarizes the key ideas that are traded among these three paradigms. When implementing VFIs, we can cluster highly communicating cores to aid the WiNoC and cluster cores with similar application characteristic variations to aid dynamic V/F tuning. WiNoCs can be designed to take into account the shape and characteristics of the VFI in order to reduce the performance degradation and reduce the delay between each core and its V/F controller. Lastly, dynamic V/F tuning can take advantage of the application slack present in both VFIs and WiNoCs to optimize the energy with low performance penalties. IV. EXPERIMENTAL RESULTS AND ANALYSIS A. Experimental Setup In this section, we evaluate the performance and energy dissipation of the WiNoC-enabled multicore chip compared to conventional wireline mesh-based designs in the presence of the hybrid VFI clustering (section III.A) and dynamic V/F tuning (section III.C). We use GEM5 [31], a full system simulator, to obtain detailed processor and network-level information. In all experiments, we consider a system running Linux within the GEM5 platform in full-system (FS) mode. Since FS mode running Linux with Alpha cores is limited to a maximum of 64 cores, all experiments are done on a 64-core system. All full-system simulations were run with the default GEM5 packet size of 6-flits of 128-bits. The MOESI_CMP_directory memory setup is used with private 64KB L1 instruction and data caches and a shared 8MB (128 KB distributed per core) L2 cache. Four SPLASH-2 benchmarks, i.e., FFT, RADIX, LU, and WATER [32], and four PARSEC benchmarks, i.e., CANNEAL, DEDUP, FLUIDANIMATE (FLUID) and VIPS [33] are considered. The Fig. 4. Inter-paradigm benefits between WiNoC, VFI and V/F Tuning paradigms > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < TABLE III VFI SIZES FOR HETEROGENEOUS CLUSTERING AND THEIR RESPECTIVE STATIC V/F LEVELS. Benchmark VFI1 VFI2 VFI3 CANNEAL 22 – 0.6 22 – 1.0 16 – 0.6 DEDUP 40 – 0.9 16 – 1.0 4 – 1.0 FFT 29 – 0.9 23 – 1.0 7 – 0.9 FLUID 40 – 0.9 16 – 1.0 4 – 0.7 LU 32 – 0.8 24 – 1.0 4 – 0.6 RADIX 37 – 1.0 19 – 0.9 4 – 0.9 VIPS 30 – 0.7 26 – 0.9 4 – 0.7 WATER 33 – 0.8 23 – 1.0 4 – 0.9 VFI4 16 – 0.9 16 – 0.9 16 – 0.9 16 – 1.0 16 – 1.0 16 – 0.9 16 – 1.0 16 – 1.0 processor-level statistics generated by the GEM5 simulations are incorporated into McPAT to determine the processor-level power values [34]. In this work, we consider a nominal range operation. Hence, the adopted dynamic V/F strategy uses discrete V/F pairs that maintain a linear relationship. The considered V/F pairs are: 1V/2.5GHz, 0.9V/2.25GHz, 0.8V/2.0GHz, 0.7V/1.75GHz, 0.6V/1.5GHz, and 0.5V/1.25GHz. For the remainder of this work these V/F pairs will be referred to as 1.0, 0.9, 0.8, 0.7, 0.6 and 0.5 respectively. By using on-chip voltage regulators with fast transitions, latency penalties and energy overheads due to voltage transitions can be kept low. We estimate the energy overhead introduced by the regulators due to voltage transition as: 𝐸𝑟𝑒𝑔𝑢𝑙𝑎𝑡𝑜𝑟 = (1 − 𝜂) ∙ 𝐶𝑓𝑖𝑙𝑡𝑒𝑟 ∙ |𝑉22 − 𝑉12 | (14) where, Eregulator is the energy dissipated by the voltage regulator due to a voltage transition, η is the power efficiency of the regulator, Cfilter is the regulator filter capacitance, and V2 and V1 are the two voltage levels. Both the regulator switching and dynamic VFI controller energies, described in section IV.C, are taken into account while analyzing entire system energy. The synchronization delay associated with the mixedclock/mixed-voltage FIFOs at the boundaries of each VFI has been incorporated into the simulations following [35]. The energy overhead for each VFI has been taken into the simulations as proposed in [24]. 𝐸𝑉𝐹𝐼 = 𝐸𝐶𝑙𝑘𝐺𝑒𝑛 + 𝐸𝑟𝑒𝑔𝑢𝑙𝑎𝑡𝑜𝑟 + 𝐸𝑀𝑖𝑥𝐶𝑙𝑘𝐹𝑖𝑓𝑜 (15) where 𝐸𝐶𝑙𝑘𝐺𝑒𝑛 is the energy overhead of generating additional clock signals [36] and 𝐸𝑀𝑖𝑥𝐶𝑙𝑘𝐹𝑖𝑓𝑜 is the energy overhead of the mixed-clock/mixed-voltage FIFOs. We should also note that this overhead is same irrespective of the interconnect architecture considered. Indeed, this is a fixed overhead in all the VFI NoC architectures considered in this work. B. VFI Parameters Following the methodology described in section III.A.5, we determine the V/F pairs for all clusters using the hybrid clustering approach. In this work, we investigate two separate configurations: Homogenous (Hom) clustering, where we consider the whole system to be divided into four equally sized clusters and heterogeneous (Het) clustering, where we still have four clusters, but they are not of equal size. In Het clustering, the clusters are only required to contain at least four cores. Tables II and III show the cluster size followed by its respective static V/F pair (VFI SIZE – V/F) for Hom and Het clustering Avg. Inter-VFI Interface per Message TABLE II VFI SIZES FOR HOMOGENEOUS CLUSTERING AND THEIR RESPECTIVE STATIC V/F LEVELS. Benchmark VFI1 VFI2 VFI3 CANNEAL 16 – 0.7 16 – 0.6 16 – 1.0 DEDUP 16 – 1.0 16 – 1.0 16 – 0.9 FFT 16 – 1.0 16 – 1.0 16 – 0.9 FLUID 16 – 0.8 16 – 0.8 16 – 1.0 LU 16 – 0.7 16 – 0.7 16 – 1.0 RADIX 16 – 1.0 16 – 0.9 16 – 1.0 VIPS 16 – 0.6 16 – 0.8 16 – 0.7 WATER 16 – 0.8 16 – 0.8 16 – 0.9 1.2 Hom Mesh Het Mesh Hom WiNoC 8 VFI4 4 – 0.9 4 – 1.0 5 – 0.9 4 – 0.8 4 – 0.9 4 – 0.8 4 – 0.9 4 – 0.7 Het WiNoC 0.9 0.6 0.3 0 Fig. 5. Average number of inter-VFI interfaces traversed per message for VFI Hom Mesh, VFI Het Mesh, VFI Hom WiNoC and VFI Het WiNoC. respectively. These configurations were obtained by the clustering approach described in Section III.A for a performance target of α = 95%, which means achieving at least 95% of performance of the system with all clusters running at the nominal V/F, i.e. 1.0. For the dynamic V/F configurations, the V/F values will be adjusted throughout the execution in response to changing application characteristics. When creating the WiNoC, we determine the parameters described in section III.B.1: as shown in [9], ⟨𝑘𝑖𝑛𝑡𝑒𝑟 ⟩ = 1, ⟨𝑘𝑖𝑛𝑡𝑟𝑎 ⟩ = 3, 𝑘𝑚𝑎𝑥 = 7 optimizes the network latency and energy. To further demonstrate how the WiNoC network is better suited for VFIs, we analyze the number of inter-VFI interfaces that a message needs to traverse on average in both mesh and WiNoC networks (refer to Fig. 5). It can be seen that the messages in WiNoC networks need to traverse less inter-VFI interfaces on average since the connectivity and ALASH guarantees that each message will pass through at most one interface. On the other hand, mesh depends on the structure of the VFIs and the routing (standard X-Y is considered here) and makes no such guarantees. Also, Het WiNoC is able to reduce the number of interfaces traversed over Hom WiNoC by clustering more traffic within each cluster. C. Implementation of the Dynamic VFI Control Circuit In this section we discuss the implementation of the dynamic V/F controller as described in section III.C. The dynamic V/F controller is synthesized from a RTL level design using the TSMC 65-nm CMOS process and Synopsys™ Design Vision. The circuit has been designed to operate on a 1.25 GHz clock. The area overhead for each controller is 0.06 mm2 in the worstcase scenario, i.e., a controller for a VFI containing all 64 cores. Due to the simplicity of the application information used, as described in Section III.C, the core-to-controller traffic is insignificant when compared to the total amount of traffic traversing the system. It is measured that the total traffic > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < VFI1 VFI2 VFI3 VFI4 VFI1 0.3 Tracking MAE Tracking MAE 0.3 0.2 0.1 0 VFI2 VFI3 9 VFI4 0.2 0.1 0 -0.4 -0.3 -0.2 -0.1 0 Δ𝑀 0.1 0.2 0.3 0.4 -0.4 Optimal Δ𝑀 VFI1: -0.09, VFI2: -0.09, VFI3: -0.09, and VFI4: -0.09 -0.3 -0.2 -0.1 0 Δ𝑀 0.1 0.2 0.3 0.4 Optimal Δ𝑀 VFI1: -0.05, VFI2: -0.07, VFI3: -0.06, and VFI4: -0.06 (a) (b) Core Utilization Fig. 6. Mean absolute error (MAE) between the predicted V/F value and the next window’s metric for various values of ΔM for (a) FLUID and (b) RADIX. The optimal Δ𝑀 that minimizes the MAE for each VFI is also noted. 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 Time (Window) (b) Time (Window) (a) Fig. 7. A sample of the average core utilization for each window of a VFI running (a) FLUID and (b) RADIX. The error bars notate the minimum and maximum core utilization during that particular window. Previously in Section III.C, we had discussed using two different values to determine ΔM: ΔMopt, the value of ΔM that minimizes the tracking error, and ΔMstd, the standard deviation in the core utilization of the cluster. The combination of these two, ΔMopt+std (Eq. (13)), would provide benefits from both the two measures. In order to determine ΔMopt, we swept the values of ΔM per VFI for each benchmark to find the value of ΔM that minimizes the tracking MAE. Fig. 6 shows how the MAE changes while sweeping through values of ΔM. We consider two benchmarks, FLUID and RADIX, as examples. The optimal ΔM for each VFI is also shown here. Following the same process, we determined the optimal ΔM of each VFI for all the benchmarks under consideration. The parameter ΔMstd is calculated as the average standard deviation of the core utilization of all dynamic decision windows per VFI. To justify our usage of ΔMstd, in Fig. 7 we show the average, minimum (notated as negative error) and maximum (notated as positive error) core utilization for a sample period during the FLUID and RADIX benchmarks for a VFI. From this plot, we can see that there are cores that significantly deviate from the mean, which could cause large execution time penalties if we calculate the V/F simply from the average core utilization. Hence, we use ΔMopt+std as the ΔM value for all DVFI configurations. In order for the controller to make the best decision possible, the ability to quickly receive the most up-to-date information from each core in the VFI is essential. In this context, the maximum delay between any core and its local V/F controller is the most relevant metric. To demonstrate how the WiNoC helps facilitate core to controller communication, we show in Fig. 8 that the WiNoC reduces the maximum hop-count between a core and its local V/F controller compared to standard mesh architectures. Not only do lower hop-counts result in quicker communication, but it also reduces the coreto-controller communication’s impact on standard inter-core traffic. It should be noted that the difference between the Mesh and WiNoC maximum hop-count values increase with the size Max Hop Count generated for the V/F controllers contribute less than 0.05% of the total traffic for all benchmarks considered. The decisionmaking delay for the V/F controller is 8.8 ns. In this work we ensure that the V/F controller delay be less than 1% of the V/F switching window period, T, hence, the controller doesn’t introduce any significant time overhead with respect to the overall switching window. Therefore, we set the lower bound 𝑇 ≥ 1𝜇𝑠. The switching window period 𝑇 is swept throughout the range 1𝑚𝑠 ≥ 𝑇 ≥ 1𝜇𝑠. The energy per switching window, in the worst case, is 39.1 𝑛𝐽. This energy is added to the VFI overhead Eq. (15) for the dynamic V/F configurations. In section IV.D, we present the results for the value of T that optimizes the full-system energy-delay product (EDP) for each benchmark. The optimal values of T were found to be between 9𝜇𝑠 and 1𝑚𝑠 depending on the benchmark considered. 8 Mesh WiNoC 6 4 2 0 Fig. 8. Maximum hop-count between any core and its respective V/F controller for Mesh and WiNoC architectures. Execution Time (w.r.t. NVFI Mesh) > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 1.2 Mesh NVFI Mesh SVFI-HOM Mesh SVFI-HET WiNoC NVFI WiNoC SVFI-HOM 10 WiNoC SVFI-HET 1 0.8 0.6 CANNEAL DEDUP FFT FLUID LU RADIX VIPS WATER 1 0.8 0.6 0.4 0.2 0 HOM HET Fig. 10. Percentage of inter-VFI traffic of total traffic for homogeneous (HOM) and heterogeneous (HET) clustering. of the largest VFI cluster, VFI1, as seen in Table III. For example, FLUID has the largest VFI1 cluster and the largest difference between Mesh and WiNoC maximum hop-count value. On the other hand CANNEAL has the smallest VFI1 cluster and the smallest difference between Mesh and WiNoC maximum hop-count values. This is a result of the better connectivity of the small-world architecture and the long-range wireless shortcuts. By using the WiNoC, we provide a scalable and efficient communication backbone for DVFI. D. Full System Performance Evaluation In order to evaluate the performance of our proposed framework for multicore chips, we consider the effects of the different VFI clustering, network configurations and V/F tuning. As the baseline to all our configurations, we consider the commonly used non-VFI (NVFI) mesh architecture. 1) Effects of VFI Clustering: In this section we consider two VFI configurations. The first is a statically-tuned VFI (SVFI) with Hom clustering (SVFI-HOM) which makes four equalsized VFI clusters with statically configured V/F values, this is used as a comparison to currently existing VFI work [9]. The second configuration is SVFI with Het clustering (SVFI-HET) which creates VFI clusters according to Section III.A.5 with statically configured V/F values. We also include these clustering configurations with both Mesh and WiNoC topologies to analyze the effects of the network. We compare the execution time for all of the configurations to the baseline NVFI Mesh design in Fig. 9. Since Het clustering places significantly less restrictions on the VFI clustering, this type of clustering mechanism is able to obtain a better configuration than Hom clustering. Therefore, the Het cluster is able to contain more of the traffic within each VFI cluster which improves the overall performance of the platform. In Fig. 10 we can see that Het clustering is able to reduce the Average TrafficWeighted Hop-Count Traffic Interaction (% of total traffic) Fig. 9. The execution time of the NVFI, SVFI-HOM, and SVFI-HET configurations using Mesh and WiNoC with respect to NVFI Mesh for all benchmarks considered. 6 Mesh WiNoC 4 2 0 Fig. 11. Average traffic-weighted hop-count for both Mesh and WiNoC architectures. amount of inter-VFI traffic when compared to Hom clustering for all benchmarks considered. This helps Het clustering maintain or improve the execution time compared to Hom clustering. Also as expected in the presence of VFI, traditional mesh-based designs suffer from the degradation of execution time. For the WiNoC-enabled designs, we can see that they all outperform their Mesh counterparts due to the better connectivity of the WiNoC. We analyze the connectivity of the WiNoC by investigating the average traffic-weighted hopcount in Fig. 11, it can be clearly seen that WiNoC significantly reduces the average traffic-weighted hop-count when compared to Mesh for all the benchmarks considered. Therefore, WiNoC is able to transfer inter-core communication much more quickly than Mesh. The other important parameter in analyzing our VFI design is the energy dissipation profile. Since in VFI designs we principally save energy at the cost of performance, the most relevant metric to consider is the energy delay product (EDP) when analyzing the energy profile. By delay we consider the execution time here. Fig. 12 shows the EDP for all configurations with respect to NVFI Mesh. We can see that for all benchmarks considered, the WiNoC configuration is able to save EDP over their Mesh counterpart. This is due to WiNoC’s ability to reduce the performance impact and lower network energy through better connectivity and low-power wireless links. Similarly to the execution time analysis above, Het clustering is able to more effectively group the cores with similar utilization together compared to Hom clustering. As such, Het clustering is able to apply lower V/F levels without significantly impacting performance, resulting in better EDP EDP (w.r.t. NVFI Mesh) > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < Mesh NVFI 1.2 Mesh SVFI-HOM Mesh SVFI-HET WiNoC NVFI WiNoC SVFI-HOM 11 WiNoC SVFI-HET 1 0.8 0.6 0.4 0.2 Execution Time (w.r.t. NVFI Mesh) CANNEAL DEDUP FFT FLUID LU RADIX VIPS WATER Fig. 12. The EDP of the NVFI, SVFI-HOM, and SVFI-HET configurations using Mesh and WiNoC with respect to NVFI Mesh for all benchmarks considered. 1.2 Mesh NVFI Mesh SVFI Mesh DVFI WiNoC NVFI WiNoC SVFI WiNoC DVFI 1 0.8 0.6 CANNEAL DEDUP FFT FLUID LU RADIX VIPS WATER 1 0.8 0.6 0.4 0.2 0 Fig. 14. Traffic intensity of all the benchmarks considered with respect to the highest injection rate benchmark (RADIX). profiles. Due to this analysis we utilize Het clustering when comparing statically tuned and dynamically tuned VFIs. 2) Static vs. Dynamic V/F Tuning: In this section, the main two configurations under consideration are: SVFI and DVFI, both with Het clustering. We also include the Mesh and WiNoC configurations in order to give a final picture of all the components acting in harmony. We compare the execution time for all of the configurations considered here to the baseline NVFI mesh in Fig. 13. Again we can see that WiNoC outperforms Mesh in every configuration. One important thing to note is that the DVFI WiNoC is able to perform at or better than NVFI Mesh for a majority of the benchmarks considered. The only exceptions to this is DEDUP and VIPS, where DVFI WiNoC operates at a 2.5% and 3.5% penalty respectively. This is mostly due to the low traffic injection rates of these benchmarks. Fig. 14 shows the relative traffic intensity of each benchmark, measured as the traffic injection rate normalized to the highest value, in this case RADIX. It can be seen that VIPS is the lowest traffic intensity benchmark, thus reducing the network’s ability to improve the performance. The other extremely low traffic intensity benchmarks DEDUP and FLUID also have limited benefits from using the WiNoC compared to the other benchmarks. Also Traffic Interaction (% of total traffic) Traffic Intensity Fig. 13. The execution time of the NVFI, SVFI, and DVFI configurations using Mesh and WiNoC with respect to NVFI Mesh for all benchmarks considered. 1.0 x <=2.5mm 2.5< x <=5 5<x<=7.5 7.5 < x 0.5 0.0 Fig. 15. Distance of communication characteristics of the benchmarks considered. the level of long-range communication has an effect on how much improvement can be gained through the network. Fig. 15 shows the proportion of total traffic between cores of specific distance ranges for each benchmark. It can be seen for WATER, the amount of traffic traversing over 7.5 mm is significantly lower than the other benchmarks resulting in lower performance gains with WiNoC. Traffic traversing over 7.5 mm is considered to be long-range traffic since the wireless link becomes more energy efficient than the wireline counterpart beyond this distance as mentioned in Section III.B.2. Like before, we analyze the EDP as the relevant metric. Fig. 16 shows the EDP for all configurations with respect to NVFI Mesh. Again, we can see that for all benchmarks considered, the WiNoC configuration is able to save EDP over their Mesh counterpart. This is due to the same reasoning discussed in the previous section. We also see that the DVFI configurations outperform the other systems running the same network configuration due to DVFI’s ability to reduce the V/F levels in the system with little performance impact. Fig. 17 demonstrates this capability for a snapshot of a VFI running FFT, it is seen that DVFI is able to reduce its V/F to save more energy (e.g. time windows 93-106, 113-130) and increase its V/F to reduce the execution time penalty (e.g. time > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < Mesh NVFI EDP (w.r.t. NVFI Mesh) 1.2 Mesh SVFI Mesh DVFI WiNoC NVFI WiNoC SVFI 12 WiNoC DVFI 1 0.8 0.6 0.4 0.2 CANNEAL DEDUP FFT FLUID LU RADIX VIPS WATER 1 Avg Core Utilization SVFI DVFI 1 Average VF Core Utilization & V/F Fig. 16. The EDP of the NVFI, SVFI, and DVFI configurations using Mesh and WiNoC with respect to NVFI Mesh for all benchmarks considered. 0.5 SVFI-HOM SVFI-HET DVFI 0.8 0.6 0.4 1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 129 0 Time (window) Fig. 17. Core average utilization and V/F calculated for SVFI and DVFI during a snapshot of the FFT benchmark. windows 23-87) when compared to the SVFI configuration. The EDP improvement comes from three components: a better NoC architecture, heterogeneous VFI clustering and dynamically tuned VFI. The role of each can be analyzed from Fig. 16. As an example, for the FFT benchmark, the better NoC architecture improves the EDP by 14.7% (NVFI Mesh vs. NVFI WiNoC), the VFI clustering improves the EDP 5.3% further (NVFI WiNoC vs. SVFI WiNoC) and the dynamic V/F tuning improves the EDP 7.9% further (SVFI WiNoC vs. DVFI WiNoC) with respect to NVFI Mesh. By co-designing all three methodologies we are able to save a total of 27.9% EDP for the FFT benchmark. For the other benchmarks, we also see a similar trend. Fig. 18 shows the average V/F level for SVFI-HOM, SVFIHET and DVFI Mesh across the system and application runtime. Here we discuss DVFI Mesh as an example; DVFI WiNoC will exhibit a similar trend and show the same benefit with respect to SVFI-HOM and SVFI-HET. As it can be seen, DVFI Mesh lowers the average V/F for all benchmarks considered allowing for significant EDP reduction. Our proposed DVFI WiNoC saves up to 46.6% EDP (VIPS) and an average of 17.4% EDP over the state of the art static VFI system [9]. It should also be noted that the proposed DVFI design saves up to 60.5% EDP (VIPS) and an average of 38.9% EDP with respect to NVFI Mesh for all benchmarks considered. V. CONCLUSION AND FUTURE WORK In this paper, we have demonstrated that by incorporating WiNoCs, VFIs, and dynamic VF tuning in a synergistic manner, we are able to create an energy efficient design for multicore chips without sacrificing noticeable performance. We have shown that for all benchmarks considered, with the exception of only two (DEDUP and VIPS), there is no performance degradation for DVFI WiNoC with respect to Fig. 18. Average V/F level for three VFI configurations, SVFI-HOM, SVFI-HET and DVFI Mesh. NVFI Mesh. Along with this low impact on execution time, we are able to save significant full-system energy-delay product (EDP) over traditional non-VFI Mesh. As such, we have demonstrated the importance of an integrated design approach involving VFI, dynamic V/F tuning and wireless NoC to achieve energy efficiency for multicore chips. In this work we have mainly demonstrated an integrated design approach for multicore chips. It is only natural that future work would include further advancements of each component of the design process including investigations to increase the synergy between each component. REFERENCES [1] [2] [3] [4] [5] [6] [7] S. Rusu, S. Tam, H. Muljono, D. Ayers, J. Chang, R. Varada, M. Ratta and S. Vora. , “A 45nm 8-core enterprise xeon processor,” in Proc. of ASSCC, 2009, pp. 9-12. B. Stackhouse, S. Bhimji, C. Bostak, D. Bradley, B. Cherkauer, J. Desai, E. Francom, M. Gowan, P. Gronowski, D. Krueger, C. Morganti and S. Troyer, “A 65 nm 2-billion transistor quad-core itanium processor,” IEEE J. Solid-States Circuits, vol. 44, no. 1, 2009, pp. 18-31. J. Friedrich, B. McCredie, N. James, B. Huott, B. Curran, E. Fluhr, G. Mittal, E. Chan, Y. Chan, D. Plass, S. Chu, H. Le, L. Clark, J. Ripley, S. Taylor, J. Dilullo and M. Lanzerotti, “Design of the power6 microprocessor,” in Proc. of ISSCC, 2007, pp. 96-97. H. Mair, A. Wang, G. Gammie, D. Scott, P. Royannez, S. Gururajarao, M. Chau, R. Lagerquist, L. Ho, M. Basude, N. Culp, A. Sadate, D. Wilson, F. Dahan, J. Song, B. Carlson and U. Ko, “A 65-nm mobile multimedia applications processor with an adaptive power management scheme to compensate for variations,” in Proc. of VLSIC, 2007, pp. 224-225. N. Kapadia and S. Pasricha. “A framework for low power synthesis of interconnection networks-on-chip with multiple voltage islands.” Integration, the VLSI Journal, vol. 45, issue 3, June 2012. S. Deb, K. Chang, X. Yu, S.P. Sah, M. Cosic, A. Ganguly, P.P. Pande, B. Belzer and D. Heo, "Design of an Energy-Efficient CMOS-Compatible NoC Architecture with Millimeter-Wave Wireless Interconnects," Computers, IEEE Transactions on, vol.62, no.12, pp.2382,2396, Dec. 2013 R. Marculescu, U. Ogras, L.-S. Peh, N.E. Jerger and Y. Hoskote, “Outstanding research problems in NoC design: system, > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] microarchitecture, and circuit perspectives, “IEEE Trans. on CAD, vol. 28, Jan. 2009, pp. 3-21. R. Kim, G. Liu, P. Wettin, R. Marculescu, D. Marculescu, and P.P. Pande, "Energy-efficient VFI-partitioned multicore design using wireless NoC architectures," Compilers, Architecture and Synthesis for Embedded Systems (CASES), 2014 International Conference on, vol., no., pp.1,9. R.G. Kim, W. Choi, G. Liu, E. Mohandesi, P.P. Pande, D. Marculescu and R. Marculescu, "Wireless NoC for VFI-Enabled Multicore Chip Design: Performance Evaluation and Design Trade-offs," Computers, IEEE Transactions on, (in press) U.Y. Ogras and R. Marculescu, “It’s a small world after all: NoC performance optimization via long-range link insertion,” IEEE Trans. Very Large Scale Integr. Syst., vol. 14, no. 7, 2006, pp. 693-706. S. Garg, D. Marculescu, and R. Marculescu, “Custom feedback control: enabling truly scalable on-chip power management for MPSoCs,” in Proc. of ISLPED, 2010. J. Murray, R. Kim, P. Wettin, P. P. Pande, and B. Shirazi. 2014. “Performance Evaluation of Congestion-Aware Routing with DVFS on a Millimeter-Wave Small-World Wireless NoC,” J. Emerg. Technol. Comput. Syst. 11, 2, Article 17, November 2014. A. Bartolini, M. Cacciari, A. Tilli, L. Benini, "Thermal and Energy Management of High-Performance Multicores: Distributed and SelfCalibrating Model-Predictive Controller," Parallel and Distributed Systems, IEEE Transactions on, vol.24, no.1, pp.170,183, Jan. 2013 S. Murali, M. Coenen, A. Radulescu, K. Goossens, and G. De Micheli, "A Methodology for Mapping Multiple Use-Cases onto Networks on Chips," in Design, Automation and Test in Europe, 2006. DATE '06. Proceedings, vol.1, no., pp.1-6, 6-10 March 2006 U.Y. Ogras, R. Marculescu, D. Marculescu, "Variation-adaptive feedback control for networks-on-chip with multiple clock domains," Design Automation Conference (DAC) 2008. 45th ACM/IEEE, vol., no., pp.614,619 S. Garg, D. Marculescu, R. Marculescu, and U. Ogras, “Technologydriven Limits on DVFS Controllability of Multiple Voltage-Frequency Island Designs” in Proc. of IEEE/ACM Design Automation Conference (DAC), July 2009. P. Choudhary, D. Marculescu, “Power Management of Voltage/Frequency Island-Based Systems Using Hardware Based Methods,” in IEEE Trans. on VLSI Systems, vol.17, no.3, pp. 427-438, March 2009. P. Choudhary, D. Marculescu, “Hardware based Frequency/Voltage Control of Voltage Frequency Island Systems,” in Proc. IEEE/ACM Intl. Conference on Hardware-Software Codesign and System Synthesis (CODES-ISSS), Seoul, South Korea, Oct. 2006. B.C. Lee, D.M. Brooks, B.R. de Supinski, M. Schulz, K. Singh, S.A. McKee, "Methods of inference and learning for performance modeling of parallel applications," In Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming (PPoP), 2007. Y. Tan, W. Liu, Q. Qiu, "Adaptive power management using reinforcement learning," in Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2009. Z. Chen, D. Marculescu, "Distributed reinforcement learning for power limited many-core system performance optimization," In Proceedings of the IEEE/ACM Design, Automation & Test in Europe Conference & Exhibition (DATE), 2015. C.M. Bishop, “Pattern recognition and machine learning,” Vol. 4. No. 4. New York: springer, 2006. D.-C. Juan, S. Garg, J. Park and D. Marculescu, et al. "Learning the optimal operating point for many-core systems with extended range voltage/frequency scaling." Hardware/Software Codesign and System Synthesis (CODES+ ISSS), 2013 International Conference on. IEEE, 2013. U.Y. Ogras, R. Marculescu, D. Marculescu, Eun Gu Jung, "Design and Management of Voltage-Frequency Island Partitioned Networks-onChip," Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol.17, no.3, pp.330,341, March 2009 D.J. Watts and S.H. Strogatz, 1998. Collective Dynamics of ‘SmallWorld’ Networks. Nature. 393, 440-442 J. Murray, R. Kim, P. Wettin, P.P. Pande, and B. Shirazi. 2014. Performance Evaluation of Congestion-Aware Routing with DVFS on a Millimeter-Wave Small-World Wireless NoC. J. Emerg. Technol. Comput. Syst. 11, 2, Article 17 November 2014. T. Petermann and P. De Los Rios, “Spatial small-world networks: a wiring cost perspective,” arXiv:cond-mat/0501420v2. 13 [28] B.A. Floyd, C.-M. Hung, and K.K. O, “Intra-chip wireless interconnect for clock distribution implemented with integrated antennas, receivers, and transmitters,” IEEE J. Solid-State Circuits, vol. 37, no. 5, pp. 543552. [29] P. Wettin, R. Kim, J. Murray, X. Yu, P.P. Pande, A. Ganguly and D. Heo, "Design Space Exploration for Wireless NoCs Incorporating Irregular Network Routing," Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, vol.33, no.11, pp.1732,1745, Nov. 2014 [30] O. Lysne, T. Skeie, S.-A. Reinemo and I. Theiss, “Layered routing in irregular networks,” IEEE Trans. On Parallel and Distributed Systems, vol. 17, no. 1, 2006, pp. 51-65. [31] N. Binkert, B. Beckmann, G. Black, S.K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D.R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M.D. Hill and D.A. Wood, “The GEM5 Simulator,” ACM SIGARCH Computer Architecture News, 39(2), 2011, pp. 1-7. [32] S. C. Woo, M. Ohara, E. Torrie, J.P. Singh and A. Gupta, “The SPLASH2 programs: characterization and methodological considerations,” Proc. of ISCA, 1995, pp. 24-36. [33] C. Bienia, “Benchmarking modern multiprocessors,” Ph.D. Dissertation, Princeton Univ., Princeton NJ, Jan. 2011. [34] S. Li, J.H. Ahn, R.D. Strong, J.B. Brockman, D.M. Tullsen, and N.P. Jouppi, “McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures,” in Proc. of MICRO, 2009, pp. 469-480. [35] S. Beer, R. Ginosar, M. Priel, R. Dobkin and A. Kolodny, "The Devolution of Synchronizers," Asynchronous Circuits and Systems (ASYNC), 2010 IEEE Symposium on, vol., no., pp.94,103, 3-6 May 2010 [36] D. E. Duarte, N. Vijaykrishnan and M. J. Irwin, “A clock power model to evaluate impact of architectural and technology optimizations,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 10, no. 6. Pp. Dec. 2002. Ryan Gary Kim is a fourth year PhD candidate in the Electrical Engineering and Computer Science Department, Washington State University, Pullman, USA. His research interests include low-power wireless NoC design through power management techniques. Wonje Choi received the B.S. degree in Computer Engineering from Washington State University, Pullman, WA, USA in 2013, where he is currently working towards the PhD degree. Zhuo Chen received the B.S. degree in Electronics Engineering and Computer Science from Peking University, Beijing, China in 2013. He is currently working toward the Ph.D. degree in the Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA. His research interests are in the area of energy-aware computing. In particular, his research focuses on multi-core heterogeneous/homogeneous system optimization, and low-power applicationspecific system design Partha Pratim Pande is a Professor and holder of the Boeing Centennial Chair in computer engineering at the school of Electrical Engineering and Computer Science, Washington State University, Pullman, USA. His current research interests are novel interconnect architectures for multicore chips, on-chip wireless communication networks, and hardware accelerators for Biocomputing. > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < Diana Marculescu is a Professor of Electrical and Computer Engineering at Carnegie Mellon University. She has won several best paper awards in top conferences and journals. Her research interests include energy-, reliability-, and variability-aware computing and CAD for non-silicon applications. She is an IEEE fellow. Radu Marculescu is a professor in the Department of Electrical and Computer Engineering at Carnegie Mellon University. He has received several Best Paper Awards in top conferences and journals covering design automation of integrated systems and embedded systems. His current research focuses on modeling and optimization of embedded and cyber-physical systems. He is an IEEE Fellow. 14