21st International Conference on VLSI Design Exploiting Variable Cycle Transmission for Energy-Efficient On-Chip Interconnect Design T. Venkata Kalyan Madhu Mutyam P. Vijaya Sankara Rao IIIT Hyderabad IIT Madras IIT Kharagpur Hyderabad−500032, India Chennai−600036, India Kharagpur−721302, India kalyan tv@research.iiit.ac.in madhu@cs.iitm.ernet.in vijaysankar@ece.iitkgp.ernet.in Abstract buses [7], the data sequences that are closely placed are transformed to minimize the coupling effects, which in turn achieves power savings. In order to ensure the data integrity, the authors proposed to use either extra control lines or extra clock cycles. A technique to minimize power consumption due to coupling transitions is proposed in [15], which modifies the transition profiles to reduce switching energy by 50%. An area and energy efficient coding technique is proposed in [17] to obtain nearly 46% reduction in delay along with 10% reduction in energy, but it requires 48 wires to encode 32-bit data. An adaptive bus encoding technique using weighted code mapping and the delayed bus technique is proposed in [3] to achieve significant reduction in power consumption. In all the above mentioned techniques, a fixed clock period is considered for data transmission. But data can be transmitted using variable clock periods. As data transition patterns determine the necessary delay for data transmission, we can fix necessary delay for each data transition pattern and transmit data with delay corresponding to its worst-case transition pattern. A technique based on this idea called variable cycle transmission (VCT) technique [8]. It is shown that VCT technique can achieve significant delay savings. As system-wide power consumption is one of the critical issues in VLSI community, several techniques are proposed to minimize it. One of the well known techniques for reducing system-wide dynamic power consumption is dynamic voltage scaling (DVS) [5]. DVS technique exploits the quadratic dependency between supply voltage and power consumption and the linear relationship between clock frequency and supply voltage, to achieve significant dynamic power savings. Application of DVS technique for on-chip interconnect is explored in [6]. By considering interconnect designs based on a double sampling latch which detect and correct for timing errors, DVS technique is applied to recover the available slack which results in good power savings. By keeping frequency constant, supply voltage is scaled for As on-chip interconnect in deep-submicron designs contribute to the system-wide power consumption, minimization of interconnect power consumption has become one of the important design issues in deep-submicron technologies. As transition activity mainly determines the interconnect power consumption, several bus encoding techniques have been proposed to minimize the activity. Unlike the existing low-power or energy-efficient bus encoding techniques, in this paper, we propose a scheme which exploits both dynamic voltage scaling and variable cycle transmission mechanisms for minimizing on-chip interconnect energy consumption. We transmit data using variable cycle transmission method and, based on the delay savings achieved through variable cycle transmission method at regular intervals, scale the voltage and frequency to obtain significant energy savings. Using our technique for a 5mm interconnect wire we achieved energy savings of 30% and 45% over the base case in the address bus and data bus, respectively. Our technique also reduces the energy-delay-product by 34% and 52% for address bus and data bus, respectively. 1. Introduction With the scaling of process technology, system-wide power consumption is increasing. One of the contributors for the system-wide power consumption is the on-chip interconnect. Data transmission on the interconnect causes railto-rail voltage swing and charging/discharging of capacitance, which in turn result in dynamic (or switching) power consumption. As switching activity mainly determines the power consumption, several low-power or energy-efficient bus encoding techniques have been proposed in the literature to minimize it. In coupling-driven bus encoding technique for on-chip 1063-9667/08 $25.00 © 2008 IEEE DOI 10.1109/VLSI.2008.15 235 Crosstalk Relative delay Class on the middle Transition patterns wire (×CL RT ) 1 0 x−y 2 1 ↑↑↑,↓↓↓ 3 1+λ − ↑↑,↑↑ −,− ↓↓,↓↓ − 4 1 + 2λ − ↑ −,− ↓ −,↓↓↑,↑↓↓,↑↑↓, ↓↑↑ 5 1 + 3λ − ↑↓,− ↓↑,↓↑ −,↑↓ − 6 1 + 4λ ↓↑↓,↑↓↑ power reduction. Voltage scaling by keeping frequency as constant can result in timing errors. Errors are detected and corrected using the doubling sampling latch and the voltage scaling is controlled by the error recovery rate. Motivating from the fact that interconnection network consumes significant portion of system-wide power budget, application of DVS technique for interconnection network links is explored in [13]. Power minimization is achieved by adjusting both frequency and voltage of links. In this paper, we propose an interconnect design on which data is transmitted using variable clock periods. Delay savings obtained by transmitting data using variable clock periods are exploited for applying DVS technique. As part of the application of DVS technique, we scale both supply voltage and frequency. We validate the effectiveness of our technique by focusing on the L1 cache address/data buses of a microprocessor using the SPEC CPU2000 benchmark suite and show that for a 5mm interconnect wire our technique achieves 30% and 45% energy savings over the base case in the address bus and data bus, respectively. In addition to the energy savings, our technique reduces the energy-delay-product by 34% and 52% for address bus and data bus, respectively. 2 Table 1. Crosstalk classes (here λ = x, y : {a → b | a, b ∈ {0, 1}}). Ebus We first review the effects of voltage scaling on dynamic power consumption and delay. Each transition of a digital circuit consumes power because of charging and discharging of the digital circuit’s capacitance. The dynamic power consumption (Pdynamic ) is expressed as ∝ 2 VDD f Tl ∝ VDD (VG − VT )α (3) = CL RT [(1 + 2λ)∆2l − λ∆l (∆l−1 + ∆l+1 )] (4) where RT is the total resistance, λ is the ratio of the interwire capacitance (CI ) to the wire-to-substrate capacitance (CL ), and ∆l is the transition occurring on line l. ∆l is equal to 1 (or ↑) for 0-to-1 transition, −1 (or ↓) for 1-to-0 transition, and 0 (or −) for no transition. As data transition patterns determine the propagation delay, they are classified into six different crosstalk classes [8, 16] based on the relative delay of a wire w.r.t. its adjacent wires. Table 1 shows different crosstalk classes. (1) where VDD is the supply voltage and f is the clock frequency. It is clear from Equation (1) that the power consumption can be minimized quadratically by reducing VDD , but supply voltage reduction increases the propagation delay. The propagation delay (τ ) of a CMOS transistor [12] is expressed as τ 2 (αs CL + αc CI )VDD and where αs and αc denote the rates at which each capacitance is switched. While αs represents the self transitions, αc is related to the coupling transitions on the interconnect. We know from Equation (3) that the voltage scaling can significantly reduces the energy consumption. We now review an analytical model for propagation delay in deep-submicron buses. By assuming a n-bit parallel bus in a single metal layer, we model a deep sub-micron bus as a distributed RC network with coupling capacitance between adjacent wires. The delay of wire l (1 < l < n) of the bus is given by [16] Prerequisites Pdynamic = CI CL 3 Our Approach (2) The basic idea of our approach is to transmit data using variable clock periods as proposed in the VCT technique [8] and exploit the delay savings obtained through the VCT technique to apply the DVS technique for significant power savings. As part of the application of DVS technique, we scale both supply voltage and frequency. Figure 1 shows the basic mechanism used in our approach. The shaded portion in the figure represents the implementation of the VCT technique. In general, data is transmitted using a fixed clock period, which is at least the where VT is the threshold voltage, VG is the input gate voltage, and 1 ≤ α ≤ 2. In deep-submicron designs, the value of α is nearly 1.3. The clock frequency is restricted by the propagation delay and it has to be reduced to tolerate the increased propagation delay. Energy consumption per data transmission for an interconnect, which includes both the self capacitance (CL ) and the coupling capacitance between two adjacent lines (CI ), is given by [10] 236 Figure 1. Basic mechanism used in our approach. delay of crosstalk class 6 (refer to Table 1). Instead of transmitting data using fixed worst-case crosstalk class delay, we analyze the crosstalk class of a next data w.r.t. the present data and transmit the next data using the necessary delay. In order to determine the crosstalk class of a next data w.r.t. the present data on the bus, we use a Crosstalk Class Analyzer [8] (as shown in Figure 1). To support variable clock period for data transmission, we consider x as the unit clock period such that x 2x 3x 4x 5x ≥ ≥ ≥ ≥ ≥ CL RT CL RT (1 + λ) CL RT (1 + 2λ) CL RT (1 + 3λ) CL RT (1 + 4λ) Figure 2. Distribution of transition patterns. delay minimization, we exploit the delay savings obtained through the VCT mechanism for energy reduction. In order to apply the DVS technique for energy minimization, the delay savings are measured at regular intervals and based on the delay savings obtained so far we scale the voltage. As voltage scaling alters the propagation delay, we also scale the frequency. As shown in Figure 1, we consider a Delay Counter which consists of a Global Counter and a Local Counter. The Global Counter maintains the number of clock cycles that have been saved so far. The Local Counter maintains the number of clock cycles that have been saved within the current sampling interval. We consider the sampling period as 15000 cycles. The Crosstalk Class Analyzer provides values 3, 4, 5, and 6 for crosstalk classes 1-3, 4, 5, and 6, respectively, to the Local Counter. The Local Counter, upon receiving the value i from the Crosstalk Class Analyzer, increments its count by (6-i), where i ∈ {3, 4, 5, 6}, and resets its value after every 15000 cycles. The Delay Counter updates the Global Counter value for every 15000 cycles using the following formula: Hence, the unit clock period becomes CL RT 5(1+4λ) . Based on the crosstalk class of a next data w.r.t. the present data, we consider delay as an integer multiple of the unit clock period. As variable clock periods are used for data transmission, we use an extra interconnect wire (for Ready signal) to indicate the availability of data at the receiver side. The extra wire is separated from the actual data by using a shield wire so that the Ready signal takes a single cycle to reach the receiver. The delay due to Crosstalk Class Analyzer is overlapped with the propagation delay of the Ready signal [8]. Thus, we use two extra wires (i.e., the Ready signal and a shield wire) and hence the VCT technique requires 34 wires for a 32-bit bus. Data transmission using variable clock periods can result in significant delay savings. It is shown in [8] that the VCT technique achieves 31.5% delay savings in the case of L1 cache data bus for on-chip data transmission. Significant delay savings achieved by the VCT technique are due to the fact that on-chip data exhibit high percentage of lower crosstalk class transitions (refer to Table 1) as shown in Figure 2. As our main focus is on energy minimization rather than Ng = Ng + Nl − d( (x0 − x)Nl e x (5) where Ng and Nl are the values of Global Counter and Local Counter, respectively, x is the original unit clock pe- 237 B’mark Bzip Eon Gcc Mcf Perlbmk Vortex Applu Art Galgel Mesa Swim Address 104103584 180520910 186621216 117417966 129748403 114818866 102778392 109836092 106010236 130750176 100232496 Data 31711265 48659025 56015716 32797679 36285808 42302045 33394619 36995126 5657714 33263659 24084939 B’mark Crafty Gap Gzip Parser Twolf Vpr Apsi Fma3d Lucas Mgrid Wupwise Address 160591861 134847111 181144637 201532006 122384930 227654678 102578114 131143197 99952797 101415579 99954963 Data 45039063 37827729 40342073 51105161 31389418 37423636 28801754 19278215 18938412 32568604 8159589 Parameter Value W S T H Dielectric (nm) (nm) (nm) (nm) constant 205 205 430.5 398.5 3.3 Table 3. Device parameters for 90nm technology nodes based on the ITRS 2004 edition. # Codec overhead Method of wires Area (µm2 ) Energy (pJ) Base 32 0 0 VCT 34 3603.5 0.458 Table 2. Number of 32-bit data items considered in different benchmarks. Table 4. Codec overhead summary. 0 riod (as defined earlier in the section), and x is the new unit clock period obtained due to frequency scaling. 4 If the Global Counter value is more than δ1 and the energy savings due to voltage scaling by 0.1V is more than the energy overhead due to voltage regulator (given by Equation (6)), we scale-down the voltage by 0.1V . On the other hand, if the Global Counter value is less than δ2 , we scaleup the voltage by 0.1V . We consider the upper and lower limits for the supply voltage as 1.2V and 0.8V , respectively, and δ1 and δ2 as 4999 and 0, respectively. We adjust the clock frequency in accordance with the supply voltage scaling. The on-chip voltage regulators take few µs to change from one voltage to another voltage [11]. We assume that the original frequency of the bus is 1.5GHz, the voltage regulator takes 1µs to adjust the voltage by 10mV [6] and tbus (15) cycles are needed for frequency transition [13]. Thus, the voltage regulator takes 15000 cycles to adjust the voltage by 0.1V . As we use greedy method when applying the DVS technique (i.e., voltage and frequency are scaled based on the clock cycles saved so far), even though the voltage regulator takes 15000 cycles to adjust the voltage by 0.1V , there is no need of a prediction technique to estimate the bus transition patterns. Although the frequency transitions are much faster compared to voltage transitions, the bus is disabled during a frequency transition in order to avoid timing uncertainity when the receiver is tracking the input clock. Experimental Validation We validate our technique by simulating 22 SPEC2000 CPU benchmarks [2] using the Simplescalar 3.0 simulator [4]. For each benchmark, we fast forward 100 million instructions and then simulate next 300 million instructions. Table 2 shows benchmark-wise number of 32-bit data items transmitted on L1 cache address/data bus. We first discuss the energy, area, and latency overhead due to extra circuitry used to implement our technique. As we transmit data using the VCT mechanism, we design codecs used in the VCT technique [8] in Verilog and synthesize them using the Synopsys Design Compiler with 90nm TSMC technology library. The Berkeley interconnect model [1] is used to calculate the ground capacitance and coupling capacitance of the interconnect. In our experimental results, we consider metal layer 4 wire parameters as shown in Table 3 and wire length of 5mm. Energy and area overheads of VCT codecs along with the base case are shown in Table 4. Note that there is no latency overhead due to Crosstalk Analyzer used in the VCT technique as its latency is overlapped with the propagation delay of the Ready signal [8] and the delay associated with performing activities related to voltage regulations is overlapped with the data transmission delay. We now discuss the effect of voltage scaling on energy consumption. Energy consumption per data transmission in the base case is given by Energy overhead due to transition from initial voltage V1 to final voltage V2 is calculated by [18] Ebase (V ,V2 ) EV R1 = (1 − η)C|V22 − V12 | = 2 (αs CL + αc CI )VDD (7) where αs and αc represent the average self and coupling transitions, respectively. Energy consumption per data transmission in our technique is given by (6) where C is the filter capacitance of the power supply regulator and η is the power efficiency. In our experimental setup, we assume C as 5µF and η to be 94%. It is clear (V ,V ) (V ,V ) from Equation (6) that EV R1 2 = EV R2 1 . Note that the energy overhead calculated using Equation (6) is included in the calculation of overall energy savings. EV CT +DV S = 2 (Σn i=1 (αsi CL +αci CI )Vi ) n + EOH (8) where EOH is the energy overhead due to voltage regulator and the VCT circuitry, n is the number of samples 238 Figure 3. Energy savings of our technique w.r.t. the base case. Figure 4. Delay savings of our technique w.r.t. the base case. considered, αsi and αci represent the average self and coupling transitions, respectively, and Vi is the voltage at the ith sample period. As voltage may change in each sample period depending on the Global Counter value, we consider different voltages (Vi ) and use the average number of self (αsi ) and coupling (αci ) transitions occurred during ith sample period to calculate the energy consumption during the sample period. The total number of self (coupling) transitions during the entire execution is equal to the number of self (coupling) transitions at each sample period, i.e., αs = Σni=1 αsi and αc = Σni=1 αci . Energy overhead due to voltage regulator and the VCT circuitry is given by an average value of 30%. Variation in the energy savings across different benchmarks is less as data transition pattern behavior is almost uniform across all benchmarks (refer to Figure 2). Energy savings in the data bus case are also almost uniform across different benchmarks except for benchmarks “Applu” and “Galgel”. In the case of “Applu” and “Galgel” benchmarks, our technique achieves 14% and 34% energy savings, respectively, while for all other benchmarks, our technique achieves nearly 45% energy savings. As voltage and frequency scaling can affect data transmission delay, we now discuss the delay savings achieved by our technique w.r.t. the base case. We calculate the percentage of delay savings (dsave ) obtained through our technique by using the following formula: (i,i+0.1) EOH = Σ1.1 i=0.8 n(i,i+0.1) EV R m + t × ECodec (9) dsave = (1 − where ni,j is the number of times voltage is changed from i to j, m is the total number of data items transmitted, ECodec is the energy overhead due to the VCT circuitry (as shown in Table 4), and t is a multiplicative factor, which is used to consider energy overhead due to comparator circuit. Note that we use a comparator circuit in our experiments to check whether or not the energy savings due to voltage scaling is more than the energy overhead due to voltage regulation. In our experiments we consider t as 1.1. Using the above equations, we now give a formula to calculate the percentage of energy savings (Esave ) obtained through our technique w.r.t. the base case as: Esave = (1 − EV CT +DV S ) × 100 Ebase (Σni=1 ki − Σ1.1 i=0.8 n(i,i+0.1) tbus ) ) × 100 (11) k where k is the total number of cycles required for data transmission in the base case and ki is the total number of cycles required during ith sample period for data transmission in our technique. In equation (11), the overhead due to the bus disabling, tbus , is considered for every frequency transition. Figure 4 shows benchmark-wise delay savings obtained through our technique w.r.t. the base case. Delay savings in the address bus are within the range of 8.50% to 10.64% with an average value of 9.08%, while in the data bus, they range from −0.12% to 28.63% with an average value of 13.69%. We now consider the normalized energy-delay-product (EDP) of our technique w.r.t. the base case as shown in Figure 5. It is clear from the figure that our technique reduces the EDP by 34% and 52% for address and data buses, respectively. Note that it is easy to show that the EDP of our technique is less than that of the VCT technique. For instance, the VCT technique achieves 31.5% delay savings (10) Figure 3 shows benchmark-wise energy savings obtained through our technique w.r.t. the base case. Energy savings are almost uniform across different benchmarks in the address bus case and are ranging from 27.53% to 31.66% with 239 [4] D.C. Burger and T.M. Austin. The SimpleScalar tool-set, version 2.0, Technical Report 1342, Department of Computer Science, UW, 1997. [5] A. Chandrakasan, S. Sheng, and R. Brodersen. “Low-power CMOS digital design”. JSSC, 27(4), 1992, pp. 473-484. [6] H. Kaul, D. Sylvester, D. Blauuw, T. Mudge, and T. Austin. “DVS for on-chip bus designs based on timing error correction”. DATE, 2005, pp. 80-85. [7] K.-W. Kim, K.-H. Baek, N. Shanbhag, C.L. Liu, and S.M. Kang. “Coupling-driven signal encoding scheme for lowpower interface design”. ICCAD, 2000, pp. 317-321. [8] L. Li, N. Vijaykrishnan, M. Kandemir, and M.J. Irwin. “A Crosstalk Aware Interconnect with Variable Cycle Transmission”. DATE, 2004, pp. 102-107. Figure 5. Energy-delay-product of our technique w.r.t. the base case. [9] D. Liu et al. “Power consumption estimation in CMOS VLSI chips”. IEEE JSSC, 1994, 26, pp. 663-670. [10] L. Macchiarulo, E. Macii, M. Poncino. “Low-Energy Encoding for Deep-Submicron Address Buses”. ISLPED, 2001, pp. 176-181. in the data bus [8] and incurs some energy penalty (due to codec energy overhead of the VCT technique) as compared to the base case, and hence its normalized EDP is nearly 30%. The sensitivity analysis with wire length of 4mm yielded energy savings of 23.68% and 42.70% and EDP reduction of 30.6% and 50.5% for address and data bus, respectively. 5 [11] M. Meijer, J. Pinede de Gyvez, and R. Otten. “On-Chip Digital Power Supply Control for System-on-Chip Applications”. ISLPED, 2005, pp. 311-314. [12] T. Sakurai and A.R. Newton. “Alpha-power law MOSFET model and its applications to CMOS inverter delay and other formulas”. IEEE JSSC, 1990, 25(2), pp. 584-594. Conclusion [13] L. Shang, L.S. Peh, N.K. Jha. “Dynamic Voltage Scaling with Links for Power Optimization of Interconnection Networks”. HPCA, 2003, pp. 91-102. In this paper, by exploiting variable cycle transmission and DVS mechanisms, we proposed a novel technique for energy-efficient on-chip interconnect design. As part of the application of DVS technique, we scaled both supply voltage and frequency. Delay savings provided by the variable cycle transmission mechanism are exploited while voltage scaling technique is applied so that we obtained significant energy savings as well as delay savings without impacting the throughput. We validated the effectiveness of our technique by focusing on the L1 cache address/data buses of a microprocessor using the SPEC CPU2000 benchmark suite and showed that for a 5mm interconnect wire our technique achieves 30% and 45% energy savings over the base case in the address bus and data bus, respectively. We also demonstrated that our technique reduces the energy-delay-product significantly as compared to the base case. [14] P. Sotiriadis and A. Chandrakasan. “Low Power Bus Coding Techniques Considering Inter-wire Capacitances”. CICC, 2000, pp. 507-510. [15] P. Sotiriadis and A. Chandrakasan. “Bus energy minimization by transition pattern coding (TPC) in deep submicron technologies”. ICCAD, 2000, pp. 322-327. [16] P. Sotiriadis and A. Chandrakasan. “Reducing Bus Delay in Sub-micron Technology using Coding”. ASPDAC, 2001, pp. 109-114. [17] S.R. Sridhara, A. Ahmed, and N.R. Shanbhag. “Area and Energy-efficient Crosstalk Avoidance Codes for On-chip Buses”. ICCD, 2004, pp. 12-17. [18] A. Stratakos. High-efficiency low-voltage DC-DC conversion for portable applications. Ph.D. Thesis, University of California, Berkeley, 1998. References [1] Berkeley predictive technology model. http://wwwdevice.eecs.berkeley.edu/∼ptm/interconnect.html [2] SPEC CPU2000 Benchmark. http://www.spec.org [3] A.R. Brahmbhatt, J. Zhang, Q. Wu, and Q. Qiu. “Low-power bus encoding using an adaptive hybrid algorithm”. DAC, 2006, pp. 987-990. 240