IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 10, OCTOBER 2013 1797 Embedded Transition Inversion Coding With Low Switching Activity for Serial Links Ching-Te Chiu, Wen-Chih Huang, Chih-Hsing Lin, Wei-Chih Lai, and Ying-Fang Tsao Abstract— Serial link interconnection has been proposed for its advantages of reducing crosstalk and area. However, serializing parallel buses tends to increase bit transition and power dissipation. Several coding schemes, such as serial followed by encoding (SE) and transition inversion coding (TIC), have been proposed to reduce bit transition. TIC is capable of decreasing transitions by 15% compared to the SE scheme, but an extra indication bit is added in every data word to represent inversion occurrence. The extra bit increases the transmission overhead and the bit transitions. This paper proposes an embedded transition inversion (ETI) coding scheme that uses the phase difference between the clock and data in the transmitted serial data to tackle the problem of the extra indication bit. The ETI coding scheme reduces the transition by up to 31% compared to SE scheme. The analysis and simulation results indicate that the proposed coding scheme produces a low bit transition for different kinds of data patterns. Using the optimum degree of multiplexing, width, and spacing, the ETI coding scheme achieves 30%–60% energy reduction compared with the parallel bus without overhead. Taking circuit overhead into consideration, the power saving is up to 31.71% and 26.46% at a clock cycle of 250 ps for the 90- and 130-nm CMOS technology for m = 2 where m is the number of parallel wires multiplexed into a serial link. Index Terms— Coding techniques, low switching activity, phase detector, serial link. I. I NTRODUCTION A DVANCED silicon technology offers the possibility of integrating hundreds of millions of transistors into a single chip, which makes system-on-chip (SoC) design possible. With the continuous scaling of silicon technology, area and power dissipation of interconnects are one of the main bottlenecks for both on-chip and off-chip buses. Multiplexing parallel buses into a serial link enables an improvement in terms of reducing interconnect area, coupling capacitance, and crosstalk [1], but it may increase the overall switching activity factor (AF) and energy dissipation. Therefore, an efficient coding method that reduces the switching AF is an important issue in serial interconnect design. Many studies attempt to reduce the AF of parallel buses. For example, Stan and Burleson [2] introduced a bus-invert Manuscript received December 14, 2011; revised May 16, 2012; accepted August 31, 2012. Date of publication October 26, 2012; date of current version September 9, 2013. This work was supported by the National Science Council, Taiwan, under Contract NSC 101-2220-E-007-019 and Contract NSC99-2220E007-019. The authors are with the Department of Computer Science, Institute of Communications Engineering, National Tsing Hua University, Hsinchu 300, Taiwan (e-mail: ctchiu@cs.nthu.edu.tw; g9762594@oz.nthu.edu.tw; chihhsinglin@gmail.com; and y751026@hotmail.com; likeet171819@gmail. com). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TVLSI.2012.2219888 method that transmits the original or inverted pattern to minimize the switching activity. Researchers have proposed many techniques to improve the bus-invert coding method, such as the partial bus-invert coding [3] and weight-based businvert coding methods [4]. The schemes mentioned above use an extra channel to send the inversion indication signal. Kuo et al. [5] proposed the serial coding technique to solve the extra channel problem. They append extra information bits to the back of the original data word. Although this approach resolves the area overhead problem, it increases data latency. Three level differential encoding is proposed for parallel bus [6] to enable multiple drivers at the transmitter and to recycle the same current and reduce power consumption [6]. Joint crosstalk avoidance code and error correction code are proposed to reduce the power in parallel bus [7]. Huang et al. [8] further proposed combining serializing bus with the joint crosstalk avoidance code and error correction code to reduce the power. Serialized low-energy transmission (SILENT) [1] is a coding method used in reducing the switching activity for serial links. This approach encodes every single bit in the parallel bus using the XOR gate, and multiplexes the encoded parallel buses into a serial link. The XOR operation sets an adjacent bit with the same value to zero. The greater the correlation is, the more zeros the encoder produces. This method is designed for data with strong correlation. Bharghava et al. [9] proposed the transition inversion coding (TIC) technique to reduce switching activity for random data and to detect errors. Their technique counts the transitions in the data word, and inverts the transition states if the number of transitions in a data word is more than half of the word length. The scheme sets the current bit in the serial stream to be the same as the previous encoded bit when there is a transition. Otherwise, it is set to the inversion of the previous encoded bit. A transition indication bit is added in every data word. This extra bit not only increases the number of transmitted bits, but also increases the transitions and latency. Lee et al. [10]– [12] used serial links as communication channels on an on-chip network architecture for SoC. The serial links reduce the area of communication channels by 57% compared with a nonserialized approach [10]. This approach also reduces the switch activity because the coupling capacitance of the interconnect wires decreases [10]–[12]. Forward error correction (FEC) code is used to reduce the serial link power by trading off the FEC coding gain with specifications on transmit swing, analog-to-digital converter precision, jitter tolerance, receive amplification, and by enabling higher signal constellations [13]. By combining the single and differential signals and 8B/10B coding, a high 1063-8210 © 2012 IEEE 1798 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 10, OCTOBER 2013 speed phase tracking clock recovery is developed for serial link to reduce the power [14]. To utilize the relation between data in a link, the encoder reorders or shuffles the M bits in the parallel bus such that the encoded data are less correlated or silent [15]. Then the encoded data are serialized with one control bit. The silent coding with shuffling can reduce the power dissipation compared to the silent coding. A serial link on-chip bus architecture is proposed to lower interconnect power [16]. Serialization reduces the number of wires and leads to a larger interconnect width and spacing. A large interconnect spacing reduces the coupling capacitance, while the wider interconnects reduce the resistivity. A significant improvement in the interconnect energy dissipation is achieved by applying different coding schemes and their proposed multiplexing techniques. However, the power reduction decreases when the degree of multiplexing increases. This paper proposes the embedded transition inversion (ETI) coding scheme to solve the issue of the extra indication bit [17]. This scheme eliminates the need of sending an extra bit by embedding the inversion information in the phase difference between the clock and the encoded data. When there is an inversion in the data word, a phase difference is generated between the clock and data. Otherwise, the data word remains unchanged and there is no phase difference between the clock and the data. This ETI coding scheme reduces transition by 31% compared with the SE scheme. The improvement of transition reduction is 19% compared with that of the TIC. The receiver side adopts a phase detector (PD) to detect whether the received data word has been encoded or not. Statistical analysis and experimental results show that the proposed coding scheme has low transitions for different kinds of data patterns. Increasing the interconnect spacing reduces the coupling capacitance for on-chip buses. We also study optimization on degree of multiplexing, width, and spacing when serializing parallel buses. Using the optimum degree of multiplexing, width, and spacing, the proposed ETI coding scheme reduces energy consumption from 30% to 60% compared to the parallel buses. Compared the works in [17], we add the enhanced ETI architecture, link width and spacing selection scheme, word length and energy analyses, and detail switching activity analyses and new energy simulation results. The remainder of this paper is organized as follows. Section II presents the ETI coding scheme. We address the ETI architecture that includes the encoder and decoder in Section III. Section IV provides the AF analysis for ETI under degree of multiplexing and the word length. Section V presents the energy analysis by considering the interconnect width, spacing, and capacitance. Finally, Section VI presents experimental results, and Section VII gives a brief conclusion. II. ETI C ODING S CHEME A. Background AF analysis for different kinds of coding schemes appears in [16]. There are four types of bus schemes, including (a) (b) Fig. 1. (a) Four types of bus schemes: P, S, ES, and SE. (b) Overall AFav for m = 2 versus the sum of the average AFs from two parallel bitstreams (AF1,av = AF2,av ) [16]. parallel (P), serial (S), encoding followed by serial (ES), and serial followed by encoding (SE) as shown in Fig. 1(a). Their average AFs are AF P,av , AF S,av , AFES,av , and AFSE,av , respectively. The bitstream in this paper refers to the transmitted data in each wire of the input parallel bus. The AF analysis results of serializing two bitstreams into a single output is shown in Fig. 1(b). AF1,av and AF2,av are the AFs of the bitstream in the first and the second bitstreams. The AF of each bitstream AFi,av (i = 1, 2) can be viewed as the probability of bit switching within a cycle time. The bit switching in one bitstream within a cycle time could be 00, 01, 10, and 11. We find all the combinations of serializing two bitstreams within a cycle time (i.e., 0000, 0001,...etc., total combinations are 16). Then we find the AF of the serial link for all possible joint switching scenarios. For example, the AF of the serial link with bit pattern 0001 is one. The switching probabilities of the input bitstreams are assumed statistically independent. The probability of joint switching event in the serial link is the product of the probability of each of the individual events in each bitstream. The probabilities of two bitstreams both switching, only one switching, and both not switching are AF1,av AF2,av , AF1,av (1 − AF1,av ), AF2,av (1−AF1,av ), and (1−AF2,av )(1−AF2,av ) [16]. CHIU et al.: ETI CODING WITH LOW SWITCHING ACTIVITY The average AFs AF P,av , AF S,av , AFSE,av , and AFES,av of the four schemes are obtained by summing products of the AF after serialization and its probability for all the combinations. The average AF AF P,av equals to AFSE,av , and is the summation of the AFs of the first and the second bitstreams AF1,av + AF2,av . The average AF AF S,av equals to one. The average AF AFES,av is a downward parabolic curve. The average AFs of the four schemes AF P,av , AF S,av , AFES,av , and AFSE,av are considered under three cases. The summation of the first and second bitstream AF (AF1,av + AF2,av ) is smaller than, equal to, and greater than one. Note that the AFs of the ES, SE, and P schemes are smaller than S when the summation of the AF is smaller than one (AF1,av + AF2,av < 1). That means serializing bitstreams (S) may increase the overall AF. The SE and P schemes reduce more switching activity than ES when each individual AF is small. When the summation of the AF equals to one (AF1,av + AF2,av = 1), the AFav values of SE and ES schemes equal to AF S,av . This means that the reduction of average activity for both coding schemes is zero. When the summation of the AF is greater than one (AF1,av + AF2,av > 1), the AF of the ES scheme is smaller than SE and provides the lowest overall AF. Although the AFES,av is very low at the two boundaries (AF1,av + AF2,av = 0 or 2), the probability of AF1,av + AF2,av = 0 or 2 is approximately zero because it is rare that the bitstreams are always switching or always quiet in normal applications. The AF of individual bitstream depends on the application. For the pseudo random binary sequence bitstreams commonly used in transceiver design, the average AF of each bitstream is between 0.2 and 0.7. The average activity used in microprocessors or video processing varies from 0.1 to 0.3 [16] and 0.25 to 0.35 [1]. Assuming that the bitstreams are statistically independent, then the sum of the AFs of two individual bitstream is between 0.2 and 1.4. The ES scheme is a good choice for applications, such as microprocessors or video processing based on the comparison of the four schemes shown in Fig. 1. The ES and SE coding schemes in Fig. 1 are both based on XOR operation that is suitable for bistreams with distinct values in adjacent bits. By replacing the XOR operation with our proposed ETI coding in the SE scheme, the switching activity of our proposed ETI scheme is lower than the ES scheme for microprocessors or video processing. B. Proposed ETI Scheme Although many coding algorithms can reduce the switching AF, most of them are designed for specific applications, such as video streaming or strongly correlated data. The TIC is one of the methods developed for random data [9]. This method adds a transition indication bit to every data word to indicate if there is an inversion or not. This inversion coding is performed on every bit of two consecutive bits in the serial stream. The extra indication bit increases the switching activity. This paper proposes the ETI coding scheme that operates on a two-bit basis and removes all the transition indication bits. 1799 Fig. 2. n/m ETI serial links with n input bitstreams under degree of multiplexing m. Fig. 3. ETI coding scheme for one serial link, word length = WL, Nth = W L/2, and number of transition = Nt . An n/m ETI serial links with n input bitstreams under degree of multiplexing m is shown in Fig. 2. Each serial link has m input bitstreams that are multiplexed by a serializer, followed by the ETI encoding. The encoded stream is transmitted through the serial link and followed by the ETI decoding and a deserializer. The ETI coding scheme includes the inversion coding and phase coding as shown in Fig. 3. 1) Inversion Coding: Define the word length (WL) as the number of bits in a data word and a threshold Nth as half of WL. A transition is defined as a bit changing from zero to one or from one to zero. For example, the bitstream “0100” has two transitions while “0101” has three transitions. When the number of transitions Nt in a data word exceeds the threshold Nth , the bits in the data word should be encoded. Otherwise, the data word remains the same. When an encoding is needed in a data word, this method checks every two-bit in the data word, as Fig. 3 shows. Every two bit in the serial stream is combined as a base to be encoded. In this case, the b11 b21 is a base and the b31 b41 is another base. The 2-bit in a base is denoted as b1 b2 and the encoded output is denoted as be1 be2 . When the Nt in a data word is less than Nth , b1 b2 remains unchanged. Otherwise, we perform the inversion coding and the phase coding. For the inversion coding, the bitstreams “01” and “10” are mapped to “00” and “11,” respectively. The bitstreams “00” and “11” are mapped to “01” and “10,” respectively. For the phase coding, we embed the inversion information in the phase difference between the clock and the encoded data. 1800 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 10, OCTOBER 2013 TABLE I A LL C OMBINATIONS OF T WO B ITSTREAMS FOR THE TIC, ETIpre , AND ETI C ODING S CHEMES Parallel Streams Stream1 Stream2 Serial Stream (TIC) Serial Stream (ETIpre ) Serial Stream (ETI) b11 b12 b21 b22 b11 b21 b12 b22 bex b11 b21 b12 b22 b11 b21 b12 b22 00 00 0000 0 0000 0000 00 01 0001 0 0001 0001 00 10 0001 1 0001 0001 00 11 0000 1 0000 1000 01 00 0111 1 0111 0111 01 01 0011 0 0011 0011 01 10 0011 1 0011 0011 01 11 0111 0 0111 0111 10 00 1000 0 1000 1000 10 01 1100 1 1100 1100 10 10 1100 0 1100 1100 10 11 1000 1 1000 1000 11 00 1111 1 1111 0111 11 01 1110 1 1110 1110 11 10 1110 0 1110 1110 11 11 1111 0 1111 1111 (a) (b) (c) (d) Fig. 4. Dout “1000” obtained from (a) Din “1000” without encoding, (b) from Din “1101” with encoding. Dout “0001” obtained from (c) Din “0001” without encoding, and (d) Din “0100” with encoding. The inversion encoding operation can be expressed as be1 = b1 b2 , with Nt < Nth be2 = !b2 , with Nt ≥ Nth . Fig. 5. The inversion decoding operation for the decoded output bd1bd2 is bd1 = be1 be2 , with Nt < Nth bd2 = !be2 , with Nt ≥ Nth . Waveform example for special data words (“0000” and “1111”). (1) (2) Since this operation is on a two-bit basis and only the second bit is inverted, it is called bit-two inversion (B2INV). 2) Phase Coding: The ETI coding uses the phase difference between the data and the clock to encode the indication information. Table I shows the corresponding output data word after TIC, ETIpre , and ETI. The ETIpre has the same data word as the TIC, except that it removes the extra bit bex . Removing the bex leaves eight sets of data words that are exactly the same. For example, there are two “1000” data words after the ETIpre coding. Within every data word duration, the phase difference between the data and the clock distinguishes these two data words, as Fig. 4 illustrates. Same Dout “1000” in Fig. 4(a) and (b) is obtained from Din “1000” and “1101” without and with inversion. A half clock cycle difference between Dout and CK is shown in Fig. 4(b), indicating that Din has been encoded. The Dout and CK are aligned in Fig. 4(a), indicating that Din has not changed. Dout “0001” is the same in Fig. 4(c) and (d) from Din “0001” and “0100” without and with inversion. This approach is able to identify whether Dout has been encoded or not as long as there is a half cycle delay between the Dout and CK. Although the phase difference can distinguish most of the data words of ETIpre , this method cannot be used for “0000” or “1111” because there is no transition inside the data word. Under the inversion condition for these two data words, the “0000” and “1111” change to “1000” and “0111”, as shown in the last column of Table I. The corresponding waveforms are shown in Fig. 5 for these two data words. The first bit of Dout in the “1000” and “0111” is aligned with CK and the duration of the bit is only half of the clock cycle. III. ETI A RCHITECTURE A. ETI Encoder The overall architecture of the ETI scheme is shown in Fig. 6(a). We add the ETIpre block and the TIC architecture for clarity. The ETIpre does not provide the decision bit information so it cannot be decoded in the receiver. The ETIpre encoder is shown by the dashed box in the ETI encoder in Fig. 6(a). The TIC counts the transitions in the data word then uses this information to perform encoding. The transition indication bit is added to every data word to indicate whether there is an inversion or not. The decoder adopts the transition indication bit to perform the decoding, as shown in Fig. 6(b). In the ETI encoder part, the input data Din are stored in the buffer to wait until the check transition operation is completed. The transition and threshold in a data word are used to set the decision bit. The decision bit is used to control the encoding process in the B2INV and the phase encoder block. When the decision bit is set to zero, the B2INV passes the noninverted bitstream. Otherwise, the bitstream is encoded. CHIU et al.: ETI CODING WITH LOW SWITCHING ACTIVITY 1801 (a) (b) Fig. 6. (a) Overall architecture of the ETI scheme. (b) Overall architecture of the TIC scheme. ETIpre encoder is marked by dashed line box. The decision bit is also adopted in the phase encoder block to select the phase encoded or un-encoded data word. In the ETI decoder part, the phase decoder checks the phase difference between the clock and the data. The phase difference information is then used to generate the decision bit. The decision bit is used in the B2INV to decode the data words. The ETI encoder includes the check transitions block, buffer, B2INV, and phase encoder. The check transition block is shown in Fig. 7(a). The WL indicator block counts the length of the data word and generates a high signal at the first bit of the data word. This signal is used to reset the adder and the D-flip-flop (D-FF). The D-FF stores the previous bit that is used to XOR with the current bit for transition checking. The adder block calculates the number of transition in a data word and sets the decision bit to high when the Nt ≥ Nth . The check transition block in Fig. 7(a), which is used to detect the number of the transitions, is part of the ETI encoder. Compared to the ETI encoder, the synthesis results show that the hardware area overhead of the check transition block are 32.34% and 34.96%, respectively, for serial link running at 500 MHz and 4 GHz under 90-nm CMOS technology. The power overhead are 23.41% and 20.28%, respectively. The architecture of the B2INV is shown in Fig. 7(b). The divider, which divides the clock by two, provides an indication signal for the first or second bit in a pair of bitstream b1 b2 . According to (1) and (2), if it is the first bit, then the bit passes through. Otherwise, the bit is inverted when the decision bit is high. Equations (1) and (2) show that the decoding block of ETI is exactly the same as the encoding block in Fig. 7(b). (a) Fig. 7. (b) (a) Check transition block. (b) B2INV block. Compared with the coding block in TIC, the B2INV has less gate count and power consumption. The TIC uses the states between any two consecutive bits to set the encoded bit. Two D-FFs are required to store the states of two consecutive bits. The TIC coding circuit includes two D-FFs, two XORs, and two inverters [9]. The B2INV block needs only one D-FF, two inverters, one AND, and one MUX. Only the second bit in the data word needs to be inverted so the ETI needs only one D-FF. A MUX is smaller than a D-FF so the B2INV consumes less gate count compared with the TIC. Using TSMC CMOS 130-nm technology, the area of TIC and B2INV coders are 2178 and 1232 μm2 . The B2INV implementation cuts the area by 43% compared with the coding block of TIC. As shown in Fig. 8, the phase generator is used to generate phase difference between the encoded data (Dpre ) and the clock (CK) at each data word. Depending on the encoded data, 1802 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 10, OCTOBER 2013 (a) Fig. 8. Phase encoder block. TABLE II A REA AND P OWER OF ETI E NCODER I NDIVIDUAL (b) B LOCK U NDER 90-nm CMOS T ECHNOLOGY Area ETI Encoder Subblocks Fig. 9. (a) Alexander PD and (b) clock and phase difference for determining the coding state. Power 0.5G 4G 0.5G (μm2 ) (%) (μm2 ) (%) (μW) (%) 4G (μW) (%) Check transition 163 32.34 136 34.96 27.43 23.41 236.64 20.28 B2INV 84 16.7 92 23.65 25.65 21.9 277.22 23.76 Phase encoder 121 24 76 19.54 30.7 274.54 23.53 Buffer 136 26.98 85 21.85 33.37 28.48 Total 504 100 117.15 100 389 26.2 378.2 34.42 100 1166.62 100 there are three types of phase encoding: the one cycle delay [Fig. 4(a) and (c)], the half cycle delay [Fig. 4(b) and (d)], and the special data word (Fig. 5). The phase encoding of the one cycle delay is shown by the first path in Fig. 8. The half cycle delay and the special data word are shown by the second and the third path. In the special data word, the predefined Flag signal is used to present the special data pattern. The “check all 0 and all 1” block is used to identify the special data word when the encoded data (Dpre ) are “0000” or “1111.” If the encoded data are not the special data word, the second path is selected from the MUX1. Otherwise, the third path is selected. The decision bit then selects the data from the first path or the output of the MUX1. The summary of area and power in individual blocks in the ETI encoder is shown in Table II. B. ETI Decoder The ETI encoder generates the phase difference between the clock and the data word. Normally, a PD identifies an early or delayed phase. A variety of PDs could detect the phase difference. This paper adopts the commonly used Alexander PD [18]. The Alexander PD architecture is shown in Fig. 9(a), which uses three consecutive clock edges to generate four sampling signals (S0, S1, S2, and S3). The PD is controlled by the clock CK and input data Din . When the clock CK and TABLE III M EANING OF D IFFERENT P HASES S1 ⊕ S2 High Low Low High S2 ⊕ S3 Low Low High High Clock Early No transition Late Special case Coding state Has not been encoded Has been encoded input data Din are valid, the PD is activated to identify the phase relation between the clock and the data. The PD can determine whether a data transition exists from the condition that the clock leads or lags the data. The basic waveform is shown in Fig. 9(b) to judge the un-inverted, inverted, no transition, or the special data word. If the clock leads the data (early conditions), the signal S1 ⊕ S2 is high and the S2 ⊕ S3 is low. Conversely, if the clock lags the data (late conditions), the signal S1 ⊕ S2 is low and S2 ⊕ S3 is high. Thus, S1 ⊕ S2 and S2 ⊕ S3 could provide the clock and data relation as shown in Table III. The early signal is for the un-inverted data and the no transition represents the un-encoded the “all zero or all one” data word. The late signal represents the case in which data have been inverted in encoder. The last case is for the encoded the “all zero or all one” data word. The decision bit is generated based on the phase information on the S1 ⊕ S2 and S2 ⊕ S3. The decision bit is used in the B2INV for the decoding. The decoding operator in the B2INV is the same as that in the encoder. Two D-FFs are added in the front of the B2INV block for buffering and alignment. A larger bandwidth is needed in the ETI coding scheme due to phase shift. The last bit has half the pulse width of the other bits so that the interconnect has twice the bandwidth. It means that the serial link needs to CHIU et al.: ETI CODING WITH LOW SWITCHING ACTIVITY 1803 TABLE IV AF S FOR THE ETI B ITSTREAM Parallel Streams Stream1 Stream2 b11 b12 b21 b22 Fig. 10. Cycle-based AF analysis. run at a much frequency. The higher clock frequency leads to problems, such as buffering, clock synchronization, and design complexity. The other way is to wait an extra bit to check the transition information but that would lower the overall bit rate. The disadvantage of this method is added in the revised paper. We send both data and clock from TX to RX in our proposed scheme. In our simulation, we assume the channel length of the clock and data links is equal and the phase skew between them is negligible. A PLL or CDR is usually needed in the receiver to resynchronize the deskew data. Only one global PLL or CDR is needed in the serial link to recover the deskew data then perform the ETI decoding and demultiplexing. The estimated power overhead of PLL or CDR is about 10%–25% of the total receiver power [19]– [21]. IV. A NALYSIS OF AF A. AF Analysis With Degree of Multiplexing Consider the case of serializing m bitstreams into a serial link and denote the AF of each bitstream as AFi , where i ranges from 1 to m. The probability that a wire has a “rise” or “fall” switching activity is AFi,av /2, while the probability that a wire is quiet at 0 or 1 at (1−AFi,av )/2. The AF of the serial link is defined as the number of transitions per clock cycle, and can be obtained from the summation of the transition in T2 and the transitions between T1 and T2 as Fig. 10 shows. For example, the AF of the serial stream “0011” is one. The fourth column of Table IV shows the AF values obtained by the ETI encoder for m = 2, while the fifth column is the probability of the corresponding parallel streams. Note that the AF values of the serial stream “1000” and “0111” are set to one in Table IV. This AF value accounts for the extra transitions produced at T1 . The average AFav of the ETI stream can be obtained by summing the products of AF and its probability for all the combinations. The average AF of ETI coding AFETI,av for m = 2 is AF E T I,av = m 0.25 + 0.25AF2i,av , m = 2. (3) i=1 The average AFav of different coding schemes for m = 2 is shown in Fig. 11(a). As mentioned in the previous section, it is rare that AF1,av + AF2,av equal to zero or two. Although the ETIpre coding scheme has a lower AF than ETI, the ETIpre scheme fails in decoding due to lack of indication bit. Compared with AF S,av , the reduction of AFETI,av is from 40% to 50% when the AF of AF1,av + AF2,av is between Serial Stream (ETI) b11 b21 b12 b22 AF P(x0.25) 00 00 0000 0 (1 − AFi,av )(1 − AF i,av ) 00 01 0001 1 (1 − AFi,av )AFi,av 00 10 0001 1 (1 − AFi,av )AFi,av 00 11 1000 1 (1 − AFi,av )(1 − AF i,av ) 01 00 0111 0 AFi,av (1 − AFi,av ) 01 01 0011 1 AFi,av AF i,av 01 10 0011 1 AFi,av AF i,av 01 11 0111 0 AFi,av (1 − AFi,av ) 10 00 1000 0 AFi,av (1 − AFi,av ) 10 01 1100 1 AFi,av AF i,av 10 10 1100 1 AFi,av AF i,av 10 11 1000 0 AFi,av (1 − AFi,av ) 11 00 0111 1 (1 − AFi,av )(1 − AFi,av ) 11 01 1110 1 (1−AFi,av )AFi,av 11 10 1110 1 (1−AFi,av )AFi,av 11 11 1111 0 (1−AFi,av )(1 − AFi,av ) 0.4 and 1.4. Under these cases, the proposed ETI coding is more effective than others. The AFETI,av expressions for m = 4 and m = 8 are AFETI,av = m 0.4375 − 0.312AFi,av (4) i=1 +0.375AF2i,av − 0.25AF3i,av , AFETI,av = 2.75, m = 8. m=4 (5) The serial AFav values of various coding schemes for m = 4 and m = 8 are shown in Fig. 11(b) and (c). For m = 4, the reduction of AFETI,av ranges from 25% to 50% compared with AF S,av . For m = 8, the reduction of AFETI,av is approximately 37.5% compared with AF S,av . When AFi,av is between 0.2 and 0.7, the AFETI,av is always below AF S,av and AFSE,av schemes. When the individual bitstream is completely random, the ETI has the lowest average AF. The AFav reductions of ES, SE, TIC, ETIpre , and ETI with AF S,av are compared in Fig. 11(d)–(f) based on AFi,av = 0.3, 0.5, and 0.7. The TIC scheme achieves lower reduction than the ETI method when AFi,av = 0.3 and m ≤ 4 is shown in Fig. 11(d). However, the TIC scheme needs to transmit the extra bit for every data word. If AFi,av increases to 0.5 in Fig. 11(e), the reductions of SE and ES are almost 0% while the performance of the ETI scheme is not affected. When AFi,av is set to 0.7 in Fig. 11(e), the reduction of SE becomes negative on m = 2. These graphs show that the proposed ETI coding method achieves the highest reduction for different m and AFi,av . The reduction of AFav for different coding schemes in Fig. 11(e) and (f) is AFet ipre ,av > AFETI,av > AFt ic,av > AFes,av > AFse,av . However, in Fig. 11(d), the reduction order is AFet ipre ,av > AFt ic,av > AFETI,av > AFse,av > AFES,av for m = 2. This is because the ETI and ES schemes perform 1804 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 10, OCTOBER 2013 (a) (b) (c) (d) (e) (f) Fig. 11. (a) Overall AFav for m = 2 versus the sum of the two bitstream average AFs (AF1,av = AF2,av ). (b) Overall AFav for m = 4 versus the sum of the four bitstream average AFs (AF1,av = AF 2,av = AF 3,av = AF4,av ). (c) Overall AFav for m = 8 versus the sum of the eight bitstream average AFs (AF1,av = AF2,av = AF3,av = AF4,av = AF5,av = AF6,av = AF7,av = AF8,av ). (d)–(f) Reduction of AFav versus m for AFi,av = 0.3, 0.5, 0.7 compared with AF S,av . better for signals with high switching probabilities. For m = 2 and AFi,av = 0.3 in Fig. 11(d), the data in the individual bitstreams and the serial link have less switching activities so the performance of the ETI and ES is slightly less than the TIC and SE. For m = 4 or 8 in Fig. 11(d) and all the m in Fig. 11(e) and (f), when AFi,av = 0.5 or 0.7, the switching activities of each bitstreams and the serial link are larger so the performance of the ETI and ES is better than that of the TIC and SE. The word length used in the above AF analysis is proportional to the multiplexing degree m. The larger the word length for a data word is selected, the higher threshold is used. Therefore, the reduction on the number of transition becomes smaller. Since the reduction of AFETI,av decreases when the number of serialized bitstream m increases, we only show the AF results for m from two to eight in Fig. 11. B. Word Length Analysis We discuss the effects of word length on the AF for three types of encoding methods: TIC, ETIpre , and ETI. The AFs of TIC, ETIpre , and ETI depend on the word length since transition checking uses half of the word length as a threshold. For these three coding schemes, the AF increases as the word length increases. An example of ETI encoding with four different word lengths is shown in Fig. 12(a). Assume that the serial input stream has 14 transitions. For a 32-bit word length with threshold Nth (=16), the number of transitions Nt (=14) is smaller than the threshold; thus, the serial input data are not encoded. For a word length of 16-bits, half of the input bitstream is encoded (the bold font part) and the other half is not encoded. The Nt in the ETI encoded data word is reduced to 12. The Nt in the ETI encoded data word using eight-bit and four-bit word lengths are 12 and 10, respectively. The smaller the word length is, the lower the AF is. Similar results appear in [9], where the number of transitions in the resulting data word increases as the word length increases from 8 to 32. In the TIC scheme, the AF of the data word using a word length of four is higher than that of a word length of eight. This is because the increase in transition caused by the extra bit becomes severe when the word length decreases to four bits. The extra bit not only increases the data latency, but also produces extra transitions, as Fig. 12(b) shows. The AF E T I,av expressions for word lengths of four, eight, and sixteen for m = 8 are shown as below AF E T I,av = 2.75, W L = 4 (6) ⎧ m ⎪ ⎪ ⎪ 0.38 − 0.32 AFi,av ⎪ ⎪ ⎨ i=1 AFFE T I,av = −0.15AF2 + 0.75AF3 W L = 8 (7) i,av i,av ⎪ ⎪ 4 5 ⎪ ⎪ ⎪ −1.40AFi,av + 1.43AFi,av ⎩ −0.87AF6i,av + 0.25AF7i,av ⎧ m ⎪ ⎪ ⎪ 0.38 − 0.03 AFi,av ⎪ ⎪ ⎨ i=1 W L = 16 (8) AFE T I,av = +0.46AF2 − 1.15AF3 i,av i,av ⎪ ⎪ 4 5 ⎪ +1.35AFi,av − 1.04AFi,av ⎪ ⎪ ⎩ +0.53AF6i,av − 0.15AF7i,av . CHIU et al.: ETI CODING WITH LOW SWITCHING ACTIVITY 1805 Fig. 13. AFav analysis of ETI method by using four, eight, and sixteen word length when the number of multiplexing lines (m = 8). TABLE V R EDUCTION OF THE T RANSITION FOR TIC, ETIpre , AND ETI (a) E NCODING S CHEMES W ITH D IFFERENT W ORD L ENGTH Word Length % Reduction in Transitions (for Random Data) TIC ETIpre ETI 4 12.59% 37.75% 8 14.95% 27.75% 31.25% 27.5% 16 13.31% 19.75% 19.75% 32 10.71% 13.77% 13.77% (b) Fig. 12. (a) Coding example of the ETI scheme with different word length (WL) where Nt is the number of transitions in the ETI encoded data word. (b) Example of the TIC scheme with word length of four. Under the ETI coding scheme, the AFav is close to a constant when the word length is four bits. The graph of AFav for ETI for m = 8 under different word lengths is shown in Fig. 13. These results show that the AFav is the smallest when the word length is four. This paper also conducts simulations to verify the analysis result. In these simulations, the TIC, ETIpre , and ETI schemes encode eight independent random streams under four kinds of word lengths. Table V shows the reduction of the transition for various encoding schemes with different word length. These simulation results are consistent with the analysis above and similar to the results in Fig. 13. V. E NERGY A NALYSIS A. Interconnect Capacitance and Optimum Selective Spacing The dynamic energy dissipation per unit length of an interconnect is MC F × Cc ) × V D2 D (9) E = 0.5AF × (2C g + where C g and Cc are the line-to-ground capacitance and coupling capacitance, and the MCF is the miller coupling factor. Based on the C g and Cc formulas modeled in [22], the C g is proportional to the wire width w and the Cc decreases as the wire spacing s increases. The total average interconnect capacitance Ct,av per unit length is (2C g + MC F × Cc ). To obtain the minimum C g , we choose wire width w as the minimum wire width w0 in various fabrication process. The w0 is 0.14 and 0.2 μm for TSMC 90- and 130-nm CMOS technologies, respectively. This paper fixes w to w0 and changes wire spacing s. The term soptct represents the wire spacing for obtaining the minimum total capacitance of two adjacent wires. Simulation results indicate that the soptct soptct occurs at 1.07 μm for 90-nm CMOS technologies. This soptct is much larger than the w0 . Define An as the total wire area of n parallel wires. We assume that the initial wire and space width are both equal to w0 . The total area of the n parallel wires An is nw0 +(n−1)w0 . As shown in (10) to (11), we propose a selective spacing (SS) scheme to choose the optimum wire spacing (sopt ) under the constraint of minimizing the total capacitance and total wire area. For n input bitstreams under multiplexing degree m, the number of serial link is n/m. The soptct is the wire spacing for obtaining the minimum total capacitance between two adjacent wires. Selecting the wire space as soptct for small m makes the An/m larger than An since the soptct > w0 . Therefore, for small m, we select a wire space such that An/m = An . The wire space is obtained by (An − n/mw0 )/n/m and equals 2mw0 − w0 . The m 0 is defined as the minimum multiplexing degree such that its corresponding wire spacing is larger than soptct . That is, m 0 is the minimum number that 1806 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 10, OCTOBER 2013 Fig. 14. Normalized serial link capacitance as a function of the degree of multiplexing m with n = 64. 90 nm: w0 = 140 nm, s0 = 140 nm, h = 440 nm, t = 310 nm, and ε = 2.9. 130 nm: w0 = 200 nm, s0 = 200 nm, h = 550 nm, t = 370 nm, and ε = 3.7. Fig. 15. Area and RC delay reduction versus the m with n = 64. 90 nm: w0 = 140 nm, s0 = 140 nm, h = 440 nm, t = 310 nm, and ε = 2.9. 130 nm: w0 = 200 nm, s0 = 200 nm, h = 550 nm, t = 370 nm, and ε = 3.7. satisfies the wire spacing (2m 0 w0 − w0 > soptct ). For 90-nm CMOS technology, when the soptct equals to 1.07 μm and the w0 equals to 0.09 μm, the m 0 = 2. That means when multiplexing degree is larger or equals to two, the wire spacing obtained by (An − n/mw0 )/n/m is larger than soptct and under this case we choose sopt equals to soptct to minimize the total wire area. Under the case m < m 0 , we select the wire spacing as 2mw0 − w0 . When m > 2 (w0 = 0.2 μm) in 130 nm and m > 4 (w0 = 0.14 μm) in 90 nm, the selected optimum wire spacing is soptct wopt = w0 sopt = 2mw0 − w0 , m < m 0 soptct , m ≥ m 0, (10) 0 −w0 where 2m 0swoptct ≥ 1. (11) Under the 130- and 90-nm CMOS technologies, the total interconnect capacitor versus m is shown in Fig. 14. This total Fig. 16. Normalized serial link energy dissipation as a function of m for 130-nm CMOS process under AFi,av = 0.3, 0.5, and 0.7. capacitance is normalized to the interconnect capacitance of the n parallel wires. The reduction of the total capacitance is larger than that in [16] when m ≥ 4. The area and RC delay reduction versus m with n = 64 is shown in Fig. 15. The wire spacing sopt is based on (11). When m is greater than two for 130 nm or m is greater than four for 90 nm, the total wire area is smaller than that of the n wires. Since total capacitance after multiplexing is reduced, the propagation delay after multiplexing is also reduced with the resistance unchanged. The actual values of capacitance in Fig. 14 and area/RC values in Fig. 15 are summarized in Table VI. The initial spacing s0 is the same as w0 . The w0 values in ours and [16] for 130- and 90-nm technologies are (200, 240 nm) and (140, 160 nm), respectively. The total capacitance per wire is 2(Cc + C g ) times the wire length. For 130-nm technology, our total capacitance and RC values of the parallel bus are larger than theirs without multiplexing (m = 1). With our width and spacing selection scheme, the reduction of our RC values is better for m = 8. For 90-nm technology, our total capacitance and RC values of the parallel bus are smaller than theirs without multiplexing (m = 1). With our width and spacing selection scheme, the reduction of our RC values is better for m = 8. The RC reduction for 90 nm is larger than 130 nm. As for the area, since our scheme aims at minimizing the total area by using small spacing values, our areas are smaller than theirs for both 130- and 90-nm technologies. In [16], the criterion to choose the width and spacing of a wire in serial links is to minimize the total wire capacitance in the serial links. In order to minimize the ground capacitance, the minimum wire width w0 is chosen. The optimum spacing Sopt is the maximum possible spacing between wires in order to minimize the coupling capacitance. Since the maximum possible spacing is chosen, so the total area n/mw0 + (n/m − 1)Sopt is still large for a large multiplexing index m over n wires parallel bus. Therefore, the area reductions in [16] are limited for both 90- and 130-nm CMOS technologies. Compared to the width and spacing selection schemes in [16], our approach aims at reducing both the total area CHIU et al.: ETI CODING WITH LOW SWITCHING ACTIVITY 1807 TABLE VI VALUE OF C APACITANCE , RC, AND A REA AT T ECHNOLOGY OF 130 AND 90 nm 130 nm 90 nm 130 nm [16] 90 nm [16] Cc = 93.52 fF/mm Cg = 23.52 fF/mm Cc = 87.412 fF/mm Cg = 16.06 fF/mm Cc = 63.86 fF/mm Cg = 42.63 fF/mm Cc = 74.59 fF/mm Cg = 34.82 fF/mm R = 297.28 ohm/mm (m = 1) R = 506.9 ohm/mm (m = 1) R = 246.42 ohm/mm (m = 1) R = 505.5 ohm/mm (m = 1) Capacitance RC Area Capacitance RC Area Capacitance RC Area Capacitance RC Area (fF) (aS) (μm2 ∗ mm) (fF) (aS) (μm2 ∗ mm) (fF) (aS) (μm2 ∗ mm) (fF) (aS) (μm2 ∗ mm) m=1 702.52 6.26 28.19 620.86 9.44 16.54 638.91 4.72 33.75 656.45 9.96 16.45 m=2 396.13 3.53 27.75 320.14 4.87 16.28 360.35 2.66 33.22 338.47 5.13 16.19 m=4 363.77 3.24 25.2 279.67 4.25 15.75 330.96 2.45 32.15 295.67 4.48 15.67 m=8 363.77 3.24 11.88 279.47 4.25 8.0 343.73 2.54 30.03 304 4.61 14.64 TABLE VII E NERGY D ISSIPATION W ITH D IFFERENT AF i,av AND m AT 130- AND 90-nm T ECHNOLOGY 130 nm 0.3 This Paper 0.5 0.7 m=1 970.79 1618 m=2 463.51 m=4 621.73 m=8 576.22 90 nm 0.3 [16] 0.5 0.7 2265.2 883.23 1472 2060.9 531.55 633.61 485.78 809.6 1133.5 402 576.64 531.55 640.61 762.5 640.5 477.99 576.22 576.22 640.61 763.3 640.5 422.68 422.68 of serial links and wire capacitances. Therefore, our resulting areas are scaled down when a higher m is used. The RC delay for 90- and 130-nm technologies [16] does scale similar to the rest values is because the total capacitance values and the corresponding RC values are reduced based on the criterion of minimizing the total wire capacitance in the serial links. The average energy dissipation of a serial link is directly proportional to the total interconnect capacitance, the supply voltage, and average AF. The energy dissipation of n/m serial links normalized to that of the parallel bus (m = 1) is expressed as follows: n m × 0.5AF E T I,av (m) × Ct,av (m) × V D2 D n × 0.5AFi,av × Ct,av (1) × V D2 D 0.7 0.3 [16] 0.5 0.7 858.28 1430.5 2002.7 907.48 1512.5 2117.5 461 549.52 467.35 778.9 1090.5 443.32 408.66 571.9 680.9 571.7 422.68 571.9 680.9 571.7 proposed algorithm for the 130- and 90-nm technologies is similar so they are not depicted. For 130-nm technology, since our scheme has larger capacitance than theirs for m = 1, our initial energy is larger than theirs. With our coding scheme and width/spacing selection, our energy dissipation is smaller than theirs for all m > 1. The energy dissipation of our scheme is smaller than theirs for all m under 90-nm technology. VI. E XPERIMENTAL R ESULTS B. Energy Dissipation E av (m) = E av (1) 0.3 This Paper 0.5 . (12) The AF E T I,av for different m and AFi,av are shown in (3)–(5). The energy saving of the proposed ETI scheme and the method in [16] with AFi,av = 0.3, 0.5, and 0.7 under 130-nm technology is shown in Fig. 16. The actual values of energy dissipation in Fig. 16 are summarized in Table VII. For AFi,av = 0.3 and m = 2, the energy saving is the same as the reduction in average wire capacitance, because the encoding eliminates the AF penalty due to serialization. For AFi,av = 0.3 and m ≥ 4, encoding does not eliminate all the AF penalty, and a net penalty in AF remains. For AFi,av = 0.5 and 0.7, the reduction of AF dominates the energy saving. The minimum energy dissipation is achieved when m equals to four and AFi,av is 0.7 with a 75% energy reduction. It is because the average activity reduction increases when the AFi,av increases and the maximum reduction of wire capacitance occurs at m = 4. The energy saving of the A. AF Simulation Table VIII shows the simulation results by applying real video streams. There are three input video streams (For eman, News, and Mobile in compressed and uncompressed format) based on YUV CIF 4:2:0 video with 300 frames. According to the simulation results, the average AF of the input video streams is around 0.5. For the uncompressed video streams, the first ten frames are used in our simulation. The original number of transition in the uncompressed video stream depends on the content. Compared with the serial link scheme (S), the ETI scheme is more efficient than other schemes, achieving the reduction of 28.56%, 32.91%, and 33.4% for For eman, News, and Mobile videos, respectively. The compressed video streams are encoded by H.264 with 300 frames in our simulation. For the word length of four bits, the ETI reduces 31% transition compared with the S scheme for all three compression videos. No adjacent transition (NAT) coding techniques [23] adopt the limited weight coding scheme and transition signaling to minimize power and eliminate crosstalk. Table IX shows the simulation results with self generated bitstreams by the C code. The number of bistreams in the parallel bus is eight, and the AF of each bitstream is 0.5. Three points can be noticed. First, the result of the ETI scheme using word lengths of four and eight matches the 1808 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 10, OCTOBER 2013 TABLE VIII S IMULATION R ESULTS FOR Foreman, N ews, AND Mobile I NPUT V IDEOS Input Video: Foreman, N ews, and Mobile Transition Reduction (%) Data Format Video Total Bit Number Original Transition NAT [23] TIC [9](WL = 8) ETI(WL = 8) ETI(WL = 4) SE Foreman Compression N ews (H.264, 300 frames) Mobile 16,868,992 6,229,312 45,933,568 8,425,162 3,106,504 22,952,388 −0.05% −0.02% −0.14% −0.12% −0.24% −0.05% 14.8% 14.7% 14.8% 27.12% 27% 27.1% 31.21% 31.1% 31.2% Foreman N ews Mobile 12,165,120 12,165,120 12,165,120 6,101,319 6,148,441 6,216,989 0.832% 6.52% 3.3% 1.738% 1.17% 1.81% 11.27% 16.63% 17.36% 20.03 27.17% 26.37% 28.56% 32.91% 33.4% Uncompressed (YUV, 10 frames)) TABLE IX S IMULATION R ESULTS W ITH S ELF G ENERATED I NPUT D ATA BY THE C C ODE Eight Random Input Bitstreams, (AFi,av = 0.5) Total Bits Number = 640 000 Transitions % Reduction (Compared with S) % Reduction (Compared with TIC) Serial stream (S) 319 909 - −17.55% ETI (WL = 4) 220 035 31.25% 19.1% ETI (WL = 8) 232 473 27.5% 14.57% Coding Method TIC [9] 272 142 15.00% - SILENT(ES) [1] 319 875 0.01% −17.53% SE 319 907 0.01% −17.55% 321 705 −0.007% −18.2% NAT (After serialized) [23] (a) analysis in Section IV. Second, the performance of SILENT is very weak since it reduces transitions by only 0.01%. This is because this coding scheme is only suitable for applications with strong correlation. The ETI scheme is more efficient than other schemes, achieving a reduction of 31.25%. (b) (c) B. Power Simulation Fig. 17. (a) Wire model, (b) serializer, and (c) deserializer architectures. We compare the energy dissipation of our proposed ETI serial link with the parallel bus using the TSMC 130- and 90-nm CMOS technology. A 64-bit parallel bus is constructed. The wire load model is shown in Fig. 17(a). The wire in the bus is capacitively coupled to its two neighbors. The capacitance of the wire is calculated using Berkeley Predictive Technology Model [24]. The clock period in the wire is denoted as Tcyc . The clock period is designed to meet the delay from the wire, including repeaters and the circuit component (such as D-FF). Denote PP as the maximum power of the parallel bus when the AFs (AFi,av , i = 1, . . . , n) of all the n wires equal to one. In our simulation, we adopt a 3-mm wire that is represented by a wire model. PP is obtained by HSPICE simulation with each wire with clock period Tcyc . The POP is the total power of the overhead circuits in the parallel bus. The overhead circuits include synchronizers. Assume that the AFi,av is the same for all the wires in the parallel bus. For different AF AFi,av , the total power of the parallel bus is AFi,av × PP + POP . For the ETI serial link architecture, there are n/m serial links. Equations (3)–(5) show the average accumulated AF AF E T I,av for an ETI serial link that is multiplexed from m wires. The AF E T I,av is a function of the AFi,av and m. The normalized AF E T I,av is AF E T I,av /m. The PE is the power of a serial link. The PO E is the power of the overhead circuits in an ETI serial link that includes the SerDes, ETI encoder, and decoder. The architectures of the SerDes are shown in Fig. 17(b) and (c). The architectures of ETI encoder and decoder are shown in Section III. The total power of the overhead circuit in all the ETI serial links PT O E is (n/m) × PO E . The power of the ETI serial link to the parallel bus is shown below AF E T I,av (n/m) × PE + PO E m E E T I,av (m) = . E P,av (1) AFi,av × PP + POP (13) CHIU et al.: ETI CODING WITH LOW SWITCHING ACTIVITY 1809 TABLE X VII. C ONCLUSION D IMENSIONS , PARAMETERS , AND P OWER S AVING FOR 3-mm W IRE This paper presented the ETI coding scheme to reduce the power dissipation of a serial link. The ETI scheme uses the phase difference between the clock and the data to reduce the switching activity of the serial link. This ETI reduces the number of transitions by up to 31% compared with the serial scheme. The analysis and simulation results indicated that the proposed coding scheme achieves fewer transitions for most data patterns. The analysis and simulation results indicated that the proposed coding scheme produces a low bit transition for different kinds of data patterns. Using the optimum degree of multiplexing, optimum width, and spacing, the ETI coding scheme achieves 30%–60% energy reduction compared with the parallel bus without overhead. The power saving is up to 31.71% and 26.46% at Tcyc of 250 ps for the 90-nm and 130-nm CMOS technology for m = 2 adding circuit overhead. A larger bandwidth was needed in the ETI coding scheme due to phase shift. The last bit has half the pulse width of the other bits so that the interconnect has twice the band width. It means that the serial link needs to run a much frequency. The higher clock frequency leads to problems such as buffering, clock synchronization, and design complexity. The other way is to wait an extra bit to check the transition information but that would lower the overall bit rate. L ENGTH AT 130- AND 90-nm T ECHNOLOGIES . (90 nm: h = 440 nm, t = 310 nm, AND ε = 2.9; 130 nm: h = 550 nm, t = 370 nm, AND ε = 3.7) 64-Parallel-Line Bus 64/32-Serial-Link Bus m=1 m=2 Tech (nm) W Cg Cc Tcyc W Cg Cc Power Tcyc (μm) (aF/μm) (aF/μm) (ps) (μm) (aF/μm) (aF/μm) (ps) Saving (%) 130 0.2 23.52 93.52 1000 0.6 40.38 25.64 500 29.14 130 0.2 23.52 93.52 500 0.6 40.38 25.64 250 26.46 90 0.14 16.06 87.41 1000 0.42 24.56 28.79 500 35.98 90 0.14 16.06 87.41 500 24.56 28.79 250 31.71 0.42 TABLE XI P OWER S AVING FOR D IFFERENT m U NDER THE S AME T HROUGHPUT m Tcycle ( ps ) S 130 nm 90 nm w = 0.2 μm w = 0.14 μm Cg Cc Power S Cg Cc Power (μm) (aF/μm) (aF/μm) Saving(%) (μm) (aF/μm) (aF/μm) Saving(%) 2 1000 0.6 40.38 26.64 29.14 0.42 28.8 24.56 35.29 4 500 1.3 52.15 8.47 38.57 0.98 39.17 7.44 48.03 8 250 1.3 52.15 8.47 26.38 1.07 40.09 6.49 34.66 R EFERENCES Table X shows the power saving results. Under a fixed m, the higher the frequency is, the lower the power saving is. It is because the increase of PO E is higher than PP and POP at higher frequencies. The more advanced the technology is, the higher power saving is. The power saving is up to 31.71% and 26.46% at Tcyc of 250 ps for the 90- and 130-nm CMOS technology for m = 2. Table XI shows the power saving for different m with the same total throughput at 90-nm and 130-nm technologies. To have the same throughput, the higher the m is, the higher the frequency in the serial link is. Although the increase of PO E is higher than PP and POP at higher frequencies, Table XI shows that m = 4 has higher power saving than m = 2. It is because m = 4 has much smaller capacitance than m = 2 under our selective spacing scheme. The clock frequency of the serial links is much higher than that of the parallel bus. The higher clock frequency leads to problems such as buffering, clock synchronization, and design complexity. As shown in Table X, our implemented ETI encoder and decoder achieve good power saving for operating the serial link up to 2 GHz. High frequency circuit design techniques are needed to overcome those problems when adopting the proposed ETI scheme to higher frequency serial link. As we know, serial link interface is common adopted on off-chip interconnect, such as USB and SATA. The serial link technologies have improved dramatically these years [25]–[27]. For example, the data rates of USB devices increase from 480 Mb/s to 5 Gb/s in just a few year. PDs that operate at 1∼10 GHz are reported [28]–[31]. With these advanced serial link technologies, our proposed ETI architecture is suitable for reducing link power. [1] K. Lee, S. J. Lee, and H. J. Yoo, “SILENT: Serialized low energy transmission coding for on-chip interconnection networks,” in Proc. IEEE Int. Conf. Comput.-Aided Design Conf., Nov. 2004, pp. 448–451. [2] M. R. Stan and W. P. Burleson, “Bus-invert coding for low-power I/O,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 3, no. 1, pp. 49–58, Mar. 1995. [3] Y. Shin, S. I. Chae, and K. Choi, “Partial bus-invert coding for power optimization of application-specific systems,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 9, no. 2, pp. 377–383, Apr. 2001. [4] R. B. Lin and C. M. Tsai, “Weight-based bus-invert coding for lowpower applications,” in Proc. Int. Conf. VLSI Design, Jan. 2002, pp. 121–125. [5] C. H. Kuo, W. B. Wu, Y. J. Wu, and J. H. Lin, “Serial low power bus coding for VLSI,” in Proc. IEEE Int. Conf. Commun., Circuits Syst., Jun. 2006, pp. 2449–2453. [6] S. Zogopoulos and W. Namgoong, “High-speed single-ended parallel link based on three-level differential encoding,” IEEE J. Solid-State Circuits, vol. 44, no. 2, pp. 549–558, Feb. 2009. [7] S. R. Sridhara and N. R. Shanbhag, “Coding for reliable on-chip buses: A class of fundamental bounds and practical codes,” IEEE Trans. Comput.Aided Design Integr. Circuits Syst., vol. 26, no. 5, pp. 977–983, May 2007. [8] P. T. Huang, W.-L. Fang, Y.-L. Wang, and W. Hwang, “Low power and reliable interconnection with self-corrected green coding scheme for network-on-chip,” in Proc. 2nd ACM/IEEE Int. Symp. Netw. Chip, Apr. 2008, pp. 77–84. [9] R. Abinesh, R. Bharghava, and M. B. Srinivas, “Transition inversion based low power data coding scheme for synchronous serial communication,” in Proc. IEEE Comput. Soc. Annu. Symp. VLSI Conf., May 2009, pp. 103–108. [10] S. J. Lee, S. J. Song, K. Lee, J. H. Woo, S. E. Kim, B. G. Nam, and H. J. Yoo, “An 800 MHz star-connected on-chip network for application to systems on a chip,” in IEEE Int. Solid-State Circuits Conf. Technol. Dig., Feb. 2003, pp. 468–469. [11] K. Lee, S. J. Lee, and S. E. Kim, “A 51 mW 1.6 GHz on-chip network for low-power heterogeneous SoC platform,” in Proc. IEEE Int. SolidState Circuits Conf., Feb. 2004, pp. 152–518. [12] K. Lee, S. J. Lee, and H. J. Yoo, “Low-power network-on-chip for highperformance SoC design,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 14, no. 2, pp. 148–160, Feb. 2006. 1810 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 10, OCTOBER 2013 [13] R. Narasimha and N. Shanbhag, “Design of energy-efficient high-speed links via forward error correction,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 57, no. 5, pp. 359–364, May 2010. [14] E. Azadi, H. Ghasemizadeh, A. Khoei, and K. Hadidi, “5 Gb/s phase tracking clock recovery with new line coding in 0.35u CMOS process,” in Proc. Int. Symp. Telecommun., 2010, pp. 458–463. [15] D. N. Sarma and G. Lakshminarayanan, “Encoding technique for reducing power dissipation in network on chip serial links,” in Proc. Int. Conf. Comput. Intell. Commun. Syst. Conf., 2011, pp. 323–327. [16] M. Ghoneima, Y. Ismail, M. Khellah, J. Tschanz, and V. De, “Serial-link bus: A low-power on-chip bus architecture,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 56, no. 9, pp. 2020–2032, Sep. 2009. [17] W. C. Huang, C. H. Lin, and C. T. Chiu, “Embedded transition inversion coding for low power serial link,” in Proc. IEEE Workshop Signal Process. Syst. Conf., Oct. 2011, pp. 102–105. [18] B. Razavi, “Challenges in the design of high-speed clock and data recovery circuits,” IEEE Commun. Mag., vol. 40, no. 8, pp. 94–101, Aug. 2002. [19] S. Kaeriyama, Y. Amamiya, H. Noguchi, Z. Yamazaki, T. Yamase, K. Hosoya, M. Okamoto, S. Tomari, H. Yamaguchi, H. Shoda, H. Ikeda, S. Tanaka, T. Takahashi, R. Ohhira, A. Noda, K. Hijioka, A. Tanabe, S. Fujita, and N. Kawahara, “A 40 Gb/s multi-data-rate CMOS transmitter and receiver chipset with SFI-5 interface for optical transmission systems,” IEEE J. Solid-State Circuits, vol. 44, no. 12, pp. 3568–3579, Dec. 2009. [20] K. Fukuda, H. Yamashita, G. Ono, R. Nemoto, E. Suzuki, N. Masuda, T. Takemoto, F. Yuki, and T. Saito, “A 12.3-mW 12.5-Gb/s complete transceiver in 65-nm CMOS process,” IEEE J. Solid-State Circuits, vol. 45, no. 12, pp. 2838–2850, Dec. 2010. [21] R. Reutemann, M. Ruegg, F. Keyser, J. Bergkvist, D. Dreps, T. Toifl, and M. Schmatz, “A 4.5 mW/Gb/s 6.4 Gb/s 22+1-lane source synchronous receiver core with optional cleanup PLL in 65 nm CMOS,” IEEE J. Solid-State Circuits, vol. 45, no. 12, pp. 2850–2861, Dec. 2010. [22] P. P. Sotiriadis and A. Chandrakasan, “Bus energy minimization by transition pattern coding (TPC) in deep submicron technologies,” in Proc. IEEE/ACM Int. Comput.-Aided Design Conf., Nov. 2000, pp. 320– 327. [23] P. Subrahmanya, R. Manimegalai, V. Kamakoti, and M. Mutyam, “A bus encoding technique for power and cross-talk minimization,” in Proc. IEEE Conf. VLSI Design, Jun. 2004, pp. 443–448. [24] C. Hu. (1996). Berkeley Predictive Technology Model [Online]. Available: http://ww-w-device.eecs.berkeley.edu/∼ptm [25] J. Kim, J. K. Kim, B. J. Lee, and D. K. Jeong, “Design optimization of on-chip inductive peaking structures for 0.13 μm CMOS 40 Gb/s transmitter circuits,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 56, no. 12, pp. 2544–2555, Dec. 2009. [26] S. Kaeriyama, Y. Amamiya, H. Noguchi, Z. Yamazaki, T. Yamase, K. Hosoya, M. Okamoto, S. Tomari, H. Yamaguchi, H. Shoda, H. Ikeda, S. Tanaka, T. Takahashi, R. Ohhira, A. Noda, K. Hijioka, A. Tanabe, S. Fujita, and N. Kawahara, “A 40 Gb/s multi-data-rate CMOS transmitter and receiver chipset with SFI-5 interface for optical transmission systems,” IEEE J. Solid-State Circuits, vol. 44, no. 12, pp. 3568–3579, Dec. 2009. [27] N. Nedovic, A. Kristensson, S. Parikh, S. Reddy, S. McLeod, N. Tzartzanis, K. Kanda, T. Yamamoto, S. Matsubara, M. Kibune, Y. Doi, S. Ide, Y. Tsunoda, T. Yamabana, T. Shibasaki, Y. Tomita, T. Hamada, M. Sugawara, T. Ikeuchi, N. Kuwata, H. Tamura, J. Ogawa, and W. Walker, “A 3 Watt 39.8–44.6 Gb/s dual-mode SFI5.2 SerDes chip set in 65 nm CMOS,” IEEE J. Solid-State Circuits, vol. 45, no. 10, pp. 2016–2029, Oct. 2010. [28] J. E. Rogers and J. R. Long, “A 10-Gb/s CDR/DEMUX with LC delay line VCO in 0.18 μm CMOS,” IEEE J. Solid-State Circuits, vol. 37, no. 12, pp. 1781–1789, Dec. 2002. [29] M. Sorna, T. Beukema, K. Selander, S. Zier, B. Ji, P. Murfet, J. Mason, W. Rhee, H. Ainspan, and B. Parker, “A 6.4 Gb/s CMOS SerDes core with feedforward and decision-feedback equalization,” in Proc. IEEE Int. Solid-State Circuits Conf., Feb. 2005, pp. 62–63. [30] C. F. Liao and S. I. Liu, “A 40 Gb/s CMOS serial-link receiver with adaptive equalization and clock/data recovery,” IEEE J. Solid-State Circuits, vol. 43, no. 11, pp. 2492–2502, Nov. 2008. [31] S. H. Lin and S. I. Liu, “Full-rate bang-bang phase/frequency detectors for unilateral continuous-rate CDRs,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 55, no. 12, pp. 1214–1218, Dec. 2008. Ching-Te Chiu received the B.S. and M.S. degrees from National Taiwan University, Taipei, Taiwan, in 1986 and 1988, respectively, and the Ph.D. degree from University of Maryland, College Park, MD, in 1992, all in electrical engineering. She was an Associate Professor with National Chung Cheng University, Chia-Yi, Taiwan, from 1993 to 1994. From 1994 to 1996, she was a Technical Staff Member with AT&T, Murray Hill, NJ, and Lucent Technologies, Murry Hill, from 1996 to 2000, and with Agere Systems, Santa Clara, CA, from 2000 to 2003. In 2004, she joined the Computer Science Department, Institute of Communications Engineering, National Tsing Hua University, Hsinchu, Taiwan, as an Associate Professor. Her current research interests include high-Speed SerDes design, multichip interconnect, fault tolerance for network-on-chip, high dynamic range image and video processing, high definition television video decoder chip design, and SONET and SDH mapper and framer IC design. Dr. Chiu was the recipient of the First Prize, the Best Advisor Award and the Best Innovation Award of the Golden Silicon Award in 2006. She is a TC Member of the IEEE Circuits and Systems Society, Nanoelectronics and Gigascale Systems Group, and the IEEE Signal Processing Society, Design and Implementation of Signal Processing Systems Group. She is the Program Chair of the first IEEE Signal Processing Society Summer School at Hsinchu in 2011. Wen-Chih Huang received the B.S. degree from Chung Yuan Christian University, Jhongli City, Taiwan, and the M.S. degree from the National Tsing Hua University, Hsinchu, Taiwan, in 2008 and 2010, respectively, both in computer engineering. He joined MStar Semiconductor, Hsinchu, in 2011, where he is engaged in designing transport stream processors for the digital TV products. His current research interests include low-power techniques and circuit designs. Chih-Hsing Lin received the M.S. degree in electric engineering from Chung Yuan Christian University, Chung-Li, Taiwan, and the Ph.D. degree from the Institute of Communications Engineering, National Tsing-Hua University, Hsinchu, Taiwan, in 2002 and 2011, respectively. He is currently an Associate Researcher with the Advanced Technology Division, National Chip Implementation Center, Hsinchu. His current research interests include clock synchronous, frequency synthesizers, low power, very large-scale heterogeneous integration design, and image processes. Wei-Chih Lai received the B.S. degree from National Central University, Jhongli, Taiwan, and the M.S. degree from National Tsing Hua University, Hsinchu, Taiwan, in 2009 and 2011, respectively, both in computer science. He is currently with United Microelectronics Corporation. His current research interests include highspeed switch and data center networks. Ying-Fang Tsao received the B.S. degree in computer engineering from National Tsing Hua University, Hsinchu, Taiwan, in 2011, where she is currently pursuing the M.S. degree. Her current research interests include design of integrated circuits and image and video signal processing.