Uploaded by murthi babu

eti Base paper

advertisement
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 10, OCTOBER 2013
1797
Embedded Transition Inversion Coding With
Low Switching Activity for Serial Links
Ching-Te Chiu, Wen-Chih Huang, Chih-Hsing Lin, Wei-Chih Lai, and Ying-Fang Tsao
Abstract— Serial link interconnection has been proposed for
its advantages of reducing crosstalk and area. However, serializing parallel buses tends to increase bit transition and power
dissipation. Several coding schemes, such as serial followed by
encoding (SE) and transition inversion coding (TIC), have been
proposed to reduce bit transition. TIC is capable of decreasing
transitions by 15% compared to the SE scheme, but an extra
indication bit is added in every data word to represent inversion
occurrence. The extra bit increases the transmission overhead
and the bit transitions. This paper proposes an embedded
transition inversion (ETI) coding scheme that uses the phase
difference between the clock and data in the transmitted serial
data to tackle the problem of the extra indication bit. The ETI
coding scheme reduces the transition by up to 31% compared
to SE scheme. The analysis and simulation results indicate that
the proposed coding scheme produces a low bit transition for
different kinds of data patterns. Using the optimum degree of
multiplexing, width, and spacing, the ETI coding scheme achieves
30%–60% energy reduction compared with the parallel bus
without overhead. Taking circuit overhead into consideration,
the power saving is up to 31.71% and 26.46% at a clock cycle
of 250 ps for the 90- and 130-nm CMOS technology for m = 2
where m is the number of parallel wires multiplexed into a serial
link.
Index Terms— Coding techniques, low switching activity, phase
detector, serial link.
I. I NTRODUCTION
A
DVANCED silicon technology offers the possibility of
integrating hundreds of millions of transistors into a
single chip, which makes system-on-chip (SoC) design possible. With the continuous scaling of silicon technology, area
and power dissipation of interconnects are one of the main
bottlenecks for both on-chip and off-chip buses. Multiplexing
parallel buses into a serial link enables an improvement in
terms of reducing interconnect area, coupling capacitance, and
crosstalk [1], but it may increase the overall switching activity
factor (AF) and energy dissipation. Therefore, an efficient
coding method that reduces the switching AF is an important
issue in serial interconnect design.
Many studies attempt to reduce the AF of parallel buses.
For example, Stan and Burleson [2] introduced a bus-invert
Manuscript received December 14, 2011; revised May 16, 2012; accepted
August 31, 2012. Date of publication October 26, 2012; date of current version
September 9, 2013. This work was supported by the National Science Council,
Taiwan, under Contract NSC 101-2220-E-007-019 and Contract NSC99-2220E007-019.
The authors are with the Department of Computer Science, Institute of
Communications Engineering, National Tsing Hua University, Hsinchu 300,
Taiwan
(e-mail:
ctchiu@cs.nthu.edu.tw;
g9762594@oz.nthu.edu.tw;
chihhsinglin@gmail.com; and y751026@hotmail.com; likeet171819@gmail.
com).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TVLSI.2012.2219888
method that transmits the original or inverted pattern to
minimize the switching activity. Researchers have proposed
many techniques to improve the bus-invert coding method,
such as the partial bus-invert coding [3] and weight-based businvert coding methods [4]. The schemes mentioned above use
an extra channel to send the inversion indication signal. Kuo
et al. [5] proposed the serial coding technique to solve the
extra channel problem. They append extra information bits to
the back of the original data word. Although this approach
resolves the area overhead problem, it increases data latency.
Three level differential encoding is proposed for parallel bus
[6] to enable multiple drivers at the transmitter and to recycle
the same current and reduce power consumption [6]. Joint
crosstalk avoidance code and error correction code are proposed to reduce the power in parallel bus [7]. Huang et al. [8]
further proposed combining serializing bus with the joint
crosstalk avoidance code and error correction code to reduce
the power.
Serialized low-energy transmission (SILENT) [1] is a coding method used in reducing the switching activity for serial
links. This approach encodes every single bit in the parallel
bus using the XOR gate, and multiplexes the encoded parallel
buses into a serial link. The XOR operation sets an adjacent
bit with the same value to zero. The greater the correlation is,
the more zeros the encoder produces. This method is designed
for data with strong correlation.
Bharghava et al. [9] proposed the transition inversion coding
(TIC) technique to reduce switching activity for random data
and to detect errors. Their technique counts the transitions in
the data word, and inverts the transition states if the number of
transitions in a data word is more than half of the word length.
The scheme sets the current bit in the serial stream to be the
same as the previous encoded bit when there is a transition.
Otherwise, it is set to the inversion of the previous encoded
bit. A transition indication bit is added in every data word.
This extra bit not only increases the number of transmitted
bits, but also increases the transitions and latency.
Lee et al. [10]– [12] used serial links as communication
channels on an on-chip network architecture for SoC. The
serial links reduce the area of communication channels by 57%
compared with a nonserialized approach [10]. This approach
also reduces the switch activity because the coupling capacitance of the interconnect wires decreases [10]–[12].
Forward error correction (FEC) code is used to reduce
the serial link power by trading off the FEC coding
gain with specifications on transmit swing, analog-to-digital
converter precision, jitter tolerance, receive amplification, and
by enabling higher signal constellations [13]. By combining
the single and differential signals and 8B/10B coding, a high
1063-8210 © 2012 IEEE
1798
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 10, OCTOBER 2013
speed phase tracking clock recovery is developed for serial
link to reduce the power [14]. To utilize the relation between
data in a link, the encoder reorders or shuffles the M bits in
the parallel bus such that the encoded data are less correlated
or silent [15]. Then the encoded data are serialized with one
control bit. The silent coding with shuffling can reduce the
power dissipation compared to the silent coding.
A serial link on-chip bus architecture is proposed to lower
interconnect power [16]. Serialization reduces the number of
wires and leads to a larger interconnect width and spacing.
A large interconnect spacing reduces the coupling capacitance, while the wider interconnects reduce the resistivity.
A significant improvement in the interconnect energy dissipation is achieved by applying different coding schemes
and their proposed multiplexing techniques. However, the
power reduction decreases when the degree of multiplexing
increases.
This paper proposes the embedded transition inversion (ETI)
coding scheme to solve the issue of the extra indication
bit [17]. This scheme eliminates the need of sending an
extra bit by embedding the inversion information in the phase
difference between the clock and the encoded data. When
there is an inversion in the data word, a phase difference is
generated between the clock and data. Otherwise, the data
word remains unchanged and there is no phase difference
between the clock and the data. This ETI coding scheme
reduces transition by 31% compared with the SE scheme. The
improvement of transition reduction is 19% compared with
that of the TIC. The receiver side adopts a phase detector (PD)
to detect whether the received data word has been encoded or
not. Statistical analysis and experimental results show that the
proposed coding scheme has low transitions for different kinds
of data patterns.
Increasing the interconnect spacing reduces the coupling
capacitance for on-chip buses. We also study optimization on
degree of multiplexing, width, and spacing when serializing
parallel buses. Using the optimum degree of multiplexing,
width, and spacing, the proposed ETI coding scheme reduces
energy consumption from 30% to 60% compared to the
parallel buses.
Compared the works in [17], we add the enhanced ETI
architecture, link width and spacing selection scheme, word
length and energy analyses, and detail switching activity
analyses and new energy simulation results.
The remainder of this paper is organized as follows.
Section II presents the ETI coding scheme. We address the
ETI architecture that includes the encoder and decoder in
Section III. Section IV provides the AF analysis for ETI
under degree of multiplexing and the word length. Section V
presents the energy analysis by considering the interconnect
width, spacing, and capacitance. Finally, Section VI presents
experimental results, and Section VII gives a brief conclusion.
II. ETI C ODING S CHEME
A. Background
AF analysis for different kinds of coding schemes appears
in [16]. There are four types of bus schemes, including
(a)
(b)
Fig. 1. (a) Four types of bus schemes: P, S, ES, and SE. (b) Overall AFav
for m = 2 versus the sum of the average AFs from two parallel bitstreams
(AF1,av = AF2,av ) [16].
parallel (P), serial (S), encoding followed by serial (ES),
and serial followed by encoding (SE) as shown in Fig. 1(a).
Their average AFs are AF P,av , AF S,av , AFES,av , and AFSE,av ,
respectively. The bitstream in this paper refers to the transmitted data in each wire of the input parallel bus. The AF analysis
results of serializing two bitstreams into a single output is
shown in Fig. 1(b). AF1,av and AF2,av are the AFs of the
bitstream in the first and the second bitstreams.
The AF of each bitstream AFi,av (i = 1, 2) can be viewed
as the probability of bit switching within a cycle time. The
bit switching in one bitstream within a cycle time could be
00, 01, 10, and 11. We find all the combinations of serializing
two bitstreams within a cycle time (i.e., 0000, 0001,...etc., total
combinations are 16). Then we find the AF of the serial link
for all possible joint switching scenarios. For example, the AF
of the serial link with bit pattern 0001 is one.
The switching probabilities of the input bitstreams are
assumed statistically independent. The probability of joint
switching event in the serial link is the product of the probability of each of the individual events in each bitstream. The probabilities of two bitstreams both switching, only one switching,
and both not switching are AF1,av AF2,av , AF1,av (1 − AF1,av ),
AF2,av (1−AF1,av ), and (1−AF2,av )(1−AF2,av ) [16].
CHIU et al.: ETI CODING WITH LOW SWITCHING ACTIVITY
The average AFs AF P,av , AF S,av , AFSE,av , and AFES,av of
the four schemes are obtained by summing products of the
AF after serialization and its probability for all the combinations. The average AF AF P,av equals to AFSE,av , and is the
summation of the AFs of the first and the second bitstreams
AF1,av + AF2,av . The average AF AF S,av equals to one. The
average AF AFES,av is a downward parabolic curve.
The average AFs of the four schemes AF P,av , AF S,av ,
AFES,av , and AFSE,av are considered under three cases.
The summation of the first and second bitstream AF
(AF1,av + AF2,av ) is smaller than, equal to, and greater than
one. Note that the AFs of the ES, SE, and P schemes are
smaller than S when the summation of the AF is smaller than
one (AF1,av + AF2,av < 1). That means serializing bitstreams
(S) may increase the overall AF. The SE and P schemes
reduce more switching activity than ES when each individual
AF is small. When the summation of the AF equals to one
(AF1,av + AF2,av = 1), the AFav values of SE and ES schemes
equal to AF S,av . This means that the reduction of average
activity for both coding schemes is zero. When the summation
of the AF is greater than one (AF1,av + AF2,av > 1), the AF
of the ES scheme is smaller than SE and provides the lowest
overall AF. Although the AFES,av is very low at the two
boundaries (AF1,av + AF2,av = 0 or 2), the probability of
AF1,av + AF2,av = 0 or 2 is approximately zero because it is
rare that the bitstreams are always switching or always quiet
in normal applications.
The AF of individual bitstream depends on the application.
For the pseudo random binary sequence bitstreams commonly
used in transceiver design, the average AF of each bitstream
is between 0.2 and 0.7. The average activity used in microprocessors or video processing varies from 0.1 to 0.3 [16] and
0.25 to 0.35 [1]. Assuming that the bitstreams are statistically
independent, then the sum of the AFs of two individual
bitstream is between 0.2 and 1.4.
The ES scheme is a good choice for applications, such as
microprocessors or video processing based on the comparison
of the four schemes shown in Fig. 1. The ES and SE coding
schemes in Fig. 1 are both based on XOR operation that is
suitable for bistreams with distinct values in adjacent bits. By
replacing the XOR operation with our proposed ETI coding
in the SE scheme, the switching activity of our proposed ETI
scheme is lower than the ES scheme for microprocessors or
video processing.
B. Proposed ETI Scheme
Although many coding algorithms can reduce the switching
AF, most of them are designed for specific applications, such
as video streaming or strongly correlated data. The TIC is
one of the methods developed for random data [9]. This
method adds a transition indication bit to every data word to
indicate if there is an inversion or not. This inversion coding
is performed on every bit of two consecutive bits in the serial
stream. The extra indication bit increases the switching activity. This paper proposes the ETI coding scheme that operates
on a two-bit basis and removes all the transition indication
bits.
1799
Fig. 2.
n/m ETI serial links with n input bitstreams under degree of
multiplexing m.
Fig. 3.
ETI coding scheme for one serial link, word length = WL,
Nth = W L/2, and number of transition = Nt .
An n/m ETI serial links with n input bitstreams under
degree of multiplexing m is shown in Fig. 2. Each serial link
has m input bitstreams that are multiplexed by a serializer, followed by the ETI encoding. The encoded stream is transmitted
through the serial link and followed by the ETI decoding and
a deserializer. The ETI coding scheme includes the inversion
coding and phase coding as shown in Fig. 3.
1) Inversion Coding: Define the word length (WL) as the
number of bits in a data word and a threshold Nth as half of
WL. A transition is defined as a bit changing from zero to one
or from one to zero. For example, the bitstream “0100” has
two transitions while “0101” has three transitions. When the
number of transitions Nt in a data word exceeds the threshold
Nth , the bits in the data word should be encoded. Otherwise,
the data word remains the same.
When an encoding is needed in a data word, this method
checks every two-bit in the data word, as Fig. 3 shows. Every
two bit in the serial stream is combined as a base to be
encoded. In this case, the b11 b21 is a base and the b31 b41
is another base. The 2-bit in a base is denoted as b1 b2 and the
encoded output is denoted as be1 be2 . When the Nt in a data
word is less than Nth , b1 b2 remains unchanged. Otherwise, we
perform the inversion coding and the phase coding. For the
inversion coding, the bitstreams “01” and “10” are mapped to
“00” and “11,” respectively. The bitstreams “00” and “11” are
mapped to “01” and “10,” respectively. For the phase coding,
we embed the inversion information in the phase difference
between the clock and the encoded data.
1800
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 10, OCTOBER 2013
TABLE I
A LL C OMBINATIONS OF T WO B ITSTREAMS FOR THE
TIC, ETIpre , AND ETI C ODING S CHEMES
Parallel Streams
Stream1 Stream2
Serial Stream
(TIC)
Serial Stream
(ETIpre )
Serial Stream
(ETI)
b11 b12
b21 b22
b11 b21 b12 b22 bex
b11 b21 b12 b22
b11 b21 b12 b22
00
00
0000 0
0000
0000
00
01
0001 0
0001
0001
00
10
0001 1
0001
0001
00
11
0000 1
0000
1000
01
00
0111 1
0111
0111
01
01
0011 0
0011
0011
01
10
0011 1
0011
0011
01
11
0111 0
0111
0111
10
00
1000 0
1000
1000
10
01
1100 1
1100
1100
10
10
1100 0
1100
1100
10
11
1000 1
1000
1000
11
00
1111 1
1111
0111
11
01
1110 1
1110
1110
11
10
1110 0
1110
1110
11
11
1111 0
1111
1111
(a)
(b)
(c)
(d)
Fig. 4.
Dout “1000” obtained from (a) Din “1000” without encoding,
(b) from Din “1101” with encoding. Dout “0001” obtained from (c) Din
“0001” without encoding, and (d) Din “0100” with encoding.
The inversion encoding operation can be expressed as
be1 = b1
b2 , with Nt < Nth
be2 =
!b2 , with Nt ≥ Nth .
Fig. 5.
The inversion decoding operation for the decoded output
bd1bd2 is
bd1 = be1
be2 , with Nt < Nth
bd2 =
!be2 , with Nt ≥ Nth .
Waveform example for special data words (“0000” and “1111”).
(1)
(2)
Since this operation is on a two-bit basis and only the second
bit is inverted, it is called bit-two inversion (B2INV).
2) Phase Coding: The ETI coding uses the phase difference
between the data and the clock to encode the indication
information. Table I shows the corresponding output data word
after TIC, ETIpre , and ETI. The ETIpre has the same data word
as the TIC, except that it removes the extra bit bex . Removing
the bex leaves eight sets of data words that are exactly the
same. For example, there are two “1000” data words after the
ETIpre coding.
Within every data word duration, the phase difference
between the data and the clock distinguishes these two
data words, as Fig. 4 illustrates. Same Dout “1000” in
Fig. 4(a) and (b) is obtained from Din “1000” and “1101”
without and with inversion. A half clock cycle difference
between Dout and CK is shown in Fig. 4(b), indicating that Din
has been encoded. The Dout and CK are aligned in Fig. 4(a),
indicating that Din has not changed. Dout “0001” is the same
in Fig. 4(c) and (d) from Din “0001” and “0100” without
and with inversion. This approach is able to identify whether
Dout has been encoded or not as long as there is a half cycle
delay between the Dout and CK. Although the phase difference
can distinguish most of the data words of ETIpre , this method
cannot be used for “0000” or “1111” because there is no
transition inside the data word. Under the inversion condition
for these two data words, the “0000” and “1111” change to
“1000” and “0111”, as shown in the last column of Table I.
The corresponding waveforms are shown in Fig. 5 for these
two data words. The first bit of Dout in the “1000” and “0111”
is aligned with CK and the duration of the bit is only half of
the clock cycle.
III. ETI A RCHITECTURE
A. ETI Encoder
The overall architecture of the ETI scheme is shown in
Fig. 6(a). We add the ETIpre block and the TIC architecture
for clarity. The ETIpre does not provide the decision bit
information so it cannot be decoded in the receiver. The ETIpre
encoder is shown by the dashed box in the ETI encoder in
Fig. 6(a). The TIC counts the transitions in the data word
then uses this information to perform encoding. The transition
indication bit is added to every data word to indicate whether
there is an inversion or not. The decoder adopts the transition
indication bit to perform the decoding, as shown in Fig. 6(b).
In the ETI encoder part, the input data Din are stored in the
buffer to wait until the check transition operation is completed.
The transition and threshold in a data word are used to set the
decision bit. The decision bit is used to control the encoding
process in the B2INV and the phase encoder block.
When the decision bit is set to zero, the B2INV passes the
noninverted bitstream. Otherwise, the bitstream is encoded.
CHIU et al.: ETI CODING WITH LOW SWITCHING ACTIVITY
1801
(a)
(b)
Fig. 6.
(a) Overall architecture of the ETI scheme. (b) Overall architecture of the TIC scheme. ETIpre encoder is marked by dashed line box.
The decision bit is also adopted in the phase encoder block to
select the phase encoded or un-encoded data word. In the ETI
decoder part, the phase decoder checks the phase difference
between the clock and the data. The phase difference information is then used to generate the decision bit. The decision
bit is used in the B2INV to decode the data words.
The ETI encoder includes the check transitions block,
buffer, B2INV, and phase encoder. The check transition block
is shown in Fig. 7(a). The WL indicator block counts the
length of the data word and generates a high signal at the first
bit of the data word. This signal is used to reset the adder and
the D-flip-flop (D-FF). The D-FF stores the previous bit that is
used to XOR with the current bit for transition checking. The
adder block calculates the number of transition in a data word
and sets the decision bit to high when the Nt ≥ Nth . The check
transition block in Fig. 7(a), which is used to detect the number
of the transitions, is part of the ETI encoder. Compared to
the ETI encoder, the synthesis results show that the hardware
area overhead of the check transition block are 32.34% and
34.96%, respectively, for serial link running at 500 MHz and
4 GHz under 90-nm CMOS technology. The power overhead
are 23.41% and 20.28%, respectively.
The architecture of the B2INV is shown in Fig. 7(b). The
divider, which divides the clock by two, provides an indication
signal for the first or second bit in a pair of bitstream b1 b2 .
According to (1) and (2), if it is the first bit, then the bit passes
through. Otherwise, the bit is inverted when the decision bit
is high. Equations (1) and (2) show that the decoding block
of ETI is exactly the same as the encoding block in Fig. 7(b).
(a)
Fig. 7.
(b)
(a) Check transition block. (b) B2INV block.
Compared with the coding block in TIC, the B2INV has less
gate count and power consumption. The TIC uses the states
between any two consecutive bits to set the encoded bit. Two
D-FFs are required to store the states of two consecutive bits.
The TIC coding circuit includes two D-FFs, two XORs, and
two inverters [9]. The B2INV block needs only one D-FF, two
inverters, one AND, and one MUX. Only the second bit in
the data word needs to be inverted so the ETI needs only one
D-FF. A MUX is smaller than a D-FF so the B2INV consumes
less gate count compared with the TIC. Using TSMC CMOS
130-nm technology, the area of TIC and B2INV coders are
2178 and 1232 μm2 . The B2INV implementation cuts the
area by 43% compared with the coding block of TIC.
As shown in Fig. 8, the phase generator is used to generate
phase difference between the encoded data (Dpre ) and the
clock (CK) at each data word. Depending on the encoded data,
1802
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 10, OCTOBER 2013
(a)
Fig. 8.
Phase encoder block.
TABLE II
A REA AND P OWER OF ETI E NCODER I NDIVIDUAL
(b)
B LOCK U NDER 90-nm CMOS T ECHNOLOGY
Area
ETI Encoder
Subblocks
Fig. 9. (a) Alexander PD and (b) clock and phase difference for determining
the coding state.
Power
0.5G
4G
0.5G
(μm2 ) (%) (μm2 ) (%) (μW) (%)
4G
(μW) (%)
Check transition
163
32.34
136
34.96 27.43 23.41 236.64 20.28
B2INV
84
16.7
92
23.65 25.65 21.9
277.22 23.76
Phase encoder
121
24
76
19.54 30.7
274.54 23.53
Buffer
136
26.98
85
21.85 33.37 28.48
Total
504
100 117.15 100
389
26.2
378.2
34.42
100 1166.62 100
there are three types of phase encoding: the one cycle delay
[Fig. 4(a) and (c)], the half cycle delay [Fig. 4(b) and (d)], and
the special data word (Fig. 5). The phase encoding of the one
cycle delay is shown by the first path in Fig. 8. The half cycle
delay and the special data word are shown by the second and
the third path. In the special data word, the predefined Flag
signal is used to present the special data pattern.
The “check all 0 and all 1” block is used to identify the
special data word when the encoded data (Dpre ) are “0000”
or “1111.” If the encoded data are not the special data word,
the second path is selected from the MUX1. Otherwise, the
third path is selected. The decision bit then selects the data
from the first path or the output of the MUX1. The summary
of area and power in individual blocks in the ETI encoder is
shown in Table II.
B. ETI Decoder
The ETI encoder generates the phase difference between the
clock and the data word. Normally, a PD identifies an early
or delayed phase. A variety of PDs could detect the phase
difference. This paper adopts the commonly used Alexander
PD [18]. The Alexander PD architecture is shown in Fig. 9(a),
which uses three consecutive clock edges to generate four
sampling signals (S0, S1, S2, and S3). The PD is controlled
by the clock CK and input data Din . When the clock CK and
TABLE III
M EANING OF D IFFERENT P HASES
S1 ⊕ S2
High
Low
Low
High
S2 ⊕ S3
Low
Low
High
High
Clock
Early
No transition
Late
Special case
Coding state
Has not been encoded
Has been encoded
input data Din are valid, the PD is activated to identify the
phase relation between the clock and the data. The PD can
determine whether a data transition exists from the condition
that the clock leads or lags the data. The basic waveform
is shown in Fig. 9(b) to judge the un-inverted, inverted, no
transition, or the special data word. If the clock leads the data
(early conditions), the signal S1 ⊕ S2 is high and the S2 ⊕ S3
is low. Conversely, if the clock lags the data (late conditions),
the signal S1 ⊕ S2 is low and S2 ⊕ S3 is high. Thus, S1 ⊕
S2 and S2 ⊕ S3 could provide the clock and data relation as
shown in Table III.
The early signal is for the un-inverted data and the no
transition represents the un-encoded the “all zero or all one”
data word. The late signal represents the case in which data
have been inverted in encoder. The last case is for the encoded
the “all zero or all one” data word. The decision bit is
generated based on the phase information on the S1 ⊕ S2
and S2 ⊕ S3.
The decision bit is used in the B2INV for the decoding.
The decoding operator in the B2INV is the same as that in the
encoder. Two D-FFs are added in the front of the B2INV block
for buffering and alignment. A larger bandwidth is needed in
the ETI coding scheme due to phase shift. The last bit has
half the pulse width of the other bits so that the interconnect
has twice the bandwidth. It means that the serial link needs to
CHIU et al.: ETI CODING WITH LOW SWITCHING ACTIVITY
1803
TABLE IV
AF S FOR THE ETI B ITSTREAM
Parallel Streams
Stream1 Stream2
b11 b12
b21 b22
Fig. 10.
Cycle-based AF analysis.
run at a much frequency. The higher clock frequency leads to
problems, such as buffering, clock synchronization, and design
complexity. The other way is to wait an extra bit to check the
transition information but that would lower the overall bit rate.
The disadvantage of this method is added in the revised paper.
We send both data and clock from TX to RX in our proposed
scheme. In our simulation, we assume the channel length of
the clock and data links is equal and the phase skew between
them is negligible. A PLL or CDR is usually needed in the
receiver to resynchronize the deskew data. Only one global
PLL or CDR is needed in the serial link to recover the deskew
data then perform the ETI decoding and demultiplexing. The
estimated power overhead of PLL or CDR is about 10%–25%
of the total receiver power [19]– [21].
IV. A NALYSIS OF AF
A. AF Analysis With Degree of Multiplexing
Consider the case of serializing m bitstreams into a serial
link and denote the AF of each bitstream as AFi , where i
ranges from 1 to m. The probability that a wire has a “rise” or
“fall” switching activity is AFi,av /2, while the probability that
a wire is quiet at 0 or 1 at (1−AFi,av )/2. The AF of the serial
link is defined as the number of transitions per clock cycle,
and can be obtained from the summation of the transition in
T2 and the transitions between T1 and T2 as Fig. 10 shows.
For example, the AF of the serial stream “0011” is one. The
fourth column of Table IV shows the AF values obtained by
the ETI encoder for m = 2, while the fifth column is the
probability of the corresponding parallel streams. Note that
the AF values of the serial stream “1000” and “0111” are
set to one in Table IV. This AF value accounts for the extra
transitions produced at T1 .
The average AFav of the ETI stream can be obtained by
summing the products of AF and its probability for all the
combinations. The average AF of ETI coding AFETI,av for
m = 2 is
AF E T I,av =
m
0.25 + 0.25AF2i,av ,
m = 2.
(3)
i=1
The average AFav of different coding schemes for m = 2
is shown in Fig. 11(a). As mentioned in the previous section,
it is rare that AF1,av + AF2,av equal to zero or two. Although
the ETIpre coding scheme has a lower AF than ETI, the
ETIpre scheme fails in decoding due to lack of indication bit.
Compared with AF S,av , the reduction of AFETI,av is from
40% to 50% when the AF of AF1,av + AF2,av is between
Serial Stream
(ETI)
b11 b21 b12 b22
AF
P(x0.25)
00
00
0000
0
(1 − AFi,av )(1 − AF i,av )
00
01
0001
1
(1 − AFi,av )AFi,av
00
10
0001
1
(1 − AFi,av )AFi,av
00
11
1000
1
(1 − AFi,av )(1 − AF i,av )
01
00
0111
0
AFi,av (1 − AFi,av )
01
01
0011
1
AFi,av AF i,av
01
10
0011
1
AFi,av AF i,av
01
11
0111
0
AFi,av (1 − AFi,av )
10
00
1000
0
AFi,av (1 − AFi,av )
10
01
1100
1
AFi,av AF i,av
10
10
1100
1
AFi,av AF i,av
10
11
1000
0
AFi,av (1 − AFi,av )
11
00
0111
1
(1 − AFi,av )(1 − AFi,av )
11
01
1110
1
(1−AFi,av )AFi,av
11
10
1110
1
(1−AFi,av )AFi,av
11
11
1111
0
(1−AFi,av )(1 − AFi,av )
0.4 and 1.4. Under these cases, the proposed ETI coding is
more effective than others.
The AFETI,av expressions for m = 4 and m = 8 are
AFETI,av =
m
0.4375 − 0.312AFi,av
(4)
i=1
+0.375AF2i,av − 0.25AF3i,av ,
AFETI,av = 2.75,
m = 8.
m=4
(5)
The serial AFav values of various coding schemes for m = 4
and m = 8 are shown in Fig. 11(b) and (c). For m = 4, the
reduction of AFETI,av ranges from 25% to 50% compared with
AF S,av . For m = 8, the reduction of AFETI,av is approximately
37.5% compared with AF S,av . When AFi,av is between 0.2
and 0.7, the AFETI,av is always below AF S,av and AFSE,av
schemes. When the individual bitstream is completely random,
the ETI has the lowest average AF.
The AFav reductions of ES, SE, TIC, ETIpre , and ETI with
AF S,av are compared in Fig. 11(d)–(f) based on AFi,av = 0.3,
0.5, and 0.7. The TIC scheme achieves lower reduction than
the ETI method when AFi,av = 0.3 and m ≤ 4 is shown in
Fig. 11(d). However, the TIC scheme needs to transmit the
extra bit for every data word.
If AFi,av increases to 0.5 in Fig. 11(e), the reductions of
SE and ES are almost 0% while the performance of the ETI
scheme is not affected. When AFi,av is set to 0.7 in Fig. 11(e),
the reduction of SE becomes negative on m = 2. These
graphs show that the proposed ETI coding method achieves
the highest reduction for different m and AFi,av .
The reduction of AFav for different coding schemes in
Fig. 11(e) and (f) is AFet ipre ,av > AFETI,av > AFt ic,av >
AFes,av > AFse,av . However, in Fig. 11(d), the reduction order
is AFet ipre ,av > AFt ic,av > AFETI,av > AFse,av > AFES,av
for m = 2. This is because the ETI and ES schemes perform
1804
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 10, OCTOBER 2013
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 11. (a) Overall AFav for m = 2 versus the sum of the two bitstream average AFs (AF1,av = AF2,av ). (b) Overall AFav for m = 4 versus the sum
of the four bitstream average AFs (AF1,av = AF 2,av = AF 3,av = AF4,av ). (c) Overall AFav for m = 8 versus the sum of the eight bitstream average AFs
(AF1,av = AF2,av = AF3,av = AF4,av = AF5,av = AF6,av = AF7,av = AF8,av ). (d)–(f) Reduction of AFav versus m for AFi,av = 0.3, 0.5, 0.7 compared
with AF S,av .
better for signals with high switching probabilities. For m = 2
and AFi,av = 0.3 in Fig. 11(d), the data in the individual
bitstreams and the serial link have less switching activities so
the performance of the ETI and ES is slightly less than the
TIC and SE. For m = 4 or 8 in Fig. 11(d) and all the m in
Fig. 11(e) and (f), when AFi,av = 0.5 or 0.7, the switching
activities of each bitstreams and the serial link are larger so
the performance of the ETI and ES is better than that of the
TIC and SE.
The word length used in the above AF analysis is proportional to the multiplexing degree m. The larger the word
length for a data word is selected, the higher threshold is used.
Therefore, the reduction on the number of transition becomes
smaller. Since the reduction of AFETI,av decreases when the
number of serialized bitstream m increases, we only show the
AF results for m from two to eight in Fig. 11.
B. Word Length Analysis
We discuss the effects of word length on the AF for three
types of encoding methods: TIC, ETIpre , and ETI. The AFs
of TIC, ETIpre , and ETI depend on the word length since
transition checking uses half of the word length as a threshold.
For these three coding schemes, the AF increases as the
word length increases. An example of ETI encoding with four
different word lengths is shown in Fig. 12(a). Assume that the
serial input stream has 14 transitions. For a 32-bit word length
with threshold Nth (=16), the number of transitions Nt (=14)
is smaller than the threshold; thus, the serial input data are
not encoded. For a word length of 16-bits, half of the input
bitstream is encoded (the bold font part) and the other half is
not encoded. The Nt in the ETI encoded data word is reduced
to 12. The Nt in the ETI encoded data word using eight-bit and
four-bit word lengths are 12 and 10, respectively. The smaller
the word length is, the lower the AF is. Similar results appear
in [9], where the number of transitions in the resulting data
word increases as the word length increases from 8 to 32.
In the TIC scheme, the AF of the data word using a word
length of four is higher than that of a word length of eight.
This is because the increase in transition caused by the extra
bit becomes severe when the word length decreases to four
bits. The extra bit not only increases the data latency, but also
produces extra transitions, as Fig. 12(b) shows.
The AF E T I,av expressions for word lengths of four, eight,
and sixteen for m = 8 are shown as below
AF E T I,av = 2.75, W L = 4
(6)
⎧ m
⎪
⎪
⎪
0.38 − 0.32 AFi,av
⎪
⎪
⎨ i=1
AFFE T I,av = −0.15AF2 + 0.75AF3 W L = 8
(7)
i,av
i,av
⎪
⎪
4
5
⎪
⎪
⎪ −1.40AFi,av + 1.43AFi,av
⎩
−0.87AF6i,av + 0.25AF7i,av
⎧ m
⎪
⎪
⎪
0.38 − 0.03 AFi,av
⎪
⎪
⎨ i=1
W L = 16 (8)
AFE T I,av = +0.46AF2 − 1.15AF3
i,av
i,av
⎪
⎪
4
5
⎪
+1.35AFi,av − 1.04AFi,av
⎪
⎪
⎩
+0.53AF6i,av − 0.15AF7i,av .
CHIU et al.: ETI CODING WITH LOW SWITCHING ACTIVITY
1805
Fig. 13. AFav analysis of ETI method by using four, eight, and sixteen word
length when the number of multiplexing lines (m = 8).
TABLE V
R EDUCTION OF THE T RANSITION FOR TIC, ETIpre , AND ETI
(a)
E NCODING S CHEMES W ITH D IFFERENT W ORD L ENGTH
Word Length
% Reduction in Transitions (for Random Data)
TIC
ETIpre
ETI
4
12.59%
37.75%
8
14.95%
27.75%
31.25%
27.5%
16
13.31%
19.75%
19.75%
32
10.71%
13.77%
13.77%
(b)
Fig. 12. (a) Coding example of the ETI scheme with different word length
(WL) where Nt is the number of transitions in the ETI encoded data word.
(b) Example of the TIC scheme with word length of four.
Under the ETI coding scheme, the AFav is close to a
constant when the word length is four bits. The graph of AFav
for ETI for m = 8 under different word lengths is shown in
Fig. 13. These results show that the AFav is the smallest when
the word length is four.
This paper also conducts simulations to verify the analysis
result. In these simulations, the TIC, ETIpre , and ETI schemes
encode eight independent random streams under four kinds of
word lengths. Table V shows the reduction of the transition for
various encoding schemes with different word length. These
simulation results are consistent with the analysis above and
similar to the results in Fig. 13.
V. E NERGY A NALYSIS
A. Interconnect Capacitance and Optimum Selective Spacing
The dynamic energy dissipation per unit length of an
interconnect is
MC F × Cc ) × V D2 D
(9)
E = 0.5AF × (2C g +
where C g and Cc are the line-to-ground capacitance and
coupling capacitance, and the MCF is the miller coupling
factor. Based on the C g and Cc formulas modeled in [22], the
C g is proportional to the wire width w and the Cc decreases
as the wire spacing s increases. The total average
interconnect
capacitance Ct,av per unit length is (2C g + MC F × Cc ).
To obtain the minimum C g , we choose wire width w as
the minimum wire width w0 in various fabrication process.
The w0 is 0.14 and 0.2 μm for TSMC 90- and 130-nm
CMOS technologies, respectively. This paper fixes w to w0
and changes wire spacing s. The term soptct represents the
wire spacing for obtaining the minimum total capacitance of
two adjacent wires. Simulation results indicate that the soptct
soptct occurs at 1.07 μm for 90-nm CMOS technologies. This
soptct is much larger than the w0 .
Define An as the total wire area of n parallel wires. We
assume that the initial wire and space width are both equal to
w0 . The total area of the n parallel wires An is nw0 +(n−1)w0 .
As shown in (10) to (11), we propose a selective spacing (SS)
scheme to choose the optimum wire spacing (sopt ) under the
constraint of minimizing the total capacitance and total wire
area.
For n input bitstreams under multiplexing degree m, the
number of serial link is n/m. The soptct is the wire spacing
for obtaining the minimum total capacitance between two adjacent wires. Selecting the wire space as soptct for small m makes
the An/m larger than An since the soptct > w0 . Therefore,
for small m, we select a wire space such that An/m =
An . The wire space is obtained by (An − n/mw0 )/n/m
and equals 2mw0 − w0 . The m 0 is defined as the minimum
multiplexing degree such that its corresponding wire spacing
is larger than soptct . That is, m 0 is the minimum number that
1806
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 10, OCTOBER 2013
Fig. 14. Normalized serial link capacitance as a function of the degree of
multiplexing m with n = 64. 90 nm: w0 = 140 nm, s0 = 140 nm, h = 440 nm,
t = 310 nm, and ε = 2.9. 130 nm: w0 = 200 nm, s0 = 200 nm, h = 550 nm,
t = 370 nm, and ε = 3.7.
Fig. 15. Area and RC delay reduction versus the m with n = 64. 90 nm:
w0 = 140 nm, s0 = 140 nm, h = 440 nm, t = 310 nm, and ε = 2.9. 130 nm:
w0 = 200 nm, s0 = 200 nm, h = 550 nm, t = 370 nm, and ε = 3.7.
satisfies the wire spacing (2m 0 w0 − w0 > soptct ). For 90-nm
CMOS technology, when the soptct equals to 1.07 μm and
the w0 equals to 0.09 μm, the m 0 = 2. That means when
multiplexing degree is larger or equals to two, the wire spacing
obtained by (An − n/mw0 )/n/m is larger than soptct and
under this case we choose sopt equals to soptct to minimize
the total wire area. Under the case m < m 0 , we select the
wire spacing as 2mw0 − w0 . When m > 2 (w0 = 0.2 μm) in
130 nm and m > 4 (w0 = 0.14 μm) in 90 nm, the selected
optimum wire spacing is soptct
wopt = w0
sopt =
2mw0 − w0 , m < m 0
soptct ,
m ≥ m 0,
(10)
0 −w0
where 2m 0swoptct
≥ 1.
(11)
Under the 130- and 90-nm CMOS technologies, the total
interconnect capacitor versus m is shown in Fig. 14. This total
Fig. 16. Normalized serial link energy dissipation as a function of m for
130-nm CMOS process under AFi,av = 0.3, 0.5, and 0.7.
capacitance is normalized to the interconnect capacitance of
the n parallel wires. The reduction of the total capacitance is
larger than that in [16] when m ≥ 4.
The area and RC delay reduction versus m with n = 64
is shown in Fig. 15. The wire spacing sopt is based on (11).
When m is greater than two for 130 nm or m is greater than
four for 90 nm, the total wire area is smaller than that of the
n wires. Since total capacitance after multiplexing is reduced,
the propagation delay after multiplexing is also reduced with
the resistance unchanged. The actual values of capacitance
in Fig. 14 and area/RC values in Fig. 15 are summarized
in Table VI. The initial spacing s0 is the same as w0 . The
w0 values in ours and [16] for 130- and 90-nm technologies
are (200, 240 nm) and (140, 160 nm), respectively. The total
capacitance per wire is 2(Cc + C g ) times the wire length.
For 130-nm technology, our total capacitance and RC values
of the parallel bus are larger than theirs without multiplexing
(m = 1). With our width and spacing selection scheme, the
reduction of our RC values is better for m = 8. For 90-nm
technology, our total capacitance and RC values of the parallel
bus are smaller than theirs without multiplexing (m = 1). With
our width and spacing selection scheme, the reduction of our
RC values is better for m = 8. The RC reduction for 90 nm
is larger than 130 nm. As for the area, since our scheme aims
at minimizing the total area by using small spacing values,
our areas are smaller than theirs for both 130- and 90-nm
technologies.
In [16], the criterion to choose the width and spacing of a
wire in serial links is to minimize the total wire capacitance in
the serial links. In order to minimize the ground capacitance,
the minimum wire width w0 is chosen. The optimum spacing
Sopt is the maximum possible spacing between wires in order
to minimize the coupling capacitance. Since the maximum
possible spacing is chosen, so the total area n/mw0 +
(n/m − 1)Sopt is still large for a large multiplexing index
m over n wires parallel bus. Therefore, the area reductions in
[16] are limited for both 90- and 130-nm CMOS technologies.
Compared to the width and spacing selection schemes
in [16], our approach aims at reducing both the total area
CHIU et al.: ETI CODING WITH LOW SWITCHING ACTIVITY
1807
TABLE VI
VALUE OF C APACITANCE , RC, AND A REA AT T ECHNOLOGY OF 130 AND 90 nm
130 nm
90 nm
130 nm [16]
90 nm [16]
Cc = 93.52 fF/mm Cg = 23.52 fF/mm
Cc = 87.412 fF/mm Cg = 16.06 fF/mm
Cc = 63.86 fF/mm Cg = 42.63 fF/mm
Cc = 74.59 fF/mm Cg = 34.82 fF/mm
R = 297.28 ohm/mm (m = 1)
R = 506.9 ohm/mm (m = 1)
R = 246.42 ohm/mm (m = 1)
R = 505.5 ohm/mm (m = 1)
Capacitance
RC
Area
Capacitance
RC
Area
Capacitance
RC
Area
Capacitance
RC
Area
(fF)
(aS)
(μm2 ∗ mm)
(fF)
(aS)
(μm2 ∗ mm)
(fF)
(aS)
(μm2 ∗ mm)
(fF)
(aS)
(μm2 ∗ mm)
m=1
702.52
6.26
28.19
620.86
9.44
16.54
638.91
4.72
33.75
656.45
9.96
16.45
m=2
396.13
3.53
27.75
320.14
4.87
16.28
360.35
2.66
33.22
338.47
5.13
16.19
m=4
363.77
3.24
25.2
279.67
4.25
15.75
330.96
2.45
32.15
295.67
4.48
15.67
m=8
363.77
3.24
11.88
279.47
4.25
8.0
343.73
2.54
30.03
304
4.61
14.64
TABLE VII
E NERGY D ISSIPATION W ITH D IFFERENT AF i,av AND m AT 130- AND 90-nm T ECHNOLOGY
130 nm
0.3
This Paper
0.5
0.7
m=1
970.79
1618
m=2
463.51
m=4
621.73
m=8
576.22
90 nm
0.3
[16]
0.5
0.7
2265.2
883.23
1472
2060.9
531.55
633.61
485.78
809.6
1133.5
402
576.64
531.55
640.61
762.5
640.5
477.99
576.22
576.22
640.61
763.3
640.5
422.68
422.68
of serial links and wire capacitances. Therefore, our resulting
areas are scaled down when a higher m is used.
The RC delay for 90- and 130-nm technologies [16] does
scale similar to the rest values is because the total capacitance
values and the corresponding RC values are reduced based on
the criterion of minimizing the total wire capacitance in the
serial links.
The average energy dissipation of a serial link is directly
proportional to the total interconnect capacitance, the supply
voltage, and average AF. The energy dissipation of n/m
serial links normalized to that of the parallel bus (m = 1) is
expressed as follows:
n
m
× 0.5AF E T I,av (m) × Ct,av (m) × V D2 D
n × 0.5AFi,av × Ct,av (1) × V D2 D
0.7
0.3
[16]
0.5
0.7
858.28
1430.5
2002.7
907.48
1512.5
2117.5
461
549.52
467.35
778.9
1090.5
443.32
408.66
571.9
680.9
571.7
422.68
571.9
680.9
571.7
proposed algorithm for the 130- and 90-nm technologies is
similar so they are not depicted. For 130-nm technology, since
our scheme has larger capacitance than theirs for m = 1, our
initial energy is larger than theirs. With our coding scheme and
width/spacing selection, our energy dissipation is smaller than
theirs for all m > 1. The energy dissipation of our scheme is
smaller than theirs for all m under 90-nm technology.
VI. E XPERIMENTAL R ESULTS
B. Energy Dissipation
E av (m)
=
E av (1)
0.3
This Paper
0.5
. (12)
The AF E T I,av for different m and AFi,av are shown
in (3)–(5). The energy saving of the proposed ETI scheme
and the method in [16] with AFi,av = 0.3, 0.5, and 0.7 under
130-nm technology is shown in Fig. 16. The actual values of
energy dissipation in Fig. 16 are summarized in Table VII.
For AFi,av = 0.3 and m = 2, the energy saving is the
same as the reduction in average wire capacitance, because
the encoding eliminates the AF penalty due to serialization.
For AFi,av = 0.3 and m ≥ 4, encoding does not eliminate
all the AF penalty, and a net penalty in AF remains. For
AFi,av = 0.5 and 0.7, the reduction of AF dominates the
energy saving. The minimum energy dissipation is achieved
when m equals to four and AFi,av is 0.7 with a 75% energy
reduction. It is because the average activity reduction increases
when the AFi,av increases and the maximum reduction of
wire capacitance occurs at m = 4. The energy saving of the
A. AF Simulation
Table VIII shows the simulation results by applying real
video streams. There are three input video streams (For eman,
News, and Mobile in compressed and uncompressed format)
based on YUV CIF 4:2:0 video with 300 frames. According
to the simulation results, the average AF of the input video
streams is around 0.5. For the uncompressed video streams, the
first ten frames are used in our simulation. The original number
of transition in the uncompressed video stream depends on
the content. Compared with the serial link scheme (S), the
ETI scheme is more efficient than other schemes, achieving
the reduction of 28.56%, 32.91%, and 33.4% for For eman,
News, and Mobile videos, respectively. The compressed
video streams are encoded by H.264 with 300 frames in
our simulation. For the word length of four bits, the ETI
reduces 31% transition compared with the S scheme for all
three compression videos. No adjacent transition (NAT) coding
techniques [23] adopt the limited weight coding scheme and
transition signaling to minimize power and eliminate crosstalk.
Table IX shows the simulation results with self generated
bitstreams by the C code. The number of bistreams in the
parallel bus is eight, and the AF of each bitstream is 0.5.
Three points can be noticed. First, the result of the ETI
scheme using word lengths of four and eight matches the
1808
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 10, OCTOBER 2013
TABLE VIII
S IMULATION R ESULTS FOR Foreman, N ews, AND Mobile I NPUT V IDEOS
Input Video: Foreman, N ews, and Mobile
Transition Reduction (%)
Data Format
Video
Total Bit Number Original Transition
NAT [23] TIC [9](WL = 8) ETI(WL = 8) ETI(WL = 4)
SE
Foreman
Compression
N ews
(H.264, 300 frames) Mobile
16,868,992
6,229,312
45,933,568
8,425,162
3,106,504
22,952,388
−0.05%
−0.02%
−0.14%
−0.12%
−0.24%
−0.05%
14.8%
14.7%
14.8%
27.12%
27%
27.1%
31.21%
31.1%
31.2%
Foreman
N ews
Mobile
12,165,120
12,165,120
12,165,120
6,101,319
6,148,441
6,216,989
0.832%
6.52%
3.3%
1.738%
1.17%
1.81%
11.27%
16.63%
17.36%
20.03
27.17%
26.37%
28.56%
32.91%
33.4%
Uncompressed
(YUV, 10 frames))
TABLE IX
S IMULATION R ESULTS W ITH S ELF G ENERATED
I NPUT D ATA BY THE C C ODE
Eight Random Input Bitstreams, (AFi,av = 0.5)
Total Bits Number = 640 000
Transitions
% Reduction
(Compared
with S)
% Reduction
(Compared
with TIC)
Serial stream (S)
319 909
-
−17.55%
ETI (WL = 4)
220 035
31.25%
19.1%
ETI (WL = 8)
232 473
27.5%
14.57%
Coding Method
TIC [9]
272 142
15.00%
-
SILENT(ES) [1]
319 875
0.01%
−17.53%
SE
319 907
0.01%
−17.55%
321 705
−0.007%
−18.2%
NAT
(After serialized) [23]
(a)
analysis in Section IV. Second, the performance of SILENT
is very weak since it reduces transitions by only 0.01%. This
is because this coding scheme is only suitable for applications
with strong correlation. The ETI scheme is more efficient than
other schemes, achieving a reduction of 31.25%.
(b)
(c)
B. Power Simulation
Fig. 17.
(a) Wire model, (b) serializer, and (c) deserializer architectures.
We compare the energy dissipation of our proposed ETI
serial link with the parallel bus using the TSMC 130- and
90-nm CMOS technology. A 64-bit parallel bus is constructed.
The wire load model is shown in Fig. 17(a). The wire in
the bus is capacitively coupled to its two neighbors. The
capacitance of the wire is calculated using Berkeley Predictive
Technology Model [24]. The clock period in the wire is
denoted as Tcyc . The clock period is designed to meet the delay
from the wire, including repeaters and the circuit component
(such as D-FF).
Denote PP as the maximum power of the parallel bus when
the AFs (AFi,av , i = 1, . . . , n) of all the n wires equal to one.
In our simulation, we adopt a 3-mm wire that is represented
by a wire model.
PP is obtained by HSPICE simulation with each wire with
clock period Tcyc . The POP is the total power of the overhead
circuits in the parallel bus. The overhead circuits include
synchronizers. Assume that the AFi,av is the same for all the
wires in the parallel bus. For different AF AFi,av , the total
power of the parallel bus is AFi,av × PP + POP .
For the ETI serial link architecture, there are n/m serial
links. Equations (3)–(5) show the average accumulated AF
AF E T I,av for an ETI serial link that is multiplexed from m
wires. The AF E T I,av is a function of the AFi,av and m. The
normalized AF E T I,av is AF E T I,av /m. The PE is the power
of a serial link. The PO E is the power of the overhead
circuits in an ETI serial link that includes the SerDes, ETI
encoder, and decoder. The architectures of the SerDes are
shown in Fig. 17(b) and (c). The architectures of ETI encoder
and decoder are shown in Section III. The total power of
the overhead circuit in all the ETI serial links PT O E is
(n/m) × PO E . The power of the ETI serial link to the parallel
bus is shown below
AF
E T I,av
(n/m)
× PE + PO E
m
E E T I,av (m)
=
.
E P,av (1)
AFi,av × PP + POP
(13)
CHIU et al.: ETI CODING WITH LOW SWITCHING ACTIVITY
1809
TABLE X
VII. C ONCLUSION
D IMENSIONS , PARAMETERS , AND P OWER S AVING FOR 3-mm W IRE
This paper presented the ETI coding scheme to reduce the
power dissipation of a serial link. The ETI scheme uses the
phase difference between the clock and the data to reduce
the switching activity of the serial link. This ETI reduces the
number of transitions by up to 31% compared with the serial
scheme. The analysis and simulation results indicated that the
proposed coding scheme achieves fewer transitions for most
data patterns. The analysis and simulation results indicated
that the proposed coding scheme produces a low bit transition
for different kinds of data patterns. Using the optimum degree
of multiplexing, optimum width, and spacing, the ETI coding
scheme achieves 30%–60% energy reduction compared with
the parallel bus without overhead. The power saving is up
to 31.71% and 26.46% at Tcyc of 250 ps for the 90-nm and
130-nm CMOS technology for m = 2 adding circuit overhead.
A larger bandwidth was needed in the ETI coding scheme due
to phase shift. The last bit has half the pulse width of the other
bits so that the interconnect has twice the band width. It means
that the serial link needs to run a much frequency. The higher
clock frequency leads to problems such as buffering, clock
synchronization, and design complexity. The other way is to
wait an extra bit to check the transition information but that
would lower the overall bit rate.
L ENGTH AT 130- AND 90-nm T ECHNOLOGIES . (90 nm: h = 440 nm,
t = 310 nm, AND ε = 2.9; 130 nm: h = 550 nm, t = 370 nm, AND ε = 3.7)
64-Parallel-Line Bus
64/32-Serial-Link Bus
m=1
m=2
Tech
(nm)
W
Cg
Cc
Tcyc
W
Cg
Cc
Power
Tcyc
(μm) (aF/μm) (aF/μm) (ps) (μm) (aF/μm) (aF/μm) (ps)
Saving
(%)
130
0.2
23.52
93.52
1000
0.6
40.38
25.64
500
29.14
130
0.2
23.52
93.52
500
0.6
40.38
25.64
250
26.46
90
0.14
16.06
87.41
1000 0.42
24.56
28.79
500
35.98
90
0.14
16.06
87.41
500
24.56
28.79
250
31.71
0.42
TABLE XI
P OWER S AVING FOR D IFFERENT m U NDER THE S AME T HROUGHPUT
m
Tcycle
( ps )
S
130 nm
90 nm
w = 0.2 μm
w = 0.14 μm
Cg
Cc
Power
S
Cg
Cc
Power
(μm) (aF/μm) (aF/μm) Saving(%) (μm) (aF/μm) (aF/μm) Saving(%)
2 1000
0.6
40.38
26.64
29.14
0.42
28.8
24.56
35.29
4
500
1.3
52.15
8.47
38.57
0.98
39.17
7.44
48.03
8
250
1.3
52.15
8.47
26.38
1.07
40.09
6.49
34.66
R EFERENCES
Table X shows the power saving results. Under a fixed m,
the higher the frequency is, the lower the power saving is. It
is because the increase of PO E is higher than PP and POP
at higher frequencies. The more advanced the technology is,
the higher power saving is. The power saving is up to 31.71%
and 26.46% at Tcyc of 250 ps for the 90- and 130-nm CMOS
technology for m = 2.
Table XI shows the power saving for different m with the
same total throughput at 90-nm and 130-nm technologies. To
have the same throughput, the higher the m is, the higher the
frequency in the serial link is. Although the increase of PO E
is higher than PP and POP at higher frequencies, Table XI
shows that m = 4 has higher power saving than m = 2. It
is because m = 4 has much smaller capacitance than m = 2
under our selective spacing scheme.
The clock frequency of the serial links is much higher
than that of the parallel bus. The higher clock frequency
leads to problems such as buffering, clock synchronization,
and design complexity. As shown in Table X, our implemented ETI encoder and decoder achieve good power saving
for operating the serial link up to 2 GHz. High frequency
circuit design techniques are needed to overcome those problems when adopting the proposed ETI scheme to higher
frequency serial link. As we know, serial link interface is
common adopted on off-chip interconnect, such as USB and
SATA. The serial link technologies have improved dramatically these years [25]–[27]. For example, the data rates
of USB devices increase from 480 Mb/s to 5 Gb/s in just
a few year. PDs that operate at 1∼10 GHz are reported
[28]–[31]. With these advanced serial link technologies, our
proposed ETI architecture is suitable for reducing link power.
[1] K. Lee, S. J. Lee, and H. J. Yoo, “SILENT: Serialized low
energy transmission coding for on-chip interconnection networks,” in
Proc. IEEE Int. Conf. Comput.-Aided Design Conf., Nov. 2004, pp.
448–451.
[2] M. R. Stan and W. P. Burleson, “Bus-invert coding for low-power I/O,”
IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 3, no. 1, pp.
49–58, Mar. 1995.
[3] Y. Shin, S. I. Chae, and K. Choi, “Partial bus-invert coding for power
optimization of application-specific systems,” IEEE Trans. Very Large
Scale Integr. (VLSI) Syst., vol. 9, no. 2, pp. 377–383, Apr. 2001.
[4] R. B. Lin and C. M. Tsai, “Weight-based bus-invert coding for lowpower applications,” in Proc. Int. Conf. VLSI Design, Jan. 2002, pp.
121–125.
[5] C. H. Kuo, W. B. Wu, Y. J. Wu, and J. H. Lin, “Serial low power bus
coding for VLSI,” in Proc. IEEE Int. Conf. Commun., Circuits Syst.,
Jun. 2006, pp. 2449–2453.
[6] S. Zogopoulos and W. Namgoong, “High-speed single-ended parallel
link based on three-level differential encoding,” IEEE J. Solid-State
Circuits, vol. 44, no. 2, pp. 549–558, Feb. 2009.
[7] S. R. Sridhara and N. R. Shanbhag, “Coding for reliable on-chip buses: A
class of fundamental bounds and practical codes,” IEEE Trans. Comput.Aided Design Integr. Circuits Syst., vol. 26, no. 5, pp. 977–983, May
2007.
[8] P. T. Huang, W.-L. Fang, Y.-L. Wang, and W. Hwang, “Low power
and reliable interconnection with self-corrected green coding scheme
for network-on-chip,” in Proc. 2nd ACM/IEEE Int. Symp. Netw. Chip,
Apr. 2008, pp. 77–84.
[9] R. Abinesh, R. Bharghava, and M. B. Srinivas, “Transition inversion
based low power data coding scheme for synchronous serial communication,” in Proc. IEEE Comput. Soc. Annu. Symp. VLSI Conf., May
2009, pp. 103–108.
[10] S. J. Lee, S. J. Song, K. Lee, J. H. Woo, S. E. Kim, B. G. Nam, and H.
J. Yoo, “An 800 MHz star-connected on-chip network for application
to systems on a chip,” in IEEE Int. Solid-State Circuits Conf. Technol.
Dig., Feb. 2003, pp. 468–469.
[11] K. Lee, S. J. Lee, and S. E. Kim, “A 51 mW 1.6 GHz on-chip network
for low-power heterogeneous SoC platform,” in Proc. IEEE Int. SolidState Circuits Conf., Feb. 2004, pp. 152–518.
[12] K. Lee, S. J. Lee, and H. J. Yoo, “Low-power network-on-chip for highperformance SoC design,” IEEE Trans. Very Large Scale Integr. (VLSI)
Syst., vol. 14, no. 2, pp. 148–160, Feb. 2006.
1810
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 10, OCTOBER 2013
[13] R. Narasimha and N. Shanbhag, “Design of energy-efficient high-speed
links via forward error correction,” IEEE Trans. Circuits Syst. II, Exp.
Briefs, vol. 57, no. 5, pp. 359–364, May 2010.
[14] E. Azadi, H. Ghasemizadeh, A. Khoei, and K. Hadidi, “5 Gb/s phase
tracking clock recovery with new line coding in 0.35u CMOS process,”
in Proc. Int. Symp. Telecommun., 2010, pp. 458–463.
[15] D. N. Sarma and G. Lakshminarayanan, “Encoding technique for
reducing power dissipation in network on chip serial links,” in Proc.
Int. Conf. Comput. Intell. Commun. Syst. Conf., 2011, pp. 323–327.
[16] M. Ghoneima, Y. Ismail, M. Khellah, J. Tschanz, and V. De, “Serial-link
bus: A low-power on-chip bus architecture,” IEEE Trans. Circuits Syst.
I, Reg. Papers, vol. 56, no. 9, pp. 2020–2032, Sep. 2009.
[17] W. C. Huang, C. H. Lin, and C. T. Chiu, “Embedded transition inversion
coding for low power serial link,” in Proc. IEEE Workshop Signal
Process. Syst. Conf., Oct. 2011, pp. 102–105.
[18] B. Razavi, “Challenges in the design of high-speed clock and data
recovery circuits,” IEEE Commun. Mag., vol. 40, no. 8, pp. 94–101,
Aug. 2002.
[19] S. Kaeriyama, Y. Amamiya, H. Noguchi, Z. Yamazaki, T. Yamase,
K. Hosoya, M. Okamoto, S. Tomari, H. Yamaguchi, H. Shoda, H.
Ikeda, S. Tanaka, T. Takahashi, R. Ohhira, A. Noda, K. Hijioka,
A. Tanabe, S. Fujita, and N. Kawahara, “A 40 Gb/s multi-data-rate
CMOS transmitter and receiver chipset with SFI-5 interface for optical
transmission systems,” IEEE J. Solid-State Circuits, vol. 44, no. 12,
pp. 3568–3579, Dec. 2009.
[20] K. Fukuda, H. Yamashita, G. Ono, R. Nemoto, E. Suzuki, N. Masuda,
T. Takemoto, F. Yuki, and T. Saito, “A 12.3-mW 12.5-Gb/s complete
transceiver in 65-nm CMOS process,” IEEE J. Solid-State Circuits, vol.
45, no. 12, pp. 2838–2850, Dec. 2010.
[21] R. Reutemann, M. Ruegg, F. Keyser, J. Bergkvist, D. Dreps, T. Toifl,
and M. Schmatz, “A 4.5 mW/Gb/s 6.4 Gb/s 22+1-lane source synchronous receiver core with optional cleanup PLL in 65 nm CMOS,”
IEEE J. Solid-State Circuits, vol. 45, no. 12, pp. 2850–2861, Dec.
2010.
[22] P. P. Sotiriadis and A. Chandrakasan, “Bus energy minimization by
transition pattern coding (TPC) in deep submicron technologies,” in
Proc. IEEE/ACM Int. Comput.-Aided Design Conf., Nov. 2000, pp. 320–
327.
[23] P. Subrahmanya, R. Manimegalai, V. Kamakoti, and M. Mutyam, “A
bus encoding technique for power and cross-talk minimization,” in Proc.
IEEE Conf. VLSI Design, Jun. 2004, pp. 443–448.
[24] C. Hu. (1996). Berkeley Predictive Technology Model [Online]. Available: http://ww-w-device.eecs.berkeley.edu/∼ptm
[25] J. Kim, J. K. Kim, B. J. Lee, and D. K. Jeong, “Design optimization
of on-chip inductive peaking structures for 0.13 μm CMOS 40 Gb/s
transmitter circuits,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 56,
no. 12, pp. 2544–2555, Dec. 2009.
[26] S. Kaeriyama, Y. Amamiya, H. Noguchi, Z. Yamazaki, T. Yamase,
K. Hosoya, M. Okamoto, S. Tomari, H. Yamaguchi, H. Shoda, H.
Ikeda, S. Tanaka, T. Takahashi, R. Ohhira, A. Noda, K. Hijioka,
A. Tanabe, S. Fujita, and N. Kawahara, “A 40 Gb/s multi-data-rate
CMOS transmitter and receiver chipset with SFI-5 interface for optical
transmission systems,” IEEE J. Solid-State Circuits, vol. 44, no. 12, pp.
3568–3579, Dec. 2009.
[27] N. Nedovic, A. Kristensson, S. Parikh, S. Reddy, S. McLeod, N.
Tzartzanis, K. Kanda, T. Yamamoto, S. Matsubara, M. Kibune, Y. Doi,
S. Ide, Y. Tsunoda, T. Yamabana, T. Shibasaki, Y. Tomita, T. Hamada,
M. Sugawara, T. Ikeuchi, N. Kuwata, H. Tamura, J. Ogawa, and W.
Walker, “A 3 Watt 39.8–44.6 Gb/s dual-mode SFI5.2 SerDes chip set
in 65 nm CMOS,” IEEE J. Solid-State Circuits, vol. 45, no. 10, pp.
2016–2029, Oct. 2010.
[28] J. E. Rogers and J. R. Long, “A 10-Gb/s CDR/DEMUX with LC delay
line VCO in 0.18 μm CMOS,” IEEE J. Solid-State Circuits, vol. 37, no.
12, pp. 1781–1789, Dec. 2002.
[29] M. Sorna, T. Beukema, K. Selander, S. Zier, B. Ji, P. Murfet, J. Mason,
W. Rhee, H. Ainspan, and B. Parker, “A 6.4 Gb/s CMOS SerDes core
with feedforward and decision-feedback equalization,” in Proc. IEEE
Int. Solid-State Circuits Conf., Feb. 2005, pp. 62–63.
[30] C. F. Liao and S. I. Liu, “A 40 Gb/s CMOS serial-link receiver with
adaptive equalization and clock/data recovery,” IEEE J. Solid-State
Circuits, vol. 43, no. 11, pp. 2492–2502, Nov. 2008.
[31] S. H. Lin and S. I. Liu, “Full-rate bang-bang phase/frequency detectors
for unilateral continuous-rate CDRs,” IEEE Trans. Circuits Syst. II, Exp.
Briefs, vol. 55, no. 12, pp. 1214–1218, Dec. 2008.
Ching-Te Chiu received the B.S. and M.S. degrees
from National Taiwan University, Taipei, Taiwan, in
1986 and 1988, respectively, and the Ph.D. degree
from University of Maryland, College Park, MD, in
1992, all in electrical engineering.
She was an Associate Professor with National
Chung Cheng University, Chia-Yi, Taiwan, from
1993 to 1994. From 1994 to 1996, she was a
Technical Staff Member with AT&T, Murray Hill,
NJ, and Lucent Technologies, Murry Hill, from 1996
to 2000, and with Agere Systems, Santa Clara, CA,
from 2000 to 2003. In 2004, she joined the Computer Science Department,
Institute of Communications Engineering, National Tsing Hua University,
Hsinchu, Taiwan, as an Associate Professor. Her current research interests
include high-Speed SerDes design, multichip interconnect, fault tolerance
for network-on-chip, high dynamic range image and video processing, high
definition television video decoder chip design, and SONET and SDH mapper
and framer IC design.
Dr. Chiu was the recipient of the First Prize, the Best Advisor Award and
the Best Innovation Award of the Golden Silicon Award in 2006. She is a
TC Member of the IEEE Circuits and Systems Society, Nanoelectronics and
Gigascale Systems Group, and the IEEE Signal Processing Society, Design
and Implementation of Signal Processing Systems Group. She is the Program
Chair of the first IEEE Signal Processing Society Summer School at Hsinchu
in 2011.
Wen-Chih Huang received the B.S. degree from
Chung Yuan Christian University, Jhongli City, Taiwan, and the M.S. degree from the National Tsing
Hua University, Hsinchu, Taiwan, in 2008 and 2010,
respectively, both in computer engineering.
He joined MStar Semiconductor, Hsinchu, in 2011,
where he is engaged in designing transport stream
processors for the digital TV products. His current
research interests include low-power techniques and
circuit designs.
Chih-Hsing Lin received the M.S. degree in
electric engineering from Chung Yuan Christian
University, Chung-Li, Taiwan, and the Ph.D. degree
from the Institute of Communications Engineering,
National Tsing-Hua University, Hsinchu, Taiwan,
in 2002 and 2011, respectively.
He is currently an Associate Researcher with
the Advanced Technology Division, National
Chip Implementation Center, Hsinchu. His current
research interests include clock synchronous,
frequency synthesizers, low power, very large-scale
heterogeneous integration design, and image processes.
Wei-Chih Lai received the B.S. degree from
National Central University, Jhongli, Taiwan, and the
M.S. degree from National Tsing Hua University,
Hsinchu, Taiwan, in 2009 and 2011, respectively,
both in computer science.
He is currently with United Microelectronics Corporation. His current research interests include highspeed switch and data center networks.
Ying-Fang Tsao received the B.S. degree in computer engineering from National Tsing Hua University, Hsinchu, Taiwan, in 2011, where she is
currently pursuing the M.S. degree.
Her current research interests include design of
integrated circuits and image and video signal
processing.
Download