Uploaded by Sayantan Banerjee

SNG omegaflip

advertisement
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 65, NO. 2, FEBRUARY 2018
231
A Parallel Stochastic Number Generator With
Bit Permutation Networks
Vikash Sehwag, Student Member, IEEE, N. Prasad, Student Member, IEEE,
and Indrajit Chakrabarti, Member, IEEE
Abstract—Stochastic computing (SC) is a promising paradigm
to realize low-complexity digital circuits that are tolerant to soft
errors. Stochastic circuits include a stochastic number generator (SNG) to generate a stochastic number that corresponds
to a given binary number. Conventional SNGs, which employ
linear feedback shift registers (LFSRs) to generate stochastic
numbers in a serial manner would cost significantly in time. In
this brief, a parallel SNG has been proposed, which can generate
stochastic numbers in parallel by transforming the input binary
number to a modified unary number and permuting it using a
bit permutation network. Further, a method to share a single
LFSR among multiple SNGs has been presented. Experimental
results show that the proposed SNG can achieve improvement in
SC correlation and energy-delay-product by 28.57% and 4.32×,
respectively, when compared to the existing shared LFSR-based
SNG. For applications, such as edge detector, multiplier, and
complex multiplication, the proposed SNG has achieved reduction in execution time and area-delay-product by up to 1000×
and 9×, respectively, as compared to others.
Index Terms—Bit permutation, low latency, omega-flip
network, stochastic computing, stochastic number.
I. I NTRODUCTION
TOCHASTIC Computing (SC) is an emerging paradigm
for energy-efficient and error-tolerant computing [1], [2].
Instead of conventional computing approach, it uses probability as a baseline measure in generating stochastic numbers
to perform arithmetic operations on them. Recent improvements in SC promise that it can achieve higher accuracy in its
outputs with much less usage of hardware resources than the
conventional computing approach [3]–[5].
Stochastic circuits include a stochastic number generator
(SNG), which is used to generate a stochastic number corresponding to a given binary number. Traditional stochastic
circuits with conventional SNGs that employ linear feed-back
shift registers (LFSRs) to generate stochastic numbers serially, result in high latency in the computation. To resolve this,
various methods, such as binary stochastic and range segmentation, have been proposed to parallelize the stochastic
S
Manuscript received May 3, 2017; revised May 13, 2017; accepted
May 22, 2017. Date of publication May 25, 2017; date of current version January 29, 2018. This brief was recommended by Associate Editor
C.-T. Cheng. (Corresponding author: N. Prasad.)
The authors are with the Department of Electronics and Electrical
Communication Engineering, Indian Institute of Technology Kharagpur,
Kharagpur
721302,
India
(e-mail:
sehwag.vikash@gmail.com;
nprasad@ece.iitkgp.ernet.in; indrajit@ece.iitkgp.ernet.in).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TCSII.2017.2708128
computations [6], [7]. However these methods are not effective unless SNGs generate stochastic numbers in parallel and at
the throughput same as their binary counterparts. Though the
state-of-the-art SNGs improve the latency encountered in generating stochastic numbers [8], [9], having a fully parallel SNG
helps in achieving higher throughput of the design, as well
as in leveraging the resources when the scenario of multiple
SNGs is considered. However, parallelizing an SNG should
not increase the Stochastic Computing Correlation (SCC), a
metric that measures the correlation between two stochastic
numbers, as well as the resulting error in the output of a
stochastic circuit.
In this brief, techniques for further reduction in the time
complexity and other VLSI design metrics of SNGs have been
addressed. Further, an SNG architecture with a latency of one
clock cycle has been proposed. Bit permutation networks have
been used in the proposed SNG to achieve more randomness
in the output bit streams, which help in further improving the SCC value compared to the other recently proposed
SNGs [8], [9]. Moreover, the approach to use a single LFSR
for multiple SNGs has also been presented. Experimental evaluation has also been carried out to compare the proposed SNG
with existing SNGs in terms of metrics like SCC, area, power,
power-delay-product, and energy-delay-product.
The rest of the brief is organized as follows. Section II
discusses the background on existing SNGs and omegaflip network. Section III presents the proposed SNG.
Section IV presents the experimental results. Section V analyses the performance of the proposed SNG with other SNGs.
Section VI concludes the brief and mentions the future work.
II. BACKGROUND
A. Existing SNGs
In SC, the first step is to convert a binary number into a
stochastic number, by generating a random bit stream whose
probability is proportional to the given binary number. SNGs
are employed in accomplishing this process. Very few works
have been done in this area. Gupta and Kumaresan [10] have
proposed the conventional SNG, which employs a log(n)bit LFSR and a weighted binary generator to generate an
n-bit stochastic number from a given log(n)-bit binary number (Fig. 1). It generates the stochastic number in a bit serial
fashion with a latency equal to n clock cycles [2].
Later, Ichihara et al. [8] have proposed SNG designs that
include sharing of an LFSR among multiple SNGs. The advantage of this architecture is reduced area of the SNG, compared
to the conventional one. Though it achieves low SCC values
for few considered applications, its average SCC is shown
to be higher than that of the conventional one. Recently,
Kim et al. [9] have proposed an energy-efficient SNG for
c 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
1549-7747 See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
232
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 65, NO. 2, FEBRUARY 2018
Fig. 3.
Fig. 1.
A 4-bit conventional SNG.
An N-to-2N − 1 binary to modified unary code encoder.
in [8], is calculated as shown in (1).
n n
SCC S i , S j n
n
SCCavg =
n2
(1)
i=1 j=1
To calculate the average SCC, firstly, all possible numbers
from 1, . . . , n are generates by a SNG. SCC(S i , S j ) defines
n
Fig. 2. A four-stage omega-flip network with two omega (O) stages followed
by two flip (F) stages.
stochastic circuits with improved progressive precision. This
technique employs three sub-blocks, such as even-distribution
encoding, and inter-group and intra-group randomizers with
LFSR inputs. Though the architecture is an energy-efficient
one, its sub-blocks cannot be shared among multiple SNGs.
This results in significant area overhead in complex stochastic
circuits where multiple SNGs are required.
B. Omega-Flip Network and Its Application in SNGs
Bit permutation networks have been used in many applications, such as cryptography, to randomise a given bit stream.
Several bit permutation networks are available in [11], while
butterfly, inverse butterfly, and shuffle networks are being
widely used. Another popular bit permutation network is an
omega-flip network, which is a modified version of cascaded
butterfly/inverse butterfly and shuffle stages. Fig. 2 shows a
four-stage omega-flip network, for an input (I1. . .I8) and output (O1. . .O8) size of eight bits. For an omega-flip network,
given no repetitions, the total number of possible permutations of a bit stream of size n is n!. Without loss of generality,
it is assumed that n is a power of 2. However, to achieve
all possible permutations, it requires log n! control bits [12].
Since n! < nn , one can say that a maximum of n log(n) bits
are necessary to achieve all possible permutations for a given
stream of n bits. Employing an omega-flip network in an
SNG increases the randomness in the stochastic number, as
it operates with a larger input bit-stream, as compared to a
conventional SNG.
C. Stochastic Computing Correlation
Stochastic Computing Correlation (SCC), which has a value
between -1 and 1, is a metric to measure the correlation
between stochastic numbers [13]. SCC will be 0 for uncorrelated bit streams, positive if most of the ones are aligned,
and negative if most of the zeros are aligned. In this brief, the
average SCC value of an SNG, similar to the one mentioned
n
the SCC value for two stochastic numbers ni and nj . As both
i and j can have any values between [0, n], the normalization factor, n2 , is used to calculate the average value. Absolute
value of SCC is used to specifically answer the question about
whether the generated bit-streams using a given SNG are correlated or not. Without the modulus in (1) the average value
of SCC will be ≈ 0 due to the cancellation of positive and
negative terms in the summation. The mean square error in the
output increases quadratically with SCC, which also motivates
the use of absolute value of SCC as the measure to calculate
the average SCC of an SNG.
III. P ROPOSED S TOCHASTIC N UMBER G ENERATOR
This section presents the architecture of the proposed SNG.
In this brief, omega-flip permutation networks have been used
to permute the modified unary number generated from the
binary number.
A. Binary-to-Modified Unary Code Conversion
Fig. 3 shows the logic of the binary to modified unary
code (BMC) encoder. Position based expansion of the binary
bits has been considered in obtaining a non-random modified
unary number. An N-to-2N − 1 BMC encoder in the proposed
architecture is completely hardwired, thus not accounting for
any gate delay. The most significant bit of the modified unary
stream can be connected to a logic zero to make it a stream
of 2N digits.
B. The Proposed SNG
Figs. 4a and 4b show a stage of each of the corresponding
networks for a 4-bit input binary number. Each stage consists
of a pair of shufflers, which need two control bits. Similarly,
for the permutation of n-bit numbers, each stage of omega-flip
network requires (n/2) control bits. These control bits will
serve as select bits for the multiplexers. As reported in [12],
for an n-bit binary number, it requires n log(n) control bits to
achieve all possible permutations with a permutation network.
Thus a total of 2 log(n) n/2-bit LFSRs are required to generate
the control bits for the permutation network.
Fig. 5 shows the architecture of the proposed SNG. The
first step consists of encoding the input binary number to its
equivalent modified unary code, which is accomplished by
‘BMC’ block. Later, the modified unary code is passed through
the permutation network to obtain the corresponding stochastic
number. Here, 2 log(n) omega-flip stages have been employed
for permuting the input bits of size n in a single clock cycle.
SEHWAG et al.: PARALLEL SNG WITH BIT PERMUTATION NETWORKS
Fig. 4.
233
A stage of 4-bit (a) Omega network and (b) Flip network.
Fig. 6. Average distribution of symbols after 2 log(n) permutations using
(a) independent LFSRs and (b) single LFSR with hardwired circular shifters.
Simulation has been done for 1024 symbols and 10 000 runs. Results have
been plotted considering the mean value of bins of 64 symbols.
Fig. 5. Proposed Stochastic Number Generator with single LFSR. Both
outputs from k-bit circular shift blocks are the same.
Fig. 7.
In the proposed SNG, one can use an omega-flip network
with dedicated LFSRs for each stage, to generate a stochastic
number. This incurs huge hardware overhead, as it requires
2 log(n) n/2-bit LFSRs to generate the control bits in a single
clock cycle. A good alternative to this can be to obtain the
control bits for all stages in the bit permutation network using
a single LFSR. As LFSR generates one value in one clock
cycle, hard-wired k-bit circular shifters, one for each stage,
with no additional logic elements, has been used to generate
the control bits from the LFSR for all the stages in the permutation network. To generate the control bits of ith stage,
(i−1)×n
the output of the LFSR can be rotated by 2×2
log(n) . Thus, for
n = 1024, control bits of each stage will be 50-bit circularly
shifted with respect to corresponding previous stage control
bits. Note that the first stage will have control bits that are
similar to the output of the LFSR. To establish that the controller with a single LFSR performs with similar accuracy as
the former approach that has 2 log(n) LFSRs, Fig. 6 plots the
average distribution of all symbols for both the approaches.
From the plots, it is evident that employing a single LFSR
with circular shifters can achieve uniform distribution for all
possible permutations. For the proposed SNG, the critical path
will correspond to the path from the output of the LFSR to the
stochastic number output via the Bit Shifter Network, owing
to high fan-out of each of these nets.
C. LFSR Sharing Among Multiple SNGs
Although the use of n/2-bit LFSR, rather than the log(n)bit in conventional SNG, may seem unnecessary, it provides
huge advantages due to its very long repetition cycles. As
mentioned in Section III-B, the circular shifted output bits
LFSR sharing between two SNGs.
of a single LFSR are used as the control bits of the permutation network. Similarly, a single LFSR can be shared
among multiple SNGs to generate multiple stochastic numbers. Fig. 7 shows the mechanism to utilize a single LFSR to
generate multiple stochastic numbers. The LFSR output is circular shifted by different value for each SNG. However, one
needs to observe that the correlation between the outputs of
all SNGs should not increase due to this.
To quantify the change in SCC of several SNGs with a
shared LFSR, Fig. 8a shows the average SCC between the
output bits of two SNGs for different shift values. It shows
that a circular shift by any number of places produces nearly
uncorrelated outputs, which results in an SCC value that is
same as the SCC when different SNGs are used to generate
them (Fig. 9b). The achieved average SCC value is also better
than the SCC value of shared LFSR architecture for conventional SNG [8]. This is due to the reason that the size of the
LFSR used here has a higher range compared to that used in
the conventional SNG.
Another advantage of sharing an LFSR among multiple
SNGs is the savings observed in terms of the combined area
of all SNGs. Let the area of an LFSR be denoted as AL and the
area of other blocks of an SNG be denoted as AO . If an application needs S SNGs, each with an LFSR, the total area of all
SNGs will be S × (AL + AO ). However, if a single LFSR is
shared among S SNGs, the total area consumed by the SNGs
will now be AL + (S × AO ). Thus, the percentage area savings of SNGs with a shared LFSR, with respect to SNGs with
L
individual LFSRs is then given as (1 − S1 ) × ALA+A
× 100%.
O
Fig. 8b shows the % savings in area of various number of
SNGs when a single LFSR is shared among them.
234
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 65, NO. 2, FEBRUARY 2018
Fig. 8.
(a) Average SCC of outputs of SNGs with shared LFSR.
X-axis denotes the circular shift in the 512-bit LFSR output bit streams.
(b) Percentage saving in area if a 512-bit LFSR is shared among S proposed
SNGs. Size of stochastic number considered is 210 .
D. Effect of Hardwired Logic
Though the proposed SNG can generate a stochastic number
in one clock cycle, much of its logic in BMC and Bit Shifter
Network blocks is hardwired. Hardwired logic with moderate
to high fan-out contributes to additional parasitic capacitance
as well as an increase in the net capacitance of the circuit,
which need additional buffers to drive nets with such high
load capacitance values.
In the proposed SNG, one can note that instances, like bit
shifter network and BMC, have huge fan-out. For example,
the most-significant-bit (MSB) of the input binary number in
Fig. 3, which is BN−1 , will have a fan-out of 2 × 2N−1 . To
support for such a high fan-out, appropriate input buffer with
sufficient driving capacity needs to be instantiated. Similar
analysis can be carried out for the k-bit circular shifter block
mentioned in Fig. 7. The fan-out of each flip-flop of the LFSR
for each SNG is 40. If each LFSR is shared among S SNGs,
then the fan-out of each flip-flop of the LFSR would be S×40.
From the results obtained after the synthesis, one LFSR can
drive the entire omega-flip network without the need of any
other buffer. However, when driving the omega-flip networks
of other SNGs, intermediate buffers need to be instantiated
according to the total load capacitance. However, for omegaflip network, the data-path consists of multiplexers, each with
a fan-out of two. Since each multiplexer consists of inverters at
its input and output, they act as intermediate buffers to support
the desired fan-out of these multiplexers, thus accounting for
no additional buffers.
Fig. 9. (a) Variation of average SCC with input size (in binary) in proposed
SNGs. (b) Comparison of average SCC for a 210 stochastic number.
SNG shows an improvement by 28.57% when compared with
the shared LFSR based SNG. This is so because, with the bit
permutation network used in the proposed SNGs, more permutations can be achieved for a given length of bits, compared
to other generation methods.
IV. E XPERIMENTAL E VALUATION
Experimental evaluation of the proposed SNGs has been
conducted for parameters like area, critical path delay, power
consumption, power- and energy-delay-products, and average
SCC. The proposed architectures have been synthesized using
Synopsys design compiler (DC) and TSMC 90 nm standardcell library. Performance comparison of the proposed SNGs
has been done with other SNGs existing in the literature.
B. Hardware Comparison
Fig. 10 shows the performance comparison of the proposed
SNG with existing architectures for parameters like area, critical path delay, power consumption, and power- and energydelay-products. For fair comparison, all cases are implemented
in parallel manner. For the shared LFSR architecture, one
LFSR is assumed to be shared between two SNGs. For the
case of the energy-efficient architecture, 32 bits are considered in a group. The conventional LFSR based architecture has
been replicated 210 times to realize a parallel SNG. Similarly,
the proposed SNG has been considered only once, as it can
generate a parallel stochastic number in a single cycle.
From the area plots (Fig. 10a), one can find that the relative area occupied by the proposed SNG to generate a parallel
stochastic number is less than other SNG architectures. Since
the critical path of the proposed SNG comprises only 20 multiplexers, it can achieve higher speed compared to other SNGs
(Fig. 10b). In the case of power consumption (Fig. 10c), the
proposed SNG consumes less power than all other SNGs,
except the energy-efficient (E) one. It is due to the fact that
the stochastic number is generated in parallel in a single cycle.
Power-delay-product (PDP) for each case has been determined
as the product of the power consumed by each architecture
and its critical path delay. From the PDP plot (Fig. 10d), one
can find that the proposed SNG enjoys low PDP compared to
other SNG architectures except the energy-efficient (E) SNG.
However, from the plots of energy-delay-product (Fig. 10e),
one can note that the proposed SNG outperforms other SNGs
by up to 7.84×. Thus, the proposed SNG provides good tradeoff among the considered parameters, making it a desirable
and efficient choice to consider as an efficient SNG with low
latency.
A. Analysis of Average SCC
Fig. 9a shows the variation in the average SCC metric of the
proposed SNG with respect to the size of the input binary number. It is expected that with the increase in the input size of the
binary number, average SCC drops down. Fig. 9b compares
the average SCC metric of the proposed SNG [P] architecture,
with that of conventional (C) [10], shared LFSR based (S) [8]
and energy-efficient (E) [9] SNG architectures. From the figure, one can find that the average SCC value of the proposed
V. P ERFORMANCE A NALYSIS W ITH A PPLICATIONS
To compare the proposed SNG with other SNGs, a case
study for different applications has been reported in Table I.
The reported results include both stochastic circuits and SNGs.
For the application of multiplication, where two different
SNGs are required to generate the inputs, the conventional
SNG, which takes 210 clock cycles to generate a 210 -bit
stochastic number, has its execution time to be much larger
than that of the proposed one. As only two SNGs are required,
SEHWAG et al.: PARALLEL SNG WITH BIT PERMUTATION NETWORKS
235
Fig. 10. Performance comparison of proposed SNGs with others in terms of (a) area, (b) critical path delay, (c) power consumption, (d) Power Delay Product
(PDP), and (e) Energy Delay product (EDP). For fair comparison, values are scaled for the one-bit generation of a Stochastic number.
TABLE I
P ERFORMANCE C OMPARISON OF SNG S FOR VARIOUS S TOCHASTIC
C OMPUTING A PPLICATIONS
method of sharing a single LFSR among multiple SNGs has
also been presented, which brings in the advantages such as
reduced area, when complex applications with multiple SNGs
are considered. Employing the proposed SNG in applications
such as multiplication, edge detection, and complex multiplication, shows an improvement in terms of execution time by up
to 1000× as well as improvement in area-delay-product by up
to 9×, when compared to other approaches. Future work will
consider improving energy consumption of the proposed SNG
by considering alternate logic styles as well as by following
time domain approaches in generating stochastic numbers.
R EFERENCES
sharing an LFSR between them does not attain much savings in the area. However, both the non-shared and shared
LFSR approaches in the proposed SNG achieve a reduction of
about 8× in area-delay-product (ADP), when compared with
the other approaches.
Similarly, in edge detector circuit [14], the proposed SNG
architecture has huge advantage in ADP. Further, sharing a
single LFSR among the five SNGs results in 9% decrease in
the ADP when compared with the proposed SNG with separate
LFSRs. For the application of complex multiplication [15],
sharing a single LFSR results in 9.5% savings in both area and
ADP, compared to the proposed SNG with separate LFSRs.
From the above mentioned results, it becomes apparent that
as the stochastic circuit becomes more complex, that is, as
it requires large number of SNGs, shared LFSR based SNG
architectures can be used to reduce the area overheads without degrading the performance. Also, one can note that the
clock rates in all the three aforementioned applications is governed by the critical path of the SNG rather than the stochastic
circuit.
VI. C ONCLUSION
In this brief, a parallel stochastic number generator (SNG)
has been proposed, which can generate all bits of stochastic
number in parallel. Omega-flip bit permutation (BP) network
has been employed to achieve parallel stochastic number generation. Employing BP networks also results in improving
the SCC of a stochastic number, as more permutations can
be done for a given bit stream. Experimental analysis of the
proposed SNG has shown an improvement in average SCC
and energy-delay-product by 28.57% and 4.32×, respectively,
as compared to the shared LFSR based SNG. Moreover, a
[1] B. Moons and M. Verhelst, “Energy-efficiency and accuracy of stochastic
computing circuits in emerging technologies,” IEEE J. Emerg. Sel. Topic
Circuits Syst., vol. 4, no. 4, pp. 475–486, Dec. 2014.
[2] A. Alaghi and J. P. Hayes, “Survey of stochastic computing,” ACM
Trans. Embedded Comput. Syst., vol. 12, no. 2s, pp. 1–19, May 2013.
[3] B. Yuan, Y. Wang, and Z. Wang, “Area-efficient scaling-free DFT/FFT
design using stochastic computing,” IEEE Trans. Circuits Syst. II, Exp.
Briefs, vol. 63, no. 12, pp. 1131–1135, Dec. 2016.
[4] B. Yuan and K. K. Parhi, “Belief propagation decoding of polar
codes using stochastic computing,” in Proc. IEEE Int. Symp. Circuits
Syst. (ISCAS), Montreal, QC, Canada, May 2016, pp. 157–160.
[5] K. K. Parhi and Y. Liu, “Architectures for IIR digital filters using
stochastic computing,” in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS),
Melbourne, VIC, Australia, Jun. 2014, pp. 373–376.
[6] Y. Zhu, P. Suo, and K. Bazargan, “Binary stochastic implementation
of digital logic,” in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate
Arrays, Monterey, CA, USA, Feb. 2014, pp. 171–180.
[7] R. Saraiva, J. C. Ruzicki, A. de Souza, and R. I. Soares, “Range segmentation to improve latency in parallel stochastic computing,” in Proc.
IEEE 7th Latin Amer. Symp. Circuits Syst. (LASCAS), Florianópolis,
Brazil, Feb./Mar. 2016, pp. 307–310.
[8] H. Ichihara, S. Ishii, D. Sunamori, T. Iwagaki, and T. Inoue, “Compact
and accurate stochastic circuits with shared random number sources,”
in Proc. IEEE 32nd Int. Conf. Comput. Design (ICCD), Seoul,
South Korea, Oct. 2014, pp. 361–366.
[9] K. Kim, J. Lee, and K. Choi, “An energy-efficient random number generator for stochastic circuits,” in Proc. 21st Asia South Pac. Design Autom.
Conf. (ASP-DAC), Jan. 2016, pp. 256–261.
[10] P. K. Gupta and R. Kumaresan, “Binary multiplication with PN
sequences,” IEEE Trans. Acoust. Speech Signal Process., vol. 36, no. 4,
pp. 603–606, Apr. 1988.
[11] Bit Permutations. Accessed on Nov. 6, 2016. [Online]. Available:
http://programming.sirrida.de/bit_perm.html
[12] X. Yang and R. B. Lee, “Fast subword permutation instructions using
omega and flip network stages,” in Proc. Int. Conf. Comput. Design,
Austin, TX, USA, Sep. 2000, pp. 15–22.
[13] T.-H. Chen and J. P. Hayes, “Analyzing and controlling accuracy
in stochastic circuits,” in Proc. 32nd IEEE Int. Conf. Comput.
Design (ICCD), Seoul, South Korea, 2014, pp. 367–373.
[14] A. Alaghi and J. P. Hayes, “Fast and accurate computation using stochastic circuits,” in Proc. Autom. Test Europe Conf. Exhibit. Design (DATE),
Dresden, Germany, Mar. 2014, pp. 1–4.
[15] P.-S. Ting and J. P. Hayes, “Isolation-based decorrelation of stochastic circuits,” in Proc. IEEE 34th Int. Conf. Comput. Design (ICCD),
Scottsdale, AZ, USA, 2016, pp. 88–95.
Download