IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 65, NO. 2, FEBRUARY 2018 231 A Parallel Stochastic Number Generator With Bit Permutation Networks Vikash Sehwag, Student Member, IEEE, N. Prasad, Student Member, IEEE, and Indrajit Chakrabarti, Member, IEEE Abstract—Stochastic computing (SC) is a promising paradigm to realize low-complexity digital circuits that are tolerant to soft errors. Stochastic circuits include a stochastic number generator (SNG) to generate a stochastic number that corresponds to a given binary number. Conventional SNGs, which employ linear feedback shift registers (LFSRs) to generate stochastic numbers in a serial manner would cost significantly in time. In this brief, a parallel SNG has been proposed, which can generate stochastic numbers in parallel by transforming the input binary number to a modified unary number and permuting it using a bit permutation network. Further, a method to share a single LFSR among multiple SNGs has been presented. Experimental results show that the proposed SNG can achieve improvement in SC correlation and energy-delay-product by 28.57% and 4.32×, respectively, when compared to the existing shared LFSR-based SNG. For applications, such as edge detector, multiplier, and complex multiplication, the proposed SNG has achieved reduction in execution time and area-delay-product by up to 1000× and 9×, respectively, as compared to others. Index Terms—Bit permutation, low latency, omega-flip network, stochastic computing, stochastic number. I. I NTRODUCTION TOCHASTIC Computing (SC) is an emerging paradigm for energy-efficient and error-tolerant computing [1], [2]. Instead of conventional computing approach, it uses probability as a baseline measure in generating stochastic numbers to perform arithmetic operations on them. Recent improvements in SC promise that it can achieve higher accuracy in its outputs with much less usage of hardware resources than the conventional computing approach [3]–[5]. Stochastic circuits include a stochastic number generator (SNG), which is used to generate a stochastic number corresponding to a given binary number. Traditional stochastic circuits with conventional SNGs that employ linear feed-back shift registers (LFSRs) to generate stochastic numbers serially, result in high latency in the computation. To resolve this, various methods, such as binary stochastic and range segmentation, have been proposed to parallelize the stochastic S Manuscript received May 3, 2017; revised May 13, 2017; accepted May 22, 2017. Date of publication May 25, 2017; date of current version January 29, 2018. This brief was recommended by Associate Editor C.-T. Cheng. (Corresponding author: N. Prasad.) The authors are with the Department of Electronics and Electrical Communication Engineering, Indian Institute of Technology Kharagpur, Kharagpur 721302, India (e-mail: sehwag.vikash@gmail.com; nprasad@ece.iitkgp.ernet.in; indrajit@ece.iitkgp.ernet.in). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCSII.2017.2708128 computations [6], [7]. However these methods are not effective unless SNGs generate stochastic numbers in parallel and at the throughput same as their binary counterparts. Though the state-of-the-art SNGs improve the latency encountered in generating stochastic numbers [8], [9], having a fully parallel SNG helps in achieving higher throughput of the design, as well as in leveraging the resources when the scenario of multiple SNGs is considered. However, parallelizing an SNG should not increase the Stochastic Computing Correlation (SCC), a metric that measures the correlation between two stochastic numbers, as well as the resulting error in the output of a stochastic circuit. In this brief, techniques for further reduction in the time complexity and other VLSI design metrics of SNGs have been addressed. Further, an SNG architecture with a latency of one clock cycle has been proposed. Bit permutation networks have been used in the proposed SNG to achieve more randomness in the output bit streams, which help in further improving the SCC value compared to the other recently proposed SNGs [8], [9]. Moreover, the approach to use a single LFSR for multiple SNGs has also been presented. Experimental evaluation has also been carried out to compare the proposed SNG with existing SNGs in terms of metrics like SCC, area, power, power-delay-product, and energy-delay-product. The rest of the brief is organized as follows. Section II discusses the background on existing SNGs and omegaflip network. Section III presents the proposed SNG. Section IV presents the experimental results. Section V analyses the performance of the proposed SNG with other SNGs. Section VI concludes the brief and mentions the future work. II. BACKGROUND A. Existing SNGs In SC, the first step is to convert a binary number into a stochastic number, by generating a random bit stream whose probability is proportional to the given binary number. SNGs are employed in accomplishing this process. Very few works have been done in this area. Gupta and Kumaresan [10] have proposed the conventional SNG, which employs a log(n)bit LFSR and a weighted binary generator to generate an n-bit stochastic number from a given log(n)-bit binary number (Fig. 1). It generates the stochastic number in a bit serial fashion with a latency equal to n clock cycles [2]. Later, Ichihara et al. [8] have proposed SNG designs that include sharing of an LFSR among multiple SNGs. The advantage of this architecture is reduced area of the SNG, compared to the conventional one. Though it achieves low SCC values for few considered applications, its average SCC is shown to be higher than that of the conventional one. Recently, Kim et al. [9] have proposed an energy-efficient SNG for c 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. 1549-7747 See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. 232 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 65, NO. 2, FEBRUARY 2018 Fig. 3. Fig. 1. A 4-bit conventional SNG. An N-to-2N − 1 binary to modified unary code encoder. in [8], is calculated as shown in (1). n n SCC S i , S j n n SCCavg = n2 (1) i=1 j=1 To calculate the average SCC, firstly, all possible numbers from 1, . . . , n are generates by a SNG. SCC(S i , S j ) defines n Fig. 2. A four-stage omega-flip network with two omega (O) stages followed by two flip (F) stages. stochastic circuits with improved progressive precision. This technique employs three sub-blocks, such as even-distribution encoding, and inter-group and intra-group randomizers with LFSR inputs. Though the architecture is an energy-efficient one, its sub-blocks cannot be shared among multiple SNGs. This results in significant area overhead in complex stochastic circuits where multiple SNGs are required. B. Omega-Flip Network and Its Application in SNGs Bit permutation networks have been used in many applications, such as cryptography, to randomise a given bit stream. Several bit permutation networks are available in [11], while butterfly, inverse butterfly, and shuffle networks are being widely used. Another popular bit permutation network is an omega-flip network, which is a modified version of cascaded butterfly/inverse butterfly and shuffle stages. Fig. 2 shows a four-stage omega-flip network, for an input (I1. . .I8) and output (O1. . .O8) size of eight bits. For an omega-flip network, given no repetitions, the total number of possible permutations of a bit stream of size n is n!. Without loss of generality, it is assumed that n is a power of 2. However, to achieve all possible permutations, it requires log n! control bits [12]. Since n! < nn , one can say that a maximum of n log(n) bits are necessary to achieve all possible permutations for a given stream of n bits. Employing an omega-flip network in an SNG increases the randomness in the stochastic number, as it operates with a larger input bit-stream, as compared to a conventional SNG. C. Stochastic Computing Correlation Stochastic Computing Correlation (SCC), which has a value between -1 and 1, is a metric to measure the correlation between stochastic numbers [13]. SCC will be 0 for uncorrelated bit streams, positive if most of the ones are aligned, and negative if most of the zeros are aligned. In this brief, the average SCC value of an SNG, similar to the one mentioned n the SCC value for two stochastic numbers ni and nj . As both i and j can have any values between [0, n], the normalization factor, n2 , is used to calculate the average value. Absolute value of SCC is used to specifically answer the question about whether the generated bit-streams using a given SNG are correlated or not. Without the modulus in (1) the average value of SCC will be ≈ 0 due to the cancellation of positive and negative terms in the summation. The mean square error in the output increases quadratically with SCC, which also motivates the use of absolute value of SCC as the measure to calculate the average SCC of an SNG. III. P ROPOSED S TOCHASTIC N UMBER G ENERATOR This section presents the architecture of the proposed SNG. In this brief, omega-flip permutation networks have been used to permute the modified unary number generated from the binary number. A. Binary-to-Modified Unary Code Conversion Fig. 3 shows the logic of the binary to modified unary code (BMC) encoder. Position based expansion of the binary bits has been considered in obtaining a non-random modified unary number. An N-to-2N − 1 BMC encoder in the proposed architecture is completely hardwired, thus not accounting for any gate delay. The most significant bit of the modified unary stream can be connected to a logic zero to make it a stream of 2N digits. B. The Proposed SNG Figs. 4a and 4b show a stage of each of the corresponding networks for a 4-bit input binary number. Each stage consists of a pair of shufflers, which need two control bits. Similarly, for the permutation of n-bit numbers, each stage of omega-flip network requires (n/2) control bits. These control bits will serve as select bits for the multiplexers. As reported in [12], for an n-bit binary number, it requires n log(n) control bits to achieve all possible permutations with a permutation network. Thus a total of 2 log(n) n/2-bit LFSRs are required to generate the control bits for the permutation network. Fig. 5 shows the architecture of the proposed SNG. The first step consists of encoding the input binary number to its equivalent modified unary code, which is accomplished by ‘BMC’ block. Later, the modified unary code is passed through the permutation network to obtain the corresponding stochastic number. Here, 2 log(n) omega-flip stages have been employed for permuting the input bits of size n in a single clock cycle. SEHWAG et al.: PARALLEL SNG WITH BIT PERMUTATION NETWORKS Fig. 4. 233 A stage of 4-bit (a) Omega network and (b) Flip network. Fig. 6. Average distribution of symbols after 2 log(n) permutations using (a) independent LFSRs and (b) single LFSR with hardwired circular shifters. Simulation has been done for 1024 symbols and 10 000 runs. Results have been plotted considering the mean value of bins of 64 symbols. Fig. 5. Proposed Stochastic Number Generator with single LFSR. Both outputs from k-bit circular shift blocks are the same. Fig. 7. In the proposed SNG, one can use an omega-flip network with dedicated LFSRs for each stage, to generate a stochastic number. This incurs huge hardware overhead, as it requires 2 log(n) n/2-bit LFSRs to generate the control bits in a single clock cycle. A good alternative to this can be to obtain the control bits for all stages in the bit permutation network using a single LFSR. As LFSR generates one value in one clock cycle, hard-wired k-bit circular shifters, one for each stage, with no additional logic elements, has been used to generate the control bits from the LFSR for all the stages in the permutation network. To generate the control bits of ith stage, (i−1)×n the output of the LFSR can be rotated by 2×2 log(n) . Thus, for n = 1024, control bits of each stage will be 50-bit circularly shifted with respect to corresponding previous stage control bits. Note that the first stage will have control bits that are similar to the output of the LFSR. To establish that the controller with a single LFSR performs with similar accuracy as the former approach that has 2 log(n) LFSRs, Fig. 6 plots the average distribution of all symbols for both the approaches. From the plots, it is evident that employing a single LFSR with circular shifters can achieve uniform distribution for all possible permutations. For the proposed SNG, the critical path will correspond to the path from the output of the LFSR to the stochastic number output via the Bit Shifter Network, owing to high fan-out of each of these nets. C. LFSR Sharing Among Multiple SNGs Although the use of n/2-bit LFSR, rather than the log(n)bit in conventional SNG, may seem unnecessary, it provides huge advantages due to its very long repetition cycles. As mentioned in Section III-B, the circular shifted output bits LFSR sharing between two SNGs. of a single LFSR are used as the control bits of the permutation network. Similarly, a single LFSR can be shared among multiple SNGs to generate multiple stochastic numbers. Fig. 7 shows the mechanism to utilize a single LFSR to generate multiple stochastic numbers. The LFSR output is circular shifted by different value for each SNG. However, one needs to observe that the correlation between the outputs of all SNGs should not increase due to this. To quantify the change in SCC of several SNGs with a shared LFSR, Fig. 8a shows the average SCC between the output bits of two SNGs for different shift values. It shows that a circular shift by any number of places produces nearly uncorrelated outputs, which results in an SCC value that is same as the SCC when different SNGs are used to generate them (Fig. 9b). The achieved average SCC value is also better than the SCC value of shared LFSR architecture for conventional SNG [8]. This is due to the reason that the size of the LFSR used here has a higher range compared to that used in the conventional SNG. Another advantage of sharing an LFSR among multiple SNGs is the savings observed in terms of the combined area of all SNGs. Let the area of an LFSR be denoted as AL and the area of other blocks of an SNG be denoted as AO . If an application needs S SNGs, each with an LFSR, the total area of all SNGs will be S × (AL + AO ). However, if a single LFSR is shared among S SNGs, the total area consumed by the SNGs will now be AL + (S × AO ). Thus, the percentage area savings of SNGs with a shared LFSR, with respect to SNGs with L individual LFSRs is then given as (1 − S1 ) × ALA+A × 100%. O Fig. 8b shows the % savings in area of various number of SNGs when a single LFSR is shared among them. 234 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 65, NO. 2, FEBRUARY 2018 Fig. 8. (a) Average SCC of outputs of SNGs with shared LFSR. X-axis denotes the circular shift in the 512-bit LFSR output bit streams. (b) Percentage saving in area if a 512-bit LFSR is shared among S proposed SNGs. Size of stochastic number considered is 210 . D. Effect of Hardwired Logic Though the proposed SNG can generate a stochastic number in one clock cycle, much of its logic in BMC and Bit Shifter Network blocks is hardwired. Hardwired logic with moderate to high fan-out contributes to additional parasitic capacitance as well as an increase in the net capacitance of the circuit, which need additional buffers to drive nets with such high load capacitance values. In the proposed SNG, one can note that instances, like bit shifter network and BMC, have huge fan-out. For example, the most-significant-bit (MSB) of the input binary number in Fig. 3, which is BN−1 , will have a fan-out of 2 × 2N−1 . To support for such a high fan-out, appropriate input buffer with sufficient driving capacity needs to be instantiated. Similar analysis can be carried out for the k-bit circular shifter block mentioned in Fig. 7. The fan-out of each flip-flop of the LFSR for each SNG is 40. If each LFSR is shared among S SNGs, then the fan-out of each flip-flop of the LFSR would be S×40. From the results obtained after the synthesis, one LFSR can drive the entire omega-flip network without the need of any other buffer. However, when driving the omega-flip networks of other SNGs, intermediate buffers need to be instantiated according to the total load capacitance. However, for omegaflip network, the data-path consists of multiplexers, each with a fan-out of two. Since each multiplexer consists of inverters at its input and output, they act as intermediate buffers to support the desired fan-out of these multiplexers, thus accounting for no additional buffers. Fig. 9. (a) Variation of average SCC with input size (in binary) in proposed SNGs. (b) Comparison of average SCC for a 210 stochastic number. SNG shows an improvement by 28.57% when compared with the shared LFSR based SNG. This is so because, with the bit permutation network used in the proposed SNGs, more permutations can be achieved for a given length of bits, compared to other generation methods. IV. E XPERIMENTAL E VALUATION Experimental evaluation of the proposed SNGs has been conducted for parameters like area, critical path delay, power consumption, power- and energy-delay-products, and average SCC. The proposed architectures have been synthesized using Synopsys design compiler (DC) and TSMC 90 nm standardcell library. Performance comparison of the proposed SNGs has been done with other SNGs existing in the literature. B. Hardware Comparison Fig. 10 shows the performance comparison of the proposed SNG with existing architectures for parameters like area, critical path delay, power consumption, and power- and energydelay-products. For fair comparison, all cases are implemented in parallel manner. For the shared LFSR architecture, one LFSR is assumed to be shared between two SNGs. For the case of the energy-efficient architecture, 32 bits are considered in a group. The conventional LFSR based architecture has been replicated 210 times to realize a parallel SNG. Similarly, the proposed SNG has been considered only once, as it can generate a parallel stochastic number in a single cycle. From the area plots (Fig. 10a), one can find that the relative area occupied by the proposed SNG to generate a parallel stochastic number is less than other SNG architectures. Since the critical path of the proposed SNG comprises only 20 multiplexers, it can achieve higher speed compared to other SNGs (Fig. 10b). In the case of power consumption (Fig. 10c), the proposed SNG consumes less power than all other SNGs, except the energy-efficient (E) one. It is due to the fact that the stochastic number is generated in parallel in a single cycle. Power-delay-product (PDP) for each case has been determined as the product of the power consumed by each architecture and its critical path delay. From the PDP plot (Fig. 10d), one can find that the proposed SNG enjoys low PDP compared to other SNG architectures except the energy-efficient (E) SNG. However, from the plots of energy-delay-product (Fig. 10e), one can note that the proposed SNG outperforms other SNGs by up to 7.84×. Thus, the proposed SNG provides good tradeoff among the considered parameters, making it a desirable and efficient choice to consider as an efficient SNG with low latency. A. Analysis of Average SCC Fig. 9a shows the variation in the average SCC metric of the proposed SNG with respect to the size of the input binary number. It is expected that with the increase in the input size of the binary number, average SCC drops down. Fig. 9b compares the average SCC metric of the proposed SNG [P] architecture, with that of conventional (C) [10], shared LFSR based (S) [8] and energy-efficient (E) [9] SNG architectures. From the figure, one can find that the average SCC value of the proposed V. P ERFORMANCE A NALYSIS W ITH A PPLICATIONS To compare the proposed SNG with other SNGs, a case study for different applications has been reported in Table I. The reported results include both stochastic circuits and SNGs. For the application of multiplication, where two different SNGs are required to generate the inputs, the conventional SNG, which takes 210 clock cycles to generate a 210 -bit stochastic number, has its execution time to be much larger than that of the proposed one. As only two SNGs are required, SEHWAG et al.: PARALLEL SNG WITH BIT PERMUTATION NETWORKS 235 Fig. 10. Performance comparison of proposed SNGs with others in terms of (a) area, (b) critical path delay, (c) power consumption, (d) Power Delay Product (PDP), and (e) Energy Delay product (EDP). For fair comparison, values are scaled for the one-bit generation of a Stochastic number. TABLE I P ERFORMANCE C OMPARISON OF SNG S FOR VARIOUS S TOCHASTIC C OMPUTING A PPLICATIONS method of sharing a single LFSR among multiple SNGs has also been presented, which brings in the advantages such as reduced area, when complex applications with multiple SNGs are considered. Employing the proposed SNG in applications such as multiplication, edge detection, and complex multiplication, shows an improvement in terms of execution time by up to 1000× as well as improvement in area-delay-product by up to 9×, when compared to other approaches. Future work will consider improving energy consumption of the proposed SNG by considering alternate logic styles as well as by following time domain approaches in generating stochastic numbers. R EFERENCES sharing an LFSR between them does not attain much savings in the area. However, both the non-shared and shared LFSR approaches in the proposed SNG achieve a reduction of about 8× in area-delay-product (ADP), when compared with the other approaches. Similarly, in edge detector circuit [14], the proposed SNG architecture has huge advantage in ADP. Further, sharing a single LFSR among the five SNGs results in 9% decrease in the ADP when compared with the proposed SNG with separate LFSRs. For the application of complex multiplication [15], sharing a single LFSR results in 9.5% savings in both area and ADP, compared to the proposed SNG with separate LFSRs. From the above mentioned results, it becomes apparent that as the stochastic circuit becomes more complex, that is, as it requires large number of SNGs, shared LFSR based SNG architectures can be used to reduce the area overheads without degrading the performance. Also, one can note that the clock rates in all the three aforementioned applications is governed by the critical path of the SNG rather than the stochastic circuit. VI. C ONCLUSION In this brief, a parallel stochastic number generator (SNG) has been proposed, which can generate all bits of stochastic number in parallel. Omega-flip bit permutation (BP) network has been employed to achieve parallel stochastic number generation. Employing BP networks also results in improving the SCC of a stochastic number, as more permutations can be done for a given bit stream. Experimental analysis of the proposed SNG has shown an improvement in average SCC and energy-delay-product by 28.57% and 4.32×, respectively, as compared to the shared LFSR based SNG. Moreover, a [1] B. Moons and M. Verhelst, “Energy-efficiency and accuracy of stochastic computing circuits in emerging technologies,” IEEE J. Emerg. Sel. Topic Circuits Syst., vol. 4, no. 4, pp. 475–486, Dec. 2014. [2] A. Alaghi and J. P. Hayes, “Survey of stochastic computing,” ACM Trans. Embedded Comput. Syst., vol. 12, no. 2s, pp. 1–19, May 2013. [3] B. Yuan, Y. Wang, and Z. Wang, “Area-efficient scaling-free DFT/FFT design using stochastic computing,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 63, no. 12, pp. 1131–1135, Dec. 2016. [4] B. Yuan and K. K. Parhi, “Belief propagation decoding of polar codes using stochastic computing,” in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), Montreal, QC, Canada, May 2016, pp. 157–160. [5] K. K. Parhi and Y. Liu, “Architectures for IIR digital filters using stochastic computing,” in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), Melbourne, VIC, Australia, Jun. 2014, pp. 373–376. [6] Y. Zhu, P. Suo, and K. Bazargan, “Binary stochastic implementation of digital logic,” in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, Monterey, CA, USA, Feb. 2014, pp. 171–180. [7] R. Saraiva, J. C. Ruzicki, A. de Souza, and R. I. Soares, “Range segmentation to improve latency in parallel stochastic computing,” in Proc. IEEE 7th Latin Amer. Symp. Circuits Syst. (LASCAS), Florianópolis, Brazil, Feb./Mar. 2016, pp. 307–310. [8] H. Ichihara, S. Ishii, D. Sunamori, T. Iwagaki, and T. Inoue, “Compact and accurate stochastic circuits with shared random number sources,” in Proc. IEEE 32nd Int. Conf. Comput. Design (ICCD), Seoul, South Korea, Oct. 2014, pp. 361–366. [9] K. Kim, J. Lee, and K. Choi, “An energy-efficient random number generator for stochastic circuits,” in Proc. 21st Asia South Pac. Design Autom. Conf. (ASP-DAC), Jan. 2016, pp. 256–261. [10] P. K. Gupta and R. Kumaresan, “Binary multiplication with PN sequences,” IEEE Trans. Acoust. Speech Signal Process., vol. 36, no. 4, pp. 603–606, Apr. 1988. [11] Bit Permutations. Accessed on Nov. 6, 2016. [Online]. Available: http://programming.sirrida.de/bit_perm.html [12] X. Yang and R. B. Lee, “Fast subword permutation instructions using omega and flip network stages,” in Proc. Int. Conf. Comput. Design, Austin, TX, USA, Sep. 2000, pp. 15–22. [13] T.-H. Chen and J. P. Hayes, “Analyzing and controlling accuracy in stochastic circuits,” in Proc. 32nd IEEE Int. Conf. Comput. Design (ICCD), Seoul, South Korea, 2014, pp. 367–373. [14] A. Alaghi and J. P. Hayes, “Fast and accurate computation using stochastic circuits,” in Proc. Autom. Test Europe Conf. Exhibit. Design (DATE), Dresden, Germany, Mar. 2014, pp. 1–4. [15] P.-S. Ting and J. P. Hayes, “Isolation-based decorrelation of stochastic circuits,” in Proc. IEEE 34th Int. Conf. Comput. Design (ICCD), Scottsdale, AZ, USA, 2016, pp. 88–95.