IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 55, NO. 1, JANUARY 2008 31 A Timing-Driven Approach to Synthesize Fast Barrel Shifters Sabyasachi Das and Sunil P. Khatri Abstract—In modern digital signal processing and graphics applications, the shifter is an important module, consuming a significant amount of delay. This brief presents an architectural optimization approach to synthesize a faster barrel shifter block, which can be useful to reduce the delay of the design without significantly increasing the area. We have divided the problem of generating the shifter into two steps: i) timing-driven selection of multiple stages for merging, and ii) the design of the merged stage. In our proposed method, we define the notion of dual merged stage, where two stages are merged and the triple merged stage, where three stages are merged into a single composite stage. These merged stages are identified by using a timing-driven algorithm and are used in conjunction with some single stages of the traditional barrel shifter. The use of these merged stages helps reduce the depth of the proposed barrel shifter architecture, thereby improving the delay. The timing-driven nature of our algorithm helps produce a faster implementation for the overall shifter block. We have evaluated the performance of our design by using a number of technology libraries, timing constraints and shifter bit-widths. Our experimental data shows that the shifter block generated by our algorithm is significantly faster (10.19% on average) than the shifter block generated by a commercially available datapath synthesis tool. These improvements were verified on placed-and-routed designs as well. I. INTRODUCTION A S WE MIGRATE toward ultra deep sub-micron feature sizes, digital designs are becoming increasingly complex, with very aggressive performance goals. Arithmetic components are typically highly computation-intensive, and are widely used in modern integrated circuits (ICs). The shifter is an integral part of many digital designs. A barrel shifter is a combinational logic block that can shift a data by any given number of bits, in a single operation. There are many applications that require shift operations, including CPUs, floating point operations (like normalization), variable length coding, word packing/unpacking, bit indexing, address generation, field extraction etc. Shifters are essential in the digital signal processing field. The barrel shifter is a commonly used shifter architecture. One of the important reasons behind the widespread usage of this architecture is the fact that it can perform multi-bit shifts in a single operation (within one clock cycle). In addition, the area of the barrel shifter is also reasonably small, which helps keep the area of the design under control. Manuscript received May 22, 2007, revised July 11, 2007. This paper was recommended by Associate Editor L. Lavagno. S. Das is with Synplicity Inc, Sunnyvale, CA 94087 USA (e-mail: sabya@synplicity.com). S. P. Khatri is with the Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 30332 USA (e-mail: sunilkhatri@tamu.edu). Digital Object Identifier 10.1109/TCSII.2007.908951 Several techniques have been proposed to design efficient barrel shifters in different contexts. Basic architecture of a barrel shifter was introduced in [1]. High-speed pipelined architectures using TSPC were discussed in [2] and [3]. A high performance and area efficient CMOS 32-bit barrel switch and its physical design were presented in [4]. In [5], number of stages in a shifter were reduced, resulting in significantly faster speed. A multilevel barrel shifter structure in the context of the CORDIC design was introduced in [6]. In [7], different design tradeoffs in the context of barrel shifter were analyzed. Timing-driven layout techniques of cyclic shifters were proposed in [8] and [9]. In [10], data-driven dynamic logic is used to generate a faster and more power-efficient barrel shifter than domino-logic based design. A 4-bit barrel shifter in the QCA computing paradigm was introduced in [11]. A mixed signal 32-bit rotator/shifter circuit design with short latency was discussed in [12]. Several low-power architectures for barrel shifters have been presented in [13]–[15] and [16]. Energy delay evaluation of a Low Power Barrel Switch is discussed in [17]. In this brief, we propose a timing-driven technique to synthesize a faster barrel shifter block. In our approach, we merge two (or three) stages of the shift operation into a single stage, leading to a reduction in the total number of stages. These stages are referred to as dual merged and triple merged stages. The decision to merge stages is made in a timing-driven fashion, so that the overall delay of the shifter is minimized. The optimizations involved in our approach are orthogonal to the ideas previously presented in this section. We have organized the rest of the brief as follows: In Section II, we present some background information about the barrel shifter architecture. In Section III, we discuss our proposed approach in detail. Section IV presents the experimental results. Conclusions are drawn in Section V. II. PRELIMINARIES In this section, we briefly explain the concept of a barrel shifter and discuss how it is typically synthesized [18]. In a barrel shifter, if the data input signal is -bit wide, then the shift bit wide. The width of the output signal is typically of the shifter is typically same as the input width ( ). The shifter stages, where each stage ( ) performs a is divided into single shift of 0 or bits, depending on the value of the th bit of the shift signal. Each bit of the shift signal controls exactly one barrel shifter stage. The input data is shifted (or not shifted) by each of the stages in sequence. To implement this, multiplexers (or an equivalent logic circuit constituted using technology library cells) are used in each stage. Fig. 1 shows the block-level diagram of a 3-stage barrel shifter (left shifter), where each row represents a stage. In this figure, the logic-0 input signal is de(Verilog notation). In this diagram, the data input noted by signal ( ) is 8-bit wide and the output signal ( ) is also 8-bit 1549-7747/$25.00 © 2007 IEEE Authorized licensed use limited to: Texas A M University. Downloaded on May 20, 2009 at 03:44 from IEEE Xplore. Restrictions apply. 32 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 55, NO. 1, JANUARY 2008 Fig. 1. Traditional left barrel shifter with 3-stages. wide. The shift signal has 3 bits ( ) and hence the shifter consists of 3 stages. Similarly, to implement a shifter with a 128-bit input data signal, the shifter would require 7 stages in the shifter architecture. III. OUR APPROACH Throughout the rest of the brief, we will assume the data input to the barrel shifter is bits wide and the shift input signal is bits wide. The output (signal ) is also bits wide. We by . Let be the arrival time of the th bit denote of the shift signal . In the traditional barrel shifter architecture, this block has stages and each stage consists of 2-to-1 Multiplexers (MUXes). The timing-critical path of the shifter traverses through 2-to-1 MUXes. To estimate the delay of a traditional barrel shifter stage, we identify the fastest 2-to-1 MUX cell from the provided technology library. The functionality of a 2-to-1 MUX cell can equivalently be implemented by one of the following two logic expressions: (1) (2) In some technology libraries, the built-in 2-to-1 MUX cell delay is larger than the MUX cells generated from the basic gates by using the functionality presented in (1) (AND-OR operation) or (2) (NAND-NAND operation). We consider the smallest of these three delays as the delay of a single stage of the traditional barrel . shifter. We denote this delay as In this brief, we introduce a technique to implement a faster barrel shifter. The key idea is to merge multiple (two or three) stages of the barrel shifter into one stage. We define mergeable stages as those which can be merged to create a hybrid stage (leading to faster performance of the shifter block). To identify mergeable stages, we design a timing-driven algorithm, so that the overall delay of the shifter block is minimized. In the following two subsections, we discuss each of the two steps (the design and identification of merged stages) in detail. A. Design of the Merged Stages To facilitate the explanation, we will discuss about the leftshifter only. Note that the similar concept applies to right-shifter as well. In our approach, we attempt to merge two or three stages of the shifter into one single stage. If two stages are merged, we call the newly created stage a dual merged stage. On the other hand, if three stages are merged, then we call the new stage a triple merged stage. Note that the stages to be merged are not necessarily consecutive. In the case of dual merged stages, let us assume that we merge the stages corresponding to the th bit and the th bit of the shift , and . Note signal , where that and do not require to be two consecutive bits of the shift signal. Our newly created dual merged stage will perform one of the following four operations: and ); 1) no shifting operation (if and ); 2) shift by bits (if and ); 3) shift by bits (if ) bits (if and ). 4) shift by ( The functionality of each bit-slice of our dual merged stage for a left shifter is as follows: for where = , = , = , . and = Even if no merging is performed, for the left-shifter, the functionality of a few bitslices near the least significant bits (LSB) of the shifter gets simplified, because some of the values (in ). For example, in the above expression) become logic-0 ( ) of Fig. 1, two bitslices near the LSB the middle stage ( have simplified functionality. In case merging is performed, this simplification can be exploited more aggressively. The above expressions indicate that the timing-critical path of each of our dual merged stage consists of a single inverter, a single 3-input NAND gate and a single 4-input NAND gate. We . The functionality of the dual merged denote this delay as stage can also be implemented by two individual stages of the barrel shifter placed one after the other. In all the technology libraries that we have explored, the delay of the dual merged ) is less than the delay of two cascaded stages of the stage ( ). traditional barrel shifter ( In a similar manner, we can formulate the output equations of each bitslice of a triple merged stage. Let us assume that we merge the stages corresponding to the th bit, th bit and the th bit of the shift signal , where , , , , and . Note that , and do not require to be three consecutive bits of the shift signal. Our newly created triple merged stage will perform one of the following eight operations: , and ); 1) no shifting operation (if , and ); 2) shift by bits (if , and ); 3) shift by bits (if bits (if , and ); 4) shift by ) bits (if , and ); 5) shift by ( ) bits (if , and ); 6) shift by ( ) bits (if , and ); 7) shift by ( ) bits (if , and ). 8) shift by ( The functionality of each bit-slice of our triple merged stage for a left shifter is as follows: for where . = , , = , = , = , = , = . Similar to the dual-merged stages, for the triple merged stage, the functionality of few bitslices near the LSB (for a left-shifter) , = Authorized licensed use limited to: Texas A M University. Downloaded on May 20, 2009 at 03:44 from IEEE Xplore. Restrictions apply. = DAS AND KHATRI: TIMING-DRIVEN APPROACH TO SYNTHESIZE FAST BARREL SHIFTERS or MSB (for a right-shifter) gets simplified, because some of the values in above expressions become logic-0. This fact is aggressively exploited while merging stages. By decomposing the functionality of each bitslice, we find that the timing critical path of each triple merged stages consists of a single inverter, a single 4-input NAND gate, a single 3-input NAND gate and a single 3-input OR gate. Based on the available cells in a technology library, there may be other more efficient ways of implementing the functionality of each bitslice as well. A general-purpose technology mapper is able to identify the most efficient implementation of the triple merged stage of a shifter. We denote the best possible delay of the triple merged . stage as else B. Identification of the Mergeable Stages end while In addition to the design of the merged stages, the technique to identify the mergeable stages plays a key role in determining the performance of our proposed shifter architecture. Algorithm 1 : Identification of the Mergeable Stages MergeableStageList = NULL SelPriorityQueue = Store s0 ; s1 ; . . . ; sm01 in ascending order of arrival time while SelPriorityQueue is not empty do (i; j; k) = Select rst (earliest-arriving) three elements of the shift signal from SelP riorityQueue: 33 Create a new node (singlestage) with only one element i // Not suitable for any merging singlestage:element0 = i 01 singlestage element2 = 01 singlestage:element1 = : Add singlestage into MergeableStageList Remove (Deque) i from SelPriorityQueue end if end if return MergeableStageList // The list of all stages The algorithm to identify the mergeable stages is presented in Algorithm 1. A detailed explanation is provided below. Our algorithm uses the following timing-driven analysis to find two or three stages for merging: we store all the bits of the shift signal in the ascending order of the arrival time. To perform this operation in an efficient way, we use a priority queue data structure. Let us assume that the six earliest arriving signals are , , , , and . For the signals and , if we construct a dual merged stage, then the output of the dual merged stage will be available at time On the other hand, if we construct two individual stages, then the output of the second stage will be available at time // If 3 stages are not remaining, then the algorithm takes a simpler route Tsingle1 Tdual = Tsingle2 Ttriple = tsj + Del1 + Del2 = Max((tsi + Del1 ); tsj ) + Del1 = Tsingle3 tsi tsk + Del3 = Max(Tsingle2 ; tsk ) + Del1 if (Ttriple Similarly, for the signals , and , if we construct a triple merged stage, then the output of the triple merged stage will be available at time < Tsingle3 ) and (Ttriple < (Tdual + Del2 =2)) then On the other hand, if we construct three individual stages and cascade them, then the output of the third stage will be available at time Create a new node (triplestage) with three elements i, j and k // Select three stages for triple merging triplestage:element0 = i; triplestage:elementt1 = j triplestage:element2 = k Add triplestage into MergeableStageList Remove (Deque) i, j and k from SelPriorityQueue else if (Tdual < Tsingle2 ) then Create a new node (dualstage) with two elements i and j // Select two stages for dual merging dualstage:element0 = i; dualstage:element1 = j dualstage:element2 = 01 Add dualstage into MergeableStageList Remove (Deque) i and j from SelPriorityQueue Now, if ( ) and ( ), then we designate the three stages ( , , and ) of the shifter as mergeable stages. If the two conditions above are not true and ), then we designate the two stages ( and if ( ) as mergeable stages. If both the above conditions are false ) and ( )), (in other words, if ( then we do not select stage for merging and implement a single stage for the stage . Next, we perform the same analysis with the three stages corresponding to the next three earliest arriving shift bits. For example, if we implemented a single stage for stage (in the previous analysis), then we would select stages , , and in this step of the algorithm. On the other hand, if we implemented a dual merged stage for stages and (in the previous analysis), then we would select stages , , and in this step of the algorithm. This analysis and identification of mergeable stages continues until all the stages are analyzed. At Authorized licensed use limited to: Texas A M University. Downloaded on May 20, 2009 at 03:44 from IEEE Xplore. Restrictions apply. 34 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 55, NO. 1, JANUARY 2008 TABLE I AREA AND DELAY COMPARISON OF SHIFTER BLOCKS GENERATED BY A COMMERCIAL SYNTHESIS TOOL AND BY OUR APPROACH the end of this algorithm, our approach produces a list of all the mergeable stages. Note that during technology mapping in our approach, the mapper sizes the output of any node based on their load capacitance. Also, the delay analysis for each configuration considers actual capacitance of the output node, using a load-dependent delay model. Also, note that our merged nodes do not have high fanouts (maximum of 4 for dual merged shifter stages and maximum of 8 for triple merged shifter stages). In terms of the execution of the flow, we first execute the API in Algorithm 1 to identify the mergeable stages. Once the configurations of all the stages ( dual merged, triple merged and unmerged) are identified, then we execute the second API to implement the merged stages (as well as single-stages) in the netlist with proper connectivity between the stages (as described in the Section III-A). By using the dual merged, triple merged and unmerged stages, there can be several ways to design a barrel shifter. For example, a barrel shifter having 4 stages (with 16-bit wide data input) can be designed in 14 different configurations. As the bit-width of the shift input of the shifter increases, the number of possible configurations also increases in a non-linear fashion. The timing-driven analysis in our algorithm enables us to identify a timing-efficient merging configuration. In summary, our approach can potentially reduce the number of stages of the shifter module by upto one-third (33.33%) of the original number of stages, by selecting different groups of three or two stages and merging each group to a single stage. IV. EXPERIMENTAL RESULTS We have implemented our proposed algorithm in the C++ programming language. For all our experiments, we used a Linux workstation running on RedHat 7.1 with the dual-2.2 GHz processors and 4 GB memory. To test the effectiveness of our approach under varying design conditions, we used the following design constraints. • Shifter designs of different input widths: We used Sh-8 block, where the input data and output signals are 8-bits wide and the shift signals is 3-bits wide. By following the similar naming conventions, we used Sh-16, Sh-32, Sh-64 and Sh-128 blocks. • Different technologies and libraries: — Two commercial libraries ( and ) for 0.13 . — Two commercial libraries ( and ) for 0.09 . — Two commercial libraries ( and ) for 0.065 . • Different input arrival time constraints: In many real-life designs, we have noticed that the shift input of a shifter comes from either the register-banks or the outputs of multiple datapath blocks (Multiplier, Squarer, Multiply-Accumulator or Sum-of-Products etc.). To test the effectiveness of our algorithm in different realistic situations, we used the following scenarios, which generate different arrival time constraints for the shift input signal of the shifter block. timing constraint, where the shift input ( ) of 1) the shifter design comes from the output (or a selected set of bits of the output) of a multiplier. timing constraint, where the shift input of the 2) shifter design comes from the output of an adder. timing constraint, where shift input of shifter 3) comes from output of bus-based multiplexers. timing constraint, where the shift input of the 4) shifter design comes from the output of register banks. as the arrival time of the signal Let us denote . Assuming that is a constant, then the timing constraint can be represented as Authorized licensed use limited to: Texas A M University. Downloaded on May 20, 2009 at 03:44 from IEEE Xplore. Restrictions apply. DAS AND KHATRI: TIMING-DRIVEN APPROACH TO SYNTHESIZE FAST BARREL SHIFTERS We have compared our algorithm against a well-known commercially available datapath synthesis tool. The synthesis tool generates arithmetic-optimized architectures for all the arithmetic blocks (like shifters) and then performs general-purpose operations like technology-independent optimizations, constant propagation, redundancy removal, technology mapping, timingdriven optimization, area-driven optimization etc. While running the synthesis tool, we turned on all the above-mentioned optimizations. Due to the licensing agreements, we are unable to mention the name of the commercial tool we used. In Table I, we report the worst-case delay and the total area results obtained for the shifter block from the commercial synthesis tool and from our algorithm. In this table, we report 28 sets of data-points involving different combinations of shifter blocks, timing constraints and technology libraries. If we compute the average of all the 28 data-points presented in the Table I, then our algorithm results in about 10.19% faster implementation of the shifter block, with a 4.11% area penalty. State-of-the-art designs have very strict timing goals, hence most designers would be willing to accept a 10.19% delay improvement at the expense of a 4.11% area penalty of the shifter block only. We also note that the most frequently occurring timing situatiming constraint. In that case, tion is represented by the our approach produces largest improvement in speed (11.23% on an average) with average area overhead of 4.47%. This is in accordance with our expectation, because the availability of all the bits of the shift signals at the same time enables our approach to perform maximal merging of stages, which results in largest performance improvement. Forreference purposes, we implemented the traditional barrel shifter and measured its delay and area numbers across all our shifter blocks, technology libraries and timing constraints. The experimental data showed that, on an average, our proposed shifter is about 13.84% faster than the traditional barrel shifter with a 7.29% area-penalty. To verify the correlation of post-synthesis experimental data with post place-and-route data, we performed placement and routing on 12 shifter blocks. For these testcases, the average post-routing worst-case delay of the shifter generated by our proposed approach is 0.91 (normalized to the worst delay of the shifter generated by the commercial synthesis tool). Similarly, the post-routing total area of the shifter generated by our proposed approach is 1.03 (normalized to the total area of the shifter generated by the commercial synthesis tool). These results after place and route confirm our conclusion about the efficient characteristics of our approach. In addition to stand-alone shifter blocks, we applied our algorithm on several large industrial designs, where the critical path traverses a shifter module. In such designs as well, our architecture reduces the delay of the shifter by 8% to 14%. Our delay improvement is consistent across multiple sizes of shifters, timing constraints and technology libraries. This underscores the strength of our algorithm. Since the shifter is one of the key datapath operations in modern digital design, we believe that the timing-critical portions of many real-life designs can significantly benefit from our algorithm. Since our proposed 35 approach works on general-purpose shifter blocks, it can also be used for the rotators and shift-rotate blocks. V. CONCLUSION In this brief, we have presented a new approach to implement a faster shifter block, which is very useful when the critical path of the design traverses through the shifter block. Our timing-driven algorithm to identify mergeable shifter stages, coupled with our architecture based on the merging of these stages, work seamlessly with different types of shifter blocks, arrival timing constraints and across different technology domains (0.13 , 0.09 , 0.065 ). Our experimental results indicate that our implementation of the shifter is significantly faster (with a modest area penalty) than shifters generated by a commercially available datapath synthesis tool. REFERENCES [1] R. S. Lim, “A barrel switch design,” Computer Design, pp. 76–78, 1972. [2] R. Pereira, J. A. Michell, and J. M. Solana, “Pipelined TSPC barrel shifter with scan test facilities for VLSI implementation of high speed DSP applications,” in Proc. Euro ASIC ’92, 1992, p. 405. [3] R. Pereira, J. A. Michell, and J. M. Solana, “Fully pipelined TSPC barrel shifter for high-speed applications,” IEEE J. Solid-State Circuits, vol. 30, no. 3, pp. 686–690, Jun. 1995. [4] S. M. Kang, “Domino-CMOS barrel switch for 32-bit VLSI processors,” IEEE Circuits Devices Mag., vol. 3, no. 3, pp. 3–8, Mar. 1987. [5] G. M. Tharakan and S. M. Kang, “A new design of a fast barrel switch network,” IEEE J. Solid-State Circuits, pp. 217–221, 1992. [6] S.-J. Yih, M. Cheng, and W. S. Feng, “Multilevel barrel shifter for CORDIC design,” Electron. Lett., vol. 32, no. 13, pp. 1178–79, 1996. [7] V. Milutinovic, M. Bettinger, and W. Helbig, “Multiplier/ shifter design tradeoffs in a 32-bit microprocessor,” IEEE Trans. Comput., vol. 38, no. 8, pp. 874–880, Aug. 1989. [8] P. M. Seidel and K. Fazel, “Two dimensional folding strategies for improved layouts of cyclic shifters,” in Proc. IEEE Comput. Soc. Ann. Symp. VLSI, 2004, pp. 277–278. [9] M. A. Hillebrand, T. Schurger, and P. M. Seidel, “How to half wire lengths in the layout of cyclic shifters,” in Proc. IEEE Int. Conf. VLSI Design, 2001, pp. 339–344. [10] R. Rafati, S. M. Fakhraie, and K. C. Smith, “A 16-Bit barrel-shifter im),” IEEE Trans. Circuits plemented in data-driven dynamic logic ( Syst. I, Reg. Papers, vol. 53, no. 10, pp. 2194–2202, Oct. 2006. [11] A. Vetteth, K. Walus, V. S. Dimitrov, and G. A. Jullien, “Quantum dot cellular automata carry-look-ahead adder and barrel shifter,” in Proc. IEEE Emerging Telecommunications Technologies Conf., Dallas, TX, Sep. 2002. [12] A. P. Singh, M. Barany, and D. J. Deleganes, “A mixed signal rotator/ shifter for 8 GHz intel/spl reg/ pentium/spl reg/ 4 integer core,” in Proc. Symp. VLSI Circuits, 2004, pp. 394–397. [13] K. P. Acken, M. J. Irwin, and R. M. Owens, “Power comparisons for barrel shifters,” in Proc. Int. Symp. Low Power Electron. Design, 1996, pp. 209–212. [14] P. A. Beerel, S. Kim, P.-C. Yeh, and K. Kim, “Statistically optimized asynchronous barrel shifters for variable length codecs,” in Proc. Int. Symp. Low Power Electronics and Design, 1999, pp. 261–263. [15] R. Ramadoss, “A new breed of power-aware hybrid shifters,” in Proc. IEEE Int. SOC Conf., 2005, pp. 143–146. [16] K. H. Abed and R. E. Siferd, “CMOS VLSI implementation of a low-power logarithmic converter,” IEEE Trans. Comput., vol. 52, pp. 1421–1433, 2003. [17] R. V. K. Pillai, D. Al-Khalili, and A. J. Al-Khalili, “Energy delay measures of barrel switch architectures for pre-alignment of floating point operands for addition,” in Proc. Int. Symp. Low Power Electronics and Design, 1997, pp. 235–238. [18] M. D. Ercegovac and T. Lang, Digital Arithmetic, ser. Computer Architecture and Design. New York: Morgan Kaufman, 2003. Authorized licensed use limited to: Texas A M University. Downloaded on May 20, 2009 at 03:44 from IEEE Xplore. Restrictions apply. DL