A VLSI Implementation of the Blowfish Encryption/Decryption Algorithm† Michael C.-J. Lin, Youn-Long Lin Department of Computer Science National Tsing Hua University Hsin-Chu, Taiwan 30043, R.O.C. Cryptography is widely applied to protect digital data. Nowadays, there are many kinds of cryptography and most of them require a secret key to encode digital data. After applying a cryptography algorithm to our digital data, others can’t regain the original data easily without the secret key. Then, the private data are under protection. The Blowfish algorithm was designed by Bruce Schneier in 1993. It is a symmetric block cipher and each block is 64 bits. The secret key of Blowfish cryptography ranges from 32 bits to 448 bits. Blowfish has been examined for five years. Serge Vaudenay has examined weak keys in Blowfish. Vincent Rijmen's Ph.D. paper includes a second-order differential attack on 4-round Blowfish [2]. The key of the Blowfish algorithm is 448 bits, so it re-quires 2448 combinations to examine all keys. The Blowfish algorithm has many advantages. It is suitable and efficient for hardware implementation. Besides, it is unpatented and no license is required. The proposed architecture can produce 4-bit data per clock. The scan chains are also included in this architecture for testing. The die size of the chip is 5.7x6.1 mm2, and the The elementary operators of Blowfish algorithm include table-lookup, addition and XOR. The table includes four S-boxes (256x32bits) and a P-array (18x32bits). The Blowfish algorithm consists of four steps including table initialization, key initialization, data encryption and data decryption. Fig. 1 shows the Blowfish encryption algorithm III. THE PROPOSED ARCHITECTURE XL XR 8 bits 8 bits 8 bits S-box1 S-box2 S-box3 32 bits a b 32 bits 8 bits c Divide XL into four eight-bit quarters: a, b, c, and d F(XL) = ( ( S1[a] + S2[b] ) S3[c] ) + S4[d] Fig. 1 Blowfish algorithm maximum frequency is up to 50MHz. † Supported in part by the National Science Council, R.O.C, under contract no. NSC 88-2215-E-007-025 32 bits d 32 bits e f XOR XOR 32 bits XOR XOR XOR 32 bits XOR 32 bits XOR r1 32 bits Result Fig 2. DFG of the loop body XL XR 8 bits 8 bits 8 bits S-box1 S-box2 S-box3 a 32 bits 32 bits b c 32 bits CG XOR 32 bits 8 bits P-array 32 bits 32 bits S-box4 d 32 bits e f XOR XOR XOR CG Divide X into two 32-bit halves: XL, XR For i = 1 to 16: XL = XL Pi XR = F(XL) XR Swap XL and XR Swap XL and XR (Undo the last swap.) XR = XR P17 XL = XL P18 Concatenate XL and XR P-array 32 bits S-box4 32 bits CG I. INTRODUCTION II. BLOWFISH ALGORITHM CG Abstract We propose an efficient hardware architecture for the Blowfish algorithm [1]. The speed is up to 4 bit/clock, which is 9 times faster than a Pentium. By applying operator-rescheduling method, the critical path delay is improved by 21.7%. We have successfully implemented it using Compass cell library targeted at a 0.6 m TSMC SPTM CMOS process. The die size is 5.7x6.1 mm2 and the maximum frequency is 50MHz. 32 bits r2 XOR 32 bits Result Fig. 3 DFG after rescheduling A. Operator Rescheduling When calculating “s = a + b”, the i-th bit of s is equal to ai bi ci , where ci is the carry-in of i-th bit. Fig. 2 shows the original DFG of the loop body after replacing the add operation with CG and XOR function. The operators include only carry generators and XOR, so we can use operator-rescheduling method to reduce the critical path delay. Fig. 3 shows the result of operator rescheduling. The gray line in these figures shows the critical path. The original critical path delay is two CG delay plus five XOR delay. After rescheduling, the critical path delay is reduced to two CG delay plus two XOR delay. Three 2-input XOR delays are hidden. According to a synthesizer’s report, the improvement of critical path delay is about 21.7%. B. Fast Carry Generator The fast carry generator is based on a carry-lookahead adder [3]. We construct the carry generator using hierarchical 4-bit carry generators. The eight 4-bit carry generators in the first level act like the equations we mention above. C. The System Configuration The system consists of the controller and the datapath. Controller The controller is implemented as a finite state machine and described in a behavioral Verilog model. See Fig. 4. !reset start clear load e1 mode=0 mode=1 idle mode=3 initial e2 e4 mode=2 e3 decrypt encrypt Fig. 4 FSM of the controller Datapath register under DataIn to expand 4-bit input to 64-bit input and a shift register over DataOut to reduce 64-bit output to 4-bit output. CORE implements the loop of the 16-round iteration. A pipeline stage is added to the output of the SRAM modules. The pipeline stage will double the performance of the Blowfish hardware but lead to the overhead of area. D. DFT Consideration The testing circuit of the controller is done by adding scan registers to store the signals of the controller and scan out the contents of the registers in test mode. The datapath is described by Verilog RTL model. All of the flip-flops of the datapath are replaced by scan flip-flops. IV. EXPERIMENTAL RESULTS Table 1 shows the feature of this chip. The maximum frequency of this Blowfish cipher chip is 50MHz. Fig. 6 shows the photomicrograph. V. CONCLUSION The proposed hardware architecture of the Blowfish algorithm can achieve high-speed data transfer up to 4 bits per clock, which is 9 times faster than a Pentium. In this design, we avoid I/O limited constraint by modifying the I/O from 64 bits to 4 bits. By applying operator-rescheduling method, the critical path delay is improved about 21.7%. Besides, DFT is also taken into consideration. Specially, the chip is cascadable that means if two chips are used, the performance is double. The test results show that the maximum frequency of this Blowfish cipher chip is 50MHz. The proposed architecture has satisfied the need of high-speed data transfer and can be applied to security device of a system. It includes ROM modules, SRAM modules, and the main arithmetic units of Blowfish. Fig. 5 shows the datapath architecture. addr_p[3:0] clk sel5 sel6 oe_p we_p oe_s we_s[3:0] clk DataIn 4 clk Shift Register XOR s3 32 Mux 32 Mux s2 Mux 32 so_din s1_din Mux s1 p_din s2_din Mux 32 FF s0 ROM_P FF 32 FF p Mux 32 s3_din FF FF FF p_dout s0_dout SRAM_Sbox ROM_Sbox FF SRAM_P 32 FF 0 0 Mux Mux Mux Mux cnl0 sel0 sel1 s1_dout s2_dout clk en_de sel2 s3_dout CORE addr_s0 Mux addr_s1 Mux FF Mux addr_s3 Mux Mux addr_s2 FF 32 cnl1 Fig. 6 Photomicrograph 32 Shift Register clk 4 addr_s[7:0] sel4 sel3 DataOut Fig. 5 The architecture of the datapath The string is mapped to ROM_P and ROM_S-box. The P-array is mapped to SRAM_P, and the four S-boxes are mapped to SRAM_Sbox. Because the size of SRAM module is 2n words, P1 and P18 are implemented as registers, and the others are mapped to 16x32 bits SRAM. We use a shift REFERENCE [1] Bruce Schneier, “Applied Cryptography”, John Wiley & Sons, Inc. 1996 [2] The homepage of description of a new variable-length key, 64-bit block cipher http://www.counterpane.com/bfsverlag.html [3] Patterson and Hennessy, “Computer Organization & Design: The Hardware/ Software Interface”, Morgan Kaufmann, Inc. 1994