Word file

advertisement
A VLSI Implementation of the Blowfish Encryption/Decryption Algorithm†
Michael C.-J. Lin, Youn-Long Lin
Department of Computer Science
National Tsing Hua University
Hsin-Chu, Taiwan 30043, R.O.C.
Cryptography is widely applied to protect digital data.
Nowadays, there are many kinds of cryptography and most
of them require a secret key to encode digital data. After
applying a cryptography algorithm to our digital data, others
can’t regain the original data easily without the secret key.
Then, the private data are under protection.
The Blowfish algorithm was designed by Bruce Schneier
in 1993. It is a symmetric block cipher and each block is 64
bits. The secret key of Blowfish cryptography ranges from 32
bits to 448 bits.
Blowfish has been examined for five years. Serge
Vaudenay has examined weak keys in Blowfish. Vincent
Rijmen's Ph.D. paper includes a second-order differential
attack on 4-round Blowfish [2]. The key of the Blowfish
algorithm is 448 bits, so it re-quires 2448 combinations to
examine all keys.
The Blowfish algorithm has many advantages. It is
suitable and efficient for hardware implementation. Besides,
it is unpatented and no license is required.
The proposed architecture can produce 4-bit data per
clock. The scan chains are also included in this architecture
for testing. The die size of the chip is 5.7x6.1 mm2, and the
The elementary operators of Blowfish algorithm include
table-lookup, addition and XOR. The table includes four
S-boxes (256x32bits) and a P-array (18x32bits).
The Blowfish algorithm consists of four steps including
table initialization, key initialization, data encryption and
data decryption. Fig. 1 shows the Blowfish encryption
algorithm
III. THE PROPOSED ARCHITECTURE
XL
XR
8 bits
8 bits
8 bits
S-box1
S-box2
S-box3
32 bits
a
b
32 bits
8 bits
c
Divide XL into four eight-bit quarters: a, b, c, and d
F(XL) = ( ( S1[a] + S2[b] )  S3[c] ) + S4[d]
Fig. 1 Blowfish algorithm
maximum frequency is up to 50MHz.
†
Supported in part by the National Science Council, R.O.C,
under contract no. NSC 88-2215-E-007-025
32 bits
d
32 bits
e
f
XOR
XOR
32 bits
XOR
XOR
XOR
32 bits
XOR
32 bits
XOR
r1
32 bits
Result
Fig 2. DFG of the loop body
XL
XR
8 bits
8 bits
8 bits
S-box1
S-box2
S-box3
a
32 bits
32 bits
b
c
32 bits
CG
XOR
32 bits
8 bits
P-array
32 bits
32 bits
S-box4
d
32 bits
e
f
XOR
XOR
XOR
CG
Divide X into two 32-bit halves: XL, XR
For i = 1 to 16:
XL = XL Pi
XR = F(XL)  XR
Swap XL and XR
Swap XL and XR (Undo the last swap.)
XR = XR  P17
XL = XL  P18
Concatenate XL and XR
P-array
32 bits
S-box4
32 bits
CG
I. INTRODUCTION
II. BLOWFISH ALGORITHM
CG
Abstract We propose an efficient hardware architecture for
the Blowfish algorithm [1]. The speed is up to 4 bit/clock,
which is 9 times faster than a Pentium. By applying
operator-rescheduling method, the critical path delay is
improved by 21.7%. We have successfully implemented it
using Compass cell library targeted at a 0.6 m TSMC
SPTM CMOS process. The die size is 5.7x6.1 mm2 and the
maximum frequency is 50MHz.
32 bits
r2
XOR
32 bits
Result
Fig. 3 DFG after rescheduling
A. Operator Rescheduling
When calculating “s = a + b”, the i-th bit of s is equal to ai
 bi  ci , where ci is the carry-in of i-th bit.
Fig. 2 shows the original DFG of the loop body after
replacing the add operation with CG and XOR function.
The operators include only carry generators and XOR, so
we can use operator-rescheduling method to reduce the
critical path delay. Fig. 3 shows the result of operator
rescheduling.
The gray line in these figures shows the critical path. The
original critical path delay is two CG delay plus five XOR
delay. After rescheduling, the critical path delay is reduced to
two CG delay plus two XOR delay. Three 2-input XOR
delays are hidden. According to a synthesizer’s report, the
improvement of critical path delay is about 21.7%.
B. Fast Carry Generator
The fast carry generator is based on a carry-lookahead
adder [3]. We construct the carry generator using hierarchical
4-bit carry generators. The eight 4-bit carry generators in the
first level act like the equations we mention above.
C. The System Configuration
The system consists of the controller and the datapath.
Controller
The controller is implemented as a finite state machine
and described in a behavioral Verilog model. See Fig. 4.
!reset
start
clear
load
e1
mode=0
mode=1
idle
mode=3
initial
e2
e4
mode=2
e3
decrypt
encrypt
Fig. 4 FSM of the controller
Datapath
register under DataIn to expand 4-bit input to 64-bit input
and a shift register over DataOut to reduce 64-bit output to
4-bit output.
CORE implements the loop of the 16-round iteration. A
pipeline stage is added to the output of the SRAM modules.
The pipeline stage will double the performance of the
Blowfish hardware but lead to the overhead of area.
D. DFT Consideration
The testing circuit of the controller is done by adding
scan registers to store the signals of the controller and scan
out the contents of the registers in test mode.
The datapath is described by Verilog RTL model. All of
the flip-flops of the datapath are replaced by scan flip-flops.
IV. EXPERIMENTAL RESULTS
Table 1 shows the feature of this chip. The maximum
frequency of this Blowfish cipher chip is 50MHz. Fig. 6
shows the photomicrograph.
V. CONCLUSION
The proposed hardware architecture of the Blowfish
algorithm can achieve high-speed data transfer up to 4 bits
per clock, which is 9 times faster than a Pentium. In this
design, we avoid I/O limited constraint by modifying the I/O
from 64 bits to 4 bits. By applying operator-rescheduling
method, the critical path delay is improved about 21.7%.
Besides, DFT is also taken into consideration. Specially, the
chip is cascadable that means if two chips are used, the
performance is double. The test results show that the
maximum frequency of this Blowfish cipher chip is 50MHz.
The proposed architecture has satisfied the need of
high-speed data transfer and can be applied to security device
of a system.
It includes ROM modules, SRAM modules, and the main
arithmetic units of Blowfish. Fig. 5 shows the datapath
architecture.
addr_p[3:0]
clk
sel5
sel6
oe_p we_p oe_s we_s[3:0]
clk
DataIn
4
clk
Shift Register
XOR
s3
32
Mux
32
Mux
s2
Mux
32
so_din
s1_din
Mux
s1
p_din
s2_din
Mux
32
FF
s0
ROM_P
FF
32
FF
p
Mux
32
s3_din
FF
FF
FF
p_dout
s0_dout
SRAM_Sbox
ROM_Sbox
FF
SRAM_P
32
FF
0
0
Mux
Mux
Mux
Mux
cnl0
sel0
sel1
s1_dout
s2_dout
clk
en_de
sel2
s3_dout
CORE
addr_s0
Mux
addr_s1
Mux
FF
Mux
addr_s3
Mux
Mux
addr_s2
FF
32
cnl1
Fig. 6 Photomicrograph
32
Shift Register
clk
4
addr_s[7:0]
sel4
sel3
DataOut
Fig. 5 The architecture of the datapath
The  string is mapped to ROM_P and ROM_S-box. The
P-array is mapped to SRAM_P, and the four S-boxes are
mapped to SRAM_Sbox. Because the size of SRAM module
is 2n words, P1 and P18 are implemented as registers, and the
others are mapped to 16x32 bits SRAM. We use a shift
REFERENCE
[1] Bruce Schneier, “Applied Cryptography”, John Wiley & Sons, Inc. 1996
[2] The homepage of description of a new variable-length key, 64-bit block
cipher http://www.counterpane.com/bfsverlag.html
[3] Patterson and Hennessy, “Computer Organization & Design: The
Hardware/ Software Interface”, Morgan Kaufmann, Inc. 1994
Download