Low-Latency Digit-Serial and Digit-Parallel Systolic Multipliers for

advertisement
1
Low-Latency Digit-Serial and Digit-Parallel Systolic
Multipliers for Large Binary Extension Fields
Jeng-Shyang Pan, Senior Member, IEEE, Chiou-Yng Lee, Senior Member, IEEE, and Pramod Kumar Meher,
Senior Member, IEEE
Abstract—For cryptographic applications, such as elliptic
curve digital signature algorithm (ECDSA) and pairing algorithm, the crypto-processors are required to perform large
number of additions and multiplications over finite fields of large
orders. To have a balanced trade-off between space complexity
and time complexity, in this paper, novel digit-serial and digitparallel systolic structures are presented for computing multiplication over GF (2m ). Based on novel decomposition algorithm, we
have derived an efficient
p digit-serial systolic architecture, which
involves latency of O( m
) clock cycles, while the existing digitd
serial systolic multipliers involve at least O( m
) latency for digitd
size d. The proposed digit-serial design could be used for AESPbased fields with the same digit-size as the case of trinomialbased fields with a small increase in area. We have also proposed
digit-parallel systolic architecture employing n-term Karatsubapm
like method,
p m where the latency can be reduced from O( d )
to O( nd ). This feature would be a major advantage for
implementing multiplication for the fields of large orders. From
synthesis results, it is shown that the proposed architectures have
significantly lower time complexity, lower area-delay product, and
higher bit-throughput than the existing digit-serial multipliers.
Index Terms—Karatsuba-like multiplication, elliptic curve digital signature algorithm, least-significant digit first (LSD-first)
multiplication, pairing algorithm. almost equally spaced polynomial (AESP).
I. I NTRODUCTION
Finite field arithmetic is widely used in cryptography and
error control coding [1], [2]. For cryptographic applications,
such as elliptic curve digital signature algorithm (ECDSA) [3],
[4], elliptic curve cryptography (ECC) has been used in many
security-sensitive applications in various contexts. Recently,
cryptographic pairing has been used extensively to derive various security protocols, such as identity based cryptography [5]
and short signature scheme [6]. For such protocols, the Weil
and Tate pairings based on elliptic curve arithmetic require
thousands of additions and multiplication over large finite
fields, and have drawn the attention of many researchers [30],
[31], [32]. From the point view of VLSI implementations, the
realization of pairing computation in resource-constrained applications is highly challenging due to its high computational
J.-S. Pan is with Innovative Information Industry Research Center (IIIRC),
Shenzhen Graduate School, Harbin Institute of Technology, China (e-mail:
jengshyangpan@gmail.com).
C.-Y. Lee is with the Department of Computer Information and Network
Engineering, Lunghwa University of Science and Technology, Taoyuan 33306,
Taiwan (e-mail: pp010@mail.lhu.edu.tw).
P. K. Meher is with Institute for Infocomm Research, 138632 Singapore
(e-mail: pkmeher@i2r.a-star.edu.sg).
Copyright (c) 2013 IEEE. Personal use of this material is permitted.
However, permission to use this material for any other purposes must be
obtained from the IEEE by sending an email to pubs-permissions@ieee.org.
demand compared to the classical ECC-based crypto-systems.
On one hand, the latency of implementations and the logic
complexity of arithmetic operations increase with the field
order, on the other hand, the number of arithmetic operations
increases in this case. For the high-speed architectures, the
area and power complexities of implementations of pairing
becomes too high. Therefore, based on the requirement of the
application, a trade-off is needed to be reached between speed
performance and power/energy complexity.
Systolic designs have several advantages such as regularity
and modularity of design, simplicity of their processing elements (PE), local interconnections and high-throughput rates.
Therefore, several systolic multipliers have been proposed for
finite field multipliers. Systolic multipliers for binary extension
fields are mainly of two types. Those are bit-parallel and
bit-serial multipliers. The systolic multipliers over extended
binary field GF (2m ) [7], [8], [11], [21], [25] usually employ
either the least-significant bit first (LSB-first) or the mostsignificant bit first (MSB-first) algorithms. The bit-serial multipliers require less hardware and less power but they are
slow. Bit parallel multipliers are fast but involve high hardware
and power complexities. To have systolic multipliers with less
hardware complexity, the field is usually selected by special
polynomials, such as all-one polynomials, pentanomials, and
trinomials. Fully bit-parallel systolic multipliers based on
Toeplitz matrix-vector product approach are proposed in [9]
and [10]. However, these architectures for polynomial basis
of GF (2m ) require O(m2 ) XOR gates, O(m2 ) AND gates,
O(m2 ) 1-bit latches, and O(m) latency. Recently, by using
the special properties of reduction polynomial for trinomials,
a super systolic multiplier
√ [22] is proposed to reduce the
latency from O(m) to O( m) clock cycles. Bit-serial systolic
array multipliers on the other hand require only O(m) space
complexity, but they involve longer computational delays.
To have a trade-off between speed and area complexities,
digit-serial multipliers have been proposed in the literature.
The design of digit-serial multipliers are classified into digitin digit-out (DIDO) design, digit-in parallel-out (DIPO) design, and scalable design. The digit-serial polynomial basis
multiplier with the DIPO structure is proposed in [13], [14]
and [23]. A scalable and systolic multiplier using a fixed d×d
bit-parallel Hankel matrix-vector multiplier has
been
mproposed
− 1))
in [15] and [12] whose latency is (d + m
d ( d
clock cycles. Digit-serial systolic multipliers using DIDO
architecture are presented in [16], [17] and [24].
The latency
of these digit-serial systolic multipliers is 2 m
− 1 clock
d
cycles. As mentioned above, complexity of digit-serial systolic
then a field element can be padded by (qd − m)-bit zeros as A = (a0 , a1 , · · · , am−1 , 0, . . . , 0 ). Accordingly, an
| {z }
finite field multipliers depends on the selected irreducible
polynomials and the chosen basis representation.
In this paper, we present a novel decomposition scheme for
digit-serial multiplication, and based on that we have derived
a low-latency digit-serial systolic multiplier. The proposed
digit-serial
systolic architecture achieves a latency as low as
p
2 m/d clock cycles, where d is the selected digit-size. Under
using a fixed digit-size, say d = 4, our proposed digit-serial
systolic multiplier for GF (2409 ) requires 22 clock cycles,
while the existing digit-serial multipliers presented in [13] and
[14] need 103 clock cycles. When the selected digit-size is one
bit, the latency of the proposed multiplier is the same as that of
Meher’s multiplier [22]. The divide and conquer algorithm of
Karatsuba-Ofman [19] is used to reduce the space complexity
of the multiplier in [27] and [28]. Recently, Montgomery
has proposed Karatsuba-like function [20]. Based on that we
have proposed a digit-parallel systolic multiplier to achieve the
trade-off between time and area complexities in the proposed
systolic multipliers for large binary extension fields.
The rest of the paper is organized as follows. Section II
briefly reviews the classic LSD-first digit-serial multiplier over
GF (2m ). In Section III, we have proposed the novel digitserial multiplication algorithm to develop a digit-serial systolic
multiplier. We have utilized the Karatsuba-like method derive
a digit-parallel systolic multiplier in Section IV. In Section
V, time and space complexities of proposed multipliers and
corresponding existing works are presented and compared.
Finally, we conclude the paper in Section VI.
qd−m bits
element A can be represented by A =
Ai xid , where
i=0
Ai = aid + aid+1 x + · · · + aid+d−1 xd−1 . By using LSD-first
multiplication scheme, the product C can be rewritten as
C = AB mod F (x) = B(A0 + A1 xd +
· · · + Aq−1 x(q−1)d ) mod F (x)
= (C 0 + C 1 + · · · + C q−1 ) mod F (x)
(2)
C i = B (i) Ai
(3)
where
B (i) = Bxdi mod F (x) = xd B (i−1) mod F (x),
B
(0)
(4)
= B.
As mentioned above, the traditional LSD-first multiplication
given by (2) can be described by Algorithm 1. Fig. 1 shows a
digit-serial multiplier over GF(2m ) based on Algorithm 1. It
consists of one multiplier core, two registers for two reduction
operations ( Bxd mod F (x) and C mod F (x)), and one (m +
d)-bit adder. The multiplier core computes the term Ai B of
step 3 computation. In the initial step, the register < B > is
initialized by the element B, and the register <C> is initialized
by zero. According to LSD-first multiplication of (2), after
dm/de clock cycles, the register <C> provides C = C 0 +C 1 +
· · · + C q−1 . And the final reduction in Step 6 is performed
by for computing C = C mod F (x) to obtain the complete
multiplication. Thus, the architecture of Fig.1 for the LSDbased digit-serial multiplier requires dm/de + 1 clock cycles.
II. T RADITIONAL DIGIT- SERIAL MULTIPLIER OVER
GF(2m )
In this section, we briefly review the digit-serial multiplication algorithm [14]. Let the field GF(2m ) be constructed from
an irreducible polynomial F (x) of degree m. And let two
elements A and B be represented by the polynomial basis of
GF(2m ), i.e.,
Algorithm 1 Traditional LSD-first multiplication scheme [14]
Input: A and B are two elements in GF(2m )
Output: C = AB mod F (x)
1. C = 0;
2. A = A0 + A1 xd + · · · + Aq−1 xd(q−1) , where Ai =
Pd−1
j
j=0 aid+j x
2. For i = 0 to q − 1
3. C = C + Ai B;
4. B = Bxd mod F (x);
5. endfor
6. C = C mod F (x)
A = a0 + a1 x + · · · + am−1 xm−1
B = b0 + b1 x + · · · + bm−1 xm−1
where ai and bi for 0 ≤ i ≤ m − 1 are 0 or 1. Finite field
multiplication of two elements A and B is given by
C = AB mod F (x)
q−1
P
(1)
Various schemes have been reported in the literature to
achieve low hardware implementation of (1) in the resourceconstrained environments. Digit-serial multiplier, introduced
by [14], provides a trade-off between speed and area complexities. In the following we discuss the Least Significant Digit
(LSD) first multiplication to derive a digit-serial multiplier
architecture.
Let A, B, and C be three elements in GF (2m ) generated by the irreducible polynomial F (x). Three elements are
presented by polynomial basis representation, where
C =
,
where
d
AB mod F (x). Let us assume that q = m
d
is the selected digit size. If m is not a multiple of dq,
III. P ROPOSED DIGIT- SERIAL SYSTOLIC MULTIPLIER
In order to derive a new digit-serial systolic multiplier, we
briefly review
the basic properties of the reduction polynomial.
Pm−1
Let B = i=0 bi xi be the element in GF(2m ), and the field
GF(2m ) be constructed from an irreducible polynomial of the
form F (x) = 1 + xl + xm . Let us represent xB mod F (x) as
B (1) = xB mod F (x)
=
m−1
X
i=0
2
bi xi+1 mod F (x)
Cij = Aik+j B (jd)
<B>
Note that the partial product of (10) is not in reduced form.
To simplify the product C in (8), the partial product Ci xdki
is rewritten as
Bx d mod F ( x )
C i = Ci x
A q  1 ,  , A1 , A 0
(10)
dki
=
k−1
X
jd dki
Aik+j x x
B=
j=0
multiplier core
k−1
X
(jd)
Aik+j Bi
(11)
j=0
where
Bi = xkid B mod F (x) = xkd Bi−1 mod F (x)
As mentioned above, the product C can also be represented
as
p−1
X
C i mod F (x)
(12)
C=
C 
i=0
The proposed digit-serial multiplication scheme based on (12)
is described in Algorithm 2.
C mod F ( x )
Algorithm 2 Proposed digit-serial multiplication algorithm
Inputs: A and B are two elements in GF(2m ).
Output: C = AB mod F (x).
1. Initialization step
C = 0.
kp−1
P
A =
Ai xid , where Ai = aid + aid+1 x + · · · +
C
Figure 1. Traditional LSD-first digit-serial multiplier [14]
= (bm−1 +
m−1
X
i=0
i
bi−1 x ) + bm−1 x
aid+d−1 xd−1 .
2. Multiplication step
2.1. for i = 0 to p − 1
2.2. D = B
2.3. B = xkd B mod F (x)
2.4. for j = 0 to k − 1
2.5. C = C + DAik+j
2.6. D = xd D mod F (x)
2.7. endfor
2.8. endfor
3. Final reduction polynomial step
3.1. C = C mod F (x)
l
i=1
=B
where
B
(1)
(1)
+ bm−1 xl
(5)
= bm−1 + b0 x + · · · + bm−2 xm−1
(6)
Generalizing (5), xd B mod F (x) can be rewritten as
B (d) = xd B mod F (x) = xB (d−1) mod F (x)
=B
(d)
+
d−1
X
bm−i xl+i−1
(7)
i=1
Let p and k be two integers which satisfy kp = q = m
d ,
where d is the selected digit-size. Note that if q is not divisible
by k, one needs to append zeros into the most significant bit
ends of A to satisfy
q = kp. Then the element
Pq−1
Pd−1A can be
rewritten as A = i=0 Ai xid , where Ai = j=0 aid+j xj .
The product C can be represented by
C = AB mod F (x) =
q−1
X
B
R3
Ai xid B mod F (x)
i=0
=
p−1
X
Ci xdki mod F (x)
(8)
Ci =
k−1
X
Aik+j xjd B
k−1
X
j=0
Aik+j B (jd) =
k−1
X
Cij ,
PE2
PE3
PE4
A0
A1
A2
A3
A4
A5
A6
A7




RAC
R2
C 
The proposed architecture for new digit-serial multiplier
based on Algorithm 2 is shown in Fig. 2. It consists of k
processing elements (PEs), one reduction module R3, one
reduction-accumulator (RAC), and register B. Each PE (shown
in Fig.3) performs the computations of Steps 2.5 and 2.6. Each
j=0
=
PE1
Figure 2. The proposed digit-serial systolic multiplier architecture for k = 4
i=0
where
0
(9)
j=0
3
becomes large, and consequently, the latency (=k+p) becomes
large. To have the low latency for finite field multiplications
for large values of m, we need
√ to minimize the latency k + p.
Hence, we select k = p = pq .The proposed multiplier can
m
clock cycles.
then have the latency of 2
d
For clarity of the above discussions regarding the proposed
digit-serial systolic multiplier, we use the following example
to illustrate the PE operations in different clock cycles.
P26
P26
Example 1. Let A = i=0 ai xi and B = i=0 bi xi be two
elements in GF (227 ) generated by the irreducible polynomial
F (x). Let us assume
that the selected digit-size is d = 3. Then,
q
27
we have k =
3 = 3. And the element A is decomposed by
P8
P2
3i
j
A =
i=0 Ai x , where Ai =
j=0 a3i+j x . Considering
(8), the product C can be represented
as C = C0 + C1 +
P2
B i xdj and B i =
C2 mod F (x), where Ci =
A
3i+j
j=0
9
x B i−1 mod F (x) for i = 0, 1, 2. Table I lists each PE
operation in every clock cycle. We note that for this case,
the proposed digit-serial systolic multiplier requires 6 clock
cycles.
m-bit
R1
Bin
multiplier
core
Cin
m+d-bit
L
Bout
L
Cout
d-bit
Ain
Figure 3. The detailed circuit of processing element (PE)
PE is comprised of one multiplier core, one reduction module
R1, one (m+d)-bit adder and a pair of (m+d)-bit latches. The
multiplier core computes the partial products Ain Bin , where
Bin is an m bit word, and Ain is a d-bit digit. R1 module
performs the reduction xd Bin mod F (x). R3 module performs
the reduction xkd Bin mod F (x). The RAC module (shown in
Fig.2) consists of one reduction module R2, one m-bit adder
and one m-bit register < C >. It performs the final reduction
given by step 3.1 of Algorithm 2.
In the proposed digit-serial systolic multiplier of Fig.2, the
register < B > is initialized by the element B, and the
register < C > is initialized by zeros. It performs the LSD-first
multiplication according to the proposed scheme based on (12)
to compute the product C = C 0 +C 1 +· · ·+C p−1 . In the first
clock cycle, the register < B > is fed from left as the input
to the proposed multiplier for computing the partial result C 0 .
Concurrently the reduction module R3 performs B1 = xkd B
mod F (x) and stores the reduced operand in register < B >.
In the next clock cycle, B1 is used as input to the multiplier
for computing the partial result C 1 , and the reduction module
R3 is used for computing B2 = xkd B1 mod F (x) to store
the result in the register < B >, and so on. Each of the
partial results, C i s, passes through k PEs followed by RAC
module. The result is stored in register < C > after k+1 clock
cycles. The RAC module finally computes C = C + C i mod
F (x). Therefore, the proposed digit-serial systolic multiplier
computes the multiplication C = AB mod F (x) in after k + p
clock cycles.
IV. P ROPOSED DIGIT- PARALLEL SYSTOLIC MULTIPLIER
USING K ARATSUBA - LIKE SCHEME
In this section, we use the Karatsuba-like function to realize
the digit-parallel systolic multiplier.
A. Review of Karatsuba-like function
The divide-and-conquer algorithm for high-precision multiplication was introduced by Karatsuba and Ofman. For
modifying Karatsuba function, the Karatsuba-like formulae is
suggested by Montgomery [20]. Here we briefly review the
Karatsuba-like function.
In finite fieldP
GF (2m ), each field element A can be reprem−1
sented as A = i=0 ai xi , where ai ∈ GF (2). The element
A can also be rewritten as
A = A2L + A2H x
where
dm/2e−1
AL =
Theorem 1. The proposed digit-serial systolic multiplier (as
seen in Fig.2) is composed of k PEs, one RAC, one register
< B >, and one R3 module. The latency of the derived
architecture requires at mostp
2k clock
cycles, where d is the
m
selected digit-size, and k =
.
d
(13)
X
a2j xj
j=0
dm/2e−1
AH =
X
a2j+1 xj
j=0
Therefore, for two-term Karatsuba-like function, A = A2L +
2
xA2H and B = BL2 + xBH
are two polynomials of degree m,
where AL , AH , BL , BH are (m/2)-bit term polynomials. The
product of A and B can be rewritten as
Proof: Assume that the digit-size is d and the multiplica tion is decomposed into q-term computations, where q = m
d .
Given the proposed digit-level systolic architecture in Fig. 2,
suppose we have k PEs and one RAC, where q = kp. Thus,
the multiplication can also be segmented into p-term partial
results, i.e., C = C 0 + C 1 + · · · + C p−1 . The sub-product
C i along with Ai and Bi are used as inputs to the PEs of
the systolic array multiplier. Based on the feature of fullypipelined systolic array architectures and as mentioned before,
the complete multiplication requires k + p clock cycles. For
a given q = kp, if k (the number of PEs) is smaller, then p
(the number of digit provided as input to the systolic array)
2
C = AB = (A2L + xA2H )(BL2 + xBH
)
2 2
= A2L BL2 + x(AL BH + AH BL )2 + A2H BH
x
= A2L BL2 (1 + x) + (AL + AH )2 (BL + BH )2 x
2
+A2H BH
(x2 + x)
= C02 (1 + x) + C12 (x2 + x) + C22 x,
4
(14)
Table I. Contents of the components in the digit-serial systolic multiplier for GF (227 ) in each clock cycle.
Cycle
initial
Register B
B0 = B
1
B 1 = x9 B 0 mod F (x)
2
B 2 = x9 B 1 mod F (x)
3
P E1
P E2
P E3
Register C
B 11 = x3 B 0 mod F (x)
C 11 = A0 B 0
B 21 = x3 B 1 mod F (x)
B 22 = x3 B 11 mod F (x)
C 21 = A3 B 1
C 22 = C 11 + A1 B 11
B 31 =
x3 B
2
B 32 = x3 B 21 mod F (x)
mod F (x)
C 31 = A6 B 2
4
B 33 = x3 B 22 mod F (x)
C 32 = C 21 + A4 B 21
C 33 = C 22 + A2 B 22
B 42 = x3 B 31 mod F (x)
B 43 = x3 B 32 mod F (x)
C 42 = C 31 + A7 B 31
C 43 = C 32 + A5 B 32
B 53 = x3 B 42 mod F (x)
5
C0 = C 33 mod F (x)
C1 = C0 + C 43 mod F (x)
C 53 = C 42 + A8 B 42
C2 = C1 + C 53 mod F (x)
6
Note: “Xij ” denotes the output element “X” of P Ej at the i-th clock cycle.
where
Bi =
C0 = AL BL ,
dm
2 e−1
X
bi,j =
j=0
C 1 = AH B H ,
where
C2 = (AL + AH )(BL + BH ).
BL
BH


BL + BH
for i = 0
for i = 1
for i = 2
a2j+i
a2j + a2j+1
f or
f or
i = 0, 1
i=2
b2j+i
b2j + b2j+1
f or
f or
i = 0, 1
i=2
ai,j =
Similarly, based on two-term splitting method, two polyno2
mials A = A2L + xA2H and B = BL2 + xBH
can be again
partitioned as four-term polynomials such as A = A4LL +
4
4
4
+
+x2 BLH
+xBHL
xA4HL +x2 A4LH +x3 A4HH and B = BLL
2
2
2
3 4
x BHH , where AL = ALL + xALH ,AH = AHL + xA2HH ,
2
2
2
2
BL = BLL
+ xBLH
, and BH = BHL
+ xBHH
. The product
of A and B using four-term Karatsuba-like function can be
obtained as follows.



bi,j =
Thus, the product C is represented as :
C = (A0 B 0 )2 (1+x)+(A1 B 1 )2 (x2 +x)+(A2 B 2 )2 x mod F (x)
(16)
Assume that d is the selected digit size, each subword Ai
m
Pd 2d
e−1
Ai,j xjd , where Ai,j =
can be rewritten as Ai =
j=0
Pd−1
l
to Theorem
1, assuming k to be
l=0 ai,jd+l x . According p
m
an integer that satisfies k =
,
each
partial product Ai B i
2d
can be represented by
AB = (A4LL + xA4HL + x2 A4LH + x3 A4HH )
4
4
4
4
·(BLL
+ xBHL
+ x2 BLH
+ x3 BHH
)
4
4
4
= A4LL BLL
(1 + x2 ) + ((A4LL + A4HL )(BLL
+ BHL
)
4
4
+(A4LL + A4LH )(BLL
+ BLH
))(x + x2 )
Ai B i =
4
4
+(A4HL BHL
+A4LH BLH
)(x+x3 )+(A4LL +A4LH +A4HL +A4HH )
m
d 2d
e−1
X
Ai,j B i xjd mod F (x)
j=0
4
4
4
4
4
4
·(BLL
+BLH
+BHL
+BHH
)x2 +((A4HL +A4HH )(BHL
+BHH
)
=
4
4
4
+(A4LH +A4HH )(BLH
+BHH
))(x2 +x3 )+A4HH BHH
(x2 +x4 ).
k−1
X
C i,j xkdj mod F (x)
j=0
(15)
=
B. Digit-parallel systolic multiplier
C i,k−1 x
dk
+ C i,k−2 xdk + · · · xdk +C i,0 mod F (x)
(17)
where
In this section we use two-term Karatsuba-like function
of (14) to develop a digit-parallel systolic multiplier. From
the structure of (14), we can find that the product C= AB
mod F (x) includes three partial product-squarings, such as
(AL BL )2 , (AH BH )2 and (AL + AH )2 (BL + BH )2 . For
simplify the subword representation, let us define


dm
AL
for i = 0
2 e−1

X
Ai =
ai,j =
AH
for i = 1


j=0
AL + AH for i = 2
C i,j =
k−1
X
Ai,jk+l B i xld
l=0
Ai,jk+k−1 B i x + Ai,jk+k−2 B i xd + · · · xd +Ai,jk B i .
(18)
Based on (16), (17) and (18), we can derive the digit-parallel
multiplication scheme as stated in Algorithm 3.
Fig.4 shows the proposed digit-parallel systolic array architecture using two-term Karatsuba-like function. It consists of
three main parts, e.g., pre-processing unit, subword product
=
5
d
C2
 kd
C1
A0 ,i
PE2,3
A1,i
 B1 
GF(2m)
adder
A2 ,i
B2
PE 2,1
PE2,2
Digit-serial Systolic array [2]
PE2,4
PE1,4
PE1,3
PE1, 2
PE1,1
Digit-serial systolic array [1]
0
0
 kd
PE0,4
PE0,3
PE0,2
 B0 
pre-processing
unit
0
PE0,1
Digit-serial Systolic array [0]
subword product
computation unit
computation (SPC) unit and post-processing unit. In the preprocessing unit, we use GF (2m ) adder to realize two addition
operations of A2,j = A0,j + A1,j and B 2 = B 0 + B 1
corresponding to two steps 4 and 5, respectively. The SPC
unit consists of three partial product computation array (PCA)
to compute (C 0 = B 0 A0,j ,C 1 = B 1 A1,j ,C 2 = B 2 A2,j ).
Each PCA is a digit-serial systolic arrayl with k mmodified
p
processing elements (P Es), where k =
m/2d . It performs partial product computation according to (18). Fig. 5
shows the detailed circuit of the P E. The post-processing
unit consists of three accumulation (AC) modules and one
final polynomial reduction (FPR) module. Each AC module
performs accumulation of partial products. The FPR module
performs step 13 to obtain the final results, as shown in Fig.6.
AC
FPR
AC
C 0 
AC
 kd
post-processing unit
C
Algorithm 3 Digit-parallel multiplication algorithm based on
two-term Karatsuba-like function
2
Inputs: A = A2L +A2H x and B = BL2 +BH
x are two elements
m
in GF(2 ).
Output: C = AB mod F (x).
1. C0 = 0, C1 = 0,C2 = 0.
2. A0 = AL , A1 =AH , B 0 =BL and B 1 =BH .
3. for j = k − 1 to 0
*/initialization step
4. A2,j
=
A0,j + A1,j , where Ai,j
=
(ai,dkj , ai,dkj+1 , · · · , ai,dkj+dk−1 ).
5. B 2 = B 0 + B 1 .
*/subword product computation step
6. C 0 = B 0 A0,j .
7. C 1 = B 1 A1,j .
8. C 2 = B 2 A2,j .
9. C0 = C0 xkd + C 0 .
10. C1 = C1 xkd + C 1 .
11. C2 = C2 xkd + C 2 .
12. endfor
*/ finial polynomial reduction step
13. C = C02 (1 + x) + C12 (x2 + x) + C22 x mod F (x).
Figure 4. The proposed digit-parallel systolic multiplier architecture based on two-term Karatsuba-like function for k = 4
Theorem 2. For finite field GF (2m ) constructed from irreducible polynomials, the latency of the proposed digit-parallel
systolic
with using two-term Karatsuba-like function
pmultiplier
m
is (2
+
2)
clock cycles.
2d
p m Proof: Let k be a positive integer to satisfy k =
2d ,
where d is the selected digit-size. In the proposed digitparallel systolic multiplier architecture of Fig.4, the three main
parts (pre-processing unit, subword product computation unit,
and post-processing unit) requires 1, k, and 2 clock cycles,
respectively. Thus, computing each sub-product C i,j to store
in the AC module requires k + 2 clock cycles. According to
Algorithm 3, the main computation requires k iterations in the
for loop of the multiplications, which demands 2k + 1 clock
cycles. Finally, the final reduction in Step 13 performs the
summation of three partial results followed by reduction (i. e.
C = C02 (1 + x) + C12 (x2 + x) + C22 x mod F (x)) to obtain the
product word. Thus, the one complete multiplication requires
2k + 2 clock cycles.
As mentioned above, in the proposed digit-parallel sys-
tolic multiplier (Fig.4), the subword product computation unit
consists of three digit-serial systolic array [i] for computing
Ai B i of i=0, 1 and 2. Observing the structure of Fig.4,
each PCA is to calculate the subword product of two m
2 -bit
polynomials. Since three PCAs in Fig.4 are fully parallelism
computations,
lp
m the latency of the proposed multiplier is at most
m/2 + 2) clock cycles, if the selected digit-size is
(2
d = 1.
In this regarding, we employ the recursions of two-term
Karatsuba-like function to derive the proposed digit-serial
systolic architecture. The proposed multiplier can have the
following properties.
Theorem 3. Assume that we use n-term Karatsuba-like function to construct the digit-parallel systolic multiplier, where
n = 2i , then, the subword product computation unit is required
6
(
Bi,in
multiplier
core
(
m
)  bits
2
V. T IME AND S PACE C OMPLEXITIES
m
 d )  bits
2
 d
Ci,in
d-bits
Bi,out
L
A. Complexities of digit-serial systolic multiplier
L
adder
(
m
 kd )  bits
2
Let us consider the following properties to analyze the time
and space complexities of the proposed digit-serial systolic
multiplier.
Remark 1. Let F (x) be an irreducible trinomials of the form
F (x) = 1 + xl + xm . The computation of xd B mod F (x)
then requires d XOR gates and involves one XOR gate delay.
Ci,out
Ai,in
Figure 5. The modified processing element (P E)
C0,in
() 2 ( x  1) mod F ( x )
C1,in
(  ) 2 ( x  x 2 ) mod F ( x )
C2,in
Remark 2. Let F (x) be an irreducible pentanomial of the form
m
F (x) = 1+xl1 +xl2 +xl3 +xm with l1 u m
4 , l2 −l1 u 4 , and
m
l3 − l2 u 4 . It is shown in [18] that such type of pentanomial
exists in GF (2m ) for m > 9. This polynomial is called an
almost equally spaced pentanomial (AESP). In this case, the
computation of xd B mod F (x) requires 3d XOR gates and
involves one XOR gate delay.
Remark 3 (multiplier core). Let A, B, C be represented by
polynomials given by d, m, m + d bits, respectively. Then, the
computation of BA+C by traditional grade-school technique,
it requires dm XOR and dm AND gates, and involves TA +
(log 2 (d + 1))TX ) gate-delay time.
Remark 4 (Final polynomial reduction). Let C be represented by m + d bit polynomial. Then, computing C mod
F (x) has the following time and space complexities.
Cout
(  ) 2 ( x ) mod F ( x )
Figure 6. The finial polynomial reduction module (FPR)
to have the following configuration.
•
•
•
•
Based on two-way splitting Toeplitz matrix-vector product scheme [29], if n = 2i , the subword product computation unit then produces nlog 2 3 digit-serial systolic
p m arrays.
Each digit-serial systolic array consists of
nd P Es.
In the structure of P E in Fig. 5, the multiplier core
performs the multiplication
of Bi,in and Ai,in , where
Bi,in and Ai,in are m
n -bit and d -bit polynomials,
respectively.
The output result of each digit-serial
array
systolic
p m
for
each loop computation demands m
+
d
n
nd digit
sizes.
•
•
If F (x) is an irreducible trinomial, C mod F (x) requires
2d XOR gates and involves one XOR gate-delay.
If F (x) is an irreducible AESP, C mod F (x) requires 6d
XOR gates and involves one XOR gate-delay.
As shown in the structure of Fig.2, the proposed multiplier
consists of k PEs, one m-bit register < B >, one reduction
module R3, and one RAC module. Each PE consists of one
multiplier core, one reduction module R1, one m-bit GF (2m )
adder, and a (2m + d)-bit register. RAC module is comprised
of one reduction module R2, one m XOR gates, and m-bit
register < C >, where the reduction module R2 performs the
final reduction operation of step 3.1 according to Algorithm
2. Based on Remarks 1 and 4, we have estimated the space
complexity of our proposed architecture for trinomials and
AESPs, and listed in Tables II and III. From these tables,
we can find
√ that AESP-based digit-serial multiplier (Fig.2)
requires (4 md + 2d) number of more XOR gates compared
to the trinomial-based
multiplier. The multiplier
lp digit-serial
m
has the latency of 2
m/d clock cycles if d is the selected
digit-size. In [22], Meher has proposed√the 2-D super systolic
multiplier for trinomials, having 2 d me clock cycles of
latency. When the selected digit-size is one bit, the latency
of our proposed multiplier in Fig.2 is the same as that of
Meher’s multiplier. In this case, the R1 module in Fig.3 can
be reduced to one XOR gate, and duration of clock cycle to
TA + TX + TL , where TA , TX and TL denote the propagation
delays of a 2-input AND gate, a 2-input XOR gate and 1-bit
latch, respectively. Therefore, latency of the lproposed
m digitp
√
serial multiplier ranges from 2 d me to 2
m/d clock
cycles, which depends on the selected digit-size d.
Theorem 4. The latency of the proposed digit-parallel
systolic
p
m
multiplier with n-term Karatsuba-like function is (2
nd +
2) clock cycles.
According to Theorem 3, each digit-serial systolic array calculates the product of two dm/ne-bit subword polynomials.
For example of m = 409 and n = 4, we use four-term
Karatsuba-like function in (15) to build the digit-serial multiplier. We require 9 digit-serial systolic arrays for computing
A0 B0 , A01 B01 , A02 B02 , A1 B1 , A2 B2 , A0123 B0123 ,A13 B13 ,
A23 B23 ,A3 B3 , respectively, to construct the subword product
computation unit. Each subwords Ai and Bj are d409/4e =
103-bit polynomials. Assume that the selected digit-size is
four bits, we use the previous proposed digit-serial systolic
multiplier architecture (shown
m to construct each digitlpin Fig.2)
103/4 = 6 P Es. Therefore,
serial systolic array with
based on Theorem 4, the proposed digit-parallel systolic
multiplier using p
four-term
Karatsuba-like function can have
m
the latency of 2
+
2=14
clock cycles.
nd
7
of our proposed system is much lower than the existing
multipliers. Amongst all the existing digit-serial multipliers,
the non-systolic multiplier of [14] has the minimum timecomplexity. But as shown in Fig .7, the proposed multiplier
of Fig.2 involves nearly 6 to 27 times less time-complexity
compared with those of [14] as digit-size increases from 2
to 32. The time-complexity of proposed digit-parallel systolic
multiplier using 8-term KA is 1.6 to 2.6 times less than the
proposed digit-serial systolic architecture of Fig.2. It is found
that proposed multiplier using KA involves the lowest timecomplexity amongst the digit-wise systolic multipliers [14],
[16], [17]. As shown in Fig. 8, our proposed architectures have
higher area-complexity compared to the existing digit-serial
multipliers, but as shown in Fig. 9, proposed architectures
involve less area-delay product (ADP) than other digit-serial
multipliers [14], [16], [17].
For clarity of comparisons, in Table IV, we have listed area,
normalized power consumption per GHz and energy per output
bit (EOB) of the proposed and the corresponding existing
digit-serial multipliers for digit-size d = 8. The estimation
of EOB is explained in the following.
• The multiplier in a given clock period computes “L” bits
of the product word, which could be considered as bitthroughput (BT).
• The multiplier in a given clock period consumes “E”
amount of energy.
We can compute the energy consumed per cycle as E= (power
consumption) × (clock period). Then the EOB is defined as
B. Complexities of digit-parallel systolic multiplier
Remark 5. By using n-term Karatsuba Algorithm, in the
m
log 3
pre-processing
q unit,
the GF (2 ) adder requires (n 2 −
m
md
n)( n +
n ) XOR gates and log 2 n XOR gate delays.
Digit-parallel systolic multiplier of Fig.4 is comprised of
three main parts. For simplicity of discussion, let us consider
the two-term KA to estimate the time- and space-complexities
of proposed digit-parallel systolic multiplier. According to
Remark 5, the
unit involves space-complexity
qpre-processing
m
md
) XOR gates and m 1-bit latches, and
of ( 2 +
2
requires TX +p
TL gate-delay
to complete the computation. It
m
consists of 3
P
Es
to
construct the subword product
2d
computation unit. Each P E involves space complexity
of
q
m
m
md
d 2 AND gates, d 2 XOR gates, and m +
2 latches;
and requires TA + (log 2 (d + 1))TX + TL gate-delay for
completing its computation. The post-processing unit consists
of an AC module
q and an FPR module. The AC module consists
md
+
of 3( m
2
2 ) XOR gates and 3m latches. Assuming
that the field is constructed from an irreducible trinomial,
the
FPR module in Fig. 6 has space-complexity of (7m −
3 m−n
+ 5) XOR gates to perform the computation of (16).
2
The critical-path of the proposed architecture for trinomials is
TA + (log(d + 1))TX + TL . Based on the above discussion we
have calculated the time- and space-complexities of the digitparallel systolic multiplier for trinomials, and listed in Tables
II and III. Similarly, we can estimate the complexity of the
proposed architecture based on n-term KA.
(power consumption)×( clock period)
E
=
.
L
the number of output bits produced per cycle
(19)
The clock period mentioned in expression (19) is the clock
period used for estimating the power consumption. As shown
in Table IV, the structure of [14] has the lowest critical path
among the existing digit-serial designs, is 6.2 times more than
the proposed digit-serial structure of Fig.2, and 7.7 times and
11 times of the proposed digit-parallel structures of Fig.4 for 2term and 4-term KAs, respectively. The proposed digit-parallel
systolic multiplier using four-term KA can save about 55.2%
ADP and 56.04% EOB over the best of the existing digitserial multipliers [14], [16], [17]. Moreover, the digit-parallel
systolic multiplier using four-term KA can save about 36.84%
EOB over the digit-parallel systolic multiplier using two-term
KA. In Table IV, it is shown that the latencies of our proposed
architectures are lower than those of the existing multipliers.
For digit-size d = 8, our proposed architectures can have
BT > 94, while the existing multipliers have BT ≤ 8.
Therefore, the proposed digit-parallel systolic multipliers using
KA with different number of terms could be used to have the
desired trade-off among speed, ADP/EOB, and BT of digitwise multipliers for large fields.
EOB =
C. Comparisons
Table II lists the hardware components used by our proposed
multipliers and the existing digit-serial multipliers [16], [14],
[17]. The latency and critical-path of proposed multipliers are
compared with existing multipliers in Table III. From this table
we can find that the proposed digit-serial
p m and digit-parallel
p m systolic multipliers have latencies of 2
d and (2
dn +
2) clock cycles, respectively, while traditional digit-serial nonsystolic and systolic multipliers involve latencies of dm/de+1
and 2 dm/de clock cycles, respectively. We note that as shown
in this table proposed AESP-based multiplier has also the same
latency as the proposed trinomial-based multiplier. In Table II,
we have listed the Bit-Throughput (BT ) as a measure of speed
performance.
The BT for our proposed architectures is more
√
than dm, which depends on the selected digit-size d.
The applications of Tate and Weil pairing algorithms involve
additions and multiplications of very large finite fields. Therefore, we select the field GF (21223 ) constructed by the trinomial x1223 +x155 +1 to estimate critical-path, area complexity,
and area-delay product for various digit-serial multipliers.
We have used the NanGate’s Library Creator and the 45nm FreePDK Base Kit from North Carolina State University
(NCSU) [26] to synthesize the proposed and the corresponding
existing digit-serial multipliers and obtained time and area
complexities. From Fig. 7, it is shown that computation
time of our proposed architectures is significantly lower than
those of the existing multipliers. Therefore, time-complexity
VI. C ONCLUSIONS
We have presented two novel low-latency digit-serial and
digit-parallel systolic multipliers over GF (2m ) of large orders. The proposed digit-serial
p m architecture for trinomials and
AESPs has latency of 2
clock cycles, which is much
d
8
Table II. Comparison of space complexities of multipliers
Multipliers
Fig.2 for xm + xn + 1
Fig.2 for AESPs
Fig.4 for xm + xn + 1
(two-term KA)
Fig.4 for xm + xn + 1
(four-term KA)
#AND
√
m m
√
m m
q
1.5m md
2
9m
8
√
√
#XOR
#MUX
#Latch
md(2 + m) + d
√
md(6 + m) + 3d
q
−
8m + (1.5m + 3) md
2
29m
4
md
n
2
q
m
(2m
d
q
m
(2m
d
BT(bits per cycle)
√
dm
√
dm
√
2dm
+ d − 1) + 2m
+ d − 1) + 2m
q
m
4.5m + 1.5m 2d
q
√
31m
m
+ 2md + 9m
4
4
4d
+5
√
+ ( 9m
+ 4.5) md + n
8
√
4dm
(m + k + 1)d
m
(m + k)d + (k + 1)(d − 1)
2m + d + k
dm/de+1
+(k + 1)(d − 1)
Talapatra et al. [17]
md
md + 2d
2m
4m + 3d + 1
d
m
9sd
Kim et al.[16]
2md + m
2md
2m
(10d
+
1
+
+
s)
d
d
2
Note: s + 1 is the number of pipelined stages in per basic cell, d is the selected digit-size, k is the second high bit number of the irreducible polynomial
Kumar et al. [14]
Table IV. Comparison of various digit-serial multipliers over GF (21223 ) in the terms of latency, ADP, critical path delay
TCP D (ns), area (µm2 ), power (µW/GHz), EOB (pJ), and BT(bit/cycle) for digit-size d = 8.
multipliers
latency(cycles)
area (µm2 )
TCP D (ns)
ADP (um2 )ns
Power(µW/GHz)
BT(bits per cycle)
EOB(pJ)
Fig.2
Fig.4 (2-term KA)
Fig.4 (4-term KA)
Kumar multiplier[14]
Kim multiplier [16]
Talapatra multiplier[17]
25
18
13
154
459
306
383,282.6
456,965.1
433,796.5
44,034.97
165,145.8
52,840.1
0.21
0.21
0.21
0.21
0.47
0.25
2,012,233.6
1,727,328.1
1,184,264.5
1,424,090.9
35,626,910.7
4,042,267.8
629,700.9
801,322.9
722,910.9
74,756.56
216,488
81,174.9
94.08
122.3
174.71
7.94
8
8
6.693
6.552
4.138
9.413
27.061
20.294
Table III. Comparison of time complexities of multipliers
800000
Fig.4 (two-term KA)
Fig.4 (four-term KA)
( m 2 )
critical path
TA + (log 2 (d + 1))TX + TL
TA + (log 2 (d + 1))TX + TL
TA + (log 2 (d + 1))TX + TL
Kumar et al. [14]
dm/de
TA + log 2 (d + 1)TX + TL
+
1
Talapatra et al. [17]
2 m
T
+ (log2 d)TX + TM U X + TL
A
d
Kim et al.[16]
3 m
d(TA + TX + TM U X )/(s + 1) + TL
d
Note:(1) s + 1 is the number of pipelined stages in per basic cell, and d is
the selected digit-size. (2) TA , TX , TL and TM U X denote the propagation
delays of a 2-input AND gate, a 2-input XOR gate, 1-bit latch and a 2 × 1
MUX gate, respectively
c o m p le x it y
Fig.2
Latency
q
2 m
q d
m
+2
2 2d
q
m
2 4d + 2
A re a
Multipliers
700000
600000
500000
Fig.2
Fig.4 2-termKA
400000
Fig.4 4-termKA
300000
Kumar [14]
Talapatra [17]
200000
Kim[16]
100000
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
Digit-size
512
Total computation time (ns )
256
Figure 8. Comparison of area complexity for various digitserial multipliers over GF(21223 )
128
64
Fig.2
32
Fig.4 two-term KA
like method increases, it provides significantly higher bitthroughput and less critical-path, ADP and EOB. The analytical results provide a valuable reference for implementing
pairing algorithm and elliptic curve digital signature algorithm
(ECDSA) in resource-constrained embedded systems and
smart phones. Moreover, our proposed systolic architectures
have the features of regularity, modularity, and concurrency,
and are suitable for VLSI chip designs on hardware platforms
such as ASIC and FPGA.
Fig.4 four-term KA
16
Fig.4 eigh-term KA
8
Kumar [14]
4
Talapatra [17]
Kim [16]
2
1
2
4
6
8
10 12 14 16 18 20 22 24 26 28 30 32
Digit-size
Figure 7. Comparison of computation time (ns) for various
digit-serial multipliers over GF(21223 )
R EFERENCES
less than the best of existing digit-serial architectures. For exploring the area-time trade-off for large field arithmetic architectures, we have used both two-term and four-term Karatsuba
schemes to implement the digit-parallel systolic multiplier
over GF (21223 ). As the number of terms in the Karatsuba-
[1] R. Lidl and H. Niederreiter, "Introduction to Finite Fields and Their
Applications," Cambridge University Press (1994).
[2] I. S. Reed, and G. Solmon, "Polynomial Codes over Certain Finite
Fields," SIAM J. Appl. Math. pp. 300-304, 1960.
[3] “National Institute of Standards and Technology,” Digital Signature
Standard, 186-2, January 2000.
9
Area-delay product ( m2 )ns
[22] P. K. Meher, “Systolic and Super-Systolic Multipliers for Finite Field
GF (2m ) Based on Irreducible Trinomials,” IEEE Trans. Circuits and
Systems I, vol.55, no. 4, pp. 1031 - 1040 , 2008.
[23] C.-Y. Lee, “Super Digit-Serial Systolic Multiplier over GF (2m ),” The
Sixth International Conference on Genetic and Evolutionary Computing,
August 25 ~28, 2012, Kitakyushu, Japan.
[24] J.-H. Guo and C.-L. Wang, “Digit-serial systolic multiplier for finite
fields GF (2m ) ,” IEE Proc. Comput. Digit. Tech., vol. 145, no. 2, pp.
143–148, Mar. 1998.
[25] S. Kwon, C. H. Kim, and C. P. Hong, “A systolic multiplier with LSB
first algorithm over GF (2m ) which is as efficient as the one with MSB
first algorithm,” in Proc. Int. Symp. Circuits Syst., vol. 5, pp. 633–636,
May 2003.
[26] NanGate Standard Cell Library, http://www.si2.org/openeda.si2.org/
projects/nangatelib/.
[27] M. Ernst, M. Jung, F. Madlener, S. Huss, R. Blumel, “A Reconfigurable
System on Chip Implementation for Elliptic Curve Cryptography over
GF (2n ),” CHES 2002, LNCS 2523, pp. 381–399, 2003.
[28] L.S. Cheng, A. Miri, and T.H. Yeap, “Improved FPGA implementations
of parallel Karatsuba multiplication over GF (2n ),” In 23rd Biennial
Symposium on Communications, 2006.
[29] H. Fan and M. A. Hasan, “A New Approach to Subquadratic Space
Complexity Parallel Multipliers for Extended Binary Fields,” IEEE
Trans. Computers, vol. 56, no. 2, pp.224–233, Feb. 2007.
[30] J. A. Solinas, “Efficient Arithmetic on Koblitz Curves,” Designs, Codes
and Cryptography, vol. 19, no. 195-249, 2000.
[31] D.F. Aranha, J.-L. Beuchat, J. Detrey, and N. Estibals, “Optimal Eta
Pairing on Supersingular Genus-2 Binary Hyperelliptic Curves. In The
Cryptographers,” Track at the RSA Conference 2012 (CT-RSA 2012),
LNCS, pp. 98–115, Springer, 2012.
[32] J.-L. Beuchat, J. Detrey, N. Estibals, E. Okamoto, and F. Rodr´ıguezHenr´ıquez, “Fast Architectures for the ηT Pairing over SmallCharacteristic Supersingular Elliptic Curves,” Computers, IEEE Trans.
Computers, vol.60, no.2, pp.266–281, 2011.
5120000
Fig.2
2560000
Fig.4 2-term KA
Fig.4 4-term KA
Kumar [14]
1280000
Talapatra [17]
640000
2
4
6
8
10
12
14
16
18
Digit-size
20
22
24
26
28
30
32
Figure 9. Comparison of area-delay products for various digitserial multipliers over GF(21223 )
[4] IEEE Std 1363-2000, “IEEE Standard Specifications for Public-Key
Cryptography,” January 2000.
[5] D. Boneh and M. K. Franklin, “Identity-Based Encryption from the Weil
Pairing,” SIAM Journal on Computing, vol.32, no.3, pp.586–615, 2003.
[6] D. Boneh, B. Lynn, and H. Shacham, “Short Signatures from the Weil
Pairing,” Journal of Cryptology, vol.17, no.4, pp. 297–319, 2004.
[7] C.S. Yeh, S. Reed, and T.K. Truong, “Systolic Multipliers for Finite
Fields GF (2m ),” IEEE Trans. Computers, vol. 33, no. 4, pp. 357-360,
Apr. 1984.
[8] C.L. Wang, “Bit-Level Systolic Array for Fast Exponentiation in
GF(2m ),” IEEE Trans. Computers, vol. 43, no. 7, pp. 838-841, July
1994.
[9] C.-Y. Lee, C. W. Chiou and J.-M. Lin, “A Unified Parallel Systolic
Multiplier over GF(2m ),” Journal of Computer Science and Technology,
Vol. 22, No. 1, PP.28-38, Jan. 2007.
[10] C.-Y. Lee, J.-S. Horng and I-C. Jou, “Low-complexity bit-parallel
systolic Montgomery multipliers for special classes of GF(2m ),” IEEE
Trans. Computers, vol. 54, no. 9, pp. 1061–1070, Sep. 2005.
[11] P.K. Meher, “Systolic and Non-Systolic Scalable Modular Designs of
Finite Field Multipliers for Reed-Solomon Codec,” IEEE Trans. Very
Large Scale Integration (VLSI) Systems, vol. 17, no. 6, pp. 747–757,
Jun. 2009.
[12] L. H. Chen, P. L. Chang, C.-Y. Lee, and Y. K. Yang, “Scalable and
Systolic Dual Basis Multiplier over GF(2m ),” Int. Journal of Innovative
Computing, Information and Control, vol. 7, no. 3, pp. 1193–1208, Mar.
2011.
[13] A. Hariri and A. Reyhani-Masoleh, “Digit-Serial Structures for the
Shifted Polynomial Basis Multiplication over Binary Extension Fields,”
in Proc. LNCS Intl workshop Arithmetic of Finite Fields (WAIFI), ser.
LNCS, vol. 5130, pp. 103–116, 2008.
[14] S. Kumar, T. Wollinger, and C. Paar, “Optimum Digit Serial GF(2m )
Multipliers for Curve-Based Cryptography,” IEEE Trans. Computers,
vol. 55, no. 10, pp. 1306-1311, Oct. 2006
[15] C.-Y. Lee, C. W. Chiou, J. M. Lin, and C. C. Chang, “Scalable and Systolic Montgomery Multiplier over GF(2m ) Generated by Trinomials,”
IET Circuits, Devices & Systems, vol. 1, no. 6, pp. 477–484, 2007.
[16] C. H. Kim, C. P. Hong, and S. Kwon, “A digit-serial multiplier for
finite field GF (2m ),” IEEE Trans. Very Large Scale Integr. (VLSI)
Syst., vol.13, no. 4, pp. 476–483, Apr. 2005.
[17] S. Talapatra, H. Rahaman, and S. K. Saha, “Unified Digit Serial
Systolic Montgomery Multiplication Architecture for Special Classes of
Polynomials over GF(2m ),” Euromicro Conference on Digital System
Design: Architectures, Methods and Tools, pp. 427-432, 2010.
[18] J. Rajski and J. Tyszer, “Primitive Polynomials over GF (2) of Degree
up to 660 with Uniformly Distributed Coefficients,” Journal of Electronic
Testing: Theory and Applications, vol. 19, pp. 645-657, 2003.
[19] A. Karatsuba and Y. Ofman, “Multiplication of Multidigit Numbers on
Automata,” Soviet Physics-Doklady (English translation), vol. 7, no. 7,
pp. 595–596, 1963.
[20] P. L. Montgomery, “Five, Six, and Seven-Term Karatsuba-Like Formulae,” IEEE Trans. Computers, vol.54, no. 3, 2006.
[21] J. Xie, P. K. Meher, and J. He, “Low-Complexity Multiplier for
GF (2m ) Based on All-One Polynomials,” to appear in IEEE Trans.
Very Large Scale Integr. (VLSI) Syst.
Jeng-Shyang Pan received the B. S. degree in
Electronic Engineering from the National Taiwan
University of Science and Technology in 1986, the
M. S. degree in Communication Engineering from
the National Chiao Tung University, Taiwan in 1988,
and the Ph.D. degree in Electrical Engineering from
the University of Edinburgh, U.K. in 1996. Currently, he is the Doctoral advisor in Harbin Institute
of Technology and Professor in the Department of
Electronic Engineering, National Kaohsiung University of Applied Sciences, Taiwan. He has published
more than 400 papers in which 110 papers are indexed by SCI. He is the
IET Fellow, UK and the Tainan Chapter Chair of IEEE Signal Processing
Society. He was Awarded Gold Prize in the International Micro Mechanisms
Contest held in Tokyo, Japan in 2010. He was also awarded Gold Medal in the
Pittsburgh Invention & New Product Exposition (INPEX) in 2010, Gold Medal
in the International Exhibition of Geneva Inventions in 2011 and Gold Medal
of the IENA, International “Ideas – Inventions – New products“, Nuremberg,
Germany. He was offered Thousand-Elite-Project in China. He is on the
editorial board of International Journal of Innovative Computing, Information
and Control, LNCS Transactions on Data Hiding and Multimedia Security, and
Journal of Information Hiding and Multimedia Signal Processing. His current
research interests include soft computing, robot vision and cloud computing.
Chiou-Yng Lee received the Bachelor’s degree
(1986) in Medical Engineering and the M.S. degree
in Electronic Engineering (1992), both from the
Chung Yuan Christian University, Taiwan, and the
Ph.D. degree in Electrical Engineering from Chang
Gung University, Taiwan, in 2001. From 1988 to
2005, he was a research associate with Chunghwa
Telecommunication Laboratory in Taiwan. He joined
the department of project planning. He taught those
related field courses at Ching Yun University. Currently, he is a professor in the Department of
Computer Information and Network Engineering at Lunghwa University of
Science and Technology. His research interests include computations in finite
fields, error-control coding, signal processing, and digital transmission system.
Besides, he is a senior member of the IEEE and the IEEE Computer society.
He is also an honor member of Phi Tao Phi in 2001.
10
Pramod Kumar Meher (SM03) received the M.Sc.
degree in physics and the Ph.D. degree in science
from Sambalpur University, India, in 1978, and
1996, respectively.
Currently, he is a Senior Scientist with the Institute for Infocomm Research, Singapore, and Adjunct
Professor with the School of Electrical Sciences,
Indian Institute of Technology Bhubaneswar, India.
Previously, he was a Professor of Computer Applications with Utkal University, India, from 1997 to
2002, and a Reader in electronics with Berhampur
University, India, from 1993 to 1997. His research interest includes design
of dedicated and reconfigurable architectures for computation-intensive algorithms pertaining to signal, image and video processing, communication,
bio-informatics and intelligent computing. He has contributed more than 200
technical papers to various reputed journals and conference proceedings.
Dr. Meher has served as a speaker for the Distinguished Lecturer Program
(DLP) of IEEE Circuits Systems Society during 2011 and 2012 and Associate
Editor of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS
II: EXPRESS BRIEFS during 2008 to 2011. Currently, he is serving as
Associate Editor for the IEEE TRANSACTIONS ON CIRCUITS AND
SYSTEMS: REGULAR PAPERS, the IEEE TRANSACTIONS ON VERY
LARGE SCALE INTEGRATION (VLSI) SYSTEMS, and Journal of Circuits,
Systems, and Signal Processing. Dr. Meher is a Fellow of the Institution of
Electronics and Telecommunication Engineers, India. He was the recipient of
the Samanta Chandrasekhar Award for excellence in research in engineering
and technology for 1999.
11
Download