1 Low-Latency Digit-Serial and Digit-Parallel Systolic Multipliers for Large Binary Extension Fields Jeng-Shyang Pan, Senior Member, IEEE, Chiou-Yng Lee, Senior Member, IEEE, and Pramod Kumar Meher, Senior Member, IEEE Abstract—For cryptographic applications, such as elliptic curve digital signature algorithm (ECDSA) and pairing algorithm, the crypto-processors are required to perform large number of additions and multiplications over finite fields of large orders. To have a balanced trade-off between space complexity and time complexity, in this paper, novel digit-serial and digitparallel systolic structures are presented for computing multiplication over GF (2m ). Based on novel decomposition algorithm, we have derived an efficient p digit-serial systolic architecture, which involves latency of O( m ) clock cycles, while the existing digitd serial systolic multipliers involve at least O( m ) latency for digitd size d. The proposed digit-serial design could be used for AESPbased fields with the same digit-size as the case of trinomialbased fields with a small increase in area. We have also proposed digit-parallel systolic architecture employing n-term Karatsubapm like method, p m where the latency can be reduced from O( d ) to O( nd ). This feature would be a major advantage for implementing multiplication for the fields of large orders. From synthesis results, it is shown that the proposed architectures have significantly lower time complexity, lower area-delay product, and higher bit-throughput than the existing digit-serial multipliers. Index Terms—Karatsuba-like multiplication, elliptic curve digital signature algorithm, least-significant digit first (LSD-first) multiplication, pairing algorithm. almost equally spaced polynomial (AESP). I. I NTRODUCTION Finite field arithmetic is widely used in cryptography and error control coding [1], [2]. For cryptographic applications, such as elliptic curve digital signature algorithm (ECDSA) [3], [4], elliptic curve cryptography (ECC) has been used in many security-sensitive applications in various contexts. Recently, cryptographic pairing has been used extensively to derive various security protocols, such as identity based cryptography [5] and short signature scheme [6]. For such protocols, the Weil and Tate pairings based on elliptic curve arithmetic require thousands of additions and multiplication over large finite fields, and have drawn the attention of many researchers [30], [31], [32]. From the point view of VLSI implementations, the realization of pairing computation in resource-constrained applications is highly challenging due to its high computational J.-S. Pan is with Innovative Information Industry Research Center (IIIRC), Shenzhen Graduate School, Harbin Institute of Technology, China (e-mail: jengshyangpan@gmail.com). C.-Y. Lee is with the Department of Computer Information and Network Engineering, Lunghwa University of Science and Technology, Taoyuan 33306, Taiwan (e-mail: pp010@mail.lhu.edu.tw). P. K. Meher is with Institute for Infocomm Research, 138632 Singapore (e-mail: pkmeher@i2r.a-star.edu.sg). Copyright (c) 2013 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending an email to pubs-permissions@ieee.org. demand compared to the classical ECC-based crypto-systems. On one hand, the latency of implementations and the logic complexity of arithmetic operations increase with the field order, on the other hand, the number of arithmetic operations increases in this case. For the high-speed architectures, the area and power complexities of implementations of pairing becomes too high. Therefore, based on the requirement of the application, a trade-off is needed to be reached between speed performance and power/energy complexity. Systolic designs have several advantages such as regularity and modularity of design, simplicity of their processing elements (PE), local interconnections and high-throughput rates. Therefore, several systolic multipliers have been proposed for finite field multipliers. Systolic multipliers for binary extension fields are mainly of two types. Those are bit-parallel and bit-serial multipliers. The systolic multipliers over extended binary field GF (2m ) [7], [8], [11], [21], [25] usually employ either the least-significant bit first (LSB-first) or the mostsignificant bit first (MSB-first) algorithms. The bit-serial multipliers require less hardware and less power but they are slow. Bit parallel multipliers are fast but involve high hardware and power complexities. To have systolic multipliers with less hardware complexity, the field is usually selected by special polynomials, such as all-one polynomials, pentanomials, and trinomials. Fully bit-parallel systolic multipliers based on Toeplitz matrix-vector product approach are proposed in [9] and [10]. However, these architectures for polynomial basis of GF (2m ) require O(m2 ) XOR gates, O(m2 ) AND gates, O(m2 ) 1-bit latches, and O(m) latency. Recently, by using the special properties of reduction polynomial for trinomials, a super systolic multiplier √ [22] is proposed to reduce the latency from O(m) to O( m) clock cycles. Bit-serial systolic array multipliers on the other hand require only O(m) space complexity, but they involve longer computational delays. To have a trade-off between speed and area complexities, digit-serial multipliers have been proposed in the literature. The design of digit-serial multipliers are classified into digitin digit-out (DIDO) design, digit-in parallel-out (DIPO) design, and scalable design. The digit-serial polynomial basis multiplier with the DIPO structure is proposed in [13], [14] and [23]. A scalable and systolic multiplier using a fixed d×d bit-parallel Hankel matrix-vector multiplier has been mproposed − 1)) in [15] and [12] whose latency is (d + m d ( d clock cycles. Digit-serial systolic multipliers using DIDO architecture are presented in [16], [17] and [24]. The latency of these digit-serial systolic multipliers is 2 m − 1 clock d cycles. As mentioned above, complexity of digit-serial systolic then a field element can be padded by (qd − m)-bit zeros as A = (a0 , a1 , · · · , am−1 , 0, . . . , 0 ). Accordingly, an | {z } finite field multipliers depends on the selected irreducible polynomials and the chosen basis representation. In this paper, we present a novel decomposition scheme for digit-serial multiplication, and based on that we have derived a low-latency digit-serial systolic multiplier. The proposed digit-serial systolic architecture achieves a latency as low as p 2 m/d clock cycles, where d is the selected digit-size. Under using a fixed digit-size, say d = 4, our proposed digit-serial systolic multiplier for GF (2409 ) requires 22 clock cycles, while the existing digit-serial multipliers presented in [13] and [14] need 103 clock cycles. When the selected digit-size is one bit, the latency of the proposed multiplier is the same as that of Meher’s multiplier [22]. The divide and conquer algorithm of Karatsuba-Ofman [19] is used to reduce the space complexity of the multiplier in [27] and [28]. Recently, Montgomery has proposed Karatsuba-like function [20]. Based on that we have proposed a digit-parallel systolic multiplier to achieve the trade-off between time and area complexities in the proposed systolic multipliers for large binary extension fields. The rest of the paper is organized as follows. Section II briefly reviews the classic LSD-first digit-serial multiplier over GF (2m ). In Section III, we have proposed the novel digitserial multiplication algorithm to develop a digit-serial systolic multiplier. We have utilized the Karatsuba-like method derive a digit-parallel systolic multiplier in Section IV. In Section V, time and space complexities of proposed multipliers and corresponding existing works are presented and compared. Finally, we conclude the paper in Section VI. qd−m bits element A can be represented by A = Ai xid , where i=0 Ai = aid + aid+1 x + · · · + aid+d−1 xd−1 . By using LSD-first multiplication scheme, the product C can be rewritten as C = AB mod F (x) = B(A0 + A1 xd + · · · + Aq−1 x(q−1)d ) mod F (x) = (C 0 + C 1 + · · · + C q−1 ) mod F (x) (2) C i = B (i) Ai (3) where B (i) = Bxdi mod F (x) = xd B (i−1) mod F (x), B (0) (4) = B. As mentioned above, the traditional LSD-first multiplication given by (2) can be described by Algorithm 1. Fig. 1 shows a digit-serial multiplier over GF(2m ) based on Algorithm 1. It consists of one multiplier core, two registers for two reduction operations ( Bxd mod F (x) and C mod F (x)), and one (m + d)-bit adder. The multiplier core computes the term Ai B of step 3 computation. In the initial step, the register < B > is initialized by the element B, and the register <C> is initialized by zero. According to LSD-first multiplication of (2), after dm/de clock cycles, the register <C> provides C = C 0 +C 1 + · · · + C q−1 . And the final reduction in Step 6 is performed by for computing C = C mod F (x) to obtain the complete multiplication. Thus, the architecture of Fig.1 for the LSDbased digit-serial multiplier requires dm/de + 1 clock cycles. II. T RADITIONAL DIGIT- SERIAL MULTIPLIER OVER GF(2m ) In this section, we briefly review the digit-serial multiplication algorithm [14]. Let the field GF(2m ) be constructed from an irreducible polynomial F (x) of degree m. And let two elements A and B be represented by the polynomial basis of GF(2m ), i.e., Algorithm 1 Traditional LSD-first multiplication scheme [14] Input: A and B are two elements in GF(2m ) Output: C = AB mod F (x) 1. C = 0; 2. A = A0 + A1 xd + · · · + Aq−1 xd(q−1) , where Ai = Pd−1 j j=0 aid+j x 2. For i = 0 to q − 1 3. C = C + Ai B; 4. B = Bxd mod F (x); 5. endfor 6. C = C mod F (x) A = a0 + a1 x + · · · + am−1 xm−1 B = b0 + b1 x + · · · + bm−1 xm−1 where ai and bi for 0 ≤ i ≤ m − 1 are 0 or 1. Finite field multiplication of two elements A and B is given by C = AB mod F (x) q−1 P (1) Various schemes have been reported in the literature to achieve low hardware implementation of (1) in the resourceconstrained environments. Digit-serial multiplier, introduced by [14], provides a trade-off between speed and area complexities. In the following we discuss the Least Significant Digit (LSD) first multiplication to derive a digit-serial multiplier architecture. Let A, B, and C be three elements in GF (2m ) generated by the irreducible polynomial F (x). Three elements are presented by polynomial basis representation, where C = , where d AB mod F (x). Let us assume that q = m d is the selected digit size. If m is not a multiple of dq, III. P ROPOSED DIGIT- SERIAL SYSTOLIC MULTIPLIER In order to derive a new digit-serial systolic multiplier, we briefly review the basic properties of the reduction polynomial. Pm−1 Let B = i=0 bi xi be the element in GF(2m ), and the field GF(2m ) be constructed from an irreducible polynomial of the form F (x) = 1 + xl + xm . Let us represent xB mod F (x) as B (1) = xB mod F (x) = m−1 X i=0 2 bi xi+1 mod F (x) Cij = Aik+j B (jd) <B> Note that the partial product of (10) is not in reduced form. To simplify the product C in (8), the partial product Ci xdki is rewritten as Bx d mod F ( x ) C i = Ci x A q 1 , , A1 , A 0 (10) dki = k−1 X jd dki Aik+j x x B= j=0 multiplier core k−1 X (jd) Aik+j Bi (11) j=0 where Bi = xkid B mod F (x) = xkd Bi−1 mod F (x) As mentioned above, the product C can also be represented as p−1 X C i mod F (x) (12) C= C i=0 The proposed digit-serial multiplication scheme based on (12) is described in Algorithm 2. C mod F ( x ) Algorithm 2 Proposed digit-serial multiplication algorithm Inputs: A and B are two elements in GF(2m ). Output: C = AB mod F (x). 1. Initialization step C = 0. kp−1 P A = Ai xid , where Ai = aid + aid+1 x + · · · + C Figure 1. Traditional LSD-first digit-serial multiplier [14] = (bm−1 + m−1 X i=0 i bi−1 x ) + bm−1 x aid+d−1 xd−1 . 2. Multiplication step 2.1. for i = 0 to p − 1 2.2. D = B 2.3. B = xkd B mod F (x) 2.4. for j = 0 to k − 1 2.5. C = C + DAik+j 2.6. D = xd D mod F (x) 2.7. endfor 2.8. endfor 3. Final reduction polynomial step 3.1. C = C mod F (x) l i=1 =B where B (1) (1) + bm−1 xl (5) = bm−1 + b0 x + · · · + bm−2 xm−1 (6) Generalizing (5), xd B mod F (x) can be rewritten as B (d) = xd B mod F (x) = xB (d−1) mod F (x) =B (d) + d−1 X bm−i xl+i−1 (7) i=1 Let p and k be two integers which satisfy kp = q = m d , where d is the selected digit-size. Note that if q is not divisible by k, one needs to append zeros into the most significant bit ends of A to satisfy q = kp. Then the element Pq−1 Pd−1A can be rewritten as A = i=0 Ai xid , where Ai = j=0 aid+j xj . The product C can be represented by C = AB mod F (x) = q−1 X B R3 Ai xid B mod F (x) i=0 = p−1 X Ci xdki mod F (x) (8) Ci = k−1 X Aik+j xjd B k−1 X j=0 Aik+j B (jd) = k−1 X Cij , PE2 PE3 PE4 A0 A1 A2 A3 A4 A5 A6 A7 RAC R2 C The proposed architecture for new digit-serial multiplier based on Algorithm 2 is shown in Fig. 2. It consists of k processing elements (PEs), one reduction module R3, one reduction-accumulator (RAC), and register B. Each PE (shown in Fig.3) performs the computations of Steps 2.5 and 2.6. Each j=0 = PE1 Figure 2. The proposed digit-serial systolic multiplier architecture for k = 4 i=0 where 0 (9) j=0 3 becomes large, and consequently, the latency (=k+p) becomes large. To have the low latency for finite field multiplications for large values of m, we need √ to minimize the latency k + p. Hence, we select k = p = pq .The proposed multiplier can m clock cycles. then have the latency of 2 d For clarity of the above discussions regarding the proposed digit-serial systolic multiplier, we use the following example to illustrate the PE operations in different clock cycles. P26 P26 Example 1. Let A = i=0 ai xi and B = i=0 bi xi be two elements in GF (227 ) generated by the irreducible polynomial F (x). Let us assume that the selected digit-size is d = 3. Then, q 27 we have k = 3 = 3. And the element A is decomposed by P8 P2 3i j A = i=0 Ai x , where Ai = j=0 a3i+j x . Considering (8), the product C can be represented as C = C0 + C1 + P2 B i xdj and B i = C2 mod F (x), where Ci = A 3i+j j=0 9 x B i−1 mod F (x) for i = 0, 1, 2. Table I lists each PE operation in every clock cycle. We note that for this case, the proposed digit-serial systolic multiplier requires 6 clock cycles. m-bit R1 Bin multiplier core Cin m+d-bit L Bout L Cout d-bit Ain Figure 3. The detailed circuit of processing element (PE) PE is comprised of one multiplier core, one reduction module R1, one (m+d)-bit adder and a pair of (m+d)-bit latches. The multiplier core computes the partial products Ain Bin , where Bin is an m bit word, and Ain is a d-bit digit. R1 module performs the reduction xd Bin mod F (x). R3 module performs the reduction xkd Bin mod F (x). The RAC module (shown in Fig.2) consists of one reduction module R2, one m-bit adder and one m-bit register < C >. It performs the final reduction given by step 3.1 of Algorithm 2. In the proposed digit-serial systolic multiplier of Fig.2, the register < B > is initialized by the element B, and the register < C > is initialized by zeros. It performs the LSD-first multiplication according to the proposed scheme based on (12) to compute the product C = C 0 +C 1 +· · ·+C p−1 . In the first clock cycle, the register < B > is fed from left as the input to the proposed multiplier for computing the partial result C 0 . Concurrently the reduction module R3 performs B1 = xkd B mod F (x) and stores the reduced operand in register < B >. In the next clock cycle, B1 is used as input to the multiplier for computing the partial result C 1 , and the reduction module R3 is used for computing B2 = xkd B1 mod F (x) to store the result in the register < B >, and so on. Each of the partial results, C i s, passes through k PEs followed by RAC module. The result is stored in register < C > after k+1 clock cycles. The RAC module finally computes C = C + C i mod F (x). Therefore, the proposed digit-serial systolic multiplier computes the multiplication C = AB mod F (x) in after k + p clock cycles. IV. P ROPOSED DIGIT- PARALLEL SYSTOLIC MULTIPLIER USING K ARATSUBA - LIKE SCHEME In this section, we use the Karatsuba-like function to realize the digit-parallel systolic multiplier. A. Review of Karatsuba-like function The divide-and-conquer algorithm for high-precision multiplication was introduced by Karatsuba and Ofman. For modifying Karatsuba function, the Karatsuba-like formulae is suggested by Montgomery [20]. Here we briefly review the Karatsuba-like function. In finite fieldP GF (2m ), each field element A can be reprem−1 sented as A = i=0 ai xi , where ai ∈ GF (2). The element A can also be rewritten as A = A2L + A2H x where dm/2e−1 AL = Theorem 1. The proposed digit-serial systolic multiplier (as seen in Fig.2) is composed of k PEs, one RAC, one register < B >, and one R3 module. The latency of the derived architecture requires at mostp 2k clock cycles, where d is the m selected digit-size, and k = . d (13) X a2j xj j=0 dm/2e−1 AH = X a2j+1 xj j=0 Therefore, for two-term Karatsuba-like function, A = A2L + 2 xA2H and B = BL2 + xBH are two polynomials of degree m, where AL , AH , BL , BH are (m/2)-bit term polynomials. The product of A and B can be rewritten as Proof: Assume that the digit-size is d and the multiplica tion is decomposed into q-term computations, where q = m d . Given the proposed digit-level systolic architecture in Fig. 2, suppose we have k PEs and one RAC, where q = kp. Thus, the multiplication can also be segmented into p-term partial results, i.e., C = C 0 + C 1 + · · · + C p−1 . The sub-product C i along with Ai and Bi are used as inputs to the PEs of the systolic array multiplier. Based on the feature of fullypipelined systolic array architectures and as mentioned before, the complete multiplication requires k + p clock cycles. For a given q = kp, if k (the number of PEs) is smaller, then p (the number of digit provided as input to the systolic array) 2 C = AB = (A2L + xA2H )(BL2 + xBH ) 2 2 = A2L BL2 + x(AL BH + AH BL )2 + A2H BH x = A2L BL2 (1 + x) + (AL + AH )2 (BL + BH )2 x 2 +A2H BH (x2 + x) = C02 (1 + x) + C12 (x2 + x) + C22 x, 4 (14) Table I. Contents of the components in the digit-serial systolic multiplier for GF (227 ) in each clock cycle. Cycle initial Register B B0 = B 1 B 1 = x9 B 0 mod F (x) 2 B 2 = x9 B 1 mod F (x) 3 P E1 P E2 P E3 Register C B 11 = x3 B 0 mod F (x) C 11 = A0 B 0 B 21 = x3 B 1 mod F (x) B 22 = x3 B 11 mod F (x) C 21 = A3 B 1 C 22 = C 11 + A1 B 11 B 31 = x3 B 2 B 32 = x3 B 21 mod F (x) mod F (x) C 31 = A6 B 2 4 B 33 = x3 B 22 mod F (x) C 32 = C 21 + A4 B 21 C 33 = C 22 + A2 B 22 B 42 = x3 B 31 mod F (x) B 43 = x3 B 32 mod F (x) C 42 = C 31 + A7 B 31 C 43 = C 32 + A5 B 32 B 53 = x3 B 42 mod F (x) 5 C0 = C 33 mod F (x) C1 = C0 + C 43 mod F (x) C 53 = C 42 + A8 B 42 C2 = C1 + C 53 mod F (x) 6 Note: “Xij ” denotes the output element “X” of P Ej at the i-th clock cycle. where Bi = C0 = AL BL , dm 2 e−1 X bi,j = j=0 C 1 = AH B H , where C2 = (AL + AH )(BL + BH ). BL BH BL + BH for i = 0 for i = 1 for i = 2 a2j+i a2j + a2j+1 f or f or i = 0, 1 i=2 b2j+i b2j + b2j+1 f or f or i = 0, 1 i=2 ai,j = Similarly, based on two-term splitting method, two polyno2 mials A = A2L + xA2H and B = BL2 + xBH can be again partitioned as four-term polynomials such as A = A4LL + 4 4 4 + +x2 BLH +xBHL xA4HL +x2 A4LH +x3 A4HH and B = BLL 2 2 2 3 4 x BHH , where AL = ALL + xALH ,AH = AHL + xA2HH , 2 2 2 2 BL = BLL + xBLH , and BH = BHL + xBHH . The product of A and B using four-term Karatsuba-like function can be obtained as follows. bi,j = Thus, the product C is represented as : C = (A0 B 0 )2 (1+x)+(A1 B 1 )2 (x2 +x)+(A2 B 2 )2 x mod F (x) (16) Assume that d is the selected digit size, each subword Ai m Pd 2d e−1 Ai,j xjd , where Ai,j = can be rewritten as Ai = j=0 Pd−1 l to Theorem 1, assuming k to be l=0 ai,jd+l x . According p m an integer that satisfies k = , each partial product Ai B i 2d can be represented by AB = (A4LL + xA4HL + x2 A4LH + x3 A4HH ) 4 4 4 4 ·(BLL + xBHL + x2 BLH + x3 BHH ) 4 4 4 = A4LL BLL (1 + x2 ) + ((A4LL + A4HL )(BLL + BHL ) 4 4 +(A4LL + A4LH )(BLL + BLH ))(x + x2 ) Ai B i = 4 4 +(A4HL BHL +A4LH BLH )(x+x3 )+(A4LL +A4LH +A4HL +A4HH ) m d 2d e−1 X Ai,j B i xjd mod F (x) j=0 4 4 4 4 4 4 ·(BLL +BLH +BHL +BHH )x2 +((A4HL +A4HH )(BHL +BHH ) = 4 4 4 +(A4LH +A4HH )(BLH +BHH ))(x2 +x3 )+A4HH BHH (x2 +x4 ). k−1 X C i,j xkdj mod F (x) j=0 (15) = B. Digit-parallel systolic multiplier C i,k−1 x dk + C i,k−2 xdk + · · · xdk +C i,0 mod F (x) (17) where In this section we use two-term Karatsuba-like function of (14) to develop a digit-parallel systolic multiplier. From the structure of (14), we can find that the product C= AB mod F (x) includes three partial product-squarings, such as (AL BL )2 , (AH BH )2 and (AL + AH )2 (BL + BH )2 . For simplify the subword representation, let us define dm AL for i = 0 2 e−1 X Ai = ai,j = AH for i = 1 j=0 AL + AH for i = 2 C i,j = k−1 X Ai,jk+l B i xld l=0 Ai,jk+k−1 B i x + Ai,jk+k−2 B i xd + · · · xd +Ai,jk B i . (18) Based on (16), (17) and (18), we can derive the digit-parallel multiplication scheme as stated in Algorithm 3. Fig.4 shows the proposed digit-parallel systolic array architecture using two-term Karatsuba-like function. It consists of three main parts, e.g., pre-processing unit, subword product = 5 d C2 kd C1 A0 ,i PE2,3 A1,i B1 GF(2m) adder A2 ,i B2 PE 2,1 PE2,2 Digit-serial Systolic array [2] PE2,4 PE1,4 PE1,3 PE1, 2 PE1,1 Digit-serial systolic array [1] 0 0 kd PE0,4 PE0,3 PE0,2 B0 pre-processing unit 0 PE0,1 Digit-serial Systolic array [0] subword product computation unit computation (SPC) unit and post-processing unit. In the preprocessing unit, we use GF (2m ) adder to realize two addition operations of A2,j = A0,j + A1,j and B 2 = B 0 + B 1 corresponding to two steps 4 and 5, respectively. The SPC unit consists of three partial product computation array (PCA) to compute (C 0 = B 0 A0,j ,C 1 = B 1 A1,j ,C 2 = B 2 A2,j ). Each PCA is a digit-serial systolic arrayl with k mmodified p processing elements (P Es), where k = m/2d . It performs partial product computation according to (18). Fig. 5 shows the detailed circuit of the P E. The post-processing unit consists of three accumulation (AC) modules and one final polynomial reduction (FPR) module. Each AC module performs accumulation of partial products. The FPR module performs step 13 to obtain the final results, as shown in Fig.6. AC FPR AC C 0 AC kd post-processing unit C Algorithm 3 Digit-parallel multiplication algorithm based on two-term Karatsuba-like function 2 Inputs: A = A2L +A2H x and B = BL2 +BH x are two elements m in GF(2 ). Output: C = AB mod F (x). 1. C0 = 0, C1 = 0,C2 = 0. 2. A0 = AL , A1 =AH , B 0 =BL and B 1 =BH . 3. for j = k − 1 to 0 */initialization step 4. A2,j = A0,j + A1,j , where Ai,j = (ai,dkj , ai,dkj+1 , · · · , ai,dkj+dk−1 ). 5. B 2 = B 0 + B 1 . */subword product computation step 6. C 0 = B 0 A0,j . 7. C 1 = B 1 A1,j . 8. C 2 = B 2 A2,j . 9. C0 = C0 xkd + C 0 . 10. C1 = C1 xkd + C 1 . 11. C2 = C2 xkd + C 2 . 12. endfor */ finial polynomial reduction step 13. C = C02 (1 + x) + C12 (x2 + x) + C22 x mod F (x). Figure 4. The proposed digit-parallel systolic multiplier architecture based on two-term Karatsuba-like function for k = 4 Theorem 2. For finite field GF (2m ) constructed from irreducible polynomials, the latency of the proposed digit-parallel systolic with using two-term Karatsuba-like function pmultiplier m is (2 + 2) clock cycles. 2d p m Proof: Let k be a positive integer to satisfy k = 2d , where d is the selected digit-size. In the proposed digitparallel systolic multiplier architecture of Fig.4, the three main parts (pre-processing unit, subword product computation unit, and post-processing unit) requires 1, k, and 2 clock cycles, respectively. Thus, computing each sub-product C i,j to store in the AC module requires k + 2 clock cycles. According to Algorithm 3, the main computation requires k iterations in the for loop of the multiplications, which demands 2k + 1 clock cycles. Finally, the final reduction in Step 13 performs the summation of three partial results followed by reduction (i. e. C = C02 (1 + x) + C12 (x2 + x) + C22 x mod F (x)) to obtain the product word. Thus, the one complete multiplication requires 2k + 2 clock cycles. As mentioned above, in the proposed digit-parallel sys- tolic multiplier (Fig.4), the subword product computation unit consists of three digit-serial systolic array [i] for computing Ai B i of i=0, 1 and 2. Observing the structure of Fig.4, each PCA is to calculate the subword product of two m 2 -bit polynomials. Since three PCAs in Fig.4 are fully parallelism computations, lp m the latency of the proposed multiplier is at most m/2 + 2) clock cycles, if the selected digit-size is (2 d = 1. In this regarding, we employ the recursions of two-term Karatsuba-like function to derive the proposed digit-serial systolic architecture. The proposed multiplier can have the following properties. Theorem 3. Assume that we use n-term Karatsuba-like function to construct the digit-parallel systolic multiplier, where n = 2i , then, the subword product computation unit is required 6 ( Bi,in multiplier core ( m ) bits 2 V. T IME AND S PACE C OMPLEXITIES m d ) bits 2 d Ci,in d-bits Bi,out L A. Complexities of digit-serial systolic multiplier L adder ( m kd ) bits 2 Let us consider the following properties to analyze the time and space complexities of the proposed digit-serial systolic multiplier. Remark 1. Let F (x) be an irreducible trinomials of the form F (x) = 1 + xl + xm . The computation of xd B mod F (x) then requires d XOR gates and involves one XOR gate delay. Ci,out Ai,in Figure 5. The modified processing element (P E) C0,in () 2 ( x 1) mod F ( x ) C1,in ( ) 2 ( x x 2 ) mod F ( x ) C2,in Remark 2. Let F (x) be an irreducible pentanomial of the form m F (x) = 1+xl1 +xl2 +xl3 +xm with l1 u m 4 , l2 −l1 u 4 , and m l3 − l2 u 4 . It is shown in [18] that such type of pentanomial exists in GF (2m ) for m > 9. This polynomial is called an almost equally spaced pentanomial (AESP). In this case, the computation of xd B mod F (x) requires 3d XOR gates and involves one XOR gate delay. Remark 3 (multiplier core). Let A, B, C be represented by polynomials given by d, m, m + d bits, respectively. Then, the computation of BA+C by traditional grade-school technique, it requires dm XOR and dm AND gates, and involves TA + (log 2 (d + 1))TX ) gate-delay time. Remark 4 (Final polynomial reduction). Let C be represented by m + d bit polynomial. Then, computing C mod F (x) has the following time and space complexities. Cout ( ) 2 ( x ) mod F ( x ) Figure 6. The finial polynomial reduction module (FPR) to have the following configuration. • • • • Based on two-way splitting Toeplitz matrix-vector product scheme [29], if n = 2i , the subword product computation unit then produces nlog 2 3 digit-serial systolic p m arrays. Each digit-serial systolic array consists of nd P Es. In the structure of P E in Fig. 5, the multiplier core performs the multiplication of Bi,in and Ai,in , where Bi,in and Ai,in are m n -bit and d -bit polynomials, respectively. The output result of each digit-serial array systolic p m for each loop computation demands m + d n nd digit sizes. • • If F (x) is an irreducible trinomial, C mod F (x) requires 2d XOR gates and involves one XOR gate-delay. If F (x) is an irreducible AESP, C mod F (x) requires 6d XOR gates and involves one XOR gate-delay. As shown in the structure of Fig.2, the proposed multiplier consists of k PEs, one m-bit register < B >, one reduction module R3, and one RAC module. Each PE consists of one multiplier core, one reduction module R1, one m-bit GF (2m ) adder, and a (2m + d)-bit register. RAC module is comprised of one reduction module R2, one m XOR gates, and m-bit register < C >, where the reduction module R2 performs the final reduction operation of step 3.1 according to Algorithm 2. Based on Remarks 1 and 4, we have estimated the space complexity of our proposed architecture for trinomials and AESPs, and listed in Tables II and III. From these tables, we can find √ that AESP-based digit-serial multiplier (Fig.2) requires (4 md + 2d) number of more XOR gates compared to the trinomial-based multiplier. The multiplier lp digit-serial m has the latency of 2 m/d clock cycles if d is the selected digit-size. In [22], Meher has proposed√the 2-D super systolic multiplier for trinomials, having 2 d me clock cycles of latency. When the selected digit-size is one bit, the latency of our proposed multiplier in Fig.2 is the same as that of Meher’s multiplier. In this case, the R1 module in Fig.3 can be reduced to one XOR gate, and duration of clock cycle to TA + TX + TL , where TA , TX and TL denote the propagation delays of a 2-input AND gate, a 2-input XOR gate and 1-bit latch, respectively. Therefore, latency of the lproposed m digitp √ serial multiplier ranges from 2 d me to 2 m/d clock cycles, which depends on the selected digit-size d. Theorem 4. The latency of the proposed digit-parallel systolic p m multiplier with n-term Karatsuba-like function is (2 nd + 2) clock cycles. According to Theorem 3, each digit-serial systolic array calculates the product of two dm/ne-bit subword polynomials. For example of m = 409 and n = 4, we use four-term Karatsuba-like function in (15) to build the digit-serial multiplier. We require 9 digit-serial systolic arrays for computing A0 B0 , A01 B01 , A02 B02 , A1 B1 , A2 B2 , A0123 B0123 ,A13 B13 , A23 B23 ,A3 B3 , respectively, to construct the subword product computation unit. Each subwords Ai and Bj are d409/4e = 103-bit polynomials. Assume that the selected digit-size is four bits, we use the previous proposed digit-serial systolic multiplier architecture (shown m to construct each digitlpin Fig.2) 103/4 = 6 P Es. Therefore, serial systolic array with based on Theorem 4, the proposed digit-parallel systolic multiplier using p four-term Karatsuba-like function can have m the latency of 2 + 2=14 clock cycles. nd 7 of our proposed system is much lower than the existing multipliers. Amongst all the existing digit-serial multipliers, the non-systolic multiplier of [14] has the minimum timecomplexity. But as shown in Fig .7, the proposed multiplier of Fig.2 involves nearly 6 to 27 times less time-complexity compared with those of [14] as digit-size increases from 2 to 32. The time-complexity of proposed digit-parallel systolic multiplier using 8-term KA is 1.6 to 2.6 times less than the proposed digit-serial systolic architecture of Fig.2. It is found that proposed multiplier using KA involves the lowest timecomplexity amongst the digit-wise systolic multipliers [14], [16], [17]. As shown in Fig. 8, our proposed architectures have higher area-complexity compared to the existing digit-serial multipliers, but as shown in Fig. 9, proposed architectures involve less area-delay product (ADP) than other digit-serial multipliers [14], [16], [17]. For clarity of comparisons, in Table IV, we have listed area, normalized power consumption per GHz and energy per output bit (EOB) of the proposed and the corresponding existing digit-serial multipliers for digit-size d = 8. The estimation of EOB is explained in the following. • The multiplier in a given clock period computes “L” bits of the product word, which could be considered as bitthroughput (BT). • The multiplier in a given clock period consumes “E” amount of energy. We can compute the energy consumed per cycle as E= (power consumption) × (clock period). Then the EOB is defined as B. Complexities of digit-parallel systolic multiplier Remark 5. By using n-term Karatsuba Algorithm, in the m log 3 pre-processing q unit, the GF (2 ) adder requires (n 2 − m md n)( n + n ) XOR gates and log 2 n XOR gate delays. Digit-parallel systolic multiplier of Fig.4 is comprised of three main parts. For simplicity of discussion, let us consider the two-term KA to estimate the time- and space-complexities of proposed digit-parallel systolic multiplier. According to Remark 5, the unit involves space-complexity qpre-processing m md ) XOR gates and m 1-bit latches, and of ( 2 + 2 requires TX +p TL gate-delay to complete the computation. It m consists of 3 P Es to construct the subword product 2d computation unit. Each P E involves space complexity of q m m md d 2 AND gates, d 2 XOR gates, and m + 2 latches; and requires TA + (log 2 (d + 1))TX + TL gate-delay for completing its computation. The post-processing unit consists of an AC module q and an FPR module. The AC module consists md + of 3( m 2 2 ) XOR gates and 3m latches. Assuming that the field is constructed from an irreducible trinomial, the FPR module in Fig. 6 has space-complexity of (7m − 3 m−n + 5) XOR gates to perform the computation of (16). 2 The critical-path of the proposed architecture for trinomials is TA + (log(d + 1))TX + TL . Based on the above discussion we have calculated the time- and space-complexities of the digitparallel systolic multiplier for trinomials, and listed in Tables II and III. Similarly, we can estimate the complexity of the proposed architecture based on n-term KA. (power consumption)×( clock period) E = . L the number of output bits produced per cycle (19) The clock period mentioned in expression (19) is the clock period used for estimating the power consumption. As shown in Table IV, the structure of [14] has the lowest critical path among the existing digit-serial designs, is 6.2 times more than the proposed digit-serial structure of Fig.2, and 7.7 times and 11 times of the proposed digit-parallel structures of Fig.4 for 2term and 4-term KAs, respectively. The proposed digit-parallel systolic multiplier using four-term KA can save about 55.2% ADP and 56.04% EOB over the best of the existing digitserial multipliers [14], [16], [17]. Moreover, the digit-parallel systolic multiplier using four-term KA can save about 36.84% EOB over the digit-parallel systolic multiplier using two-term KA. In Table IV, it is shown that the latencies of our proposed architectures are lower than those of the existing multipliers. For digit-size d = 8, our proposed architectures can have BT > 94, while the existing multipliers have BT ≤ 8. Therefore, the proposed digit-parallel systolic multipliers using KA with different number of terms could be used to have the desired trade-off among speed, ADP/EOB, and BT of digitwise multipliers for large fields. EOB = C. Comparisons Table II lists the hardware components used by our proposed multipliers and the existing digit-serial multipliers [16], [14], [17]. The latency and critical-path of proposed multipliers are compared with existing multipliers in Table III. From this table we can find that the proposed digit-serial p m and digit-parallel p m systolic multipliers have latencies of 2 d and (2 dn + 2) clock cycles, respectively, while traditional digit-serial nonsystolic and systolic multipliers involve latencies of dm/de+1 and 2 dm/de clock cycles, respectively. We note that as shown in this table proposed AESP-based multiplier has also the same latency as the proposed trinomial-based multiplier. In Table II, we have listed the Bit-Throughput (BT ) as a measure of speed performance. The BT for our proposed architectures is more √ than dm, which depends on the selected digit-size d. The applications of Tate and Weil pairing algorithms involve additions and multiplications of very large finite fields. Therefore, we select the field GF (21223 ) constructed by the trinomial x1223 +x155 +1 to estimate critical-path, area complexity, and area-delay product for various digit-serial multipliers. We have used the NanGate’s Library Creator and the 45nm FreePDK Base Kit from North Carolina State University (NCSU) [26] to synthesize the proposed and the corresponding existing digit-serial multipliers and obtained time and area complexities. From Fig. 7, it is shown that computation time of our proposed architectures is significantly lower than those of the existing multipliers. Therefore, time-complexity VI. C ONCLUSIONS We have presented two novel low-latency digit-serial and digit-parallel systolic multipliers over GF (2m ) of large orders. The proposed digit-serial p m architecture for trinomials and AESPs has latency of 2 clock cycles, which is much d 8 Table II. Comparison of space complexities of multipliers Multipliers Fig.2 for xm + xn + 1 Fig.2 for AESPs Fig.4 for xm + xn + 1 (two-term KA) Fig.4 for xm + xn + 1 (four-term KA) #AND √ m m √ m m q 1.5m md 2 9m 8 √ √ #XOR #MUX #Latch md(2 + m) + d √ md(6 + m) + 3d q − 8m + (1.5m + 3) md 2 29m 4 md n 2 q m (2m d q m (2m d BT(bits per cycle) √ dm √ dm √ 2dm + d − 1) + 2m + d − 1) + 2m q m 4.5m + 1.5m 2d q √ 31m m + 2md + 9m 4 4 4d +5 √ + ( 9m + 4.5) md + n 8 √ 4dm (m + k + 1)d m (m + k)d + (k + 1)(d − 1) 2m + d + k dm/de+1 +(k + 1)(d − 1) Talapatra et al. [17] md md + 2d 2m 4m + 3d + 1 d m 9sd Kim et al.[16] 2md + m 2md 2m (10d + 1 + + s) d d 2 Note: s + 1 is the number of pipelined stages in per basic cell, d is the selected digit-size, k is the second high bit number of the irreducible polynomial Kumar et al. [14] Table IV. Comparison of various digit-serial multipliers over GF (21223 ) in the terms of latency, ADP, critical path delay TCP D (ns), area (µm2 ), power (µW/GHz), EOB (pJ), and BT(bit/cycle) for digit-size d = 8. multipliers latency(cycles) area (µm2 ) TCP D (ns) ADP (um2 )ns Power(µW/GHz) BT(bits per cycle) EOB(pJ) Fig.2 Fig.4 (2-term KA) Fig.4 (4-term KA) Kumar multiplier[14] Kim multiplier [16] Talapatra multiplier[17] 25 18 13 154 459 306 383,282.6 456,965.1 433,796.5 44,034.97 165,145.8 52,840.1 0.21 0.21 0.21 0.21 0.47 0.25 2,012,233.6 1,727,328.1 1,184,264.5 1,424,090.9 35,626,910.7 4,042,267.8 629,700.9 801,322.9 722,910.9 74,756.56 216,488 81,174.9 94.08 122.3 174.71 7.94 8 8 6.693 6.552 4.138 9.413 27.061 20.294 Table III. Comparison of time complexities of multipliers 800000 Fig.4 (two-term KA) Fig.4 (four-term KA) ( m 2 ) critical path TA + (log 2 (d + 1))TX + TL TA + (log 2 (d + 1))TX + TL TA + (log 2 (d + 1))TX + TL Kumar et al. [14] dm/de TA + log 2 (d + 1)TX + TL + 1 Talapatra et al. [17] 2 m T + (log2 d)TX + TM U X + TL A d Kim et al.[16] 3 m d(TA + TX + TM U X )/(s + 1) + TL d Note:(1) s + 1 is the number of pipelined stages in per basic cell, and d is the selected digit-size. (2) TA , TX , TL and TM U X denote the propagation delays of a 2-input AND gate, a 2-input XOR gate, 1-bit latch and a 2 × 1 MUX gate, respectively c o m p le x it y Fig.2 Latency q 2 m q d m +2 2 2d q m 2 4d + 2 A re a Multipliers 700000 600000 500000 Fig.2 Fig.4 2-termKA 400000 Fig.4 4-termKA 300000 Kumar [14] Talapatra [17] 200000 Kim[16] 100000 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 Digit-size 512 Total computation time (ns ) 256 Figure 8. Comparison of area complexity for various digitserial multipliers over GF(21223 ) 128 64 Fig.2 32 Fig.4 two-term KA like method increases, it provides significantly higher bitthroughput and less critical-path, ADP and EOB. The analytical results provide a valuable reference for implementing pairing algorithm and elliptic curve digital signature algorithm (ECDSA) in resource-constrained embedded systems and smart phones. Moreover, our proposed systolic architectures have the features of regularity, modularity, and concurrency, and are suitable for VLSI chip designs on hardware platforms such as ASIC and FPGA. Fig.4 four-term KA 16 Fig.4 eigh-term KA 8 Kumar [14] 4 Talapatra [17] Kim [16] 2 1 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 Digit-size Figure 7. Comparison of computation time (ns) for various digit-serial multipliers over GF(21223 ) R EFERENCES less than the best of existing digit-serial architectures. For exploring the area-time trade-off for large field arithmetic architectures, we have used both two-term and four-term Karatsuba schemes to implement the digit-parallel systolic multiplier over GF (21223 ). As the number of terms in the Karatsuba- [1] R. Lidl and H. Niederreiter, "Introduction to Finite Fields and Their Applications," Cambridge University Press (1994). [2] I. S. Reed, and G. Solmon, "Polynomial Codes over Certain Finite Fields," SIAM J. Appl. Math. pp. 300-304, 1960. [3] “National Institute of Standards and Technology,” Digital Signature Standard, 186-2, January 2000. 9 Area-delay product ( m2 )ns [22] P. K. Meher, “Systolic and Super-Systolic Multipliers for Finite Field GF (2m ) Based on Irreducible Trinomials,” IEEE Trans. Circuits and Systems I, vol.55, no. 4, pp. 1031 - 1040 , 2008. [23] C.-Y. Lee, “Super Digit-Serial Systolic Multiplier over GF (2m ),” The Sixth International Conference on Genetic and Evolutionary Computing, August 25 ~28, 2012, Kitakyushu, Japan. [24] J.-H. Guo and C.-L. Wang, “Digit-serial systolic multiplier for finite fields GF (2m ) ,” IEE Proc. Comput. Digit. Tech., vol. 145, no. 2, pp. 143–148, Mar. 1998. [25] S. Kwon, C. H. Kim, and C. P. Hong, “A systolic multiplier with LSB first algorithm over GF (2m ) which is as efficient as the one with MSB first algorithm,” in Proc. Int. Symp. Circuits Syst., vol. 5, pp. 633–636, May 2003. [26] NanGate Standard Cell Library, http://www.si2.org/openeda.si2.org/ projects/nangatelib/. [27] M. Ernst, M. Jung, F. Madlener, S. Huss, R. Blumel, “A Reconfigurable System on Chip Implementation for Elliptic Curve Cryptography over GF (2n ),” CHES 2002, LNCS 2523, pp. 381–399, 2003. [28] L.S. Cheng, A. Miri, and T.H. Yeap, “Improved FPGA implementations of parallel Karatsuba multiplication over GF (2n ),” In 23rd Biennial Symposium on Communications, 2006. [29] H. Fan and M. A. Hasan, “A New Approach to Subquadratic Space Complexity Parallel Multipliers for Extended Binary Fields,” IEEE Trans. Computers, vol. 56, no. 2, pp.224–233, Feb. 2007. [30] J. A. Solinas, “Efficient Arithmetic on Koblitz Curves,” Designs, Codes and Cryptography, vol. 19, no. 195-249, 2000. [31] D.F. Aranha, J.-L. Beuchat, J. Detrey, and N. Estibals, “Optimal Eta Pairing on Supersingular Genus-2 Binary Hyperelliptic Curves. In The Cryptographers,” Track at the RSA Conference 2012 (CT-RSA 2012), LNCS, pp. 98–115, Springer, 2012. [32] J.-L. Beuchat, J. Detrey, N. Estibals, E. Okamoto, and F. Rodr´ıguezHenr´ıquez, “Fast Architectures for the ηT Pairing over SmallCharacteristic Supersingular Elliptic Curves,” Computers, IEEE Trans. Computers, vol.60, no.2, pp.266–281, 2011. 5120000 Fig.2 2560000 Fig.4 2-term KA Fig.4 4-term KA Kumar [14] 1280000 Talapatra [17] 640000 2 4 6 8 10 12 14 16 18 Digit-size 20 22 24 26 28 30 32 Figure 9. Comparison of area-delay products for various digitserial multipliers over GF(21223 ) [4] IEEE Std 1363-2000, “IEEE Standard Specifications for Public-Key Cryptography,” January 2000. [5] D. Boneh and M. K. Franklin, “Identity-Based Encryption from the Weil Pairing,” SIAM Journal on Computing, vol.32, no.3, pp.586–615, 2003. [6] D. Boneh, B. Lynn, and H. Shacham, “Short Signatures from the Weil Pairing,” Journal of Cryptology, vol.17, no.4, pp. 297–319, 2004. [7] C.S. Yeh, S. Reed, and T.K. Truong, “Systolic Multipliers for Finite Fields GF (2m ),” IEEE Trans. Computers, vol. 33, no. 4, pp. 357-360, Apr. 1984. [8] C.L. Wang, “Bit-Level Systolic Array for Fast Exponentiation in GF(2m ),” IEEE Trans. Computers, vol. 43, no. 7, pp. 838-841, July 1994. [9] C.-Y. Lee, C. W. Chiou and J.-M. Lin, “A Unified Parallel Systolic Multiplier over GF(2m ),” Journal of Computer Science and Technology, Vol. 22, No. 1, PP.28-38, Jan. 2007. [10] C.-Y. Lee, J.-S. Horng and I-C. Jou, “Low-complexity bit-parallel systolic Montgomery multipliers for special classes of GF(2m ),” IEEE Trans. Computers, vol. 54, no. 9, pp. 1061–1070, Sep. 2005. [11] P.K. Meher, “Systolic and Non-Systolic Scalable Modular Designs of Finite Field Multipliers for Reed-Solomon Codec,” IEEE Trans. Very Large Scale Integration (VLSI) Systems, vol. 17, no. 6, pp. 747–757, Jun. 2009. [12] L. H. Chen, P. L. Chang, C.-Y. Lee, and Y. K. Yang, “Scalable and Systolic Dual Basis Multiplier over GF(2m ),” Int. Journal of Innovative Computing, Information and Control, vol. 7, no. 3, pp. 1193–1208, Mar. 2011. [13] A. Hariri and A. Reyhani-Masoleh, “Digit-Serial Structures for the Shifted Polynomial Basis Multiplication over Binary Extension Fields,” in Proc. LNCS Intl workshop Arithmetic of Finite Fields (WAIFI), ser. LNCS, vol. 5130, pp. 103–116, 2008. [14] S. Kumar, T. Wollinger, and C. Paar, “Optimum Digit Serial GF(2m ) Multipliers for Curve-Based Cryptography,” IEEE Trans. Computers, vol. 55, no. 10, pp. 1306-1311, Oct. 2006 [15] C.-Y. Lee, C. W. Chiou, J. M. Lin, and C. C. Chang, “Scalable and Systolic Montgomery Multiplier over GF(2m ) Generated by Trinomials,” IET Circuits, Devices & Systems, vol. 1, no. 6, pp. 477–484, 2007. [16] C. H. Kim, C. P. Hong, and S. Kwon, “A digit-serial multiplier for finite field GF (2m ),” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol.13, no. 4, pp. 476–483, Apr. 2005. [17] S. Talapatra, H. Rahaman, and S. K. Saha, “Unified Digit Serial Systolic Montgomery Multiplication Architecture for Special Classes of Polynomials over GF(2m ),” Euromicro Conference on Digital System Design: Architectures, Methods and Tools, pp. 427-432, 2010. [18] J. Rajski and J. Tyszer, “Primitive Polynomials over GF (2) of Degree up to 660 with Uniformly Distributed Coefficients,” Journal of Electronic Testing: Theory and Applications, vol. 19, pp. 645-657, 2003. [19] A. Karatsuba and Y. Ofman, “Multiplication of Multidigit Numbers on Automata,” Soviet Physics-Doklady (English translation), vol. 7, no. 7, pp. 595–596, 1963. [20] P. L. Montgomery, “Five, Six, and Seven-Term Karatsuba-Like Formulae,” IEEE Trans. Computers, vol.54, no. 3, 2006. [21] J. Xie, P. K. Meher, and J. He, “Low-Complexity Multiplier for GF (2m ) Based on All-One Polynomials,” to appear in IEEE Trans. Very Large Scale Integr. (VLSI) Syst. Jeng-Shyang Pan received the B. S. degree in Electronic Engineering from the National Taiwan University of Science and Technology in 1986, the M. S. degree in Communication Engineering from the National Chiao Tung University, Taiwan in 1988, and the Ph.D. degree in Electrical Engineering from the University of Edinburgh, U.K. in 1996. Currently, he is the Doctoral advisor in Harbin Institute of Technology and Professor in the Department of Electronic Engineering, National Kaohsiung University of Applied Sciences, Taiwan. He has published more than 400 papers in which 110 papers are indexed by SCI. He is the IET Fellow, UK and the Tainan Chapter Chair of IEEE Signal Processing Society. He was Awarded Gold Prize in the International Micro Mechanisms Contest held in Tokyo, Japan in 2010. He was also awarded Gold Medal in the Pittsburgh Invention & New Product Exposition (INPEX) in 2010, Gold Medal in the International Exhibition of Geneva Inventions in 2011 and Gold Medal of the IENA, International “Ideas – Inventions – New products“, Nuremberg, Germany. He was offered Thousand-Elite-Project in China. He is on the editorial board of International Journal of Innovative Computing, Information and Control, LNCS Transactions on Data Hiding and Multimedia Security, and Journal of Information Hiding and Multimedia Signal Processing. His current research interests include soft computing, robot vision and cloud computing. Chiou-Yng Lee received the Bachelor’s degree (1986) in Medical Engineering and the M.S. degree in Electronic Engineering (1992), both from the Chung Yuan Christian University, Taiwan, and the Ph.D. degree in Electrical Engineering from Chang Gung University, Taiwan, in 2001. From 1988 to 2005, he was a research associate with Chunghwa Telecommunication Laboratory in Taiwan. He joined the department of project planning. He taught those related field courses at Ching Yun University. Currently, he is a professor in the Department of Computer Information and Network Engineering at Lunghwa University of Science and Technology. His research interests include computations in finite fields, error-control coding, signal processing, and digital transmission system. Besides, he is a senior member of the IEEE and the IEEE Computer society. He is also an honor member of Phi Tao Phi in 2001. 10 Pramod Kumar Meher (SM03) received the M.Sc. degree in physics and the Ph.D. degree in science from Sambalpur University, India, in 1978, and 1996, respectively. Currently, he is a Senior Scientist with the Institute for Infocomm Research, Singapore, and Adjunct Professor with the School of Electrical Sciences, Indian Institute of Technology Bhubaneswar, India. Previously, he was a Professor of Computer Applications with Utkal University, India, from 1997 to 2002, and a Reader in electronics with Berhampur University, India, from 1993 to 1997. His research interest includes design of dedicated and reconfigurable architectures for computation-intensive algorithms pertaining to signal, image and video processing, communication, bio-informatics and intelligent computing. He has contributed more than 200 technical papers to various reputed journals and conference proceedings. Dr. Meher has served as a speaker for the Distinguished Lecturer Program (DLP) of IEEE Circuits Systems Society during 2011 and 2012 and Associate Editor of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS during 2008 to 2011. Currently, he is serving as Associate Editor for the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS: REGULAR PAPERS, the IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, and Journal of Circuits, Systems, and Signal Processing. Dr. Meher is a Fellow of the Institution of Electronics and Telecommunication Engineers, India. He was the recipient of the Samanta Chandrasekhar Award for excellence in research in engineering and technology for 1999. 11