Unified Architectures for Efficient and Compact Crypto-Processing Erkay Savaş Sabancı University 7/27/2016 Erkay Savaş 1 Outline Research Motivation Public Key Cryptography Unified Arithmetic High-Radix Multiplication Dual-Radix Multiplication Support for GF(3n) Arithmetic Implementation Results Future Research 7/27/2016 Erkay Savaş 2 Motivation Compatibility Saving in Area support for fast arithmetic in different finite fields and groups Improve {time area} metric Algorithm Agility NTRU ECC 7/27/2016 Erkay Savaş 3 Public Key Cryptography (PKC) Each user has a pair of keys: Encryption: Private Key - known only to the owner Public Key - known to everyone in the systems with assurance Encryption with the Public Key of the receiver Decryption: 7/27/2016 Only the receiver can decrypt the message by her/his Private Key Erkay Savaş 4 Public Key Cryptography in Use RSA, Rabin’s scheme Discrete Logarithm Based Algorithms Diffie-Helman Key Exchange, El Gamal Elliptic curve DH Key Exchange, ECDSA Integer factorization, Square root of modulo a composite number Discrete logarithm over elliptic curves IBE 7/27/2016 pairings over elliptic curve points Erkay Savaş 5 RSA Most popular PKC Invented by Rivest/Shamir/Adleman in 1977 at MIT. Its patent expired in 2000. Based on Integer Factorization problem Each user has public and private key pair. 7/27/2016 Erkay Savaş 6 RSA Encryption & Decryption Encryption done by using public key y xe mod n, where x, y < n Decryption done by using private key x yd mod n 7/27/2016 Erkay Savaş 7 DL Based Cryptosystems Fundamental operation gx mod p, where x, g < p and g is primitive 7/27/2016 Erkay Savaş 8 Elliptic Curve Cryptography 1/2 Emerging public key cryptography standard for constrained devices. 160 bit key length is equivalent in cryptographic strength to 1024-bit RSA. 313 bit ECC is equivalent to 4096 bit RSA Rich and deep theory suitable to cryptography As algebraic/geometric entities have been studied extensively for the past 150 years. First proposed for cryptographic usage in 1985 independently by Neal Koblitz and Victor Miller 7/27/2016 Erkay Savaş 9 Elliptic Curve Cryptography 2/2 Dominant fundamental operations Multiplication in GF(q) where q = pk and p is prime Alternatives GF(p) k = 1 GF(2k) p = 2 GF(pk) GF(3k) p = 3 7/27/2016 Erkay Savaş 10 Identity Based Encryption (IBE) Public key can be any string e-mail address, name, etc. No need for certificates Anonymity achieved users can choose any public key without revealing their ID It can easily change it 7/27/2016 Erkay Savaş 11 IBE – Bilinear Mapping e(xP, yQ) = e(P, Q)xy = e(yP, xQ) = g g is in an (extension of) the underlying field. Bilinear mapping over elliptic curves Weil pairing Tate pairing Resource consuming Most efficient bilinear mappings 7/27/2016 defined on curves over GF(3k) Erkay Savaş 12 An Introduction to Unified Arithmetic Types of finite fields are heavily used 1. Prime fields, GF(p) 2. Binary extension fields, GF(2k) 3. Ternary extension fields GF(3k) (recently, due to IBE schemes) These finite fields feature dissimilar properties Different implementations on specialized hardware 7/27/2016 Erkay Savaş 13 Unified Arithmetic Unified hardware design methodology requires 1. A single (unified) datapath 2. A single (unified) control 3. Insignificant overhead in the area 4. Insignificant overhead in the time complexity (e.g. critical path delay) 5. Good {timearea} metric 7/27/2016 Erkay Savaş 14 Unified Arithmetic (GF(p) + GF(2k)) A unified hardware design methodology for both field is possible since: 1. the elements of either field are represented using almost the same data structures in digital systems 2. the algorithms for basic arithmetic operations in both fields have structural similarities (i.e. the steps of the algorithms are almost identical) Hence, eventually unified arithmetic is possible 7/27/2016 Erkay Savaş 15 Finite Field Operations in ECC Addition in GF(p) and GF(2k) Multiplicative inversion in GF(p) and GF(2k) Relatively inexpensive in area and time complexity Prohibitively expensive in terms of time Possible to avoid some of them Multiplication in GF(p) and GF(2k) Expensive in terms of time and area Usually most important operation Our focus 7/27/2016 Erkay Savaş 16 Montgomery Multiplication 1. 2. 3. 4. 5. Very efficient way of doing multiplication in GF(p) and GF(2k) (now also in GF(3k)) Faster (replaces division by shifts) Suitable for unified design Suitable for scalable design Highly parallel Suitable for pipelining 7/27/2016 Erkay Savaş 17 Montgomery Multiplication Definition: Given a, b GF(p), MonMul(a, b) = a·b·R-1 mod p, where R = 2k mod p and k = log2p. Algorithm 1. c := 0 2. for i = 0 to k-1 3. c := (c + ai · b) 4. c := (c + c0 · p)/2 5. 7/27/2016 if c > p then c := c-p (final subtraction) Erkay Savaş 18 Algorithm for GF(2k) 1. 2. 3. 4. 5. Input : a(x), b(x) GF(2k), p(x) and k Output: c(x) = a(x)·b(x)·xk GF(2k) c(x) := 0 for i = 0 to k-1 c(x) := (c(x) ai · b(x)) c(x) := (c(x) c0 · p(x))/x No final subtraction Note that 7/27/2016 c/2 and c(x)/x are implemented in an identical way in SW and HW Erkay Savaş 19 Representation Addition Unified addition Atomic operation: multiplication is performed as a repeated addition most efficient when carry-save representation is used for elements of GF(p) Carry-save representation an integer is represented as the sum of two other integers x := xs + xc (sum and carry parts, resp.) 7/27/2016 Erkay Savaş 20 Scalability Original Montgomery multiplication algorithm performs full-precision integer additions Not scalable Instead, long integers are divided into words Addition of words are handled separately on word adders. Choice of word length depends on the precision, area and speed requirements 7/27/2016 Erkay Savaş 21 Word-Based Multiplication (j) p (j) cc(j+1) (j) p(j+1) ai bb(j+1) ai+1 b(j) p(j) c(j) PUi PUi+1 (j) cc(j+1) w-1 w-1 (j) (j) (j+1) cc(j+1) 11 cc 0 0 c(j) 7/27/2016 Erkay Savaş 22 Dependency Graph 0 0 0 ( 0) a0 B p (0) B (1) (1) p a1 B (0) p (0) 0 7/27/2016 Erkay Savaş 23 Processing Unit (PU) with w=2 p p ( j) 1 B1( j ) c0 ai C1(j) ( j) 0 B0( j ) C0(j) Dual-Field Adder Dual-Field Adder FSEL Dual-Field Adder 7/27/2016 Erkay Savaş Dual-Field Adder 24 Dual-Field Adder (DFA) 1/2 Almost identical to a full-adder (FA) Difference it has and additional (control) input (FSEL) which suppress the carry output of the adder when it is set to logic-0 Namely, when FSEL = 0 then the adder operates in GF(2k), otherwise it becomes a regular FA 7/27/2016 Erkay Savaş 25 DFA 2/2 B S A C FSEL Cout 7/27/2016 Erkay Savaş 26 Pipeline Organization with two PUs SR-a RAM-a ai ai 1 RAM-b RAM-p PU-1 PU-2 c SR-C e 2 2s if (e 2) 2s LSR C otherwise 1 7/27/2016 Erkay Savaş s: the number of PUs 27 Total Computation Time (in clock cycles) k 2 s s e -1 T k (e 1) 2( s 1) s if (e 1) 2s otherwise w: word size, k: precision, e := k/w, s: the number of PUs 7/27/2016 Erkay Savaş 28 Example Execution Times Example: k = 1024, w = 32 s = 17 T = 2105 s = 15 T = 2305 s = 10 T = 3415 s = 1 T = 33792 Example: k = 2048, w = 32 s = 33 T = 4221 s = 30 T = 4543 s = 10 T = 13343 s = 1 T = 133120 7/27/2016 Erkay Savaş 29 Comparison to the single-field (GF(p)) design GF(p) Unified Overhead Cell Area 47.2w 48.5w 2.75% Cell Propagation Time 11 ns 11 ns 0% w: word size 1.2 m CMOS technology 7/27/2016 Erkay Savaş 30 Design Alternatives Higher Radix Original design is radix 2 Namely, multiplier bits are scanned one bit in each clock cycle Possible to scan two or more bits of the multiplier a Radix-4: two bits Radix-8: three bits More Complex Design: lower clock frequency, higher area Less clock cycle count Faster execution of multiplication 7/27/2016 Erkay Savaş 31 Comparison Higher radix vs. single radix Metric area time For small total area (i.e. <10000 equivalent NAND gates) the performances of radix-2 and radix-8 are comparable Radix-8 multiplier outperforms radix-2 multiplier more than 3 times when the total area is around 25000 NAND gates 7/27/2016 Erkay Savaş 32 Dual-Radix Multiplier Radix-2 for GF(p) and radix-4 for GF(2k) 0 ai ai 1 b0 b1 c0 MUX-1 MUX-2 S ( j) Selection Logic c1 3x2 Dual Field Adder p1 7/27/2016 ( j) B ( j ) P ( j ) ( P B ) ( j ) C 2B ( j ) 2P ( j ) 2( P B) ( j ) FSEL Erkay Savaş 33 Dual-Radix Multiplier Three multipliers A1: GF(p)-only multiplier A2: single-radix unified multiplier (with precomp.) A3: dual-radix multiplier Performance (area time) A3 performs slightly worse than A1 and A2 (between 7% to 19%) in GF(p) mode A3 outperforms A2 by 38% to 46% in GF(2k)mode 7/27/2016 Erkay Savaş 34 Unified Arithmetic? Unified multiplier carry-save adders used in multiplier It is not easy to perform other arithmetic operations with carry-save representation such as subtraction and comparison (essential in inversion) 7/27/2016 Erkay Savaş 35 New Redundant Representation Recall: Carry-save representation X = xs + xc. New redundant representation Redundant signed representation (RSD) X = xp - xn. Subtraction is equivalent to the addition X-Y = (xp - xn) - (yp - yn) = (xp - xn) + (yn - yp) Comparison is relatively easy 7/27/2016 Erkay Savaş 36 RSD All previous multipliers require a reverse transformation to non-redundant for after each multiplication There are thousands multiplication in ECC With RSD, all the computation can be done in RSD form without any reverse transformation a single transformation is necessary if the result is needed in non-redundant form. 7/27/2016 Erkay Savaş 37 Support for GF(3n) Arithmetic RSD lends itself to a unified arithmetic architecture that efficiently supports GF(3n) arithmetic 7/27/2016 Erkay Savaş 38 Analysis A1: GF(p)-only architecture A2: GF(2k)-only architecture A3: GF(3n)-only architecture A4: Unified architecture (GF(p) + GF(2k)) A5: Unified architecture (GF(p) + GF(2k) + GF(3n)) A1 + A2: Hypothetical architecture that has separate datapath for GF(p) and GF(2k) 7/27/2016 Erkay Savaş 39 Analysis Metric: area time A4 over A1 + A2: 7.94% A5 over A1 + A2 + A3: 33.54% A5 over A4 + A3: 28.36% 7/27/2016 Erkay Savaş 40 Implementation Results 2.38 GHz, 0.13 m CMOS # of 160-bit ECC 1024-bit RSA Tate pairing GF(397) PUs ms s s 4 315 21.0 508 8 210 10.5 334 16 189 5.25 334 32 189 2.12 334 4 PUs ~11,000, 8 PUs ~15,000 NAND gates 7/27/2016 Erkay Savaş 41 Research Directions Embed the unified architectures into common general-purpose processors Unified inversion using RSD Unified architectures for other PKC 7/27/2016 Erkay Savaş 42 Ending… Questions Contact Erkay Savaş erkays@sabanciuniv.edu http://people.sabanciuniv.edu/~erkays 7/27/2016 Erkay Savaş 43