FourQ on FPGA: New Hardware Speed Records for Elliptic Curve Cryptography over Large Prime Characteristic Fields K. Järvinen1 , A. Miele2 , R. Azarderakhsh3 , and P. Longa4 1 2 3 4 Aalto University Intel Corporation Rochester Institute of Technology Microsoft Research Contact: kimmo.jarvinen@aalto.fi, plonga@microsoft.com CHES 2016, Santa Barbara, CA, USA, August 17–19, 2016 Introduction FourQ: I I FourQ is a high-performance elliptic curve with very good SW performance (2–3⇥ faster than Curve25519) FourQ has been shown to offer the fastest scalar multiplications on a wide range of software platforms: I I I On several 32-bit ARM microarchitectures (SAC 2016) On several 64-bit Intel/AMD processors, low and high-end (ASIACRYPT 2015) FourQ employs four-dimensional scalar decompositions, requires extensive precomputation, complex control, etc. ) Not clear how well it suits for HW implementation FourQ on FPGA CHES 2016 2/17 Introduction Contributions: I The first FPGA-based implementations of FourQ I FourQ offers 2–2.5⇥ faster performance than Curve25519 I Speed-area tradeoff is the primary optimization goal I Protected against timing and SPA attacks I We present three implementations: single-core, multi-core, and Montgomery ladder variant FourQ on FPGA CHES 2016 3/17 FourQ Costello, Longa, ASIACRYPT’15 E/Fp2 : x2 + y 2 = 1 + dx2 y 2 I Twisted Edwards curve with #E(Fp2 ) = 392 · ⇠ where ⇠ is a 246-bit prime I Defined over Fp2 with the Mersenne prime p = 2127 I Complete addition formulas over extended twisted Edwards coordinates (Hisil et al. ASIACRYPT’08) 1 FourQ on FPGA CHES 2016 4/17 FourQ Costello, Longa, ASIACRYPT’15 E/Fp2 : x2 + y 2 = 1 + dx2 y 2 I Twisted Edwards curve with #E(Fp2 ) = 392 · ⇠ where ⇠ is a 246-bit prime I Defined over Fp2 with the Mersenne prime p = 2127 I Complete addition formulas over extended twisted Edwards coordinates (Hisil et al. ASIACRYPT’08) I Two efficiently-computable endomorphisms I Four-dimensional decomposition for the 256-bit scalar m with (a1 , a2 , a3 , a4 ) such that ai 2 [0, 264 ): 1 and [m]P = [a1 ]P + [a2 ] (P ) + [a3 ] (P ) + [a4 ] ( (P )) FourQ on FPGA CHES 2016 4/17 Scalar Multiplication 1 2 3 4 5 6 Input: Point P , integer m 2 [0, 2256 ) Output: [m]P Decompose and recode m Precompute lookup table T Q T [v64 ] for i = 63 to 0 do Q [2]Q Q Q + mi T [vi ] FourQ on FPGA CHES 2016 5/17 Scalar Multiplication 1 2 3 4 5 6 Input: Point P , integer m 2 [0, 2256 ) Output: [m]P Decompose and recode m Precompute lookup table T Q T [v64 ] for i = 63 to 0 do Q [2]Q Q Q + mi T [vi ] Scalar decompose and recode I Decompose to a multi-scalar (a1 , a2 , a3 , a4 ) I Sign-aligned so that a1 [j] 2 {±1} and ai [j] 2 {0, a1 [j]} for 2 j 4 I Recode to signs mi 2 { 1, 1} and values vi 2 [0, 7] (point index) FourQ on FPGA CHES 2016 5/17 Scalar Multiplication 1 2 3 4 5 6 Input: Point P , integer m 2 [0, 2256 ) Output: [m]P Decompose and recode m Precompute lookup table T Q T [v64 ] for i = 63 to 0 do Q [2]Q Q Q + mi T [vi ] Precomputation I I I Precompute 8 points: T [u] = P + [u0 ] (P ) + [u1 ] (P ) + [u2 ] ( (P )) for u = (u2 , u1 , u0 ) 2 [0, 7] Store them with 5 coordinates (X + Y, Y X, 2Z, 2dT, 2dT ) ) +T [u] : (X + Y, Y X, 2Z, 2dT ) T [u] : (Y X, X + Y, 2Z, 2dT ) 68M + 27S and several additions FourQ on FPGA CHES 2016 5/17 Scalar Multiplication 1 2 3 4 5 6 Input: Point P , integer m 2 [0, 2256 ) Output: [m]P Decompose and recode m Precompute lookup table T Q T [v64 ] for i = 63 to 0 do Q [2]Q Q Q + mi T [vi ] Main for-loop I Fully regular and constant-time I Only 64 double-and-adds I Doubling: (X, Y, Z, Ta , Tb ) I (X, Y, Z) Addition: (X, Y, Z, Ta , Tb ) (X, Y, Z, Ta , Tb ) ⇥ (X + Y, Y X, 2Z, 2dT ) FourQ on FPGA CHES 2016 5/17 General Architecture Scalar Decomposition and Recoding Unit I Decomposes and recodes the scalar I Mainly multiplications with constants Field Arithmetic Unit (“the core”) I Precomputation and the main for-loop I Highly optimized for Fp with the Mersenne prime FourQ on FPGA CHES 2016 6/17 Scalar Unit I I Decomposition is computed with a truncated multiplier (mainly multiplications with constants) The main component is a 17⇥264-bit row multiplier built by using 11 DSPs I Recoding is bit manipulations and 64-bit additions I Outputs (m0 , v0 ) first, scalar multiplication begins with (m64 , v64 ) ) Store in a LIFO buffer Y X 264 195 17 264 17⇥264-bit multiplier 264 FSM 281 + 281 17 64 ZH 64 ZL FourQ on FPGA CHES 2016 7/17 Field Arithmetic Unit commands, responses do di 64 2 64 Interface logic 16 127 2 Dual-port RAM 18 127 Control 127 16 127 Datapath FourQ on FPGA CHES 2016 8/17 Field Arithmetic Unit commands, responses do di 64 2 64 Interface logic 16 256 ⇥ 127-bit RAM (128 Fp2 elements) 4 BRAM 127 2 Dual-port RAM 18 127 Control 127 16 127 Datapath FourQ on FPGA CHES 2016 8/17 Field Arithmetic Unit commands, responses do di 64 2 64 Interface logic 16 127-bit datapath, optimized for p = 2127 1 127 2 Dual-port RAM 18 127 Control 127 16 127 Datapath FourQ on FPGA CHES 2016 8/17 Field Arithmetic Unit commands, responses do di 64 2 64 FSM + Program ROM (6 BRAMs) Interface logic 16 127 2 Dual-port RAM 18 127 Control 127 16 127 Datapath FourQ on FPGA CHES 2016 8/17 Field Arithmetic Unit: Datapath 128 63 64 127 63 64 64 64 64 ⇥ 64-bit multiplier (pipelined) + 129 128 127 127 c a b 127 127 1 127 127 127 +/ 127 0 r 127 0 1 c FourQ on FPGA CHES 2016 9/17 Field Arithmetic Unit: Datapath Multiplier path 128 63 64 127 63 64 64 64 64 ⇥ 64-bit multiplier (pipelined) + 129 128 127 127 c a b 127 127 1 127 127 127 +/ 127 0 r 127 0 1 c FourQ on FPGA CHES 2016 9/17 Field Arithmetic Unit: Datapath 128 63 64 127 63 64 64 64 64 ⇥ 64-bit multiplier (pipelined) + 129 128 127 127 c a b 127 127 1 127 127 127 +/ 127 0 r 127 0 1 c Adder path FourQ on FPGA CHES 2016 9/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) Multiplier pipeline Adders Dual-port RAM Input regs FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) R R 1 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) 2 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) 3 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) R R 4 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) R R 5 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) 6 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) + 7 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) R % R 8 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) W + 9 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) % 10 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) clr W 11 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) + 12 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) # 13 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) + 14 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 R a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) + R 15 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) # 16 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) + 17 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) clr + 18 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) + % 19 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) # W 20 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) + 21 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 R a 1 · b1 ) + R (1) 22 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) # (2) 23 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) + (3) 24 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) clr R R + (4) 25 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 R a 1 · b1 ) + % R (5) 26 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) # W (6) 27 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) + + (7) 28 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 R a 1 · b1 ) + % R (8) 29 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) # W + (9) 30 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 R a 1 · b1 ) + % R (10) 31 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) clr W + (11) 32 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) + % (12) 33 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) # W (13) 34 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) + (14) 35 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 R a 1 · b1 ) + % R (15) 36 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) # W (16) 37 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 R a 1 · b1 ) + R (17) 38 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) clr + (18) 39 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) + % R (19) 40 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) # W (20) 41 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) + % (21) 42 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 R a 1 · b1 ) + R (1,22) 43 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) # % (2,23) 44 FourQ on FPGA CHES 2016 10/17 Example: Multiplication in Fp2 3 multiplications, 2 additions and 3 subtractions in Fp : a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 ) = (a0 · b0 a1 · b1 , (a0 + a1 ) · (b0 + b1 ) a 0 · b0 a 1 · b1 ) + W (3,24) 45 FourQ on FPGA CHES 2016 10/17 Latencies Field operations Addition Multiplication Squaring Inversion in Fp 6 (2) clocks 20 (7) clocks 20 (7) clocks 2760 clocks in Fp2 8 (4) clocks 38/45 (31/21) clocks 28 (16) clocks 2817 clocks In practice, almost all additions in parallel with multiplications FourQ on FPGA CHES 2016 11/17 Latencies Field operations Addition Multiplication Squaring Inversion in Fp 6 (2) clocks 20 (7) clocks 20 (7) clocks 2760 clocks in Fp2 8 (4) clocks 38/45 (31/21) clocks 28 (16) clocks 2817 clocks In practice, almost all additions in parallel with multiplications Operations for scalar multiplication Precomputation Scalar decomposition and recoding Double-and-add (64 times) Affine conversion Scalar multiplication 4185 clocks 1984 (0) clocks 354 clocks 2869 clocks 29739 clocks FourQ on FPGA CHES 2016 11/17 Multi-Core Architecture di, do commands responses Read/write control Scalar unit LIFO 1 LIFO 2 LIFO 3 LIFO 4 LIFO N Core 1 Core 2 Core 3 Core 4 Core N FourQ on FPGA CHES 2016 12/17 Area Results on Zynq-7020 Single-Core Architecture 100% 53,200 106,400 13,300 140 220 80% 60% 40% 20% 7.9 % 0% 12.7 % 4,217 4.1 % 4,413 1,691 LUTs Regs Slices 7.1 % 10 BRAMs 12.3 % 27 DSPs FourQ on FPGA CHES 2016 13/17 Area Results on Zynq-7020 Single-Core Architecture 100% 53,200 106,400 13,300 140 220 80% 60% 40% 20% 7.9 % 0% 12.7 % 7.1 % 12.3 % 6.3 % 4.1 % 2.6 % 9.2 % 0.0 % 5.0 % LUTs Regs Slices BRAMs DSPs Total Scalar unit FourQ on FPGA CHES 2016 13/17 Area Results on Zynq-7020 Multi-Core Architecture (N = 11) 100% 53,200 106,400 13,300 140 220 85.0 % 78.6 % 80% 187 110 60% 42.8 % 40% 5,697 25.6 % 20% 13,595 7.9 % 0% 19.7 % 20,924 12.7 % 7.1 % 12.3 % 6.3 % 4.1 % 2.6 % 9.2 % 0.0 % 5.0 % LUTs Regs Slices BRAMs DSPs Total Scalar unit FourQ on FPGA CHES 2016 13/17 Area Results on Zynq-7020 Multi-Core Architecture (N = 11) 100% 53,200 106,400 13,300 140 220 85.0 % 78.6 % 80% 187 110 60% 42.8 % 40% 5,697 25.6 % 20% 13,595 7.9 % 0% 19.7 % 20,924 12.7 % 7.1 % 12.3 % 6.3 % 6.4 4.1 % 2.6 % 2.8 9.2 % 9.0 0.0 % 5.0 % LUTs Regs Slices BRAMs DSPs Total Scalar unit FourQ on FPGA CHES 2016 13/17 Performance Results on Zynq-7020 VHDL for Xilinx Zynq-7020 with Vivado 2015.4 I One scalar multiplication takes 29,739 clock cycles I Single-core: 190 MHz ) 157 µs or 6,389 ops I I Multi-core: 175 MHz (⇥11) ) 170 µs or 64,730 ops Point validation (124 clocks), cofactor killing (1760 clocks) FourQ on FPGA CHES 2016 14/17 Performance Results on Zynq-7020 VHDL for Xilinx Zynq-7020 with Vivado 2015.4 I One scalar multiplication takes 29,739 clock cycles I Single-core: 190 MHz ) 157 µs or 6,389 ops I I Multi-core: 175 MHz (⇥11) ) 170 µs or 64,730 ops Point validation (124 clocks), cofactor killing (1760 clocks) Variant using Montgomery ladder I No scalar unit (saves 11 DSPs), no precomputations, simpler control, etc. I 522 slices, 7 BRAMs, 16 DSP I 58967 clocks at 190 MHz ) 310 µs or 3,222 ops FourQ on FPGA CHES 2016 14/17 Comparison I Many implementations for ECC over prime fields I Comparison is extremely difficult because of different FPGAs, different optimization goals, etc. I Best match with Sasdrich & Güneysu’s Curve25519 design, both on Xilinx Zynq-7020 I See the paper for further comparisons FourQ on FPGA CHES 2016 15/17 FourQ vs. Curve25519 Single-Core Architectures 25% 13,300 140 220 10,000 8,000 20% 15% 10% 2.54 ⇥ 6389 12.7 % 1.88 ⇥ 12.3 % 1691 27 7.7 % 1029 5% 7.1 % 236.6 9.1 % 6,000 4,000 20 10 2519 126.0 2,000 1.4 % 0% 2 Slices BRAMs Our FourQ DSPs Throughput Tput/DSP 0 Sasdrich & Güneysu’s Curve25519 FourQ on FPGA CHES 2016 16/17 FourQ vs. Curve25519 Montgomery Ladder 25% 13,300 140 220 10,000 20% 8,000 15% 6,000 10% 5% 7.7 % 4.2 % 1029 565 0% 9.1 % 7.3 % 16 5.0 % 7 20 1.60 ⇥ 1.28 ⇥ 201.4 3222 2519 126.0 BRAMs Our FourQ 2,000 1.4 % 2 Slices 4,000 DSPs Throughput Tput/DSP 0 Sasdrich & Güneysu’s Curve25519 FourQ on FPGA CHES 2016 16/17 FourQ vs. Curve25519 Multi-Core Architectures (N = 11) 100% 13,300 140 220 100,000 220 85.0 % 84.8 % 80% 11277 78.6 % 80,000 187 110 2.00 ⇥ 60% 64730 2.36 ⇥ 42.8 % 40% 346.2 60,000 40,000 5697 32304 20% 0% 146.8 10.0 % 22 Slices BRAMs Our FourQ DSPs Throughput Tput/DSP 20,000 0 Sasdrich & Güneysu’s Curve25519 FourQ on FPGA CHES 2016 16/17 Conclusions I We showed that FourQ is very efficient also on FPGAs I FourQ is significantly more efficient in terms of speed-area ratio than the closest counterpart FourQ on FPGA CHES 2016 17/17 Conclusions I We showed that FourQ is very efficient also on FPGAs I FourQ is significantly more efficient in terms of speed-area ratio than the closest counterpart Future Work I Low-latency implementation I Better side-channel protection: e.g., against DPA and advanced horizontal attacks FourQ on FPGA CHES 2016 17/17 Conclusions I We showed that FourQ is very efficient also on FPGAs I FourQ is significantly more efficient in terms of speed-area ratio than the closest counterpart Future Work I Low-latency implementation I Better side-channel protection: e.g., against DPA and advanced horizontal attacks Thank you! Questions? FourQ on FPGA CHES 2016 17/17