FourQ on FPGA: New Hardware Speed Records for Elliptic Curve

advertisement
FourQ on FPGA:
New Hardware Speed Records for Elliptic Curve
Cryptography over Large Prime Characteristic Fields
K. Järvinen1 , A. Miele2 , R. Azarderakhsh3 , and P. Longa4
1
2
3
4
Aalto University
Intel Corporation
Rochester Institute of Technology
Microsoft Research
Contact: kimmo.jarvinen@aalto.fi, plonga@microsoft.com
CHES 2016, Santa Barbara, CA, USA, August 17–19, 2016
Introduction
FourQ:
I
I
FourQ is a high-performance elliptic curve with very good
SW performance (2–3⇥ faster than Curve25519)
FourQ has been shown to offer the fastest scalar
multiplications on a wide range of software platforms:
I
I
I
On several 32-bit ARM microarchitectures (SAC 2016)
On several 64-bit Intel/AMD processors, low and high-end
(ASIACRYPT 2015)
FourQ employs four-dimensional scalar decompositions,
requires extensive precomputation, complex control, etc.
) Not clear how well it suits for HW implementation
FourQ on FPGA
CHES 2016
2/17
Introduction
Contributions:
I
The first FPGA-based implementations of FourQ
I
FourQ offers 2–2.5⇥ faster performance than Curve25519
I
Speed-area tradeoff is the primary optimization goal
I
Protected against timing and SPA attacks
I
We present three implementations:
single-core, multi-core, and Montgomery ladder variant
FourQ on FPGA
CHES 2016
3/17
FourQ
Costello, Longa, ASIACRYPT’15
E/Fp2 :
x2 + y 2 = 1 + dx2 y 2
I
Twisted Edwards curve with #E(Fp2 ) = 392 · ⇠
where ⇠ is a 246-bit prime
I
Defined over Fp2 with the Mersenne prime p = 2127
I
Complete addition formulas over extended twisted
Edwards coordinates (Hisil et al. ASIACRYPT’08)
1
FourQ on FPGA
CHES 2016
4/17
FourQ
Costello, Longa, ASIACRYPT’15
E/Fp2 :
x2 + y 2 = 1 + dx2 y 2
I
Twisted Edwards curve with #E(Fp2 ) = 392 · ⇠
where ⇠ is a 246-bit prime
I
Defined over Fp2 with the Mersenne prime p = 2127
I
Complete addition formulas over extended twisted
Edwards coordinates (Hisil et al. ASIACRYPT’08)
I
Two efficiently-computable endomorphisms
I
Four-dimensional decomposition for the 256-bit scalar m
with (a1 , a2 , a3 , a4 ) such that ai 2 [0, 264 ):
1
and
[m]P = [a1 ]P + [a2 ] (P ) + [a3 ] (P ) + [a4 ] ( (P ))
FourQ on FPGA
CHES 2016
4/17
Scalar Multiplication
1
2
3
4
5
6
Input: Point P ,
integer m 2 [0, 2256 )
Output: [m]P
Decompose and recode m
Precompute lookup table T
Q
T [v64 ]
for i = 63 to 0 do
Q
[2]Q
Q
Q + mi T [vi ]
FourQ on FPGA
CHES 2016
5/17
Scalar Multiplication
1
2
3
4
5
6
Input: Point P ,
integer m 2 [0, 2256 )
Output: [m]P
Decompose and recode m
Precompute lookup table T
Q
T [v64 ]
for i = 63 to 0 do
Q
[2]Q
Q
Q + mi T [vi ]
Scalar decompose and recode
I
Decompose to a multi-scalar
(a1 , a2 , a3 , a4 )
I
Sign-aligned so that a1 [j] 2 {±1}
and ai [j] 2 {0, a1 [j]} for 2  j  4
I
Recode to signs mi 2 { 1, 1}
and values vi 2 [0, 7] (point index)
FourQ on FPGA
CHES 2016
5/17
Scalar Multiplication
1
2
3
4
5
6
Input: Point P ,
integer m 2 [0, 2256 )
Output: [m]P
Decompose and recode m
Precompute lookup table T
Q
T [v64 ]
for i = 63 to 0 do
Q
[2]Q
Q
Q + mi T [vi ]
Precomputation
I
I
I
Precompute 8 points: T [u] = P +
[u0 ] (P ) + [u1 ] (P ) + [u2 ] ( (P ))
for u = (u2 , u1 , u0 ) 2 [0, 7]
Store them with 5 coordinates
(X + Y, Y X, 2Z, 2dT, 2dT ) )
+T [u] : (X + Y, Y X, 2Z, 2dT )
T [u] : (Y X, X + Y, 2Z, 2dT )
68M + 27S and several additions
FourQ on FPGA
CHES 2016
5/17
Scalar Multiplication
1
2
3
4
5
6
Input: Point P ,
integer m 2 [0, 2256 )
Output: [m]P
Decompose and recode m
Precompute lookup table T
Q
T [v64 ]
for i = 63 to 0 do
Q
[2]Q
Q
Q + mi T [vi ]
Main for-loop
I
Fully regular and constant-time
I
Only 64 double-and-adds
I
Doubling:
(X, Y, Z, Ta , Tb )
I
(X, Y, Z)
Addition:
(X, Y, Z, Ta , Tb )
(X, Y, Z, Ta , Tb ) ⇥
(X + Y, Y X, 2Z, 2dT )
FourQ on FPGA
CHES 2016
5/17
General Architecture
Scalar Decomposition and Recoding Unit
I
Decomposes and recodes the scalar
I
Mainly multiplications with constants
Field Arithmetic Unit (“the core”)
I
Precomputation and the main for-loop
I
Highly optimized for Fp with the Mersenne prime
FourQ on FPGA
CHES 2016
6/17
Scalar Unit
I
I
Decomposition is computed
with a truncated multiplier
(mainly multiplications with
constants)
The main component is a
17⇥264-bit row multiplier built
by using 11 DSPs
I
Recoding is bit manipulations
and 64-bit additions
I
Outputs (m0 , v0 ) first, scalar
multiplication begins with
(m64 , v64 )
) Store in a LIFO buffer
Y
X
264
195
17
264
17⇥264-bit multiplier
264
FSM
281
+
281
17
64
ZH
64
ZL
FourQ on FPGA
CHES 2016
7/17
Field Arithmetic Unit
commands,
responses
do
di
64
2
64
Interface logic
16
127
2
Dual-port
RAM
18
127
Control
127
16
127
Datapath
FourQ on FPGA
CHES 2016
8/17
Field Arithmetic Unit
commands,
responses
do
di
64
2
64
Interface logic
16
256 ⇥ 127-bit RAM
(128 Fp2 elements)
4 BRAM
127
2
Dual-port
RAM
18
127
Control
127
16
127
Datapath
FourQ on FPGA
CHES 2016
8/17
Field Arithmetic Unit
commands,
responses
do
di
64
2
64
Interface logic
16
127-bit datapath,
optimized for
p = 2127 1
127
2
Dual-port
RAM
18
127
Control
127
16
127
Datapath
FourQ on FPGA
CHES 2016
8/17
Field Arithmetic Unit
commands,
responses
do
di
64
2
64
FSM + Program ROM
(6 BRAMs)
Interface logic
16
127
2
Dual-port
RAM
18
127
Control
127
16
127
Datapath
FourQ on FPGA
CHES 2016
8/17
Field Arithmetic Unit: Datapath
128
63
64
127
63
64
64
64
64 ⇥ 64-bit
multiplier
(pipelined)
+
129
128
127
127
c
a
b
127
127
1
127
127
127
+/
127
0
r
127
0 1
c
FourQ on FPGA
CHES 2016
9/17
Field Arithmetic Unit: Datapath
Multiplier path
128
63
64
127
63
64
64
64
64 ⇥ 64-bit
multiplier
(pipelined)
+
129
128
127
127
c
a
b
127
127
1
127
127
127
+/
127
0
r
127
0 1
c
FourQ on FPGA
CHES 2016
9/17
Field Arithmetic Unit: Datapath
128
63
64
127
63
64
64
64
64 ⇥ 64-bit
multiplier
(pipelined)
+
129
128
127
127
c
a
b
127
127
1
127
127
127
+/
127
0
r
127
0 1
c
Adder path
FourQ on FPGA
CHES 2016
9/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
Multiplier pipeline
Adders
Dual-port RAM
Input regs
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
R
R
1
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
2
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
3
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
R
R
4
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
R
R
5
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
6
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
+
7
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
R
%
R
8
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
W
+
9
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
%
10
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
clr
W
11
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
+
12
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
#
13
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
+
14
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
R
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
+
R
15
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
#
16
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
+
17
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
clr
+
18
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
+
%
19
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
#
W
20
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
+
21
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
R
a 1 · b1 )
+
R
(1)
22
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
#
(2)
23
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
+
(3)
24
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
clr
R
R
+
(4)
25
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
R
a 1 · b1 )
+
%
R
(5)
26
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
#
W
(6)
27
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
+
+
(7)
28
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
R
a 1 · b1 )
+
%
R
(8)
29
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
#
W
+
(9)
30
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
R
a 1 · b1 )
+
%
R
(10)
31
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
clr
W
+
(11)
32
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
+
%
(12)
33
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
#
W
(13)
34
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
+
(14)
35
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
R
a 1 · b1 )
+
%
R
(15)
36
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
#
W
(16)
37
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
R
a 1 · b1 )
+
R
(17)
38
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
clr
+
(18)
39
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
+
%
R
(19)
40
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
#
W
(20)
41
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
+
%
(21)
42
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
R
a 1 · b1 )
+
R
(1,22)
43
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
#
%
(2,23)
44
FourQ on FPGA
CHES 2016
10/17
Example: Multiplication in Fp2
3 multiplications, 2 additions and 3 subtractions in Fp :
a ⇥ b = (a0 , a1 ) ⇥ (b0 , b1 )
= (a0 · b0
a1 · b1 , (a0 + a1 ) · (b0 + b1 )
a 0 · b0
a 1 · b1 )
+
W
(3,24)
45
FourQ on FPGA
CHES 2016
10/17
Latencies
Field operations
Addition
Multiplication
Squaring
Inversion
in Fp
6 (2) clocks
20 (7) clocks
20 (7) clocks
2760 clocks
in Fp2
8 (4) clocks
38/45 (31/21) clocks
28 (16) clocks
2817 clocks
In practice, almost all additions in parallel with multiplications
FourQ on FPGA
CHES 2016
11/17
Latencies
Field operations
Addition
Multiplication
Squaring
Inversion
in Fp
6 (2) clocks
20 (7) clocks
20 (7) clocks
2760 clocks
in Fp2
8 (4) clocks
38/45 (31/21) clocks
28 (16) clocks
2817 clocks
In practice, almost all additions in parallel with multiplications
Operations for scalar multiplication
Precomputation
Scalar decomposition and recoding
Double-and-add (64 times)
Affine conversion
Scalar multiplication
4185 clocks
1984 (0) clocks
354 clocks
2869 clocks
29739 clocks
FourQ on FPGA
CHES 2016
11/17
Multi-Core Architecture
di, do
commands responses
Read/write control
Scalar unit
LIFO 1
LIFO 2
LIFO 3
LIFO 4
LIFO N
Core 1
Core 2
Core 3
Core 4
Core N
FourQ on FPGA
CHES 2016
12/17
Area Results on Zynq-7020
Single-Core Architecture
100%
53,200
106,400
13,300
140
220
80%
60%
40%
20%
7.9 %
0%
12.7 %
4,217
4.1 %
4,413
1,691
LUTs
Regs
Slices
7.1 %
10
BRAMs
12.3 %
27
DSPs
FourQ on FPGA
CHES 2016
13/17
Area Results on Zynq-7020
Single-Core Architecture
100%
53,200
106,400
13,300
140
220
80%
60%
40%
20%
7.9 %
0%
12.7 %
7.1 %
12.3 %
6.3 %
4.1 %
2.6 %
9.2 %
0.0 %
5.0 %
LUTs
Regs
Slices
BRAMs
DSPs
Total
Scalar unit
FourQ on FPGA
CHES 2016
13/17
Area Results on Zynq-7020
Multi-Core Architecture (N = 11)
100%
53,200
106,400
13,300
140
220
85.0 %
78.6 %
80%
187
110
60%
42.8 %
40%
5,697
25.6 %
20%
13,595
7.9 %
0%
19.7 %
20,924
12.7 %
7.1 %
12.3 %
6.3 %
4.1 %
2.6 %
9.2 %
0.0 %
5.0 %
LUTs
Regs
Slices
BRAMs
DSPs
Total
Scalar unit
FourQ on FPGA
CHES 2016
13/17
Area Results on Zynq-7020
Multi-Core Architecture (N = 11)
100%
53,200
106,400
13,300
140
220
85.0 %
78.6 %
80%
187
110
60%
42.8 %
40%
5,697
25.6 %
20%
13,595
7.9 %
0%
19.7 %
20,924
12.7 %
7.1 %
12.3 %
6.3 %
6.4
4.1 %
2.6 %
2.8
9.2 %
9.0
0.0 %
5.0 %
LUTs
Regs
Slices
BRAMs
DSPs
Total
Scalar unit
FourQ on FPGA
CHES 2016
13/17
Performance Results on Zynq-7020
VHDL for Xilinx Zynq-7020 with Vivado 2015.4
I
One scalar multiplication takes 29,739 clock cycles
I
Single-core: 190 MHz ) 157 µs or 6,389 ops
I
I
Multi-core: 175 MHz (⇥11) ) 170 µs or 64,730 ops
Point validation (124 clocks), cofactor killing (1760 clocks)
FourQ on FPGA
CHES 2016
14/17
Performance Results on Zynq-7020
VHDL for Xilinx Zynq-7020 with Vivado 2015.4
I
One scalar multiplication takes 29,739 clock cycles
I
Single-core: 190 MHz ) 157 µs or 6,389 ops
I
I
Multi-core: 175 MHz (⇥11) ) 170 µs or 64,730 ops
Point validation (124 clocks), cofactor killing (1760 clocks)
Variant using Montgomery ladder
I
No scalar unit (saves 11 DSPs), no precomputations,
simpler control, etc.
I
522 slices, 7 BRAMs, 16 DSP
I
58967 clocks at 190 MHz ) 310 µs or 3,222 ops
FourQ on FPGA
CHES 2016
14/17
Comparison
I
Many implementations for ECC over prime fields
I
Comparison is extremely difficult because of different
FPGAs, different optimization goals, etc.
I
Best match with Sasdrich & Güneysu’s Curve25519
design, both on Xilinx Zynq-7020
I
See the paper for further comparisons
FourQ on FPGA
CHES 2016
15/17
FourQ vs. Curve25519
Single-Core Architectures
25%
13,300
140
220
10,000
8,000
20%
15%
10%
2.54 ⇥
6389
12.7 %
1.88 ⇥
12.3 %
1691
27
7.7 %
1029
5%
7.1 %
236.6
9.1 %
6,000
4,000
20
10
2519
126.0
2,000
1.4 %
0%
2
Slices
BRAMs
Our FourQ
DSPs
Throughput Tput/DSP
0
Sasdrich & Güneysu’s Curve25519
FourQ on FPGA
CHES 2016
16/17
FourQ vs. Curve25519
Montgomery Ladder
25%
13,300
140
220
10,000
20%
8,000
15%
6,000
10%
5%
7.7 %
4.2 %
1029
565
0%
9.1 %
7.3 %
16
5.0 %
7
20
1.60 ⇥
1.28 ⇥
201.4
3222
2519
126.0
BRAMs
Our FourQ
2,000
1.4 %
2
Slices
4,000
DSPs
Throughput Tput/DSP
0
Sasdrich & Güneysu’s Curve25519
FourQ on FPGA
CHES 2016
16/17
FourQ vs. Curve25519
Multi-Core Architectures (N = 11)
100%
13,300
140
220
100,000
220
85.0 %
84.8 %
80%
11277
78.6 %
80,000
187
110
2.00 ⇥
60%
64730
2.36 ⇥
42.8 %
40%
346.2
60,000
40,000
5697
32304
20%
0%
146.8
10.0 %
22
Slices
BRAMs
Our FourQ
DSPs
Throughput Tput/DSP
20,000
0
Sasdrich & Güneysu’s Curve25519
FourQ on FPGA
CHES 2016
16/17
Conclusions
I
We showed that FourQ is very efficient also on FPGAs
I
FourQ is significantly more efficient in terms of speed-area
ratio than the closest counterpart
FourQ on FPGA
CHES 2016
17/17
Conclusions
I
We showed that FourQ is very efficient also on FPGAs
I
FourQ is significantly more efficient in terms of speed-area
ratio than the closest counterpart
Future Work
I
Low-latency implementation
I
Better side-channel protection:
e.g., against DPA and advanced horizontal attacks
FourQ on FPGA
CHES 2016
17/17
Conclusions
I
We showed that FourQ is very efficient also on FPGAs
I
FourQ is significantly more efficient in terms of speed-area
ratio than the closest counterpart
Future Work
I
Low-latency implementation
I
Better side-channel protection:
e.g., against DPA and advanced horizontal attacks
Thank you! Questions?
FourQ on FPGA
CHES 2016
17/17
Download