Kummer strikes back: new DH speed records

advertisement
Kummer strikes back: new DH speed records
Chitchanok Chuengsatiansup
9th December 2014
Joint work with Daniel J. Bernstein, Tanja Lange, and Peter Schwabe
Chitchanok Chuengsatiansup
Kummer strikes back: new DH speed records
1
Diffie–Hellman Key Exchange
a
b
Chitchanok Chuengsatiansup
Kummer strikes back: new DH speed records
2
Diffie–Hellman Key Exchange
a
aP
b
bP
Chitchanok Chuengsatiansup
Kummer strikes back: new DH speed records
2
Diffie–Hellman Key Exchange
a
b
aP
a(bP)
u
bP
)
b(aP)
Chitchanok Chuengsatiansup
Kummer strikes back: new DH speed records
2
Diffie–Hellman Key Exchange
a
b
d
aP
a(bP)
c
u
bP
)
b(aP)
Chitchanok Chuengsatiansup
Kummer strikes back: new DH speed records
2
Diffie–Hellman Key Exchange
a
b
aP
a(bP)
u
bP
)
c
cP
d
dP
b(aP)
Chitchanok Chuengsatiansup
Kummer strikes back: new DH speed records
2
Diffie–Hellman Key Exchange
a
b
aP
a(bP)
a(cP)
c
bP
)
u
cP
d
dP
b(aP)
u
Chitchanok Chuengsatiansup
)
c(aP)
Kummer strikes back: new DH speed records
2
Diffie–Hellman Key Exchange
a
b
aP
a(bP)
a(cP)
c
bP
)
u
cP
d
dP
b(aP)
u
a(dP) u
Chitchanok Chuengsatiansup
)
c(aP)
* d(aP)
Kummer strikes back: new DH speed records
2
Diffie–Hellman Key Exchange
a
b
aP
a(bP)
a(cP)
c
bP
)
u
cP
dP
b(aP)
)
u
a(dP) u
d
b(cP)
Chitchanok Chuengsatiansup
c(aP)
c(bP)
* d(aP)
Kummer strikes back: new DH speed records
2
Diffie–Hellman Key Exchange
a
b
aP
a(bP)
a(cP)
c
bP
)
u
u
a(dP) u
d
cP
dP
b(aP)
b(dP)
u
b(cP)
Chitchanok Chuengsatiansup
)
c(aP)
c(bP)
)
d(bP)
* d(aP)
Kummer strikes back: new DH speed records
2
Diffie–Hellman Key Exchange
a
b
aP
a(bP)
a(cP)
c
bP
)
u
u
a(dP) u
cP
b(aP)
b(dP)
d
c(dP)
u
b(cP)
Chitchanok Chuengsatiansup
)
c(aP)
c(bP)
dP
)
u
)
d(cP)
d(bP)
* d(aP)
Kummer strikes back: new DH speed records
2
Diffie–Hellman Key Exchange
a
aP
a(bP)
a(cP)
u
u
a(dP) u
main DH challenge: make variable-base scalar mult as fast as possible
Chitchanok Chuengsatiansup
Kummer strikes back: new DH speed records
3
Diffie–Hellman Key Exchange
b
bP
)
b(aP)
b(dP)
u
b(cP)
main DH challenge: make variable-base scalar mult as fast as possible
Chitchanok Chuengsatiansup
Kummer strikes back: new DH speed records
3
Diffie–Hellman Key Exchange
c
cP
c(dP)
)
u
c(aP)
c(bP)
main DH challenge: make variable-base scalar mult as fast as possible
Chitchanok Chuengsatiansup
Kummer strikes back: new DH speed records
3
Diffie–Hellman Key Exchange
d
dP
)
)
d(cP)
d(bP)
* d(aP)
main DH challenge: make variable-base scalar mult as fast as possible
Chitchanok Chuengsatiansup
Kummer strikes back: new DH speed records
3
Ladder
Input: Q, n = (ni−1 , . . . , n0 )2
Output: R0 = nQ
R0 ← 0Q; R1 ← 1Q
if ni = 0 then
R0 ← 2R0 ; R1 ← R0 + R1
else
R0 ← R0 + R1 ; R1 ← 2R1
Return R0
Chitchanok Chuengsatiansup
Example: Compute 9Q
9 = 10012
(R0 , R1 ) = (0Q, 1Q)
n3 = 1 :
(1Q, 2Q)
n2 = 0 :
(2Q, 3Q)
n1 = 0 :
(4Q, 5Q)
n0 = 1 :
(9Q, 10Q)
Kummer strikes back: new DH speed records
4
Elliptic VS Hyperelliptic
Sandy Bridge at high security level and (claimed) constant time
cycles
194036, g=1
153000?, g=1
“?”: not verified by SUPERCOP
2011
2012
2013
Chitchanok Chuengsatiansup
2014
2015
Kummer strikes back: new DH speed records
5
Elliptic VS Hyperelliptic
Sandy Bridge at high security level and (claimed) constant time
cycles
194036, g=1
153000?, g=1
137000?, g=1 (GLV+GLS)
“?”: not verified by SUPERCOP
2011
2012
2013
Chitchanok Chuengsatiansup
2014
2015
Kummer strikes back: new DH speed records
5
Elliptic VS Hyperelliptic
Sandy Bridge at high security level and (claimed) constant time
cycles
194036, g=1
153000?, g=1
137000?, g=1 (GLV+GLS)
122716, g=2
“?”: not verified by SUPERCOP
2011
2012
2013
Chitchanok Chuengsatiansup
2014
2015
Kummer strikes back: new DH speed records
5
Elliptic VS Hyperelliptic
Sandy Bridge at high security level and (claimed) constant time
cycles
194036, g=1
153000?, g=1
137000?, g=1 (GLV+GLS)
122716, g=2
119904, g=1 (binary GLS)
“?”: not verified by SUPERCOP
2011
2012
2013
Chitchanok Chuengsatiansup
2014
2015
Kummer strikes back: new DH speed records
5
Elliptic VS Hyperelliptic
Sandy Bridge at high security level and (claimed) constant time
cycles
194036, g=1
153000?, g=1
137000?, g=1 (GLV+GLS)
122716, g=2
119904, g=1 (binary GLS)
96000?, g=1 (GLV+GLS)
“?”: not verified by SUPERCOP
2011
2012
2013
Chitchanok Chuengsatiansup
2014
2015
Kummer strikes back: new DH speed records
5
Elliptic VS Hyperelliptic
Sandy Bridge at high security level and (claimed) constant time
cycles
194036, g=1
153000?, g=1
137000?, g=1 (GLV+GLS)
122716, g=2
119904, g=1 (binary GLS)
96000?, g=1 (GLV+GLS)
92000?, g=1 (GLV+GLS)
“?”: not verified by SUPERCOP
2011
2012
2013
Chitchanok Chuengsatiansup
2014
2015
Kummer strikes back: new DH speed records
5
Elliptic VS Hyperelliptic
Sandy Bridge at high security level and (claimed) constant time
cycles
194036, g=1
153000?, g=1
137000?, g=1 (GLV+GLS)
122716, g=2
119904, g=1 (binary GLS)
96000?, g=1 (GLV+GLS)
92000?, g=1 (GLV+GLS)
88916, g=2
(new)
“?”: not verified by SUPERCOP
2011
2012
2013
Chitchanok Chuengsatiansup
2014
2015
Kummer strikes back: new DH speed records
5
Elliptic-Hyperelliptic Analogy
ECC
−→
x-line
represented as
(X : Z )
−→
Kummer surface
represented as
(X : Y : Z : T )
y 2 = x 3 + ax + b
HECC
v 2 = u 5 + f4 u 4 + f3 u 3 + f2 u 2 + f1 u 1 + f0
Chitchanok Chuengsatiansup
Kummer strikes back: new DH speed records
6
Hyperelliptic Advantages
smaller field size
Chitchanok Chuengsatiansup
Kummer strikes back: new DH speed records
7
Hyperelliptic Advantages
smaller field size
need field size ≈ 2128 for group size ≈ 2256
(ECC would need field size ≈ 2256 )
field arithmetic 2–4 times faster
Chitchanok Chuengsatiansup
Kummer strikes back: new DH speed records
7
Hyperelliptic Advantages
smaller field size
need field size ≈ 2128 for group size ≈ 2256
(ECC would need field size ≈ 2256 )
field arithmetic 2–4 times faster
improvement to HECC formulas
Chitchanok Chuengsatiansup
Kummer strikes back: new DH speed records
7
Hyperelliptic Advantages
smaller field size
need field size ≈ 2128 for group size ≈ 2256
(ECC would need field size ≈ 2256 )
field arithmetic 2–4 times faster
improvement to HECC formulas
only 25 multiplications with Gaudry ladder for Kummer surface
Chitchanok Chuengsatiansup
Kummer strikes back: new DH speed records
7
Hyperelliptic Advantages
smaller field size
need field size ≈ 2128 for group size ≈ 2256
(ECC would need field size ≈ 2256 )
field arithmetic 2–4 times faster
improvement to HECC formulas
only 25 multiplications with Gaudry ladder for Kummer surface
discovery of small constants for Kummer surface
Chitchanok Chuengsatiansup
Kummer strikes back: new DH speed records
7
Hyperelliptic Advantages
smaller field size
need field size ≈ 2128 for group size ≈ 2256
(ECC would need field size ≈ 2256 )
field arithmetic 2–4 times faster
improvement to HECC formulas
only 25 multiplications with Gaudry ladder for Kummer surface
discovery of small constants for Kummer surface
6 of 25 multiplications are by small constant
Chitchanok Chuengsatiansup
Kummer strikes back: new DH speed records
7
Hyperelliptic Advantages
smaller field size
need field size ≈ 2128 for group size ≈ 2256
(ECC would need field size ≈ 2256 )
field arithmetic 2–4 times faster
improvement to HECC formulas
only 25 multiplications with Gaudry ladder for Kummer surface
discovery of small constants for Kummer surface
6 of 25 multiplications are by small constant
new: formulas well suited for vectorization
Chitchanok Chuengsatiansup
Kummer strikes back: new DH speed records
7
Vectorization
without vector
a
+
b
=
a+b
Chitchanok Chuengsatiansup
Kummer strikes back: new DH speed records
8
Vectorization
without vector
a
+
b
=
a+b
with vector
a0
+
b0
=
a0 + b0
a1
+
b1
=
a1 + b1
Chitchanok Chuengsatiansup
a2
+
b2
=
a2 + b2
a3
+
b3
=
a3 + b 3
Kummer strikes back: new DH speed records
8
Vectorization
without vector
a
+
b
=
a+b
with vector
a0
+
b0
=
a0 + b0
a1
+
b1
=
a1 + b1
a2
+
b2
=
a2 + b2
a3
+
b3
=
a3 + b 3
single instruction performing n independent operations on
aligned inputs
Chitchanok Chuengsatiansup
Kummer strikes back: new DH speed records
8
ECC Montgomery Ladder (original)
x2
v
+
×
w
×
z2
x3
( +
v
−
Chitchanok Chuengsatiansup
×
A−4
2
−
×
×
z4
' +
,×
w
−
×g
x4
−
×
' +
( ' ×
z3
x1
x5
Kummer strikes back: new DH speed records
/×
z5
9
ECC Montgomery Ladder (2-way vectorization)
x2
v
+
move × down
×
z2
x3
( +
v
−
w
Chitchanok Chuengsatiansup
z4
×
A−4
2
−
×
×
x4
' +
,×
+
×g
−
×
−
~
( ' ×
' ×
z3
x1
x5
Kummer strikes back: new DH speed records
/×
z5
10
ECC Montgomery Ladder (2-way vectorization)
x2
require permutation
move × down
v
+
×
z2
x3
( +
v
−
w
Chitchanok Chuengsatiansup
z4
×
A−4
2
−
×
×
x4
' +
,×
+
×g
−
×
−
~
( ' ×
' ×
z3
x1
x5
Kummer strikes back: new DH speed records
/×
z5
10
ECC Montgomery Ladder (2-way vectorization)
x2
require permutation
v
+
move × down
×
z2
x3
( +
v
−
~
w
+
Chitchanok Chuengsatiansup
z4
−
×
A−4
2
' ×
×
x4
,×
+
×g
−
×
−
×
( ' ×
' require permutation and
leave + idle
z3
x1
x5
Kummer strikes back: new DH speed records
/×
z5
10
ECC Montgomery Ladder (2-way vectorization)
x2
require permutation
v
+
move × down
×
x3
( +
~
w
Chitchanok Chuengsatiansup
z4
−
×
A−4
2
' ×
×
x4
,×
+
+
−
×
×g
( ' ×
−
×
z3
v
−
' require permutation and
leave + idle
slow down mult. by
constant
z2
x1
x5
Kummer strikes back: new DH speed records
/×
z5
10
ECC Montgomery Ladder (2-way vectorization)
x2
require permutation
v
+
move × down
×
x3
( +
~
w
Chitchanok Chuengsatiansup
z4
−
×
A−4
2
' ×
×
x4
,×
+
+
nothing to match +
−
×
×g
( ' ×
−
×
z3
v
−
' require permutation and
leave + idle
slow down mult. by
constant
z2
x1
x5
Kummer strikes back: new DH speed records
/×
z5
10
ECC Montgomery Ladder (2-way vectorization)
x2
require permutation
v
+
move × down
×
x3
( +
~
w
Chitchanok Chuengsatiansup
z4
−
×
A−4
2
x4
' ×
×
require permutation
,×
+
+
nothing to match +
−
×
×g
( ' ×
−
×
z3
v
−
' require permutation and
leave + idle
slow down mult. by
constant
z2
x1
x5
Kummer strikes back: new DH speed records
/×
z5
10
ECC Montgomery Ladder (4-way vectorization)
x2
v
match + with −
+
×
z2
x3
( +
v
−
w
Chitchanok Chuengsatiansup
z4
×
A−4
2
−
×
×
x4
' +
,×
+
×g
−
×
−
~
( ' ×
' ×
z3
x1
x5
Kummer strikes back: new DH speed records
/×
z5
11
ECC Montgomery Ladder (4-way vectorization)
x2
v
match + with −
+
slow down squaring
×
z2
x3
( +
v
−
w
Chitchanok Chuengsatiansup
z4
×
A−4
2
−
×
×
x4
' +
,×
+
×g
−
×
−
~
( ' ×
' ×
z3
x1
x5
Kummer strikes back: new DH speed records
/×
z5
11
ECC Montgomery Ladder (4-way vectorization)
x2
v
match + with −
+
slow down squaring
×
z2
x3
( +
v
−
~
w
Chitchanok Chuengsatiansup
z4
×
A−4
2
−
×
×
x4
' +
,×
+
×g
−
×
−
×
( ' ×
' nothing to match and
match + and −
z3
x1
x5
Kummer strikes back: new DH speed records
/×
z5
11
ECC Montgomery Ladder (4-way vectorization)
x2
v
match + with −
+
slow down squaring
×
x3
( +
~
w
Chitchanok Chuengsatiansup
z4
−
×
A−4
2
' ×
×
x4
,×
+
+
−
×
×g
( ' ×
−
×
z3
v
−
' nothing to match and
match + and −
slow down mult. by
constant and squaring
z2
x1
x5
Kummer strikes back: new DH speed records
/×
z5
11
ECC Montgomery Ladder (4-way vectorization)
x2
v
match + with −
+
slow down squaring
×
x3
( +
~
w
Chitchanok Chuengsatiansup
z4
−
×
A−4
2
' ×
×
x4
,×
+
+
−
×
×g
nothing to match +
( ' ×
−
×
z3
v
−
' nothing to match and
match + and −
slow down mult. by
constant and squaring
z2
x1
x5
Kummer strikes back: new DH speed records
/×
z5
11
ECC Montgomery Ladder (4-way vectorization)
x2
v
match + with −
+
slow down squaring
×
x3
( +
~
w
Chitchanok Chuengsatiansup
z4
−
×
A−4
2
×
x4
' ×
nothing to match ×
,×
+
+
−
×
×g
nothing to match +
( ' ×
−
×
z3
v
−
' nothing to match and
match + and −
slow down mult. by
constant and squaring
z2
x1
x5
Kummer strikes back: new DH speed records
/×
z5
11
Squared Kummer Surface Ladder
x2
y2
z2
t2
x3
y3
z3
t3
H
H
×
×
×
×
4×
4×
4×
4×
·−833 ·2499 ·1617 ·561
·−833 ·2499 ·1617 ·561
H
H
×
×
×
×
×
×
×
×
·−114 ·57
·66
·418
· xy11
· xz11
· xt11
z4
t4
y5
z5
t5
x4
y4
Chitchanok Chuengsatiansup
x5
Kummer strikes back: new DH speed records
12
Making It Run Really Fast
maximize usage of available vector multipliers
minimize cost from carries
use redundant representation
use non-integer radix, e.g., 225.4 on Cortex-A8
do not perform full carry
do parallel carry chain
eliminate redundancy inside field operations
precompute to reuse values 2f1 , 2f2 , 2f3 , 2f4
minimize overhead from permutations
organize data to fit instruction format
minimize permutations in H
use different order of input/output and change sign
e.g. we reduced 144 Sandy Bridge permutations to 36
schedule instructions to keep CPU as busy as possible
see paper for details
Chitchanok Chuengsatiansup
Kummer strikes back: new DH speed records
13
Fastest Diffie–Hellman Ever!!!
Arch
A8-slow
A8-slow
A8-fast
A8-fast
Sandy
Sandy
Sandy
Sandy
Sandy
Sandy
Sandy
Sandy
Haswell
Haswell
Haswell
Haswell
Cycles
497389
305395
460200
273349
194036
153000?
137000?
122716
119904
96000?
92000?
88916
161648
110740
61712
60556
g
1
2
1
2
1
1
1
2
1
1
1
2
1
2
1
2
Field
2255 − 19
2127 − 1
2255 − 19
2127 − 1
2255 − 19
2252 − 2232 − 1
(2127 − 5997)2
2127 − 1
2254
(2127 − 5997)2
(2127 − 5997)2
2127 − 1
2255 − 19
2127 − 1
2254
2127 − 1
Chitchanok Chuengsatiansup
Source of software
BeSc
CHES
new (this paper)
BeSc
CHES
new (this paper)
BeDuLaScYa CHES
Hamburg
LoSi
Asiacrypt
BoCoHiLa Eurocrypt
OlLóArRo
CHES
FaLoSá
CT-RSA
FaLoSá
July
new (this paper)
BeDuLaScYa CHES
BoCoHiLa Eurocrypt
OlLóArRo
CHES
new (this paper)
2012
2012
2011
2012
2013
2013
2014
2014
2011
2013
2013
Kummer strikes back: new DH speed records
14
Download