Kummer strikes back: new DH speed records Chitchanok Chuengsatiansup 9th December 2014 Joint work with Daniel J. Bernstein, Tanja Lange, and Peter Schwabe Chitchanok Chuengsatiansup Kummer strikes back: new DH speed records 1 Diffie–Hellman Key Exchange a b Chitchanok Chuengsatiansup Kummer strikes back: new DH speed records 2 Diffie–Hellman Key Exchange a aP b bP Chitchanok Chuengsatiansup Kummer strikes back: new DH speed records 2 Diffie–Hellman Key Exchange a b aP a(bP) u bP ) b(aP) Chitchanok Chuengsatiansup Kummer strikes back: new DH speed records 2 Diffie–Hellman Key Exchange a b d aP a(bP) c u bP ) b(aP) Chitchanok Chuengsatiansup Kummer strikes back: new DH speed records 2 Diffie–Hellman Key Exchange a b aP a(bP) u bP ) c cP d dP b(aP) Chitchanok Chuengsatiansup Kummer strikes back: new DH speed records 2 Diffie–Hellman Key Exchange a b aP a(bP) a(cP) c bP ) u cP d dP b(aP) u Chitchanok Chuengsatiansup ) c(aP) Kummer strikes back: new DH speed records 2 Diffie–Hellman Key Exchange a b aP a(bP) a(cP) c bP ) u cP d dP b(aP) u a(dP) u Chitchanok Chuengsatiansup ) c(aP) * d(aP) Kummer strikes back: new DH speed records 2 Diffie–Hellman Key Exchange a b aP a(bP) a(cP) c bP ) u cP dP b(aP) ) u a(dP) u d b(cP) Chitchanok Chuengsatiansup c(aP) c(bP) * d(aP) Kummer strikes back: new DH speed records 2 Diffie–Hellman Key Exchange a b aP a(bP) a(cP) c bP ) u u a(dP) u d cP dP b(aP) b(dP) u b(cP) Chitchanok Chuengsatiansup ) c(aP) c(bP) ) d(bP) * d(aP) Kummer strikes back: new DH speed records 2 Diffie–Hellman Key Exchange a b aP a(bP) a(cP) c bP ) u u a(dP) u cP b(aP) b(dP) d c(dP) u b(cP) Chitchanok Chuengsatiansup ) c(aP) c(bP) dP ) u ) d(cP) d(bP) * d(aP) Kummer strikes back: new DH speed records 2 Diffie–Hellman Key Exchange a aP a(bP) a(cP) u u a(dP) u main DH challenge: make variable-base scalar mult as fast as possible Chitchanok Chuengsatiansup Kummer strikes back: new DH speed records 3 Diffie–Hellman Key Exchange b bP ) b(aP) b(dP) u b(cP) main DH challenge: make variable-base scalar mult as fast as possible Chitchanok Chuengsatiansup Kummer strikes back: new DH speed records 3 Diffie–Hellman Key Exchange c cP c(dP) ) u c(aP) c(bP) main DH challenge: make variable-base scalar mult as fast as possible Chitchanok Chuengsatiansup Kummer strikes back: new DH speed records 3 Diffie–Hellman Key Exchange d dP ) ) d(cP) d(bP) * d(aP) main DH challenge: make variable-base scalar mult as fast as possible Chitchanok Chuengsatiansup Kummer strikes back: new DH speed records 3 Ladder Input: Q, n = (ni−1 , . . . , n0 )2 Output: R0 = nQ R0 ← 0Q; R1 ← 1Q if ni = 0 then R0 ← 2R0 ; R1 ← R0 + R1 else R0 ← R0 + R1 ; R1 ← 2R1 Return R0 Chitchanok Chuengsatiansup Example: Compute 9Q 9 = 10012 (R0 , R1 ) = (0Q, 1Q) n3 = 1 : (1Q, 2Q) n2 = 0 : (2Q, 3Q) n1 = 0 : (4Q, 5Q) n0 = 1 : (9Q, 10Q) Kummer strikes back: new DH speed records 4 Elliptic VS Hyperelliptic Sandy Bridge at high security level and (claimed) constant time cycles 194036, g=1 153000?, g=1 “?”: not verified by SUPERCOP 2011 2012 2013 Chitchanok Chuengsatiansup 2014 2015 Kummer strikes back: new DH speed records 5 Elliptic VS Hyperelliptic Sandy Bridge at high security level and (claimed) constant time cycles 194036, g=1 153000?, g=1 137000?, g=1 (GLV+GLS) “?”: not verified by SUPERCOP 2011 2012 2013 Chitchanok Chuengsatiansup 2014 2015 Kummer strikes back: new DH speed records 5 Elliptic VS Hyperelliptic Sandy Bridge at high security level and (claimed) constant time cycles 194036, g=1 153000?, g=1 137000?, g=1 (GLV+GLS) 122716, g=2 “?”: not verified by SUPERCOP 2011 2012 2013 Chitchanok Chuengsatiansup 2014 2015 Kummer strikes back: new DH speed records 5 Elliptic VS Hyperelliptic Sandy Bridge at high security level and (claimed) constant time cycles 194036, g=1 153000?, g=1 137000?, g=1 (GLV+GLS) 122716, g=2 119904, g=1 (binary GLS) “?”: not verified by SUPERCOP 2011 2012 2013 Chitchanok Chuengsatiansup 2014 2015 Kummer strikes back: new DH speed records 5 Elliptic VS Hyperelliptic Sandy Bridge at high security level and (claimed) constant time cycles 194036, g=1 153000?, g=1 137000?, g=1 (GLV+GLS) 122716, g=2 119904, g=1 (binary GLS) 96000?, g=1 (GLV+GLS) “?”: not verified by SUPERCOP 2011 2012 2013 Chitchanok Chuengsatiansup 2014 2015 Kummer strikes back: new DH speed records 5 Elliptic VS Hyperelliptic Sandy Bridge at high security level and (claimed) constant time cycles 194036, g=1 153000?, g=1 137000?, g=1 (GLV+GLS) 122716, g=2 119904, g=1 (binary GLS) 96000?, g=1 (GLV+GLS) 92000?, g=1 (GLV+GLS) “?”: not verified by SUPERCOP 2011 2012 2013 Chitchanok Chuengsatiansup 2014 2015 Kummer strikes back: new DH speed records 5 Elliptic VS Hyperelliptic Sandy Bridge at high security level and (claimed) constant time cycles 194036, g=1 153000?, g=1 137000?, g=1 (GLV+GLS) 122716, g=2 119904, g=1 (binary GLS) 96000?, g=1 (GLV+GLS) 92000?, g=1 (GLV+GLS) 88916, g=2 (new) “?”: not verified by SUPERCOP 2011 2012 2013 Chitchanok Chuengsatiansup 2014 2015 Kummer strikes back: new DH speed records 5 Elliptic-Hyperelliptic Analogy ECC −→ x-line represented as (X : Z ) −→ Kummer surface represented as (X : Y : Z : T ) y 2 = x 3 + ax + b HECC v 2 = u 5 + f4 u 4 + f3 u 3 + f2 u 2 + f1 u 1 + f0 Chitchanok Chuengsatiansup Kummer strikes back: new DH speed records 6 Hyperelliptic Advantages smaller field size Chitchanok Chuengsatiansup Kummer strikes back: new DH speed records 7 Hyperelliptic Advantages smaller field size need field size ≈ 2128 for group size ≈ 2256 (ECC would need field size ≈ 2256 ) field arithmetic 2–4 times faster Chitchanok Chuengsatiansup Kummer strikes back: new DH speed records 7 Hyperelliptic Advantages smaller field size need field size ≈ 2128 for group size ≈ 2256 (ECC would need field size ≈ 2256 ) field arithmetic 2–4 times faster improvement to HECC formulas Chitchanok Chuengsatiansup Kummer strikes back: new DH speed records 7 Hyperelliptic Advantages smaller field size need field size ≈ 2128 for group size ≈ 2256 (ECC would need field size ≈ 2256 ) field arithmetic 2–4 times faster improvement to HECC formulas only 25 multiplications with Gaudry ladder for Kummer surface Chitchanok Chuengsatiansup Kummer strikes back: new DH speed records 7 Hyperelliptic Advantages smaller field size need field size ≈ 2128 for group size ≈ 2256 (ECC would need field size ≈ 2256 ) field arithmetic 2–4 times faster improvement to HECC formulas only 25 multiplications with Gaudry ladder for Kummer surface discovery of small constants for Kummer surface Chitchanok Chuengsatiansup Kummer strikes back: new DH speed records 7 Hyperelliptic Advantages smaller field size need field size ≈ 2128 for group size ≈ 2256 (ECC would need field size ≈ 2256 ) field arithmetic 2–4 times faster improvement to HECC formulas only 25 multiplications with Gaudry ladder for Kummer surface discovery of small constants for Kummer surface 6 of 25 multiplications are by small constant Chitchanok Chuengsatiansup Kummer strikes back: new DH speed records 7 Hyperelliptic Advantages smaller field size need field size ≈ 2128 for group size ≈ 2256 (ECC would need field size ≈ 2256 ) field arithmetic 2–4 times faster improvement to HECC formulas only 25 multiplications with Gaudry ladder for Kummer surface discovery of small constants for Kummer surface 6 of 25 multiplications are by small constant new: formulas well suited for vectorization Chitchanok Chuengsatiansup Kummer strikes back: new DH speed records 7 Vectorization without vector a + b = a+b Chitchanok Chuengsatiansup Kummer strikes back: new DH speed records 8 Vectorization without vector a + b = a+b with vector a0 + b0 = a0 + b0 a1 + b1 = a1 + b1 Chitchanok Chuengsatiansup a2 + b2 = a2 + b2 a3 + b3 = a3 + b 3 Kummer strikes back: new DH speed records 8 Vectorization without vector a + b = a+b with vector a0 + b0 = a0 + b0 a1 + b1 = a1 + b1 a2 + b2 = a2 + b2 a3 + b3 = a3 + b 3 single instruction performing n independent operations on aligned inputs Chitchanok Chuengsatiansup Kummer strikes back: new DH speed records 8 ECC Montgomery Ladder (original) x2 v + × w × z2 x3 ( + v − Chitchanok Chuengsatiansup × A−4 2 − × × z4 ' + ,× w − ×g x4 − × ' + ( ' × z3 x1 x5 Kummer strikes back: new DH speed records /× z5 9 ECC Montgomery Ladder (2-way vectorization) x2 v + move × down × z2 x3 ( + v − w Chitchanok Chuengsatiansup z4 × A−4 2 − × × x4 ' + ,× + ×g − × − ~ ( ' × ' × z3 x1 x5 Kummer strikes back: new DH speed records /× z5 10 ECC Montgomery Ladder (2-way vectorization) x2 require permutation move × down v + × z2 x3 ( + v − w Chitchanok Chuengsatiansup z4 × A−4 2 − × × x4 ' + ,× + ×g − × − ~ ( ' × ' × z3 x1 x5 Kummer strikes back: new DH speed records /× z5 10 ECC Montgomery Ladder (2-way vectorization) x2 require permutation v + move × down × z2 x3 ( + v − ~ w + Chitchanok Chuengsatiansup z4 − × A−4 2 ' × × x4 ,× + ×g − × − × ( ' × ' require permutation and leave + idle z3 x1 x5 Kummer strikes back: new DH speed records /× z5 10 ECC Montgomery Ladder (2-way vectorization) x2 require permutation v + move × down × x3 ( + ~ w Chitchanok Chuengsatiansup z4 − × A−4 2 ' × × x4 ,× + + − × ×g ( ' × − × z3 v − ' require permutation and leave + idle slow down mult. by constant z2 x1 x5 Kummer strikes back: new DH speed records /× z5 10 ECC Montgomery Ladder (2-way vectorization) x2 require permutation v + move × down × x3 ( + ~ w Chitchanok Chuengsatiansup z4 − × A−4 2 ' × × x4 ,× + + nothing to match + − × ×g ( ' × − × z3 v − ' require permutation and leave + idle slow down mult. by constant z2 x1 x5 Kummer strikes back: new DH speed records /× z5 10 ECC Montgomery Ladder (2-way vectorization) x2 require permutation v + move × down × x3 ( + ~ w Chitchanok Chuengsatiansup z4 − × A−4 2 x4 ' × × require permutation ,× + + nothing to match + − × ×g ( ' × − × z3 v − ' require permutation and leave + idle slow down mult. by constant z2 x1 x5 Kummer strikes back: new DH speed records /× z5 10 ECC Montgomery Ladder (4-way vectorization) x2 v match + with − + × z2 x3 ( + v − w Chitchanok Chuengsatiansup z4 × A−4 2 − × × x4 ' + ,× + ×g − × − ~ ( ' × ' × z3 x1 x5 Kummer strikes back: new DH speed records /× z5 11 ECC Montgomery Ladder (4-way vectorization) x2 v match + with − + slow down squaring × z2 x3 ( + v − w Chitchanok Chuengsatiansup z4 × A−4 2 − × × x4 ' + ,× + ×g − × − ~ ( ' × ' × z3 x1 x5 Kummer strikes back: new DH speed records /× z5 11 ECC Montgomery Ladder (4-way vectorization) x2 v match + with − + slow down squaring × z2 x3 ( + v − ~ w Chitchanok Chuengsatiansup z4 × A−4 2 − × × x4 ' + ,× + ×g − × − × ( ' × ' nothing to match and match + and − z3 x1 x5 Kummer strikes back: new DH speed records /× z5 11 ECC Montgomery Ladder (4-way vectorization) x2 v match + with − + slow down squaring × x3 ( + ~ w Chitchanok Chuengsatiansup z4 − × A−4 2 ' × × x4 ,× + + − × ×g ( ' × − × z3 v − ' nothing to match and match + and − slow down mult. by constant and squaring z2 x1 x5 Kummer strikes back: new DH speed records /× z5 11 ECC Montgomery Ladder (4-way vectorization) x2 v match + with − + slow down squaring × x3 ( + ~ w Chitchanok Chuengsatiansup z4 − × A−4 2 ' × × x4 ,× + + − × ×g nothing to match + ( ' × − × z3 v − ' nothing to match and match + and − slow down mult. by constant and squaring z2 x1 x5 Kummer strikes back: new DH speed records /× z5 11 ECC Montgomery Ladder (4-way vectorization) x2 v match + with − + slow down squaring × x3 ( + ~ w Chitchanok Chuengsatiansup z4 − × A−4 2 × x4 ' × nothing to match × ,× + + − × ×g nothing to match + ( ' × − × z3 v − ' nothing to match and match + and − slow down mult. by constant and squaring z2 x1 x5 Kummer strikes back: new DH speed records /× z5 11 Squared Kummer Surface Ladder x2 y2 z2 t2 x3 y3 z3 t3 H H × × × × 4× 4× 4× 4× ·−833 ·2499 ·1617 ·561 ·−833 ·2499 ·1617 ·561 H H × × × × × × × × ·−114 ·57 ·66 ·418 · xy11 · xz11 · xt11 z4 t4 y5 z5 t5 x4 y4 Chitchanok Chuengsatiansup x5 Kummer strikes back: new DH speed records 12 Making It Run Really Fast maximize usage of available vector multipliers minimize cost from carries use redundant representation use non-integer radix, e.g., 225.4 on Cortex-A8 do not perform full carry do parallel carry chain eliminate redundancy inside field operations precompute to reuse values 2f1 , 2f2 , 2f3 , 2f4 minimize overhead from permutations organize data to fit instruction format minimize permutations in H use different order of input/output and change sign e.g. we reduced 144 Sandy Bridge permutations to 36 schedule instructions to keep CPU as busy as possible see paper for details Chitchanok Chuengsatiansup Kummer strikes back: new DH speed records 13 Fastest Diffie–Hellman Ever!!! Arch A8-slow A8-slow A8-fast A8-fast Sandy Sandy Sandy Sandy Sandy Sandy Sandy Sandy Haswell Haswell Haswell Haswell Cycles 497389 305395 460200 273349 194036 153000? 137000? 122716 119904 96000? 92000? 88916 161648 110740 61712 60556 g 1 2 1 2 1 1 1 2 1 1 1 2 1 2 1 2 Field 2255 − 19 2127 − 1 2255 − 19 2127 − 1 2255 − 19 2252 − 2232 − 1 (2127 − 5997)2 2127 − 1 2254 (2127 − 5997)2 (2127 − 5997)2 2127 − 1 2255 − 19 2127 − 1 2254 2127 − 1 Chitchanok Chuengsatiansup Source of software BeSc CHES new (this paper) BeSc CHES new (this paper) BeDuLaScYa CHES Hamburg LoSi Asiacrypt BoCoHiLa Eurocrypt OlLóArRo CHES FaLoSá CT-RSA FaLoSá July new (this paper) BeDuLaScYa CHES BoCoHiLa Eurocrypt OlLóArRo CHES new (this paper) 2012 2012 2011 2012 2013 2013 2014 2014 2011 2013 2013 Kummer strikes back: new DH speed records 14