Unified Architectures for Efficient and Compact Crypto-Processing Erkay Savaş

advertisement
Unified Architectures for
Efficient and Compact
Crypto-Processing
Erkay Savaş
Sabancı University
7/27/2016
Erkay Savaş
1
Outline

Research Motivation

Public Key Cryptography

Unified Arithmetic

High-Radix Multiplication

Dual-Radix Multiplication

Support for GF(3n) Arithmetic

Implementation Results

Future Research
7/27/2016
Erkay Savaş
2
Motivation

Compatibility


Saving in Area


support for fast arithmetic in different finite
fields and groups
Improve {time  area} metric
Algorithm Agility

NTRU  ECC
7/27/2016
Erkay Savaş
3
Public Key Cryptography (PKC)

Each user has a pair of keys:



Encryption:


Private Key - known only to the owner
Public Key - known to everyone in the systems
with assurance
Encryption with the Public Key of the receiver
Decryption:

7/27/2016
Only the receiver can decrypt the message by
her/his Private Key
Erkay Savaş
4
Public Key Cryptography in Use

RSA, Rabin’s scheme


Discrete Logarithm Based Algorithms


Diffie-Helman Key Exchange, El Gamal
Elliptic curve DH Key Exchange, ECDSA


Integer factorization, Square root of modulo a
composite number
Discrete logarithm over elliptic curves
IBE

7/27/2016
pairings over elliptic curve points
Erkay Savaş
5
RSA





Most popular PKC
Invented by Rivest/Shamir/Adleman in 1977
at MIT.
Its patent expired in 2000.
Based on Integer Factorization problem
Each user has public and private key pair.
7/27/2016
Erkay Savaş
6
RSA Encryption & Decryption

Encryption done by using public key
y  xe mod n, where x, y < n

Decryption done by using private key
x  yd mod n
7/27/2016
Erkay Savaş
7
DL Based Cryptosystems

Fundamental operation
gx mod p, where x, g < p and g is primitive
7/27/2016
Erkay Savaş
8
Elliptic Curve Cryptography 1/2




Emerging public key cryptography standard
for constrained devices.
160 bit key length is equivalent in
cryptographic strength to 1024-bit RSA.

313 bit ECC is equivalent to 4096 bit RSA

Rich and deep theory suitable to cryptography
As algebraic/geometric entities have been
studied extensively for the past 150 years.
First proposed for cryptographic usage in
1985 independently by Neal Koblitz and Victor
Miller
7/27/2016
Erkay Savaş
9
Elliptic Curve Cryptography 2/2

Dominant fundamental operations
 Multiplication in GF(q) where q = pk and p
is prime
 Alternatives
GF(p) k = 1
 GF(2k) p = 2
 GF(pk)
 GF(3k) p = 3

7/27/2016
Erkay Savaş
10
Identity Based Encryption (IBE)

Public key can be any string



e-mail address, name, etc.
No need for certificates
Anonymity achieved


users can choose any public key without
revealing their ID
It can easily change it
7/27/2016
Erkay Savaş
11
IBE – Bilinear Mapping



e(xP, yQ) = e(P, Q)xy = e(yP, xQ) = g
g is in an (extension of) the underlying
field.
Bilinear mapping over elliptic curves




Weil pairing
Tate pairing
Resource consuming
Most efficient bilinear mappings

7/27/2016
defined on curves over GF(3k)
Erkay Savaş
12
An Introduction to Unified
Arithmetic

Types of finite fields are heavily used
1. Prime fields, GF(p)
2. Binary extension fields, GF(2k)
3. Ternary extension fields GF(3k) (recently, due
to IBE schemes)


These finite fields feature dissimilar
properties
Different implementations on specialized
hardware
7/27/2016
Erkay Savaş
13
Unified Arithmetic

Unified hardware design methodology
requires
1.
A single (unified) datapath
2.
A single (unified) control
3.
Insignificant overhead in the area
4.
Insignificant overhead in the time complexity
(e.g. critical path delay)
5.
Good {timearea} metric
7/27/2016
Erkay Savaş
14
Unified Arithmetic (GF(p) + GF(2k))


A unified hardware design methodology for
both field is possible since:
1.
the elements of either field are represented
using almost the same data structures in digital
systems
2.
the algorithms for basic arithmetic operations in
both fields have structural similarities (i.e. the
steps of the algorithms are almost identical)
Hence, eventually unified arithmetic is
possible
7/27/2016
Erkay Savaş
15
Finite Field Operations in ECC

Addition in GF(p) and GF(2k)


Multiplicative inversion in GF(p) and GF(2k)



Relatively inexpensive in area and time
complexity
Prohibitively expensive in terms of time
Possible to avoid some of them
Multiplication in GF(p) and GF(2k)



Expensive in terms of time and area
Usually most important operation
Our focus
7/27/2016
Erkay Savaş
16
Montgomery Multiplication

1.
2.
3.
4.
5.
Very efficient way of doing multiplication in
GF(p) and GF(2k) (now also in GF(3k))
Faster (replaces division by shifts)
Suitable for unified design
Suitable for scalable design
Highly parallel
Suitable for pipelining
7/27/2016
Erkay Savaş
17
Montgomery Multiplication

Definition:


Given a, b  GF(p), MonMul(a, b) = a·b·R-1 mod p,
where R = 2k mod p and k = log2p.
Algorithm
1.
c := 0
2.
for i = 0 to k-1
3.
c := (c + ai · b)
4.
c := (c + c0 · p)/2
5.
7/27/2016
if c > p then c := c-p (final subtraction)
Erkay Savaş
18
Algorithm for GF(2k)


1.
2.
3.
4.
5.

Input : a(x), b(x)  GF(2k), p(x) and k
Output: c(x) = a(x)·b(x)·xk GF(2k)
c(x) := 0
for i = 0 to k-1
c(x) := (c(x)  ai · b(x))
c(x) := (c(x)  c0 · p(x))/x
No final subtraction
Note that

7/27/2016
c/2 and c(x)/x are implemented in an identical
way in SW and HW
Erkay Savaş
19
Representation

Addition


Unified addition


Atomic operation: multiplication is performed as
a repeated addition
most efficient when carry-save representation
is used for elements of GF(p)
Carry-save representation


an integer is represented as the sum of two
other integers
x := xs + xc (sum and carry parts, resp.)
7/27/2016
Erkay Savaş
20
Scalability

Original Montgomery multiplication algorithm
performs full-precision integer additions


Not scalable
Instead,

long integers are divided into words

Addition of words are handled separately on
word adders.

Choice of word length depends on the precision,
area and speed requirements
7/27/2016
Erkay Savaş
21
Word-Based Multiplication
(j) p
(j) cc(j+1)
(j)
p(j+1)
ai bb(j+1)
ai+1 b(j) p(j) c(j)
PUi
PUi+1
(j)
cc(j+1)
w-1
w-1
(j)
(j)
(j+1)
cc(j+1)
11 cc 0 0
c(j)
7/27/2016
Erkay Savaş
22
Dependency Graph
0
0
0
( 0)
a0 B
p (0)
B (1) (1)
p
a1
B (0)
p (0)
0
7/27/2016
Erkay Savaş
23
Processing Unit (PU)
with
w=2
p
p
( j)
1
B1( j )
c0
ai
C1(j)
( j)
0
B0( j )
C0(j)
Dual-Field
Adder
Dual-Field
Adder
FSEL
Dual-Field
Adder
7/27/2016
Erkay Savaş
Dual-Field
Adder
24
Dual-Field Adder (DFA) 1/2


Almost identical to a full-adder (FA)
Difference


it has and additional (control) input (FSEL) which
suppress the carry output of the adder when it
is set to logic-0
Namely, when FSEL = 0 then the adder operates
in GF(2k), otherwise it becomes a regular FA
7/27/2016
Erkay Savaş
25
DFA 2/2
B
S
A
C
FSEL
Cout
7/27/2016
Erkay Savaş
26
Pipeline Organization with two PUs
SR-a
RAM-a
ai
ai 1
RAM-b
RAM-p
PU-1
PU-2
c
SR-C
e  2  2s if (e  2) 2s
LSR C  
otherwise
1
7/27/2016
Erkay Savaş
s: the number of PUs
27
Total Computation Time
(in clock cycles)
 k 
2 s s  e -1
T   
k



(e  1)  2( s  1)


  s 
if (e  1)  2s
otherwise
w: word size, k: precision, e := k/w,
s: the number of PUs
7/27/2016
Erkay Savaş
28
Example Execution Times

Example: k = 1024, w = 32





s = 17  T = 2105
s = 15  T = 2305
s = 10  T = 3415
s = 1  T = 33792
Example: k = 2048, w = 32




s = 33  T = 4221
s = 30  T = 4543
s = 10  T = 13343
s = 1  T = 133120
7/27/2016
Erkay Savaş
29
Comparison to the single-field
(GF(p)) design
GF(p)
Unified
Overhead
Cell Area
47.2w
48.5w
2.75%
Cell Propagation
Time
11 ns
11 ns
0%
w: word size
1.2 m CMOS technology
7/27/2016
Erkay Savaş
30
Design Alternatives

Higher Radix
Original

design is radix 2
Namely, multiplier bits are scanned one bit in each
clock cycle
Possible
to scan two or more bits of the multiplier a
Radix-4: two bits
 Radix-8: three bits

More
Complex Design: lower clock frequency, higher
area
Less clock cycle count  Faster execution of
multiplication
7/27/2016
Erkay Savaş
31
Comparison

Higher radix vs. single radix

Metric

area  time

For small total area (i.e. <10000 equivalent
NAND gates) the performances of radix-2 and
radix-8 are comparable

Radix-8 multiplier outperforms radix-2
multiplier more than 3 times when the total area
is around 25000 NAND gates
7/27/2016
Erkay Savaş
32
Dual-Radix Multiplier

Radix-2 for GF(p) and radix-4 for GF(2k)
0
ai
ai 1
b0
b1
c0
MUX-1
MUX-2
S ( j)
Selection
Logic
c1
3x2 Dual Field
Adder
p1
7/27/2016
( j)
B ( j ) P ( j ) ( P  B ) ( j ) C 2B ( j ) 2P ( j ) 2( P  B) ( j )
FSEL
Erkay Savaş
33
Dual-Radix Multiplier

Three multipliers




A1: GF(p)-only multiplier
A2: single-radix unified multiplier (with precomp.)
A3: dual-radix multiplier
Performance (area  time)


A3 performs slightly worse than A1 and A2
(between 7% to 19%) in GF(p) mode
A3 outperforms A2 by 38% to 46% in GF(2k)mode
7/27/2016
Erkay Savaş
34
Unified Arithmetic?

Unified multiplier


carry-save adders used in multiplier
It is not easy to perform other arithmetic
operations with carry-save representation
such as subtraction and comparison
(essential in inversion)
7/27/2016
Erkay Savaş
35
New Redundant Representation

Recall:

Carry-save representation


X = xs + xc.
New redundant representation

Redundant signed representation (RSD)




X = xp - xn.
Subtraction is equivalent to the addition
X-Y = (xp - xn) - (yp - yn) = (xp - xn) + (yn - yp)
Comparison is relatively easy
7/27/2016
Erkay Savaş
36
RSD



All previous multipliers require a reverse
transformation to non-redundant for after
each multiplication
There are thousands multiplication in ECC
With RSD, all the computation can be done in
RSD form without any reverse
transformation

a single transformation is necessary if the
result is needed in non-redundant form.
7/27/2016
Erkay Savaş
37
Support for GF(3n) Arithmetic

RSD lends itself to a unified arithmetic
architecture that efficiently supports
GF(3n) arithmetic
7/27/2016
Erkay Savaş
38
Analysis

A1: GF(p)-only architecture

A2: GF(2k)-only architecture

A3: GF(3n)-only architecture

A4: Unified architecture (GF(p) + GF(2k))


A5: Unified architecture (GF(p) + GF(2k) +
GF(3n))
A1 + A2: Hypothetical architecture that has
separate datapath for GF(p) and GF(2k)
7/27/2016
Erkay Savaş
39
Analysis




Metric: area  time
A4 over A1 + A2: 7.94%
A5 over A1 + A2 + A3: 33.54%
A5 over A4 + A3: 28.36%
7/27/2016
Erkay Savaş
40
Implementation Results

2.38 GHz, 0.13 m CMOS
# of 160-bit ECC 1024-bit RSA Tate pairing GF(397)
PUs
ms
s
s
4
315
21.0
508
8
210
10.5
334
16
189
5.25
334
32
189
2.12
334

4 PUs  ~11,000, 8 PUs  ~15,000 NAND
gates
7/27/2016
Erkay Savaş
41
Research Directions



Embed the unified architectures into
common general-purpose processors
Unified inversion using RSD
Unified architectures for other PKC
7/27/2016
Erkay Savaş
42
Ending…

Questions

Contact



Erkay Savaş
erkays@sabanciuniv.edu
http://people.sabanciuniv.edu/~erkays
7/27/2016
Erkay Savaş
43
Download