Implementation of SCA-Resistant CPU and an ECDLP Engine on

advertisement
Implementation of SCA-Resistant CPU and an ECDLP Engine on
FPGA Platform
Suvarna H. Mane
Thesis submitted to the Faculty of the
Virginia Polytechnic Institute and State University
in partial fulfillment of the requirements for the degree of
Master of Science
in
Computer Engineering
Patrick Schaumont, Chair
Leyla Nazhandali
Lynn Abbott
April 30, 2012
Blacksburg, Virginia
Keywords: Side-Channel Analysis (SCA), Elliptic Curve discrete logarithmic algorithm
(ECDLP), Pollard rho, Prime-field arithmetic, Hardware software co-design, FPGA.
Copyright 2012, Suvarna H. Mane
Implementation of SCA-Resistant CPU and an ECDLP Engine on FPGA
Platform
Suvarna H. Mane
ABSTRACT
The rapid increase in the use of embedded systems for performing secure transactions, has
proportionally increased the security threat, faced by such devices. Security threats are
an issue of concern at both software and hardware level. The field of cryptography has
been intensively researched for secure implementation techniques, methods to attack secure
systems and countermeasures to avoid such attacks. In this thesis, we provide solutions for
two interesting problems in the field of hardware security using reconfigurable hardware.
First, we discuss a countermeasure to prevent side-channel analysis (SCA) attacks on an embedded system. We present an SCA-resistant processor design in the context of an embedded
design flow for FPGA. It integrates an SCA-resistant custom instruction set on a soft-core
CPU and derives an SCA resistance from dual-rail precharge principle. The resulting countermeasure applies to a broad class of block ciphers that consist of lookup tables and logical
operations. While many countermeasures have been proposed previously, we show that our
solution achieves an excellent trade-off between SCA resistance, (software and hardware)
design complexity, performance, and circuit area cost.
Secondly, we present a system to attack a special type of cryptography called Elliptic Curve
Cryptography(ECC). It targets the Elliptic Curve Discrete Logarithmic Problem (ECDLP)
for a NIST-standardized ECC-curve over 112-bit prime field. We implement a successful demonstration of an ECC cryptanalytic engine using the Pollard rho algorithm on a
hardware-software co-integrated platform. We propose a novel, generalized architecture for
polynomial-basis multiplication over prime field and its extension to a dedicated square
module. Its design strategy is portable to other prime field moduli.
This work received support from the National Science Foundation, Grant no. 477634.
Acknowledgments
First, I would like to express my sincere thanks to my adviser, Dr. Patrick Schaumont, under
whose guidance I have completed my graduate studies. It has been a privilege working with
him and I am extremely grateful for his faith in me as a student. His example of dedication,
punctuality, work ethics and enthusiasm has always been an inspiration for me and will
continue to be so.
This section will not be complete without mentioning my family, who have always supported
and encouraged me throughout my life. Their love has always been the strongest source of
motivation for me and I simply wouldn’t be where I am today without them. Thank you
Aai and Papa for dedicating your lives to create a bright future for your children. My sister
Supriya and brother Deepak, deserve a special note of thanks for being my best friends all
through my life.
I would also like to thank my friends, who have always supported and cared for me in my
good as well as difficult times. Thank you everyone - Pooja P, Divya, Pooja A, Shubhangi,
Praveen, Sharayu, Aarti, Amrapali, Meeta, Ambuj, Rajat, Abhranil and Aditya.
I also deeply appreciate the help from my coworkers and lab mates, Abhranil Maiti, Xu Guo,
Zhimin Chen, Srikrishna Iyer, Lyndon Judge and Mostafa Taha. Without help from Lyndon
and Mostafa, my research would not have been so smooth.
iii
Contents
1 Introduction
1.1
1.2
1
Side Channel Analysis (SCA) Secure System . . . . . . . . . . . . . . . . . .
1
1.1.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.1.2
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.1.3
Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
ECDLP: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.2.1
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.2.2
Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.3
Organization
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.4
Related Articles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2 SCA-resistant CPU: Preliminaries
2.1
Side Channel Attacks (SCA)
8
. . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.1.1
SCA concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.1.2
Differential Power Analysis (DPA)
. . . . . . . . . . . . . . . . . . .
10
2.1.3
Measurements to Disclosure (MTD) . . . . . . . . . . . . . . . . . . .
12
iv
2.2
2.3
2.4
SCA Countermeasures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.2.1
Principle of Dual-rail Precharge logic (DPL) . . . . . . . . . . . . . .
13
Block Ciphers and AES Algorithm . . . . . . . . . . . . . . . . . . . . . . .
15
2.3.1
AES T-Box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
Customized Processor and Custom Instructions . . . . . . . . . . . . . . . .
17
3 SCA-resistant CPU: Implementation
19
3.1
SCA-resistant Data organization . . . . . . . . . . . . . . . . . . . . . . . . .
19
3.2
Memory Organization for Lookup Tables . . . . . . . . . . . . . . . . . . . .
21
3.3
System Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
3.3.1
SCA-resistant Custom Instruction Set
. . . . . . . . . . . . . . . . .
23
3.3.2
System on Chip Configuration (SOPC) . . . . . . . . . . . . . . . . .
24
4 SCA-resistant CPU: Results Analysis
26
4.1
Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
4.2
Single T-Box Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
4.3
AES Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
4.4
Impact of PCB Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
4.5
Related Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
4.6
Summary of Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
5 ECDLP Engine: Background
5.1
39
Elliptic Curve Cryptography (ECC) . . . . . . . . . . . . . . . . . . . . . . .
v
39
5.2
Pollard-rho Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
5.2.1
Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
5.3
Modular Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
5.4
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
6 ECDLP Engine: Implementation
6.1
6.2
6.3
6.4
46
Modular Multiplication Architecture . . . . . . . . . . . . . . . . . . . . . .
46
6.1.1
Modular Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . .
47
6.1.2
Dedicated Square Unit . . . . . . . . . . . . . . . . . . . . . . . . . .
51
6.1.3
Vectorized Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
6.2.1
Nallatech Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
6.2.2
Software Implementation . . . . . . . . . . . . . . . . . . . . . . . . .
54
6.2.3
Hardware Implementation . . . . . . . . . . . . . . . . . . . . . . . .
55
Implementation Results
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
6.3.1
Overall Performance . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
6.3.2
Comparison with Previous Software Implementations . . . . . . . . .
60
6.3.3
Comparison with Hardware Implementation . . . . . . . . . . . . . .
60
Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
7 Conclusion
63
vi
List of Figures
1.1
SCA resistant design by (a) C source code transformation, (b) Dedicated
secure logic styles and (c) Customized CPU . . . . . . . . . . . . . . . . . .
3
2.1
SCA Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.2
Correlation based DPA attack: Example . . . . . . . . . . . . . . . . . . . .
12
2.3
Comparison between CMOS standard AND gate and DPL AND gate; (a) A
standard AND has data-dependent power dissipation; (b) A DPL AND gate
has a data-independent power dissipation . . . . . . . . . . . . . . . . . . . .
14
2.4
128-bit AES Dataflow. Shaded operations belong to single T-Box Operation
16
2.5
Customized Processor Architecture: NiosII . . . . . . . . . . . . . . . . . . .
17
3.1
Balanced Interleaved data format . . . . . . . . . . . . . . . . . . . . . . . .
20
3.2
Balanced-Interleaved T-Box Organization . . . . . . . . . . . . . . . . . . . .
21
3.3
System-on-chip Configuration (SOPC) . . . . . . . . . . . . . . . . . . . . .
25
4.1
Setup for SCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
4.2
Single T-Box Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
4.3
Security Improvement: Single T-Box test . . . . . . . . . . . . . . . . . . . .
30
vii
4.4
AES-TBOX implementation: NiosII/s + SDRAM . . . . . . . . . . . . . . .
4.5
Attack results on secure implementation: NiosII/s + SRAM. Trace of correct
31
key guess (here, first key byte) is plotted in black, while all other key guesses
are in yellow(gray). The buried trace means unsuccessful attack. . . . . . . .
32
4.6
AES trace on oscilloscope (SDRAM configuration). . . . . . . . . . . . . . .
34
4.7
AES trace on oscilloscope (SSRAM configuration). . . . . . . . . . . . . . . .
35
4.8
Impact of PCB Layout on Residual Leakage . . . . . . . . . . . . . . . . . .
36
5.1
Standard Pollard-rho attack . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
5.2
Reduction with prime p, where p = 2N − r . . . . . . . . . . . . . . . . . . .
43
6.1
Polynomial Representation of Data . . . . . . . . . . . . . . . . . . . . . . .
48
6.2
Standard Multiplication Method . . . . . . . . . . . . . . . . . . . . . . . . .
49
6.3
Reduction with Adder-Chain . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
6.4
Modular Multiplication Architecture . . . . . . . . . . . . . . . . . . . . . .
50
6.5
Dedicated Square Architecture . . . . . . . . . . . . . . . . . . . . . . . . . .
52
6.6
Nallatech System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
6.7
System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
viii
List of Tables
3.1
SCA-resistant Instruction Set for AES . . . . . . . . . . . . . . . . . . . . .
22
4.1
AES Implementation: Security
32
4.2
AES Implementation: Area and Performance
. . . . . . . . . . . . . . . . .
33
4.3
Related work: Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
6.1
ECC Arithmetic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
6.2
Micro-instruction Sequence for Point addition: P + Q . . . . . . . . . . . . .
57
6.3
Comparison with Software Implementations . . . . . . . . . . . . . . . . . .
60
6.4
Comparison with Hardware Implementations (per ECC core) . . . . . . . . .
61
. . . . . . . . . . . . . . . . . . . . . . . . .
ix
Chapter 1
Introduction
Modern security systems use cryptographic algorithms to provide confidentiality, integrity
and authentication of data. These cryptographic algorithms fall into two broad categories:
Symmetric cryptography and asymmetric cryptography. They use mathematically complex
and difficult operations to achieve the desired level of security. Another research field in security deals with the cryptanalytic techniques, which attack secure systems to extract secret
keys by exploiting weakness of a cryptographic algorithm, or the weakness of its implementation at hardware/software level. In this thesis, we discuss two interesting problems in the
field of hardware security.
1.1
1.1.1
Side Channel Analysis (SCA) Secure System
Introduction
Today, many embedded electronic systems, including RFIDs, smart cards, wireless car keys,
smart phones, tablets, etc., are used to represent personal identification, to store private
information, and to do confidential communications. As a result, information security on
embedded systems has become crucially important. Although the security of such systems
1
Suvarna H. Mane
Chapter 1. Introduction
2
relies on the computational complexity of underlying cryptographic algorithms,
implementing them with either software or hardware usually leaks additional information,
which can be used to break the cryptographic systems easily. One such technique, sidechannel analysis (SCA) [24] attack, successfully extracts the secret keys from cryptographic
algorithms within a very short time by exploiting side-channel information such as execution
time, power consumption, or electromagnetic emissions.
Over time, the field of SCA has been intensively researched and the literature shows a long
string of attacks and counterattacks against countermeasures [30]. Several recent results
include the use of SCA to extract keys from Virtex-II FPGA [28] and Virtex-4/5 FPGA
[29] bitstream encryption, from the Mifare DESFire contactless card [22], from the Keeloq
keyless entry system [14], and from the Atmel Cryptomemory non-volatile memory [4]. It is,
thus, crucially important to consider SCA attacks as a part of the threat model of a design
and to introduce suitable SCA countermeasures to hamper these attacks.
With an embedded processor at the heart of these systems, it is desirable to have a easyto-design SCA-resistant embedded processor. As illustrated in figure 1.1, the approaches to
implement side-channel countermeasures in the context of embedded processors, broadly fall
into three categories. In the first category, it is handled at the software level by transforming
the crypto-algorithm into an side-channel leakage-free implementation, for example [11].
This countermeasure is usually algorithm-specific, and requires in-depth understanding of
cryptographic operations. The second approach targets underlying hardware technology and
implements the CPU in a SCA-resistant circuit style. Past research has shown that these
techniques are very expensive in hardware - costing 3 to 15 times the original circuit area [30]
- and thus not applicable to a complete CPU. Third approach uses a customized CPU, where
the cryptographic operations are implemented in custom hardware using a secure logic style.
This approach does not require conversio of the complete processor in secure logic style,
which saves on the area cost. Hence, the third approach shows a decent trade-off between
security and performance, while keeping the cost with a reasonable limit.
Suvarna H. Mane
Chapter 1. Introduction
3
scaresistant
C
C
C
C
scaresistant
CPU
Performance
Circuit Area
SCA Resistance
C Complexity
=
CPU
CPU
Performance
Circuit Area
SCA Resistance
C Complexity
(a)
(b)
~
=
Custom Ins +
Memory Org
CPU
Performance
Circuit Area
SCA Resistance
C Complexity
~
~
(c)
Figure 1.1: SCA resistant design by (a) C source code transformation, (b) Dedicated secure
logic styles and (c) Customized CPU
1.1.2
Motivation
The design and implementation of a side-channel countermeasure is a very complex and
error-prone process because side-channel leakage is a byproduct of the implementation of
a cryptographic algorithm. Predicting the amount of side-channel leakage from, say, cryptographic software in C is difficult. Nonetheless, embedded system design needs to satisfy
several constraints such as low area, low power consumption, small software footprint and
low cost, while having to maintain high performance. Our work is motivated by the need
for an easy-to-use countermeasure, applicable to a wide range of designs and usable within a
standard FPGA design flow. The objective is to systematically remove side-channel leakage
while keeping a reasonable cost in circuit area and performance degradation. We consider
protection of a general class of block ciphers that use logic operations and lookup tables.
This includes AES, DES, and many others. We propose our methodology in the context
of embedded designs with a CPU, and develop side-channel resistance for cryptographic
software executing on the processor.
Suvarna H. Mane
1.1.3
Chapter 1. Introduction
4
Contribution
This thesis presents a secure embedded system design based on customized CPU, which uses
an SCA-resistant custom instruction set and an optimized memory organization. The design
configuration is supported by a soft-core CPU in mainstream FPGA families and an SCA
resistance is derived from dual-rail precharge logic (DPL). The solution uses a balancedinterleaved data format, combined with a novel memory organization to support both logic
operations as well as lookup tables. The resulting countermeasure applies to a broad class
of block ciphers. We demonstrate our results for a 128-bit Advanced Encryption Standard
(AES) T-box implementation and show an SCA-resistance improvement of more than 400X
for a system-wide electro-magnetic attack that covers both the FPGA and offchip memory
(SSRAM). This comes at an overhead of 2.7x in performance and 1.15X in area.
Our work is not the first to suggest a customized CPU for side-channel resistant implementations; previous proposals have included masking-based [37, 5] and hiding-based [33, 12]
designs. However, using comparisons with related work, we demonstrate that our solution
represents an excellent trade-off between SCA resistance, (software and hardware) design
complexity, performance, and circuit area cost.
1.2
ECDLP: Introduction
The security of symmetric and asymmetric ciphers is usually determined by its security
parameters, foe example, size of the key. This is because the computational complexity of an
algorithm is higher for a longer key-size. Elliptic curve cryptosystems (ECC), independently
introduced by Miller [26] and Koblitz [23], have now found the significant place in the
academic literature and practical applications. It is a type of public-key cryptography based
on the algebraic structure of elliptic curves over finite fields (either binary or prime). Their
popularity is mainly because of their shorter key-sizes, which offer the same level of security
as other conventional cryptosystems such as RSA.
Suvarna H. Mane
Chapter 1. Introduction
5
The security of ECC relies on the difficulty and complexity of Elliptic Curve Discrete Logarithmic Problem (ECDLP) [7]. It refers to the ability to compute a point multiplication
and the inability to compute the multiplicand given the original and product points. By
definition, ECDLP is to find an integer n for two points P and Q on an elliptic curve E such
that
Q = [n]P
(1.1)
Here, [n] denotes the scalar multiplication with n. The Pollard rho method [32], [10] is
the strongest known attack against ECC today. This method solves ECDLP by generating
points on the curve iteratively, any of which have the property X = [a]P + [b]Q. When the
same point is encountered twice, for different [a] and [b], the collision occurs and the ECDLP
is solved.
1.2.1
Motivation
There have been different approaches to implement Pollard rho algorithm on software and
hardware platforms. Most of the solutions are implemented on software platforms using
general purpose workstations, such as clusters of PlayStation3 [9], Cell CPUs [8], GPUs [3].
These software approaches are inherently limited by the sequential nature of software on
the target platform. Programmable hardware platforms are an attractive alternative to the
above because they efficiently support parallelization. However, most of the FPGA-based
solutions that have been proposed, do not deal well with the control complexity of ECDLP.
Instead, they focus on the efficient implementation of datapath operations, and ignore the
system integration aspect of the solution.
There has been little work in the area of supporting or accelerating a full Pollard rho algorithm on a hardware-software platform [25]. Our solution, therefore, goes one step further as
we demonstrate the parallelized Pollard rho algorithm on FPGA along with its integration
to a software driver.
Suvarna H. Mane
1.2.2
Chapter 1. Introduction
6
Contribution
We start with a reference software implementation [6], and demonstrate an efficient, parallel
implementation of the ECDLP machine over prime field of the form (2k − q)/m. We use a
novel, generalized architecture for polynomial-basis multiplication over a prime field. The
resulting modular multiplier completes the multiplication within 14 clock cycles, which is a
2.5X lower latency over earlier work [19]. The complete system design takes 151 cycles per
Pollard rho step at 100MHz and performs upto 660K point additions per second per ECC
core. A single ECC core occupies 4773 slices on Virtex-5 FPGA device. With a multi-core
implementation of our design, the performance can be comparable with that of the software
implementation on a Cell processor [6]. Our work also shows that implementation of prime
field arithmetic on hardware can be as feasible as binary field arithmetic.
1.3
Organization
This thesis is organized to cover details of two problems and their solutions. Chapters 2,3 and
4 discuss the solution for the first problem on hardware security i.e. Implementation of an
SCA countermeasure on a soft-core CPU. The details of second solution i.e. implementation
of ECDLP engine, are explained in Chapters 5, 6 and 7.
The individual chapters are structured as follows: Chapter 2 introduces preliminary knowledge that is needed by first solution, such as SCA attacks and countermeasures, Dual-rail
Precharge(DRP) principle, Advanced Encryption Standard (AES) algorithm and Custom
instruction support to a CPU. Chapter 3 presents the implementation details of our SCAresistant processor design. We evaluate our solution and analyze the results in the Chapter
4.
Chapter 5 presents preliminaries required for the second problem i.e. ECDLP engine implementation. It covers background of ECC cryptography, ECDLP, pollard-rho method and
Suvarna H. Mane
Chapter 1. Introduction
7
discusses modular arithmetic operations over a prime field. The implementation details
of ECDLP engine are provided in the Chapter 6 and results are compared with previous
implementations in Chapter 7.
Finally, Chapter 8 summarizes the contributions of our work and identifies potential future
targets.
1.4
Related Articles
Our work is described in the following papers:
• S. Mane, L. Judge, P. Schaumont, ”An Integrated Prime-field ECDLP Hardware Accelerator with High-performance Modular Arithmetic Units”, ReConFig 2011.
• S. Mane, M. Taha, P. Schaumont, ”Efficient and Side-Channel-Secure Block Cipher
Implementation with Custom Instructions on FPGA”, under review, FPL’12.
• L. Judge, S. Mane, P. Schaumont, ”A Hardware Accelerated ECDLP with Highperformance Modular Multiplication”, special issue of International Journal of Reconfigurable Computing (IJRC), 2011 (under review).
Chapter 2
SCA-resistant CPU: Preliminaries
In this chapter we give a brief overview of SCA attacks and countermeasures, the AES
algorithm and architecture of customized processors.
2.1
Side Channel Attacks (SCA)
The cryptographic algorithms are considered secure because of their inherent computational
complexity. It needs a huge amount of attack efforts and thousand of years for a conventional
brute-force key search to break it. However, these algorithmic security features alone are
not enough to guarantee its security. Passive attacks on cryptographic devices such as SCA
attack, exploit the weakness of an implementation platform of a cryptographic algorithm
and break it within few hours. It reveals the secret key part by part, using the side-channel
leakage of a device such as execution time, power consumption and electromagnetic radiations. Hence, SCA causes a serious threat to the secure embedded systems. The concept of
SCA can be explained as follows.
8
Suvarna H. Mane
Chapter 2. SCA-resistant CPU: Preliminaries
9
Trust Boundary
Cryptographic Device
Key (K)
Plaintext (P)
Cipher-text (C)
Encryption
Routine (r)
Electromagnetic radiations
(Side-channel leakage)
Analysis
Algorithm
Digitized
traces
Oscilloscope
Secret Key
Figure 2.1: SCA Concept
2.1.1
SCA concept
In the passive type of cryptanalytic attacks, the cryptographic device is operated largely or
entirely within its specification and the physical properties of the device are observed to find
the secret key. These observable physical properties are referred as side-channel leakage and
the attacks are known as side-channel attacks(SCA).
The basic idea can be explained as shown in Figure 2.1, where a cryptographic device is
executing an encryption algorithm (r). It takes the plaintext (P ) as an input and converts
it into ciphertext C by using the internal secret key (K) i.e. (C = r(P, K)). As the
secret key K is stored in the the device hardware, it is not observable on the input/output
ports. The goal of SCA is to find this K. Here, we assume that an attacker knows the
underlying cryptographic algorithm and has an access to the input/output data (plaintext
Suvarna H. Mane
Chapter 2. SCA-resistant CPU: Preliminaries
10
and ciphertext).
Suppose, in this example, the cryptographic algorithm r is the AES that uses a 128-bit key
and side-channel property observed is the electromagnetic emissions. A conventional bruteforce attack will need to search through all 2128 guesses to find correct key. On contrary,
SCA targets the secret key in pieces instead of trying to break all 128 bits together. It takes
advantage of the fact that, not all key bits are used simultaneously in an cryptographic
algorithm. Some intermediate values depend only on a part of the key e.g. a SubBytes
operation of the AES works at byte-level and hence, each output byte of this operation is
calculated based on one byte of plaintext and one byte of the secret key. Consequently, the
EM radiations strongly relate to one key byte at a specific point of time in the AES execution
window. SCA uses this information to compare the recorded EM trace against 256 possible
key guesses and reveals the correct key byte. The procedure is repeated for all key bytes.
This breaks down the search space to 16 ∗ 28 , which is much less than that for a brute-force
attack.
The EM traces are captured with the help of oscilloscope and the traces are analyzed using
analysis algorithms. In the general case, an attacker has to select an algorithm-specific
intermediate value, which depends on part of the secret key and is also observable through
side-channel leakage. In above example, the output of a SubBytes operation is chosen as the
intermediate value.
2.1.2
Differential Power Analysis (DPA)
There are different ways to analyze the recorded side channel information to extract secret
keys such as, simple power analysis (SPA), differential power analysis (DPA) [34], and mutual
information analysis. The goal of the DPA attack is to reveal secret keys of cryptographic
device based on large number of power traces that have been recorded while the devices
encrypt or decrypt different plaintexts.
Suvarna H. Mane
Chapter 2. SCA-resistant CPU: Preliminaries
11
The first step is to choose an intermediate result v of a cryptographic algorithm, which is a
function of plaintext (P ) and key (K). In the second step, large number of measurements
for the power consumption (or electromagnetic radiations) of a device are recorded, while it
encrypts (decrypts) D different plaintexts. These traces can be written as a matrix A of size
DxT , where T is the length of the trace.
The next step of the attack is to calculate a hypothetical intermediate value for every possible
choice of key k, as k = (k1 , k2 , ..., kN ), where N is the total number of possible keys. We
refer elements of this vector as key hypotheses. Knowing plaintext, an attacker can calculate
intermediate values for all D encryptions and for all N key hypotheses, which results in a
matrix V of size DxN . In the next step, these hypothetical intermediate values (matrix
V ) are mapped to hypothetical power consumption values (matrix H) using a power model.
An attacker can obtain a power model using knowledge of analyzed device and simulation
techniques. The most commonly used power models are Hamming weight and Hamming
distance. We use Hamming weight power model in our experiments.
In the final step, each column hi of the matrix H is compared with each column aj of
A using a correlation technique. This means that the attacker correlates the hypothetical
power consumption values of each key hypothesis with the recorded traces at every position.
The result of this comparison is matrix R of size N x T , where each element ri,j represents
a correlation coefficient. The element of the matrix R with the highest correlation value
reveals the correct the key.
Figure 2.2 shows an example of 256 correlation coefficient traces. The trace shown in red
has a distinctly higher correlation than other 255 key guesses and thus it corresponds to a
correct key byte. This attack is also referred as correlation power attack (CPA). We collect
electromagnetic radiations and analyze them with CPA technique in our experiments.
Suvarna H. Mane
Chapter 2. SCA-resistant CPU: Preliminaries
12
Figure 2.2: Correlation based DPA attack: Example
2.1.3
Measurements to Disclosure (MTD)
There is no direct way to quantify the side-channel leakage. Hence, it needs to be expressed using indirect parameters such as, correlation coefficient values or the number of
measurements required for successful attack etc. The most widely used approach is to use
Measurements to Disclosure (MTD). It represents the number of measurements required to
attack a cryptographic device successfully. A higher MTD implies a more secure design.
SCA uses statistical methods to analyze the acquired samples, which gives better results as
the number of samples increases. Getting more samples means attacker needs to invest more
time and more attack efforts. We use MTD to evaluate the security gain in our experiments.
Suvarna H. Mane
2.2
Chapter 2. SCA-resistant CPU: Preliminaries
13
SCA Countermeasures
Power attacks and electromagnetic radiation attacks are based on the dependency between
power consumption (or electromagnetic radiation) and intermediate values in a cryptographic
algorithm. The goal of SCA countermeasures is to obtain data-independent power consumption (or electromagnetic radiation). For the purpose of completely preventing SCA, there
are two broad categories of countermeasures: masking and hiding. In this thesis, we base
our secure solution on hiding.
Masking can be implemented at the algorithm level without changing the low-level hardware
to make their power characteristics independent of the processed data. It achieves SCAresistance by randomizing the intermediate values of the cryptographic algorithms. The
idea is to mask an intermediate value with randomly generated mask m, where m needs
to remain secret to achieve the expected SCA-resistance. This countermeasure is usually
algorithm-specific, and requires in-depth understanding of cryptographic operations. Moreover, masking becomes very complex under advanced SCA techniques [18].
The goal of hiding is to suppress side-channel leakage by designing the cryptographic devices such that they have the same power consumption and electromagnetic radiation while
manipulating sensitive data. Hiding allocates the task of eliminating data dependency to
the device hardware. Due to its non-dependency on statistical characteristics of the intermediate values, hiding does not suffer from higher-order attacks. There are several different
approaches to implement hiding. One of them is the differential logic, also called Dual-Rail
Precharge Logic (DPL). The SCA-resistance in our solution is based on principle of DPL.
2.2.1
Principle of Dual-rail Precharge logic (DPL)
The cause of side-channel leakage is data-dependent processing. In CMOS logic, such processing gives data-dependent signal transitions, which in turn results in data-dependent power
consumption or radiation. The idea of Dual-rail Precharge Logic (DPL) is to eliminate
Suvarna H. Mane
Chapter 2. SCA-resistant CPU: Preliminaries
Standard AND gate
a
b
c
Transition on c
Power consumed
0
0
0
0
1
+P
1
0
-P
1
1
0
14
(a)
DPL AND gate
a
b
a
b
c
c
Transition on c
P
E
Transition on c
P
E
Power consumed
0
0
0
1
0
1
constant
0
0
1
1
0
0
constant
1
0
0
0
0
1
constant
1
0
1
0
0
0
constant
P: Precharge phase, E: Evaluation phase
(b)
Figure 2.3: Comparison between CMOS standard AND gate and DPL AND gate; (a) A
standard AND has data-dependent power dissipation; (b) A DPL AND gate has a dataindependent power dissipation
side-channel leakage at the level of the implementation.
The concept of DPL is explained as follows. First, every bit in the circuit is stored and
processed in complementary form. For example, as shown in Figure 2.3, for logic operation
AN D (a and b), there is a matching complementary operation OR (not(a) or not(b)).
Both of these gates are evaluated simultaneously. This ensures that the load of AN D and
that of OR are identical. Thus, 0 → 1 and 1 → 0 transitions on a signal c can not be
distinguished.
However, a transition on a signal can be distinguished from a non-transition. To address this
issue, a pre-charge phase is introduced before each evaluation phase. Every complementary
data pair (a, not(a)) is pre-charged to (0,0) before every evaluation. When combined,
evaluation and precharge phases together, result in a constant power consumption: every
evaluation has an active 0 → 1 transition, either on the true net, or else on the complementary
net.
Suvarna H. Mane
Chapter 2. SCA-resistant CPU: Preliminaries
15
DPL has been applied in many different forms since it was first proposed, including ASIC,
FPGA, and software [38, 18, 12]. Authors have also identified sources of residual leakage,
including early evaluation and imbalance between complementary pairs [17]. However, DPL
has demonstrated substantial reduction of side-channel leakage in prototypes.
2.3
Block Ciphers and AES Algorithm
Block ciphers use symmetric-key cryptographic algorithms to encrypt a block of plaintext into
ciphertext through successive round transformations. The majority of modern block ciphers
are built out of a limited set of operations, including substitutions with lookup-tables (SBoxes), and operations such as XOR, modular addition, rotations, and shift. Furthermore,
round transformations have a common structure, and use either a substitution-permutation
network (SPN), or a Feistel network. Of course, within this framework, there are important
differences among block ciphers as well, such as the number and size of lookup tables used,
and the detailed configuration of the operations.
The Advanced Encryption Standard (AES) [13] is a widely used symmetric block cipher,
which is capable of using cryptographic keys of 128, 192, and 256 bits. The 128-bit AES
algorithm takes a block of 128 bits of plaintext and using 128-bit key, it iterates that data
through 10 rounds to produce ciphertext. An AES state (i.e. block of 128 bit data) is
organized as a 4x4 matrix of 16 bytes and is processed by the AES round. Each AES round
includes four operations: SubBytes, ShiftRows, MixColumns, AddRoundKey. Figure 2.4
describes the dataflow of AES.
2.3.1
AES T-Box
There are two different ways to implement AES algorithm based on its way to perform
substitution operations. The basic implementation AES S-Box, performs these substitutions
Suvarna H. Mane
Chapter 2. SCA-resistant CPU: Preliminaries
16
S
Figure 2.4: 128-bit AES Dataflow. Shaded operations belong to single T-Box Operation
using a 8x256 lookup table (S-Box ). The more optimized design AES T-Box [13] computes
one complete round of AES just by using lookup tables followed by a large XOR network.
Figure 2.4 illustrates the dataflow within a single AES round. Each of the s-operations in
Figure 2.4, is a 8x256 substitution table. This S-Box operation on an input byte is calculated
by obtaining its finite-field inversion (over GF (28 )) followed by an affine transformation [13].
Each of the AES-128 rounds consists of such 16 S-Box substitutions.This leads to byteoriented computations over the entire round making it inefficient for 32-bit architectures.
A more efficient way to implement AES on 32-bit architectures is to use a T-Box table, which
merges the shaded operations in Figure 2.4 into single lookup operation. The SubBytes and
MixCoulumns operations are reformulated together to implement as a 32-bit wide 256-deep
T-Box table. A T-Box maps byte-wide operations over four bytes, each of which represent
part of a MixColumn result. In each AES round, four T-Box operations can be combined
Suvarna H. Mane
Chapter 2. SCA-resistant CPU: Preliminaries
17
ALU
Operand A
+
>>
Result
&
Operand B
Custom
Logic
Figure 2.5: Customized Processor Architecture: NiosII
to obtain a single row of the AES state matrix. This requires to have four different T-Box
lookup tables. This approach results in a better utilization of the 32-bit datapath of a
processor. We implement 128-bit AES T-Box algorithm to investigate our solution.
2.4
Customized Processor and Custom Instructions
The crucial demands of embedded system design such as, high performance, low cost, timeto-market window, design flexibility, have resulted in the emergence of the Instruction-set
Extensible Processors or Customized Processors. It consists of an existing processor core,
that is extended with application-specific custom instructions. These custom instructions
execute on a Reconfigurable Custom Hardware and allow user to reduce a complex sequence
of standard instructions to a single instruction implemented in hardware.
The idea is to implement desired custom logic in the custom hardware attached to an existing processor and call these functions using custom instructions from a software. Figure 2.5
shows a general architecture of customized processor. Optimized hardwired custom logic im-
Suvarna H. Mane
Chapter 2. SCA-resistant CPU: Preliminaries
18
plementation helps to improve performance through parallelism and chaining of operations.
At the same time, custom instructions result in compact code size as well as less number of
instruction fetches and decodes.
Altera provides instruction-set extensible softcore processor N iosII compatible with their
FPGA device CycloneII. NiosII supports an user-friendly software interface to access custom hardware using built-in user-defined Macros. These custom instructions can simply be
written as C instructions. We use softcore NiosII processor for our experiments.
Chapter 3
SCA-resistant CPU: Implementation
We implement a side-channel resistant block cipher by creating DPL versions of both the
lookup tables as well as the logic operations in hardware. These modules are efficiently
integrated into the soft-core processor using the custom-instruction set interface. This way,
SCA-resistant block ciphers can be executed as a sequence of custom instructions. Noncrypto software, on the other hand, is written using the regular instruction set without
performance hit. The custom-instruction hardware for lookup tables is built from on-chip
RAM macro’s. Research has demonstrated that such dedicated structures increase sidechannel resistance [35], and we further improve this technique.
In this section, we discuss the components of our solution: the organization of data, the
memory organization for lookup tables, and the system integration of SCA-resistant block
cipher hardware into software.
3.1
SCA-resistant Data organization
We need a data format that is compatible with the requirements of DPL and uses the wordlevel organization of an embedded system. Figure 3.1 shows our data arrangement. Each
19
Suvarna H. Mane
Chapter 3. SCA-resistant CPU: Implementation
Unbalanced
Full Word
b31..b16
20
b15..b0
Balanced-Interlaved
b15 b15
Lower Half Word
b1 b1 b0 b0
Balanced-Interleaved
b31 b31
Upper Half Word
b17 b17 b16 b16
Figure 3.1: Balanced Interleaved data format
32-bit word is split into two balanced half-words, and each bit from the original word is
interleaved with an associated complementary bit. We call this representation a balancedinterleaved (BI) format. The logical and physical proximity of complementary bits improves
symmetry between the bits (e.g. similar electrical loads), and in turn, this improves SCA
resistance.
Indeed, at the logical level, adjacent bits will share adjacent storage locations. In embedded
architectures, storage organization may use a wordlength which is different from the processor wordlength; a 32-bit memory may be organized, for example, as two half-word banks.
Keeping complementary bits adjacent ensures that they will share the same physical storage
bank. Furthermore, at the physical level, adjacent bits in a bus structure have a better
chance of being routed with equal wire lengths on the FPGA and PCB. This is important
because we are building a generic design flow, that does not pre-assume special place-androute scripts. Having equal and similar routing paths for complementary bits creates very
low electrical imbalance between them reducing the side channel leakage to the minimum.
A consequence of using a balanced-interleaved format is that each 32-bit operation from the
original, unprotected block cipher, requires expansion into two balanced operations, each
processing a balanced half-word.
Suvarna H. Mane
Chapter 3. SCA-resistant CPU: Implementation
Balanced
Address
21
16 bit
address
address
0x00
0x01
0xFF
0xFE
TBOX_H
256x32
0x00
Balanced
Data (H,L)
TBOX_L
256x32
0xFF
32 bit
Figure 3.2: Balanced-Interleaved T-Box Organization
3.2
Memory Organization for Lookup Tables
Because lookup tables are so common in block ciphers, we use a dedicated approach to
implement side-channel resistant lookup tables using the RAM macro’s of the FPGA fabric.
We use the AES T-Box implementation as a case study. The T-Box is a lookup table with
8 input bits and 32 output bits and is defined by grouping several steps of the AES round
transformation. For the purpose of explaining our method, we treat the T-Box simply as an
8x32 lookup table. The complete AES algorithm requires five different T-Box tables.
The secure T-Box design shown in Figure 3.2 uses a balanced-interleaved data organization.
An 8x32 T-Box thus needs two 8x32 balanced-interleaved tables, each storing a half-word of
the original T-Box with its complementary bits. Each balanced-interleaved table is stored
in a separate RAM macro. In order to achieve balancing in the address decoding logic,
we follow the storage order suggested in [35], namely that complementary RAM macro’s
require complementary addresses. The difference with our design, however, is that the
Suvarna H. Mane
Instruction
Chapter 3. SCA-resistant CPU: Implementation
22
Table 3.1: SCA-resistant Instruction Set for AES
Return Value
CONV INV(a) 0, 0, .., a[2], a[0]
CONV BIL(a) a[15], a[15], .., a[0], a[0]
CONV BIH(a) a[31], a[31], .., a[16], a[16]
B XOR(a,b)
balanced-interleaved xor(a, b)
B TBx L(a)
balanced-interleaved lookup-table (lower)
B TBx H(a)
balanced-interleaved lookup-table (upper)
Each AES T-Box has its own B TBx H(a) and B TBx L(a); x=0,1,2,3,4
complementary RAMs do not store complementary data: the data within each RAM is
already balanced.
Summarizing, our proposed memory organization for lookup tables achieves side-channel
resistance by combining three elements. First, the use of RAM cells reduces side-channel
leakage because the increased logic density they offer. Second, the use of balanced-interleaved
addressing for the overall lookup table. Third, the use of balanced-interleaved data storage
for lookup table content.
3.3
System Integration
An important, but often overlooked, aspect of side-channel countermeasures is the system
integration. On an embedded processor, SCA-resistant encryption is just one of the many
tasks handled by software. We have integrated our countermeasures as custom instructions
into a soft-core processor. A custom-instruction interface offers the ability to introduce
custom-hardware modules in the execution stage of a RISC pipeline.
Suvarna H. Mane
3.3.1
Chapter 3. SCA-resistant CPU: Implementation
23
SCA-resistant Custom Instruction Set
NiosII softcore CPU provides support for custom instructions and an interface to integrate
them with software. It can support up to 128 custom instructions. We have implemented
three different custom instructions to support AES routine. Table 3.1 shows the side-channel
resistant instruction set for AES.
CONV INV(a) extracts the even bits from a word, and thus converts balanced-interleaved
format into direct form. CONV BIL(a) and CONV BIH(a) generate balanced-interleaved form
from the lower resp. higher halfword of the input argument a. These instructions are
used to initialize the data variables of a program into balanced-interleaved form before it
starts executing the cryptographic routine. This makes sure that dataflow of these variables
through the program always offer SCA resistance. When encryption routine is finished, the
output variables are converted back into the standard 32-bit form.
The round function for a T-Box based AES only requires a balanced XOR, which can be
supported through a single custom instruction B XOR(a,b). It takes two 32-bit balancedinterleaved operands as inputs, perform XOR operation in DPL way and produces a balancedinterleaved 32-bit output. Move, shift and rotate operations are compatible with balancedinterleaved arguments, so that no custom instruction is needed for those. Custom logic takes
care of shift and rotate manipulations.
The AES T-Box has 5 different T-Box tables. There is a B TBx L(a) and a B TBx H(a) to
access the lower resp. higher half of each T-Box table. These instructions are specific for
the AES block cipher; a different block cipher would need to use different lookup tables.
However, it is perfectly feasible to make the lookup tables fully reconfigurable, so that they
can be programmed with the S-Box content required for a specific block cipher. The approach
to implement lookup tables in the processor is an important difference with earlier work by
Chen [12], and we will show how this brings considerable performance gain.
The AES T-Box algorithm can be written in C by making use of custom instructions em-
Suvarna H. Mane
Chapter 3. SCA-resistant CPU: Implementation
24
bedded as inline assembly macro’s. The pre-charge operation can be supported from C as
well, as illustrated in the snippet below. Note the use of volatile to prevent the removal
of precharge by an optimizing compiler.
volatile int t1, t2, t3;
t1 = 0;
// precharge
t1 = B_TB0_L(in);
// T-Box0 lower word
t2 = 0;
// precharge
t2 = B_TB1_L(in);
// T-Box1 lower word
t3 = 0;
// precharge
t3 = B_XOR(t1, t2); // XOR
A strong feature of this approach is that it is fully compatible with the existing memory hierarchy of an embedded system. Variables can be stored into RAM in balanced-interleaved
form, and they will maintain their low side-channel leakage provided that pre-charge is properly implemented. Thus, our approach is independent of the number of processor registers;
it will not run out of foreground storage (in contrast to e.g. [36]).
Storing balanced interleaved format in background memory may still cause side-channel
leakage due to asymmetry in the physical layout of background memory. We will analyze
this later in this thesis.
3.3.2
System on Chip Configuration (SOPC)
We use QuartusII- SOPC builder tool to configure our system on FPGA. It provides interface
to choose configurations of processor and different peripheral components on board to be
included in the system. It also allows user to define new custom instructions and select
any of them to include in the system. The SOPC builder assigns address space to all
these components and integrates them together to generate a desired system on chip. It
Chapter 3. SCA-resistant CPU: Implementation
System-on-chip
SDRAM-1
(16-bit, 32MB)
NiosII/s
25
Expansion
Header (GPIO)
Suvarna H. Mane
Onchip Memory
SDRAM-2
(16-bit, 32MB)
RS232
(19200 bps)
SDRAM/SRAM
Controller
JTAG Debug
Module
SSRAM (2MB)
Figure 3.3: System-on-chip Configuration (SOPC)
also assigns address space to the custom instructions, which are used to call them through
assembly macros.
Figure 3.3 illustrates the block diagram of the system used in our experiments. It incorporates a 32-bit NiosII/s (50MHz, pipelined) processor, an offchip memory (SDRAM or
SSRAM), GPIO parallel port (trigger), and communication peripherals (RS232 and JTAGUART). NiosII/s is a 32-bit RISC softcore CPU with pipelined architecture. SDRAM is
64MB memory arranged in two 16-bit wide memory banks, whereas SRAM is 32-bit wide
memory with capacity of 2MB. GPIO pin is configured as output IO to generate a trigger
for the oscilloscope. RS232 serial bus handles the communication between NiosII and PC.
Chapter 4
SCA-resistant CPU: Results Analysis
To demonstrate that our solution improves the resistance against SCA, this chapter presents
the experimental results based on real attacks. First section describes the setup we have used
in our experiments. The experiment is divided into two parts, a proof-of-concept experiment
(Single T-Box attack) and a real world implementation (128-bit AES T-Box attack) of AES
prototype. These are detailed in subsequent sections. Next, we analyze the possible causes
for residual side-channel leakage in case of SDRAM-system. In the subsequent section, we
compare our work with related published secure implementations and finally we summarize
the contributions of this work.
4.1
Experimental Setup
Our designs are implemented on an Altera DE2-70 evaluation board, that has a CycloneII EP2C70F896C6 FPGA device and NiosII softcore processor. To uncover the secret key
with SCA, we build a setup whose block diagram and real picture are shown in Figure
4.1. The setup contains the cryptographic device (Altera DE2-70), an oscilloscope (Agilent
DSO5032A) and a PC. The three parts of the setup are connected in a circular fashion. A
26
Suvarna H. Mane
Chapter 4. SCA-resistant CPU: Results Analysis
27
PC
Measurement
Analysis
USB
Waveform
Oscilloscope
RS-232
(Plaintext)
Altera DE2-70
EM Probe
(EM Traces)
Trigger
EM Probe
GPIO- Trigger
CycloneIINiosII processor to
Oscilloscope
RS232 cable
connecting
FPGA and PC
Figure 4.1: Setup for SCA
RS232 cable connects the cryptographic device and the PC. Between the oscilloscope and
the PC is a USB cable, through which the PC is able to send commands to and get sampling
waveform from the oscilloscope.
Suvarna H. Mane
Chapter 4. SCA-resistant CPU: Results Analysis
28
An Electromagnetic(EM) probe (ETS-LINDGREN Model 7405-903) is used to capture electromagnetic radiations emitted by the cryptographic device. We use the these radiations to
represent the power consumption of the embedded system. These EM traces are sampled
on an Agilent Oscilloscope DSO5032 (300MHz bandwidth, 2GSa/s sampling rate). An oscilloscope is configured to average out 32 consecutive traces so as to reduce the noise in the
acquired traces.
Side-channel analysis requires a number of measurements with different inputs (plaintexts for
AES). In this example, the result of one measurement is the EM trace of the cryptographic
device and the corresponding random plaintext block for encryption. Each measurement
consists of the following 4 steps. A side-channel analysis that requires n measurements
needs to repeat these 4 steps for n times.
• Step 1: The PC sends a random plaintext block (16 bytes) to the embedded platform
(DE2-70 board) through the RS232 cable.
• Step 2: The embedded processor (NiosII) in the platform receives the plaintext and
encrypts it with the AES software. The encryption is repeated 32 times to sample
averaged trace.
• Step 3: After sending out one block of plaintext, the PC sends command to the oscilloscope to sample the EM trace when NiosII is running the encryption.
• Step 4: After sampling is done, oscilloscope averages out 32 samples and sends one
EM trace is back to PC for side-channel analysis.
After obtaining measurements, we move on to the analysis phase. The result of first AES
round is selected as an intermediate value IV and a Correlation Power Attack (CPA)[34]
using the hamming weight model is used to analyze the acquired EM traces.
Suvarna H. Mane
Chapter 4. SCA-resistant CPU: Results Analysis
29
8-bit Secret Key
8-bit Plaintext
32-bits Output
AES TBOX
XOR
Figure 4.2: Single T-Box Experiment
4.2
Single T-Box Experiment
We do this test as a proof-of-concept experiment to verify the security achieved due to
the use of a specialized memory organization and balanced-interleaved data format. In this
experiment, we target an attack on single T-Box operation and evaluate its security gain. As
illustrated in Figure 4.2, this test design incorporates essential components of a block cipher
(AES) i.e. logical XOR and T-Box lookup. SCA-resistant XOR and lookup table operations
are implemented in a custom hardware and are accessed through custom instructions.
We do this experiment in steps and change the number of balanced bits in a balancedinterleaved dataword for each step. We reconfigure XOR operation and change the format
of T-Box table contents to have required number of balancing bits. The number of balanced
bits are varied from 0 (unsecure) to 16 (fully secure) and an SCA attack is performed for
each of these steps to evaluate its SCA-resistance. The SCA attacks the output of a lookup
operation in our analysis.
Figure 4.3 shows the results, where the maximum correlation value for correct key guess
is plotted against the number of balanced bits present in a dataword. This correlation is
calculated for 2000 traces, at its best attack point. It can be seen that the correlation
of the correct key guess reduces with increasing number of balanced bits. For completely
secure case, the correlation value reduces to 0.11 at 2000 traces. The unbalanced (0-bit)
implementation is attacked with the MTD of 50, whereas 14-bit balanced version takes 1200
traces for a successful attack. We could not attack fully (16-bit) balanced design successfully
Suvarna H. Mane
Chapter 4. SCA-resistant CPU: Results Analysis
30
Maximum Correlation value
0.8
Successful Attacks
0.7
Unsuccessful
Attacks
0.6
0.5
0.4
0.3
0.2
0.1
0
0
2
4
6
8
10
12
14
16
18
Number of Balanced bits
Figure 4.3: Security Improvement: Single T-Box test
with 170000 averaged traces. This shows that our countermeasure achieves a significant
security improvement.
In this implementation, though the sensitive contents are stored in onchip memories (RAM
macros), the program data and stack are configured on offchip memories. It should be taken
into account that the sensitive data variables flow back and forth between offchip memory
and FPGA, while the program is being executed. These offchip data-bus transfers contribute
in the side channel leakage significantly. All of the unbalanced steps of this experiment have
at least 1 bit of sensitive data, that is not balanced by its complimentary part. This provides
useful side-channel leakage to perform a successful attack. However, for fully balanced case,
none of the data-bits contribute in the side-channel leakage and hence, we could not attack
that configuration successfully.
4.3
AES Prototype
In the second part of our experiment, we implement an SCA-resistant AES T-Box (128-bit)
prototype to evaluate its efficiency in terms of security, performance and cost. We use the
same platform as that of single T-Box experiment with two different configurations of offchip
Suvarna H. Mane
Chapter 4. SCA-resistant CPU: Results Analysis
31
16
Unsecured
Number of revealed key bytes
14
12
Secured
10
8
6
4
2
0
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
-2
Number of Traces
Figure 4.4: AES-TBOX implementation: NiosII/s + SDRAM
memory (SDRAM and SSRAM). The T-Box lookup tables are implemented in onchip RAM
macros and offchip memory is used for program execution and stack. The software uses
custom instructions for secure operations and includes hardcoded secret key in a balancedinterleaved format. All intermediate variables are precharged to 0 before they are used for
next operation. We attack first round of AES and conduct CPA analysis to evaluate its
SCA-resistance.
The SCA attacks are performed on unsecure and secure implementations for SDRAM and
SSRAM configurations. We have used several different secret keys and Table 4.1 lists the
average security gain for the set keys. It considers MTD at 75% success rate i.e. MTD
to disclose 75% of the key bytes. As we use 128-bit secret key for AES, 75% success rate
corresponds to revealing 12 key bytes.
An unsecure AES implementation on NiosII/s with SDRAM offchip memory reveals 12 key
bytes at around 1600 traces whereas, the secure implementation needs 40000 traces to reveal
12 key bytes. This results in an overall security gain of 25x at 75% success rate. Figure
4.4 plots the number of key bytes revealed as a function of number of traces for SDRAM
configuration.
Suvarna H. Mane
Chapter 4. SCA-resistant CPU: Results Analysis
32
0.08
0.06
0.04
Correlation
0.02
0
−0.02
−0.04
−0.06
−0.08
0
100
200
300
400
500
600
Time Samples
700
800
900
1000
Figure 4.5: Attack results on secure implementation: NiosII/s + SRAM. Trace of correct key
guess (here, first key byte) is plotted in black, while all other key guesses are in yellow(gray).
The buried trace means unsuccessful attack.
Table 4.1: AES Implementation: Security
Configuration
MTD (# revealed keys)
Security gain
Unsecure
Secure
NiosII/S + SDRAM
1600 (12)
40000 (12)
25
NiosII/S + SSRAM
633 (12)
300000 (0)
>474
In case of SSRAM configuration, an unsecure implementation achieves 75% success rate at an
average of 633 traces, whereas we could not attack secure implementation for 300000 traces.
Figure 4.5 shows the correlation trace of correct key byte for 300000 samples. This results in
security gain of at least 474, which significantly differs from that of an SDRAM configuration.
We investigate the possible reasons for this difference later in this chapter. Figure 4.6 shows
the waveforms of the secure and unsecure AES traces for SDRAM configuration and Figure
4.7 shows the same for SSRAM configuration.
We achieve this security improvement at the cost of performance and area overhead. An
Suvarna H. Mane
Chapter 4. SCA-resistant CPU: Results Analysis
33
Table 4.2: AES Implementation: Area and Performance
Configuration
Area (LEs, M4K)
Cycle count
Unsecure
Secure
Unsecure
Secure
NiosII/S + SDRAM
3452, 143
3889, 161
13839
36977
NiosII/S + SSRAM
2814, 31
3252, 49
7375
19980
Area of a system with CPU, memory controller and custom hardware.
Area is expressed in terms of Logic Elements (LE) occupied and M4K memory blocks
(RAM Macro) used.
area overhead for secure implementation is due to the extra logic for customized hardware.
For software implementation, it needs to split every 32-bit sensitive dataword into two 32-bit
balanced words. Additionally, all variables need to be precharged before they can be reused.
This overhead of the additional instructions causes a small performance degradation. Our
secure implementation is 2.7 times slower than unsecure implementation and takes 15% more
area. Table 4.2 enlists these results.
Suvarna H. Mane
Chapter 4. SCA-resistant CPU: Results Analysis
(a) Unsecure Implementation
(a) Secure Implementation
Figure 4.6: AES trace on oscilloscope (SDRAM configuration).
34
Suvarna H. Mane
Chapter 4. SCA-resistant CPU: Results Analysis
35
(a) Unsecure Implementation
(b) Secure Implementation
Figure 4.7: AES trace on oscilloscope (SSRAM configuration).
4.4
Impact of PCB Layout
The location of peripheral chips on a PCB board has a significant impact on the security of
overall system. Figure 4.8 depicts the layout of DE2-70 board. We can see that, SSRAM
Suvarna H. Mane
Chapter 4. SCA-resistant CPU: Results Analysis
36
Byte0
Byte1
Byte2
SSRAM: symmetric pin
positions: No Leakage
Byte4
SDRAM: Asymmetric
pin positions: Leakage
Byte0 Byte 1
Byte2 Byte 3
Figure 4.8: Impact of PCB Layout on Residual Leakage
has more symmetric location with respect to the CycloneII FPGA than that of SDRAM.
A 32-bit SDRAM is configured as two 16-bit memory banks, whereas SSRAM is a 32-bit
memory chip. With this layout, SDRAM does not always offer adjacent data-pin locations
for a complementary bit-pair. This creates an imbalance between direct and complimentary
bitlines irrespective of their balanced format. On the other hand, SSRAM has more symmetric data-pin pattern, which routes complementary bit lines together and thus, reduces
the residual side-channel leakage.
This highlights that the implementation of a SCA countermeasure in FPGA alone, does not
guarantee the security of an overall system. Its integration with other offchip peripherals
(that might be SCA-sensitive) is equally important. In addition, it shows that the balancedinterleaved data format provides a very good portability to different memory organizations
Suvarna H. Mane
Chapter 4. SCA-resistant CPU: Results Analysis
37
(even-bit databus, e.g. 8/16/32-bit) because of its ability to provides SCA-resistance on bit
level. Our solution with SRAM configuration shows an excellent demonstration of how a
given PCB can be used to implement an efficient secure system.
4.5
Related Implementations
In this section, we compare our solution with other secure implementations. As shown in
Table 4.3, these implementations target different technologies, different countermeasures and
different cryptographic algorithms. As the attack methods are not standardized, it is not a
straight-forward process to compare them on the same scale.
A masking-based secure processor SecretBlaze presented by Barthe et. al. [5], does not
provide a satisfactory level of security. Authors report a successful attack on their secure
implementation with a significant correlation peak. Another custom-instruction based secure
processor uses a masking countermeasure [36]. It needs to have a mechanism for maskgeneration, mask storage and management, which increases its design complexity.
Chen et.al. [12] present a custom instruction based Virtual Secure Processor (VSC), which
uses bit-slicing technique to employ hiding countermeasure. Despite of security improvement of 20X, this work suffers from a non-trivial performance penalty due to its bit-slicing
technique. It also needs specialized coding to handle bit-slicing at software level adding to
the design complexity. MUTE-AES [2] is another hiding countermeasure that uses multiprocessor platform. However, it has security vulnerabilities, as discussed in [1]. Regazzoni
et. al. present a CAD-based design flow and evaluation framework to implement secure
applications in ASIC [33]. It presents its results based on the simulations for PRESENT
cipher, mainly targeted for ASIC.
Our design is implemented using the state of the art methodology in a very systematic way,
making design phase simpler than above mentioned implementations. It exceeds abovementioned solutions in terms of security, performance, area and design flexibility.
Suvarna H. Mane
Chapter 4. SCA-resistant CPU: Results Analysis
38
Table 4.3: Related work: Comparison
Work
Technology,
Base Implementation
processor
Security
gain/Area
over-
head/Performance
degradation
[5]
Spartan-3,
MicroB- Masking, DES
2X / 1.34X1 / –
laze
[36]
Virtex-4, Leon3
Masking, AES
3.5X / – / 2X
[12]
Spartan-3E, Leon3
Hiding (VSC), AES
20X / 3.3X / 6.5X
[33]
ASIC 180nm, Open- Hiding
Our Design
(MCML),
RISC1000
PRESENT
CycloneII, NiosII/s
Hiding, AES T-Box
– / 2.65X / –
>474X / 1.15X2 / 2.7X
1 This number represents area overhead in terms of slice-count.
2 Area of a system with only processor and SRAM memory controller.
4.6
Summary of Contribution
This work reports an efficient and secure embedded system design on FPGA by using industrial design flow. We use a novel memory organization technique and interleaved data
format in combination with a hiding countermeasure. Though, we have demonstrated our
results Altera FPGA for AES T-Box implementation, the methodology is portable to other
FPGA platform for majority of the block ciphers. We discuss how location of peripheral
offchip components on PCB board plays an important role in the overall security evaluation.
Our experimental results establish the feasibility of proposed methodology to implement an
embedded system to achieve desired security at reasonable cost.
Chapter 5
ECDLP Engine: Background
In this chapter, we give a brief overview of Elliptic Curve Cryptography (ECC), Pollard rho
algorithm, modular arithmetic and then we discuss the related work done in this area.
5.1
Elliptic Curve Cryptography (ECC)
In a public-key cryptographic scheme, a key pair is selected such that the problem of deriving
the private key from the corresponding public key is equivalent to solving a computational
problem that is believed to be intractable. Elliptic curve cryptography uses elliptic curves to
design public-key cryptographic systems [26, 23]. The idea can be explained as follows [21].
Let p be a prime number, and let Fp denote the field of integers modulo p. An elliptic curve
E over Fp is defined by an equation of the form
y 2 = x3 + ax + b,
(5.1)
where a, b ∈ Fp satisfy 4a3 + 27b2 6≡ 0 (mod) p. A pair (x, y), where x, y ∈ Fp , is a point on
the curve if (x, y) satisfies equation (5.1). The point at infinity (∞) is also said to be on the
curve. The set of all the points on E is denoted by E(Fp ).
39
Suvarna H. Mane
Chapter 5. ECDLP Engine: Background
40
Using a special modular addition method, two elliptic curve points are added to produce a
third point on the same elliptic curve. This results in a cyclic subgroup of points in E(Fp ).
Such subgroups are used to implement ECC cryptosystems. For example, if P is a point in
E(Fp ) with prime order n, then its cyclic subgroup on E(Fp ) is represented as
hP i = [∞, P, 2P, 3P, ....., (n − 1)P ].
(5.2)
Here, the elliptic curve E, prime p, point P and order n, are known as public domain
parameters. A private key d is an integer randomly selected from the interval [1, n − 1] and
corresponding public key is generated as Q = [d]P , where [d] denotes scalar multiplication
with d. This multiplication refers to a modular point multiplication.
ECC encryption takes place as follows [21]. First, a plaintext m is represented as a point M
on the curve, and then it is added to [k]Q to get the encrypted data. Here, k is randomly
selected integer. The sender transmits the points C1 = [k]P and C2 = M +[k]Q and recipient
uses the private key d to compute
[d]C1 = d([k]P ) = k([d]P ) = [k]Q
(5.3)
and recovers M = C2 − [k]Q. An attacker needs to know d to compute [d]C1 to recover the
message m.
5.2
Pollard-rho Algorithm
The security of ECC relies on the difficulty of Elliptic Curve Discrete Logarithmic Problem
(ECDLP). It refers to a problem of determining secret key (d) given the domain parameters
and the public key (Q). Mathematically, ECDLP is defined as: To find d, when P and Q
are known and Q = [d]P , where P, Q ∈ E(Fp ).
The Pollard Rho method [32], [10] is the strongest known attack against ECC today. This
method solves ECDLP by generating points on the targeted curve iteratively, any of which
Suvarna H. Mane
Chapter 5. ECDLP Engine: Background
41
X6
X5
X7
Point of Collision
X4 = X10
X8
X9
X3
X2
X1
Single pollard-rho step (f(X))
X0
Seed Point
Figure 5.1: Standard Pollard-rho attack
have the property X = [a]P + [b]Q. When the same point is encountered twice, for different
[a] and [b], the collision occurs, which means that ECDLP is solved. Here, X denotes any
point on a elliptic curve i.e. X ∈ E(Fp ).
The Pollard rho algorithm [32] uses a pseudo-random iteration function f : hXi → hXi and
calculates a finite number of points on the curve. It starts a walk from a random point (seed
point) on the curve of the form X0 = [a0 ]P + [b0 ]Q, where a0 and b0 are generated from a
random seed S. It then iteratively computes Xi+1 = f(Xi ) until it encounters a point on the
curve twice: eventually, walk ends in a cycle. The name of the algorithm, rho, expresses the
Greek letter ρ, which shows a walk ending in a cycle as shown in Figure 5.1. The function
f is also referred as Pollard rho step, or simply an iteration.
The collision point is located where the cycle starts. Therefore the underlying idea of this
Suvarna H. Mane
Chapter 5. ECDLP Engine: Background
42
algorithm is to search for two distinct points on the curve such that
[ai ]P + [bi ]Q = [aj ]P + [bj ]Q.
The iteration function is constructed in such a way that ad and bd can be computed using
a0 and b0 . A Pollard rho step corresponds to an iteration function and is often defined as a
point addition i.e. Xi+1 = Xi + x, where x is a precomputed, linear combination of P and Q,
for example, [ci ]P + [di ]Q.
When a collision occurs, two different linear combinations of Xd are computed using ad and
bd of the collided points. The solution then can be obtained as
ad1 − ad2
n=
mod l
bd1 − bd2
(5.4)
Due to the birthday-paradox, the expected length of a walk before a collision is found, is
p
proportional to |hXi|. The function f generally corresponds to modular point addition
function, where two points on the curve are added together to generate next point on the
same curve.
5.2.1
Parallelization
Van Oorschot [39] described a parallelization technique that enables parallel walks on a
single curve of Pollard rho algorithm to speed up the computation of ECDLP. The idea is
to define a subset of hXi as distinguished points (DPs), points which have a distinguishing
characteristic. For example, a DP could be defined as a point with a given number of
leading zero-bits in its x-coordinate. This method allows to distribute the random walks
among multiple processing clients and share the DPs found by them with a central server,
which then performs a collision search. This technique results in a linear speed up as the
number of clients increases.
Multiple random walks are continued till two different seed points reach the same distinguished point Xd . The expected number of DPs required to find a collision is a fraction of
For P = 2128 – r, if A = a1. 2128 + a0, then A mod P = a1.r + a0
Suvarna H. Mane
Chapter 5. ECDLP Engine: Background
*
a1
op1
N bits
op2
N bits
a0
2N bits
Carry bits
a1*r
*r
+
a0’
a1’
Carry over
(More than N-bits)
Carry bits
+
a1’’
a1’*r
A0’’
Carry over
43
Reduction
step 1
Reduction
step 2
(More than N-bits)
Carry bits
A mod P
Reduced output
N-bits
Reduction
step n
Figure 5.2: Reduction with prime p, where p = 2N − r
the expected path length. This depends on the density of DPs in a point set hXi, which in
turn depends on the chosen distinguishing property.
5.3
Modular Arithmetic
The mechanism of elliptic curves depends on finite field arithmetic involving the points of
the elliptic curve. All these arithmetic operations (addition, subtraction, multiplication) are
computed as modular operations with modulo prime p. This can be achieved by performing
these operations similar to the standard integer arithmetic operations and by reducing the
results with mod p. A straight forward way to perform reduction is to divide the result with
p iteratively. As cryptographic operations involve large numbers and prime p is generally
Suvarna H. Mane
Chapter 5. ECDLP Engine: Background
44
chosen as a big prime, reduction using division proves to be inefficient. Divisions are costly
operations in terms of hardware area and performance. However, there are more efficient
algorithms available to perform reductions with prime p, which break down costly division
operation into simple addition operations.
We explain an example of hardware-friendly modular multiplication over N − bit field as
shown in Figure 5.2. Let p be a N − bit prime represented as 2N − r. The inputs for
multiplier are two N − bit operands, op1 and op2. Then result of modular multiplication is
A = op1*op2 mod(p). Here, A can also be represented as A = a1.2N + a0. Hence,
A mod(p) = a1.r + a0
(5.5)
To perform modular multiplication, op1 and op2 are multiplied using a conventional school
method to get 2N − bit multiplication result. Now, this 2N − bit result is reduced to N − bit
output using reduction addition operations. As shown in the Figure 5.2, higher carry bits
(a1) of the result are multiplied with r and are added to the lower significant bits (a0). This
reduced result might still have residual carry bits (a10 ). So, an addition operation is repeated
iteratively until it achieves the final N − bit result A mod (p).
5.4
Related Work
Different solutions to solve ECDLP have been proposed in recent years, using software and
hardware platforms. These solutions target different curves with different primes and perform
arithmetic operations either using prime field arithmetic or binary field arithmetic. We
discuss related solutions in this section.
The software solution proposed by Bernstein on CELL platform is the fastest software solution at present, to solve the ECDLP over secp112r1 curve [6]. It uses the negation map
and non-integer polynomial-basis arithmetic to report the speedup over a similar solution by
Bos [9]. Both of these software solutions use prime field arithmetic in an affine co-ordinate
Suvarna H. Mane
Chapter 5. ECDLP Engine: Background
45
system, and they exploit the SIMD architecture and rich instruction set. Another software
solution by Bos [8] describes the implementation of parallel Pollard rho algorithm on Synergistic Processor Units of Cell Broadband Engine Architecture to approach the ECC2K-130
Certicom challenge.
Among hardware-platform-based solutions, Bulens et al. proposes an FPGA solution to
attack the ECC Certicom challenge for GF(279 ) [31]. Though it discusses the hardwaresoftware integration aspect of the solution, the authors did not confirm if their system was
operational. Fan proposes the use of a normal-basis, binary field implementation to solve
ECC2K-130 [6].
Another binary field solution, for the COPACOBANA platform, targets the 160-bit curve
[16]. Since a curve of this size would require a single COPACOBANA platform to run for
7.62 x 109 years, the authors did not demonstrate a collision that can validate their design.
Guneysu et al. propose an architecture to solve ECDLP over prime fields using FPGAs
and analyze its estimated performance for different ECC curves [20]. A three-layer hybrid
distributed system is described by Piotr et al. to solve ECDLP over binary field [25]. It uses
the general purpose computers with FPGAs and integrates them with a main server at the
top level. Our solution is based on a hardware-software co-integrated platform.
Chapter 6
ECDLP Engine: Implementation
This section discusses architecture of ECDLP engine, experimental setup and analyzes the
results of our solution.
6.1
Modular Multiplication Architecture
A point addition is the most basic operation in Pollard rho algorithm. The overall performance of an ECDLP engine highly depends on its efficiency to perform point addition
operation. The point addition routine refers to a sequence of modular arithmetic operations
such as addition, subtraction, multiplication and inversion. The modular multiplication operation dominates among these operations with 80% of the share. So the design of highly
efficient modular multiplier (modmul) unit results in faster point addition and consequently,
in a more efficient ECDLP engine. For this reason, we optimize the modular multiplication
architecture.
We target prime field arithmetic using polynomial representation in an affine co-ordinate
system (Equation 6.1). Typically, hardware solutions use binary field arithmetic, primarily
because of the assumption that binary field avoids costly carry propagation. Our represen-
46
Suvarna H. Mane
Chapter 6. ECDLP Engine: Implementation
47
tation for prime-field arithmetic is based on [6], which has similar properties. Although, the
target curve considered in our solution is secp112r1 over 112-bit in GF((2128 − 3)/76439), all
the arithmetic operations are performed over 128-bit field. After each iteration, the result
is mapped to 112-bit to check it against the distinguishing property. This 128-bit to 112-bit
conversion is obtained by the canonicalization i.e. multiplying a result with 76439 [6].
Table 6.1: ECC Arithmetic Operations
Operation
Cycle cost
Addition/Subtraction
2
Modular Multiplication
14
Square
11
Inversion
1594
The point addition operation corresponds to a Pollard rho step that consists of four subtractions, one addition, four modular multiplications, and one inversion. Table 6.1 lists these
arithmetic operations along with their cycle costs to compute a single operation. Subsequent
sections explain the architecture of arithmetic modules in detail.
6.1.1
Modular Multiplication
Bernstein [6] uses the non-integer basis for polynomial representation of data to achieve an
efficient software implementation. We choose 16-bit coefficient representation to make the
partial product computation uniform across all the coefficients, which also makes the design
scalable over larger fields. These multipliers are mapped on the dedicated DSP48 cores
available in the used Virtex-5 devices for efficient implementation. As depicted in Figure
Suvarna H. Mane
Chapter 6. ECDLP Engine: Implementation
i=6
96
i=7
112
A7
A6
i=4
64
i=5
80
A3
A4
A5
i=1
16
i=2
32
i=3
48
A2
48
A1
i=0
Bit-0
A0
128-bit
1
Figure 6.1: Polynomial Representation of Data
6.1, the 128-bit data is represented as
X=
nX
A −1
xi .2i.lA where, nA = l/lA ,
(6.1)
i=0
l = 128 and lA = 16.
The two operands are represented using above mentioned polynomial format and are multiplied with conventional school method using word aligned coefficients. Figure 6.2 illustrates
the dataflow. The result consists of 15 partial products (S0, S1 .... S14 ). These partial
products are accumulated in a way to achieve the first level of reduction, where higher order
coefficients are folded and added to lower order coefficients. This results in eight 36-bit wide
partial products (Preg0, Preg1, ..... Preg7 ), which are then passed through second level of
reduction to get eight 16-bit coefficients (C0, C1, ... C7 ). Figure 6.3 shows how an adder
chain achieves this second level of reduction.
Suvarna H. Mane
a7b2
Chapter 6. ECDLP Engine: Implementation
a7
a2
a1
a0
op1
b7
b2
b1
b0
op2
a7b0
a2b0
a1b0
a7b1
a6b1
a1b1
a0b1
X
a6b2
a5b2
a0b2
X
X
s2
s1
s14 s14
S7
Preg7
s13
a0b0
8 cycles
s0
op1*op2
S6
S5
S4
S3
S2
S1
S0
S14
S13
S12
S11
S10
S9
S8
Preg6
Preg5
Preg4
Preg3
Preg2
Preg1
Preg0
Each word = 36 bits
Figure 6.2: Standard Multiplication Method
Preg7
Preg6
Preg5
Preg4
Preg3
Preg2
Preg1
Preg0
Upper
20 bits
Lower
16 bits
+
+
Upper
20 bits
Upper
20 bits
Lower
16 bits
C1
C2
128 bit result
C7 C6 C5
C4
C3
C2
C1
C0
Figure 6.3: Reduction with Adder-Chain
C0
49
Suvarna H. Mane
Chapter 6. ECDLP Engine: Implementation
Op1_1
Op1_6
Op1_7
50
Op1_0
* P
DSP
C
Op2_7
Op2_6
Op2_1
MULT 7
MULT 6
MULT 1
Op2_0
Upper
carry
bits
Lower 16
bits
* P
+
+
+
Reg 7
Reg 6
Reg 1
Reg 0
+
+
+
+
C
S6
C
S1
multiplication
stage
MULT 0
+
S7
16 bits
C
Addition
Reduction
stage 1
* P
Addition
Reduction
stage 2
S0
16 bits
128-bit modular multiplication result
Figure 6.4: Modular Multiplication Architecture
Figure 6.4 shows the architecture of the modular multiplication module in hardware. It
takes two 128-bit inputs (Op1 and Op2) broken into 16-bit coefficients and gives 128-bit
reduced output. There are eight DSP48 multipliers employed to find partial products of
16-bit coefficients. As it needs to compute nA 2 (64 in our case) partial products to get
multiplication result, it takes eight multiplication cycles to get unreduced multiplication
result.
A similar architecture was presented earlier by Guneysu [19]. However, Figure 6.4 presents
an important optimization of the reduction step. Since we are computing Op1*Op2 mod
(2128 − p), the reduction adds to the cycle cost of a modular multiplication (p = 3 in our
case). By multiplying the shifting operand op2 7 with 3, we perform the reduction in
parallel with the multiplication.
For the 128-bit data field with eight DSP multipliers, it takes eight cycles of multiplications
Suvarna H. Mane
Chapter 6. ECDLP Engine: Implementation
51
and 12 iterations of reduction. This needs a cycle cost of 20 per modular multiplication. As
depicted in Figure 6.4, this cost has been reduced to 14 cycles, by overlapping reduction with
partial multiplications. This is a significant improvement in terms of latency over Guneysu’s
architecture [19], which takes 70 clock cycles for 256-bit modular multiplication.
The reduction has been achieved by an adder chain, which adds lower 16 bits of ith partial
product with the upper carry bits of (i-1)th one. The carry bits of the highest coefficient
Reg7 are multiplied by 3 before adding them to the lowest coefficient Reg0. The final 128-bit
result can be obtained by concatenating these eight 16-bit reduced outputs. A two stage
pipeline of adder chains is employed to reduce the critical path of the reduction stage.
This architecture supports the generalized modular multiplication for any p over 128-bit
prime field. It can easily be extended over larger prime fields by adding additional multiplieradder columns, in which case, the performance might differ with the number of multiplieradder columns.
6.1.2
Dedicated Square Unit
Table 6.1 shows that, an inversion is the most expensive operation in a Pollard rho step.
Therefore, optimizing this operation results in an improvement of the overall system performance. An inversion involves a total of 137 modular multiplications, out of which, 75% are
the squaring operations. Having a dedicated optimized squaring unit reduces the effective
time of inversion and consequently accelerates the point addition operation.
The squaring operation needs only half as many partial products as multiplication, so we
modified the multiplication architecture to get an optimized square module as shown in
Figure 6.5. It involves only 5 multiplication iterations instead of 8. The reduction stage
for the square module is similar to that of multiplication. With this architecture, a square
operation is completed in only 11 cycles, which achieves the speed-up of 1.2 over the multiplication architecture. Since the majority of operations required for inversion are squares,
Suvarna H. Mane
Chapter 6. ECDLP Engine: Implementation
52
* P
op7 *2
op7
op6 * 2
op5 * 2
op3
op2
op3 *2
Mult 7
op4 * 2
Mult 6
Mult 5
op2 * 2
Mult 4
op1
op1 * 2
Mult 3
Mult 2
op0
op1 * 2
Mult 1
Mult 0
To Reduction
Figure 6.5: Dedicated Square Architecture
this translates to a significant reduction in the cycles cost of an inversion.
6.1.3
Vectorized Inversion
To reduce effective cost of an inversion operation, we use few optimization tricks for an inversion operation as following. From the Fermat’s little theorem, it follows that the modular
inverse of P ∈ E(Fp) can be obtained by computing P(p-2) . For the secp112r1 curve in 128bit arithmetic, an inversion requires 112 squarings and 59 multiplications. We use following
techniques to optimize the inversion operation.
Windowing Optimization
A windowing method allows to reduce the number of squarings and multiplications needed to
invert an input. For a window-size of four, we achieve an inversion operation in 108 squarings
and 29 modular multiplications. It takes 1594 clock cycles for an inversion as opposed to
2058 cycles without windowing, providing a speed-up of 1.3.
Suvarna H. Mane
Chapter 6. ECDLP Engine: Implementation
53
Montgomery trick
To further reduce the cost of an inversion, we use Montgomery’s trick [27], which enables
to trade M inversions for 3(M − 1) multiplications and one inversion. It allows to vectorize
multiple random walks together and run them simultaneously on a single ECC core, to share
the inversion cost.
With the large vector size, the inversion cost per iteration becomes small compared to other
operations such as multiplication. We select vector size of 16 and optimization of this
quantity is a future work.
6.2
System Architecture
The system is implemented on a Nallatech computing platform. The FPGA performs the
computationally expensive Pollard rho iterations, whereas the host processor manages the
central database and executes collision search. The communication between software and
hardware is carried out only for the exchange of seed points and distinguished points, which
reduces the communication overhead.
6.2.1
Nallatech Platform
Figure 6.6 depicts the architecture Nallatech system. It consists of one quad-core Xeon
processor E7310 and three Virtex-5 FPGAs (1 xc5vlx110, 2 xc5vsx240t). A fast North
Bridge integrates high-speed components, including a Xeon, FPGA, and main memory. A
slower South Bridge integrates peripherals into the system, including the hard disk. Both
the Xeon and the FPGA can directly access system memory using a Front Side Bus (FSB).
A Field Programmable Gate array (FPGA) is used for computing additive random walks on
elliptic curve, while a host Xeon processor executes the software driver.
Suvarna H. Mane
Chapter 6. ECDLP Engine: Implementation
54
Xeon Processor
C0
C1
C2
C3
System Memory
workspace
21GB/s
(peak)
North
Bridge
3.2GB/s
Interface
3.2GB/s
100MB/s
South
Bridge
FPGA
Hard Disk
FPGA Accelerator
Figure 6.6: Nallatech System
6.2.2
Software Implementation
The Xeon processor executes a software driver (in C) and manages software interface to FSB.
The software driver mainly handles the communication interface with FPGA, seed point (SP)
generation, storage and sorting of DPs. As shown in Figure 6.7, two-way communication
between the Xeon and the FPGA takes place over the FSB.
When the program execution starts, the software calls APIs to configure FPGA card, to
initialize the FSB link, and to allocate the workspace memory. It then generates random
SPs on the curve E and starts an attack by sending them to FPGA over FSB. Every point
has x - and y- coordinates of 128-bit length each.
The hardware computes DP for each SP received and sends it to the software along with
its corresponding SP. When the software receives SP-DP pair from an FPGA, it performs
a collision search among all the received DPs. Once a collision is detected, it computes the
Suvarna H. Mane
Chapter 6. ECDLP Engine: Implementation
55
secret scalar. As the software takes care of the central database of DPs, a collision search is
conducted in parallel with hardware computations.
6.2.3
Hardware Implementation
On the hardware side, as shown in Figure 6.7, FPGA edge core provides an interface between
the FSB and the ECC core. It consists of a control logic and two 256-bit wide FIFOs. RX
FIFO buffers the incoming SPs and TX FIFO stores the DPs found, to send them back to
the Xeon.
The ECC core performs a random walk by computing the point addition operation iteratively
until it finds a DP and stores that in the TX FIFO. We have defined DP as a point, which
has y-coordinate with 16 zeros. The probability of a point being distinguished is almost
exactly 2−16 . The distinguishing property of points allows the ECC core to send only few
points back to the Xeon, which reduces the communication overhead and minimizes the
storage requirement in the hardware. The required bandwidth of a communication bus is
around 8Kbits/sec, which is well within the range of FSB. Following are the details of key
components in the design.
IO controller
The IO controller manages the read/write interfaces of TX-RX FIFOs and controls the ECC
core operation. It feeds the SPs from RX FIFO to the ECC core and initiates a Pollard rho
walk. When a DP is found, the IO controller halts the ECC core operation until a new SP
is loaded from the RX FIFO. The computed DPs are buffered in the TX FIFO and then
transferred to the Xeon along with corresponding SPs.
Suvarna H. Mane
Chapter 6. ECDLP Engine: Implementation
56
XEON Processor 4-core (1.6GHz) C driver
Distinguishing Point
(Y)
FSB bus
Software
Driver
Seed (X)
FPGA Edge Module
Read interface
Write interface
Input/Output Controller
ECC core
Vector Sequencer
SP
RX FIFO
Next-Address
logic
Micro-instruction
ROM
DP
TX FIFO
Hardware
Virtex-5
FPGA
PA module Datapath
RAM
Arithmetic
Modules
Figure 6.7: System Architecture
ECC Core
The ECC core consists of a micro-instruction sequencer and the point-addition (PA) datapath. It performs a random walk by computing point additions iteratively until it finds a DP
or crosses the iteration limit (which is currently set to 220 ).
Sequencer
This secondary controller executes the micro-instruction sequence stored on a ROM and
accordingly issues the control signals to the PA datapath module. This way, it controls the
execution flow of the low-level arithmetic operations for a point addition. The micro-coded
architecture adds to the flexibility of the micro-instruction sequence to support different
vector sizes (N). It also supports scalability of the design by providing a mechanism to
Suvarna H. Mane
Chapter 6. ECDLP Engine: Implementation
57
control multiple ECC cores in a SIMD (Single Instruction Multiple Data) fashion. The
vector size is a generic parameter of the design.
Table 6.2 summarizes the micro-instruction sequence for a single point addition. Here, t1,
t2, t3, t4, Px, Py, Qx and Qy represent N-entry register files. The register files Px, Py, Qx
and Qy hold the x− and y− coordinates of the points to be added and remaining register
files store the intermediate results.
Table 6.2: Micro-instruction Sequence for Point addition: P + Q
Instruction
Function performed
canonicalization
Py * 76439
Subtraction
t1 = Py - Qy
Subtraction
t2 = Px - Qx
Inversion
t2 = invert(t2)
Addition
t3 = Px + Qx
Modular Multiplication
t4 = t2 * t1
Modular Square
t1 = t4 * t4
Subtraction
t1 = t1 - t3 : Px (i+1)
Subtraction
t2 = Px - t1
Modular Multiplication
t3 = t2 * t4
Subtraction
t3 = t3 - Py : Py (i+1)
The next-address logic (NAL) controls the execution of microinstructions from a ROM. Every
micro-instruction consists of N phases, corresponding to the N entries of a vector. The
same micro-operation is applied to every element of the vector. Furthermore, every microoperation can have a variable execution time. This allows the micro-instruction concept to
Suvarna H. Mane
Chapter 6. ECDLP Engine: Implementation
58
be applied to all the operations required for point addition regardless of latency.
The NAL module reads a micro-instruction i and issues the corresponding control signals
to PA datapath N times. These control signals include a start pulse, an operand select, the
opcode of a micro-operation, a write pulse to store the results, a destination-register select
and a vector phase ni , to which the operation belongs.
For an inversion operation, NAL module extends start and write signals for N clock cycles,
each with different vector phase ni . This enables to load N inputs into inversion module and
write N results into the corresponding register files. An inversion is performed only once per
N vectors, whereas the other instructions are executed N times.
As the datapath of PA iteration is fixed, there is no need of the conventional conditional
micro-instructions such as jump, check flag etc. This makes the instruction-set simpler.
Datapath
The datapath consists of modular arithmetic operators and memory. We carefully designed
each of these sub-blocks to support vectorized point additions. A vectorized point addition
allows the execution of multiple random walks simultaneously on a single ECC core, with
only one inversion per N point-additions.
As shown in Table 6.2, we need eight registers (t1, t2, t3, t4, Px, Py, Qx, Qy) to hold the
intermediate results for a single point addition. For vector size of N , we use N -entry register
files. For efficient implementation, these N-entry register files are mapped to the distributed
RAMs available in FPGA. Each of these memories is 128-bit wide and has a depth equal to
the vector size N. The address of an entry in the memory corresponds to the vector number,
to which the content belongs.
Suvarna H. Mane
6.3
Chapter 6. ECDLP Engine: Implementation
59
Implementation Results
Though our work is dedicated to prime p = (2128 − 3)/76439, the same solution can work
with little modifications for any curve of the form y 2 = x3 − 3x + b. For demonstration
purpose, the seed points that we generate are carefully chosen to be of order 250 [6], which
means we need only 225 steps to solve the ECDLP. This allows us to demonstrate collisions,
proving that our solution works.
6.3.1
Overall Performance
The whole system runs at 100 MHz and utilizes 4773 slices which is 12.7% area of the Virtex5 device xq5vsx240t with a single ECC core. It takes 1.5 microseconds per Pollard rho step
and can perform upto 660K iterations per second per ECC core. With 16 ECC cores working
in parallel, our system would need 176 years to solve secp112r1 ECDLP.
The Guneysu’s architecture described in [19] targets 256-bit prime arithmetic over two fixed
NIST primes. Our solution shows an improvement over it in terms of latency for the important ECC arithmetic operations. Assuming the cycle cost for 256-bit arithmetic as twice of
that for 128-bit arithmetic (worst case scenario), we can see that our architecture has cycle
cost of 28 for a modular multiplication and 302 for the point addition. This is 2.5X and 3X
times lower latency for a modular multiplication and point addition operation respectively,
than those of the design in [19].
The performance comparison among various implementations is not a straightforward process, as different solutions target different curves and different co-ordinate systems. Also the
performance figures are specific to the target platform, the size of the target curves and an
underlying arithmetic (i.e. binary or prime field).
Suvarna H. Mane
Chapter 6. ECDLP Engine: Implementation
60
Table 6.3: Comparison with Software Implementations
Platform
Time/PA in ns
Iterations/sec
Cell processor @3.192GHz, secp112r1
113 (362 cycles)
8.81M
142 (453 cycles)
7.04M
233 (745 cycles)
4.28M
1500
660K: single core
curve [6]
Cell processor, @3.192GHz, secp112r1
curve [9]
Cell processor, @3.192GHz, ECC2K130 Binary Field [8]
Our system, secp112r1
10.56M: 16 cores
6.3.2
Comparison with Previous Software Implementations
Table 6.3 compares our solution with other software implementations. It shows that our
results can be comparable with those of software if we have multiple ECC cores in parallel.
The listed multi-core performance is an estimate; our measured results are for a single ECC
core only.
6.3.3
Comparison with Hardware Implementation
As shown in Table 6.4, the solution reported in [15] claims to have 111M iterations per second.
It also claims to solve ECC2K-130 within a year with five COPACOBANA machines, but
the system demonstration is not reported.
Similarly, the solution reported in [31] claims to have 100M iterations per second based on
paper design. We assume the difference of performance figures exists due to the factors, such
as, binary field arithmetic, different curve sizes and use of pipelined architectures. We can
Suvarna H. Mane
Chapter 6. ECDLP Engine: Implementation
61
Table 6.4: Comparison with Hardware Implementations (per ECC core)
Platform
Arithmetic Iterations Area (slices)
Demonstrated
/sec
Spartan-3 [16]
Prime
47.28K
3034
Unclear
50.12K
2660
Unclear
Binary
claims
26731
Paper design
(130-bit)
111M
Binary
claims
22236
(79-bit)
100M
BRAMs
660K
4773
(160-bit)
Spartan-3 [20]
Prime
(160-bit)
Spartan-3 [15]
Virtex-4 [31]
Virtex-5
tem
our sys- Prime
(112-bit)
Slices
Slices
30 Paper design
9
Demonstrated
BRAMs 62 DSP48
see that ours is the only solution at present, which demonstrates the Pollard rho algorithm
successfully on a hardware-software integrated platform.
6.4
Summary of Contributions
We successfully demonstrate a complete ECC cryptanalytic machine to solve ECDLP on
hardware-software co-integrated platform. We also implement a novel architecture on hardware to perform modular multiplication over a prime field and this is the most efficient implementation reported at present for prime field multiplication. This architecture is further
extended to a dedicated square module. This work also demonstrates the use of microinstruction based sequencing logic to support different vector sizes and to control multiple
ECC cores in SIMD fashion. We compare our performance results with the previous hard-
Suvarna H. Mane
Chapter 6. ECDLP Engine: Implementation
62
ware implementations and show that our solution can have a comparable performance with
multi-core implementation. The proposed system can further be used to demonstrate the
prime arithmetic over other curves of different sizes.
Chapter 7
Conclusion
In this thesis we have addressed two problems in the hardware security area. We used
reconfigurable hardware to demonstrate our results successfully. The first solution presents
an efficient implementation of the SCA countermeasure on an FPGA platform by using
industrial design flow. It proposes a novel memory organization technique and interleaved
data format in combination with a hiding countermeasure. The methodology is portable to
other FPGA platforms for majority of the block ciphers. A comparison with related solutions,
shows that our solution offers a very good trade-off between security gain, performance,
circuit area and design complexity. Our experimental results establish the feasibility of
a proposed methodology to implement an embedded system to achieve a desired level of
security at reasonable cost.
In the second part of this thesis, we successfully demonstrate a complete ECC cryptanalytic
machine to solve the ECDLP on a hardware-software co-integrated platform. We use a novel
architecture on hardware to perform modular multiplication over prime field. This work also
demonstrates the use of micro-instruction based sequencing logic to support different vector
sizes and to control multiple ECC cores in SIMD fashion. We compare our performance
results with the previous hardware implementations and show that our solution acheive a
comparable performance with multi-core implementation.
63
Bibliography
[1] Virtual secure circuit: Porting dual-rail pre-charge technique into software on multicore.
2010. chenzm@vt.edu 14739 received 10 May 2010.
[2] J.A. Ambrose, S. Parameswaran, and A. Ignjatovic. Mute-aes: A multiprocessor architecture to prevent power analysis based side channel attack of the aes algorithm. pages
678 –684, nov. 2008.
[3] Daniel V. Bailey, Lejla Batina, Daniel J. Bernstein, Peter Birkner, Joppe W. Bos,
Hsieh-Chung Chen, Chen-Mou Cheng, Gauthier Van Damme, Giacomo de Meulenaer,
Luis J. Dominguez Perez, Junfeng Fan, Tim Güneysu, Frank K. Gürkaynak, Thorsten
Kleinjung, Tanja Lange, Nele Mentens, Ruben Niederhagen, Christof Paar, Francesco
Regazzoni, Peter Schwabe, Leif Uhsadel, Anthony Van Herrewege, and Bo-Yin Yang.
Breaking ecc2k-130. IACR Cryptology ePrint Archive, 2009:541, 2009.
[4] Josep Balasch, Benedikt Gierlichs, Roel Verdult, Lejla Batina, and Ingrid Verbauwhede.
Power analysis of atmel cryptomemory - recovering keys from secure eeproms. pages
19–34, 2012.
[5] Lyonel Barthe, Pascal Benoit, and Lionel Torres. Investigation of a masking countermeasure against side-channel attacks for risc-based processor architectures. pages
139–144, 2010.
[6] Daniel J. Bernstein, Tanja Lange, and Peter Schwabe. On the correct use of the negation
map in the pollard rho method. pages 128–146, 2011.
64
Suvarna H. Mane
Chapter 7. Conclusion
65
[7] I. Blake, G. Seroussi, N. Smart, and J. W. S. Cassels. Advances in Elliptic Curve Cryptography (London Mathematical Society Lecture Note Series). Cambridge University
Press, New York, NY, USA, 2005.
[8] Joppe W. Bos, Thorsten Kleinjung, Ruben Niederhagen, and Peter Schwabe. Ecc2k-130
on cell cpus. pages 225–242, 2010.
[9] Montgomery P.L. Bos J.W., Kaihara M.E. Pollard rho on the playstation 3. SHARCS,
pages 35–50, 2009.
[10] R. P. Brent and J. M. Pollard. Factorization of the eighth fermat number. Math. Comp.,
36:627–630, 1981.
[11] Suresh Chari, Charanjit S. Jutla, Josyula R. Rao, and Pankaj Rohatgi. Towards sound
approaches to counteract power-analysis attacks. pages 398–412, 1999.
[12] Zhimin Chen, Ambuj Sinha, and Patrick Schaumont. Implementing virtual secure circuit
using a custom-instruction approach. pages 57–66, 2010.
[13] Joan Daemen and Vincent Rijmen. The Design of Rijndael. Springer-Verlag New York,
Inc., Secaucus, NJ, USA, 2002.
[14] Thomas Eisenbarth, Timo Kasper, Amir Moradi, Christof Paar, Mahmoud Salmasizadeh, and Mohammad T. Manzuri Shalmani. On the power of power analysis in
the real world: A complete break of the keeloqcode hopping scheme. pages 203–220,
2008.
[15] Junfeng Fan, Daniel V. Bailey, Lejla Batina, Tim Güneysu, Christof Paar, and Ingrid
Verbauwhede. Breaking elliptic curve cryptosystems using reconfigurable hardware.
pages 133–138, 2010.
[16] Tim Gneysu, Gerd Pfeiffer, Christof Paar, and Manfred Schimmler. Three years of
evolution: Cryptanalysis with copacobana, 2009.
Suvarna H. Mane
Chapter 7. Conclusion
66
[17] Sylvain Guilley, Laurent Sauvage, Florent Flament, Vinh-Nga Vong, Philippe
Hoogvorst, and Renaud Pacalet. Evaluation of power constant dual-rail logics countermeasures against dpa with design time security metrics. IEEE Trans. Computers,
59(9):1250–1263, 2010.
[18] Sylvain Guilley, Laurent Sauvage, Philippe Hoogvorst, Renaud Pacalet, Guido Marco
Bertoni, and Sumanta Chaudhuri. Security evaluation of wddl and seclib countermeasures against power attacks. IEEE Trans. Computers, 57(11):1482–1497, 2008.
[19] Tim Güneysu and Christof Paar. Ultra high performance ecc over nist primes on commercial fpgas. pages 62–78, 2008.
[20] Tim Güneysu, Christof Paar, and Jan Pelzl. Special-purpose hardware for solving the
elliptic curve discrete logarithm problem. TRETS, 1(2), 2008.
[21] Darrel Hankerson, Alfred J. Menezes, and Scott Vanstone. Guide to Elliptic Curve
Cryptography. Springer Publishing Company, Incorporated, 2010.
[22] Timo Kasper, David Oswald, and Christof Paar. Side-channel analysis of cryptographic
rfids with analog demodulation. pages 61–77, 2011.
[23] Neal Koblitz. Constructing elliptic curve cryptosystems in characteristic 2. pages 156–
167, 1990.
[24] Paul C. Kocher, Joshua Jaffe, and Benjamin Jun. Differential Power Analysis. CRYPTO
1999, LNCS 1666:pp. 388–397, 1999.
[25] Piotr Majkowski, Mariusz Rawski, Tomasz Wojciechowski, Zbigniew Kotulski, and Maciej Wojtynski. Heterogenic distributed system for cryptanalysis of elliptic curve based
cryptosystems. pages 300–305, 2008.
[26] Victor S Miller. Use of elliptic curves in cryptography. pages 417–426, 1986.
Suvarna H. Mane
Chapter 7. Conclusion
[27] Peter L. Montgomery.
of factorization.
67
Speeding the Pollard and elliptic curve methods
Mathematics of Computation,
48:243–264,
1987.
URL:
http://links.jstor.org/sici?sici=0025-5718(198701)48:177<243:STPAEC>2.0.CO;2-3.
[28] Amir Moradi, Alessandro Barenghi, Timo Kasper, and Christof Paar. On the vulnerability of fpga bitstream encryption against power analysis attacks: extracting keys from
xilinx virtex-ii fpgas. pages 111–124, 2011.
[29] Amir Moradi, Markus Kasper, and Christof Paar. Black-box side-channel attacks highlight the importance of countermeasures - an analysis of the xilinx virtex-4 and virtex-5
bitstream encryption mechanism. pages 1–18, 2012.
[30] Amir Moradi and Axel Poschmann. Lightweight cryptography and dpa countermeasures:
A survey. pages 68–79, 2010.
[31] Jean-Jacques Quisquater Philippe Bulens, Guerric Meurice de Dormale. Hardware for
collision search on elliptic curve over gf(2m). SHARCS, April, 2006.
[32] J. M. Pollard. Monte carlo methods for index computation (mod p). pages 918–924,
1978.
[33] Francesco Regazzoni, Alessandro Cevrero, François-Xavier Standaert, Stéphane Badel,
Theo Kluter, Philip Brisk, Yusuf Leblebici, and Paolo Ienne. A design flow and evaluation framework for dpa-resistant instruction set extensions. pages 205–219, 2009.
[34] T. Popp S. Mangard, E. Oswald. Power Analysis Attacks: Revealing the Secrets of
Smart Cards. Springer-Verlag New York, Inc., 2002.
[35] Shaunak Shah, Rajesh Velegalati, Jens-Peter Kaps, and David Hwang. Investigation
of DPA resistance of Block RAMs in cryptographic implementations on FPGAs. pages
274–279, Dec 2010.
[36] Stefan Tillich, Mario Kirschbaum, and Alexander Szekely. Sca-resistant embedded processors: the next generation. pages 211–220, 2010.
Suvarna H. Mane
Chapter 7. Conclusion
68
[37] Stefan Tillich, Mario Kirschbaum, and Alexander Szekely. Implementation and evaluation of an sca-resistant embedded processor. pages 151–165, 2011.
[38] Kris Tiri and Ingrid Verbauwhede. A digital design flow for secure integrated circuits.
IEEE Trans. on CAD of Integrated Circuits and Systems, 25(7):1197–1208, 2006.
[39] Paul C. van Oorschot and Michael J. Wiener. Parallel collision search with cryptanalytic
applications. J. Cryptology, 12(1):1–28, 1999.
Download