Implementation of SCA-Resistant CPU and an ECDLP Engine on FPGA Platform Suvarna H. Mane Thesis submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Master of Science in Computer Engineering Patrick Schaumont, Chair Leyla Nazhandali Lynn Abbott April 30, 2012 Blacksburg, Virginia Keywords: Side-Channel Analysis (SCA), Elliptic Curve discrete logarithmic algorithm (ECDLP), Pollard rho, Prime-field arithmetic, Hardware software co-design, FPGA. Copyright 2012, Suvarna H. Mane Implementation of SCA-Resistant CPU and an ECDLP Engine on FPGA Platform Suvarna H. Mane ABSTRACT The rapid increase in the use of embedded systems for performing secure transactions, has proportionally increased the security threat, faced by such devices. Security threats are an issue of concern at both software and hardware level. The field of cryptography has been intensively researched for secure implementation techniques, methods to attack secure systems and countermeasures to avoid such attacks. In this thesis, we provide solutions for two interesting problems in the field of hardware security using reconfigurable hardware. First, we discuss a countermeasure to prevent side-channel analysis (SCA) attacks on an embedded system. We present an SCA-resistant processor design in the context of an embedded design flow for FPGA. It integrates an SCA-resistant custom instruction set on a soft-core CPU and derives an SCA resistance from dual-rail precharge principle. The resulting countermeasure applies to a broad class of block ciphers that consist of lookup tables and logical operations. While many countermeasures have been proposed previously, we show that our solution achieves an excellent trade-off between SCA resistance, (software and hardware) design complexity, performance, and circuit area cost. Secondly, we present a system to attack a special type of cryptography called Elliptic Curve Cryptography(ECC). It targets the Elliptic Curve Discrete Logarithmic Problem (ECDLP) for a NIST-standardized ECC-curve over 112-bit prime field. We implement a successful demonstration of an ECC cryptanalytic engine using the Pollard rho algorithm on a hardware-software co-integrated platform. We propose a novel, generalized architecture for polynomial-basis multiplication over prime field and its extension to a dedicated square module. Its design strategy is portable to other prime field moduli. This work received support from the National Science Foundation, Grant no. 477634. Acknowledgments First, I would like to express my sincere thanks to my adviser, Dr. Patrick Schaumont, under whose guidance I have completed my graduate studies. It has been a privilege working with him and I am extremely grateful for his faith in me as a student. His example of dedication, punctuality, work ethics and enthusiasm has always been an inspiration for me and will continue to be so. This section will not be complete without mentioning my family, who have always supported and encouraged me throughout my life. Their love has always been the strongest source of motivation for me and I simply wouldn’t be where I am today without them. Thank you Aai and Papa for dedicating your lives to create a bright future for your children. My sister Supriya and brother Deepak, deserve a special note of thanks for being my best friends all through my life. I would also like to thank my friends, who have always supported and cared for me in my good as well as difficult times. Thank you everyone - Pooja P, Divya, Pooja A, Shubhangi, Praveen, Sharayu, Aarti, Amrapali, Meeta, Ambuj, Rajat, Abhranil and Aditya. I also deeply appreciate the help from my coworkers and lab mates, Abhranil Maiti, Xu Guo, Zhimin Chen, Srikrishna Iyer, Lyndon Judge and Mostafa Taha. Without help from Lyndon and Mostafa, my research would not have been so smooth. iii Contents 1 Introduction 1.1 1.2 1 Side Channel Analysis (SCA) Secure System . . . . . . . . . . . . . . . . . . 1 1.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 ECDLP: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Related Articles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 SCA-resistant CPU: Preliminaries 2.1 Side Channel Attacks (SCA) 8 . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.1 SCA concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.2 Differential Power Analysis (DPA) . . . . . . . . . . . . . . . . . . . 10 2.1.3 Measurements to Disclosure (MTD) . . . . . . . . . . . . . . . . . . . 12 iv 2.2 2.3 2.4 SCA Countermeasures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.1 Principle of Dual-rail Precharge logic (DPL) . . . . . . . . . . . . . . 13 Block Ciphers and AES Algorithm . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.1 AES T-Box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Customized Processor and Custom Instructions . . . . . . . . . . . . . . . . 17 3 SCA-resistant CPU: Implementation 19 3.1 SCA-resistant Data organization . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Memory Organization for Lookup Tables . . . . . . . . . . . . . . . . . . . . 21 3.3 System Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.1 SCA-resistant Custom Instruction Set . . . . . . . . . . . . . . . . . 23 3.3.2 System on Chip Configuration (SOPC) . . . . . . . . . . . . . . . . . 24 4 SCA-resistant CPU: Results Analysis 26 4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.2 Single T-Box Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3 AES Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.4 Impact of PCB Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.5 Related Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.6 Summary of Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5 ECDLP Engine: Background 5.1 39 Elliptic Curve Cryptography (ECC) . . . . . . . . . . . . . . . . . . . . . . . v 39 5.2 Pollard-rho Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.2.1 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.3 Modular Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 6 ECDLP Engine: Implementation 6.1 6.2 6.3 6.4 46 Modular Multiplication Architecture . . . . . . . . . . . . . . . . . . . . . . 46 6.1.1 Modular Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . 47 6.1.2 Dedicated Square Unit . . . . . . . . . . . . . . . . . . . . . . . . . . 51 6.1.3 Vectorized Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.2.1 Nallatech Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.2.2 Software Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 54 6.2.3 Hardware Implementation . . . . . . . . . . . . . . . . . . . . . . . . 55 Implementation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.3.1 Overall Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.3.2 Comparison with Previous Software Implementations . . . . . . . . . 60 6.3.3 Comparison with Hardware Implementation . . . . . . . . . . . . . . 60 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 7 Conclusion 63 vi List of Figures 1.1 SCA resistant design by (a) C source code transformation, (b) Dedicated secure logic styles and (c) Customized CPU . . . . . . . . . . . . . . . . . . 3 2.1 SCA Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Correlation based DPA attack: Example . . . . . . . . . . . . . . . . . . . . 12 2.3 Comparison between CMOS standard AND gate and DPL AND gate; (a) A standard AND has data-dependent power dissipation; (b) A DPL AND gate has a data-independent power dissipation . . . . . . . . . . . . . . . . . . . . 14 2.4 128-bit AES Dataflow. Shaded operations belong to single T-Box Operation 16 2.5 Customized Processor Architecture: NiosII . . . . . . . . . . . . . . . . . . . 17 3.1 Balanced Interleaved data format . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 Balanced-Interleaved T-Box Organization . . . . . . . . . . . . . . . . . . . . 21 3.3 System-on-chip Configuration (SOPC) . . . . . . . . . . . . . . . . . . . . . 25 4.1 Setup for SCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2 Single T-Box Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3 Security Improvement: Single T-Box test . . . . . . . . . . . . . . . . . . . . 30 vii 4.4 AES-TBOX implementation: NiosII/s + SDRAM . . . . . . . . . . . . . . . 4.5 Attack results on secure implementation: NiosII/s + SRAM. Trace of correct 31 key guess (here, first key byte) is plotted in black, while all other key guesses are in yellow(gray). The buried trace means unsuccessful attack. . . . . . . . 32 4.6 AES trace on oscilloscope (SDRAM configuration). . . . . . . . . . . . . . . 34 4.7 AES trace on oscilloscope (SSRAM configuration). . . . . . . . . . . . . . . . 35 4.8 Impact of PCB Layout on Residual Leakage . . . . . . . . . . . . . . . . . . 36 5.1 Standard Pollard-rho attack . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.2 Reduction with prime p, where p = 2N − r . . . . . . . . . . . . . . . . . . . 43 6.1 Polynomial Representation of Data . . . . . . . . . . . . . . . . . . . . . . . 48 6.2 Standard Multiplication Method . . . . . . . . . . . . . . . . . . . . . . . . . 49 6.3 Reduction with Adder-Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 6.4 Modular Multiplication Architecture . . . . . . . . . . . . . . . . . . . . . . 50 6.5 Dedicated Square Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 52 6.6 Nallatech System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6.7 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 viii List of Tables 3.1 SCA-resistant Instruction Set for AES . . . . . . . . . . . . . . . . . . . . . 22 4.1 AES Implementation: Security 32 4.2 AES Implementation: Area and Performance . . . . . . . . . . . . . . . . . 33 4.3 Related work: Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 6.1 ECC Arithmetic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 6.2 Micro-instruction Sequence for Point addition: P + Q . . . . . . . . . . . . . 57 6.3 Comparison with Software Implementations . . . . . . . . . . . . . . . . . . 60 6.4 Comparison with Hardware Implementations (per ECC core) . . . . . . . . . 61 . . . . . . . . . . . . . . . . . . . . . . . . . ix Chapter 1 Introduction Modern security systems use cryptographic algorithms to provide confidentiality, integrity and authentication of data. These cryptographic algorithms fall into two broad categories: Symmetric cryptography and asymmetric cryptography. They use mathematically complex and difficult operations to achieve the desired level of security. Another research field in security deals with the cryptanalytic techniques, which attack secure systems to extract secret keys by exploiting weakness of a cryptographic algorithm, or the weakness of its implementation at hardware/software level. In this thesis, we discuss two interesting problems in the field of hardware security. 1.1 1.1.1 Side Channel Analysis (SCA) Secure System Introduction Today, many embedded electronic systems, including RFIDs, smart cards, wireless car keys, smart phones, tablets, etc., are used to represent personal identification, to store private information, and to do confidential communications. As a result, information security on embedded systems has become crucially important. Although the security of such systems 1 Suvarna H. Mane Chapter 1. Introduction 2 relies on the computational complexity of underlying cryptographic algorithms, implementing them with either software or hardware usually leaks additional information, which can be used to break the cryptographic systems easily. One such technique, sidechannel analysis (SCA) [24] attack, successfully extracts the secret keys from cryptographic algorithms within a very short time by exploiting side-channel information such as execution time, power consumption, or electromagnetic emissions. Over time, the field of SCA has been intensively researched and the literature shows a long string of attacks and counterattacks against countermeasures [30]. Several recent results include the use of SCA to extract keys from Virtex-II FPGA [28] and Virtex-4/5 FPGA [29] bitstream encryption, from the Mifare DESFire contactless card [22], from the Keeloq keyless entry system [14], and from the Atmel Cryptomemory non-volatile memory [4]. It is, thus, crucially important to consider SCA attacks as a part of the threat model of a design and to introduce suitable SCA countermeasures to hamper these attacks. With an embedded processor at the heart of these systems, it is desirable to have a easyto-design SCA-resistant embedded processor. As illustrated in figure 1.1, the approaches to implement side-channel countermeasures in the context of embedded processors, broadly fall into three categories. In the first category, it is handled at the software level by transforming the crypto-algorithm into an side-channel leakage-free implementation, for example [11]. This countermeasure is usually algorithm-specific, and requires in-depth understanding of cryptographic operations. The second approach targets underlying hardware technology and implements the CPU in a SCA-resistant circuit style. Past research has shown that these techniques are very expensive in hardware - costing 3 to 15 times the original circuit area [30] - and thus not applicable to a complete CPU. Third approach uses a customized CPU, where the cryptographic operations are implemented in custom hardware using a secure logic style. This approach does not require conversio of the complete processor in secure logic style, which saves on the area cost. Hence, the third approach shows a decent trade-off between security and performance, while keeping the cost with a reasonable limit. Suvarna H. Mane Chapter 1. Introduction 3 scaresistant C C C C scaresistant CPU Performance Circuit Area SCA Resistance C Complexity = CPU CPU Performance Circuit Area SCA Resistance C Complexity (a) (b) ~ = Custom Ins + Memory Org CPU Performance Circuit Area SCA Resistance C Complexity ~ ~ (c) Figure 1.1: SCA resistant design by (a) C source code transformation, (b) Dedicated secure logic styles and (c) Customized CPU 1.1.2 Motivation The design and implementation of a side-channel countermeasure is a very complex and error-prone process because side-channel leakage is a byproduct of the implementation of a cryptographic algorithm. Predicting the amount of side-channel leakage from, say, cryptographic software in C is difficult. Nonetheless, embedded system design needs to satisfy several constraints such as low area, low power consumption, small software footprint and low cost, while having to maintain high performance. Our work is motivated by the need for an easy-to-use countermeasure, applicable to a wide range of designs and usable within a standard FPGA design flow. The objective is to systematically remove side-channel leakage while keeping a reasonable cost in circuit area and performance degradation. We consider protection of a general class of block ciphers that use logic operations and lookup tables. This includes AES, DES, and many others. We propose our methodology in the context of embedded designs with a CPU, and develop side-channel resistance for cryptographic software executing on the processor. Suvarna H. Mane 1.1.3 Chapter 1. Introduction 4 Contribution This thesis presents a secure embedded system design based on customized CPU, which uses an SCA-resistant custom instruction set and an optimized memory organization. The design configuration is supported by a soft-core CPU in mainstream FPGA families and an SCA resistance is derived from dual-rail precharge logic (DPL). The solution uses a balancedinterleaved data format, combined with a novel memory organization to support both logic operations as well as lookup tables. The resulting countermeasure applies to a broad class of block ciphers. We demonstrate our results for a 128-bit Advanced Encryption Standard (AES) T-box implementation and show an SCA-resistance improvement of more than 400X for a system-wide electro-magnetic attack that covers both the FPGA and offchip memory (SSRAM). This comes at an overhead of 2.7x in performance and 1.15X in area. Our work is not the first to suggest a customized CPU for side-channel resistant implementations; previous proposals have included masking-based [37, 5] and hiding-based [33, 12] designs. However, using comparisons with related work, we demonstrate that our solution represents an excellent trade-off between SCA resistance, (software and hardware) design complexity, performance, and circuit area cost. 1.2 ECDLP: Introduction The security of symmetric and asymmetric ciphers is usually determined by its security parameters, foe example, size of the key. This is because the computational complexity of an algorithm is higher for a longer key-size. Elliptic curve cryptosystems (ECC), independently introduced by Miller [26] and Koblitz [23], have now found the significant place in the academic literature and practical applications. It is a type of public-key cryptography based on the algebraic structure of elliptic curves over finite fields (either binary or prime). Their popularity is mainly because of their shorter key-sizes, which offer the same level of security as other conventional cryptosystems such as RSA. Suvarna H. Mane Chapter 1. Introduction 5 The security of ECC relies on the difficulty and complexity of Elliptic Curve Discrete Logarithmic Problem (ECDLP) [7]. It refers to the ability to compute a point multiplication and the inability to compute the multiplicand given the original and product points. By definition, ECDLP is to find an integer n for two points P and Q on an elliptic curve E such that Q = [n]P (1.1) Here, [n] denotes the scalar multiplication with n. The Pollard rho method [32], [10] is the strongest known attack against ECC today. This method solves ECDLP by generating points on the curve iteratively, any of which have the property X = [a]P + [b]Q. When the same point is encountered twice, for different [a] and [b], the collision occurs and the ECDLP is solved. 1.2.1 Motivation There have been different approaches to implement Pollard rho algorithm on software and hardware platforms. Most of the solutions are implemented on software platforms using general purpose workstations, such as clusters of PlayStation3 [9], Cell CPUs [8], GPUs [3]. These software approaches are inherently limited by the sequential nature of software on the target platform. Programmable hardware platforms are an attractive alternative to the above because they efficiently support parallelization. However, most of the FPGA-based solutions that have been proposed, do not deal well with the control complexity of ECDLP. Instead, they focus on the efficient implementation of datapath operations, and ignore the system integration aspect of the solution. There has been little work in the area of supporting or accelerating a full Pollard rho algorithm on a hardware-software platform [25]. Our solution, therefore, goes one step further as we demonstrate the parallelized Pollard rho algorithm on FPGA along with its integration to a software driver. Suvarna H. Mane 1.2.2 Chapter 1. Introduction 6 Contribution We start with a reference software implementation [6], and demonstrate an efficient, parallel implementation of the ECDLP machine over prime field of the form (2k − q)/m. We use a novel, generalized architecture for polynomial-basis multiplication over a prime field. The resulting modular multiplier completes the multiplication within 14 clock cycles, which is a 2.5X lower latency over earlier work [19]. The complete system design takes 151 cycles per Pollard rho step at 100MHz and performs upto 660K point additions per second per ECC core. A single ECC core occupies 4773 slices on Virtex-5 FPGA device. With a multi-core implementation of our design, the performance can be comparable with that of the software implementation on a Cell processor [6]. Our work also shows that implementation of prime field arithmetic on hardware can be as feasible as binary field arithmetic. 1.3 Organization This thesis is organized to cover details of two problems and their solutions. Chapters 2,3 and 4 discuss the solution for the first problem on hardware security i.e. Implementation of an SCA countermeasure on a soft-core CPU. The details of second solution i.e. implementation of ECDLP engine, are explained in Chapters 5, 6 and 7. The individual chapters are structured as follows: Chapter 2 introduces preliminary knowledge that is needed by first solution, such as SCA attacks and countermeasures, Dual-rail Precharge(DRP) principle, Advanced Encryption Standard (AES) algorithm and Custom instruction support to a CPU. Chapter 3 presents the implementation details of our SCAresistant processor design. We evaluate our solution and analyze the results in the Chapter 4. Chapter 5 presents preliminaries required for the second problem i.e. ECDLP engine implementation. It covers background of ECC cryptography, ECDLP, pollard-rho method and Suvarna H. Mane Chapter 1. Introduction 7 discusses modular arithmetic operations over a prime field. The implementation details of ECDLP engine are provided in the Chapter 6 and results are compared with previous implementations in Chapter 7. Finally, Chapter 8 summarizes the contributions of our work and identifies potential future targets. 1.4 Related Articles Our work is described in the following papers: • S. Mane, L. Judge, P. Schaumont, ”An Integrated Prime-field ECDLP Hardware Accelerator with High-performance Modular Arithmetic Units”, ReConFig 2011. • S. Mane, M. Taha, P. Schaumont, ”Efficient and Side-Channel-Secure Block Cipher Implementation with Custom Instructions on FPGA”, under review, FPL’12. • L. Judge, S. Mane, P. Schaumont, ”A Hardware Accelerated ECDLP with Highperformance Modular Multiplication”, special issue of International Journal of Reconfigurable Computing (IJRC), 2011 (under review). Chapter 2 SCA-resistant CPU: Preliminaries In this chapter we give a brief overview of SCA attacks and countermeasures, the AES algorithm and architecture of customized processors. 2.1 Side Channel Attacks (SCA) The cryptographic algorithms are considered secure because of their inherent computational complexity. It needs a huge amount of attack efforts and thousand of years for a conventional brute-force key search to break it. However, these algorithmic security features alone are not enough to guarantee its security. Passive attacks on cryptographic devices such as SCA attack, exploit the weakness of an implementation platform of a cryptographic algorithm and break it within few hours. It reveals the secret key part by part, using the side-channel leakage of a device such as execution time, power consumption and electromagnetic radiations. Hence, SCA causes a serious threat to the secure embedded systems. The concept of SCA can be explained as follows. 8 Suvarna H. Mane Chapter 2. SCA-resistant CPU: Preliminaries 9 Trust Boundary Cryptographic Device Key (K) Plaintext (P) Cipher-text (C) Encryption Routine (r) Electromagnetic radiations (Side-channel leakage) Analysis Algorithm Digitized traces Oscilloscope Secret Key Figure 2.1: SCA Concept 2.1.1 SCA concept In the passive type of cryptanalytic attacks, the cryptographic device is operated largely or entirely within its specification and the physical properties of the device are observed to find the secret key. These observable physical properties are referred as side-channel leakage and the attacks are known as side-channel attacks(SCA). The basic idea can be explained as shown in Figure 2.1, where a cryptographic device is executing an encryption algorithm (r). It takes the plaintext (P ) as an input and converts it into ciphertext C by using the internal secret key (K) i.e. (C = r(P, K)). As the secret key K is stored in the the device hardware, it is not observable on the input/output ports. The goal of SCA is to find this K. Here, we assume that an attacker knows the underlying cryptographic algorithm and has an access to the input/output data (plaintext Suvarna H. Mane Chapter 2. SCA-resistant CPU: Preliminaries 10 and ciphertext). Suppose, in this example, the cryptographic algorithm r is the AES that uses a 128-bit key and side-channel property observed is the electromagnetic emissions. A conventional bruteforce attack will need to search through all 2128 guesses to find correct key. On contrary, SCA targets the secret key in pieces instead of trying to break all 128 bits together. It takes advantage of the fact that, not all key bits are used simultaneously in an cryptographic algorithm. Some intermediate values depend only on a part of the key e.g. a SubBytes operation of the AES works at byte-level and hence, each output byte of this operation is calculated based on one byte of plaintext and one byte of the secret key. Consequently, the EM radiations strongly relate to one key byte at a specific point of time in the AES execution window. SCA uses this information to compare the recorded EM trace against 256 possible key guesses and reveals the correct key byte. The procedure is repeated for all key bytes. This breaks down the search space to 16 ∗ 28 , which is much less than that for a brute-force attack. The EM traces are captured with the help of oscilloscope and the traces are analyzed using analysis algorithms. In the general case, an attacker has to select an algorithm-specific intermediate value, which depends on part of the secret key and is also observable through side-channel leakage. In above example, the output of a SubBytes operation is chosen as the intermediate value. 2.1.2 Differential Power Analysis (DPA) There are different ways to analyze the recorded side channel information to extract secret keys such as, simple power analysis (SPA), differential power analysis (DPA) [34], and mutual information analysis. The goal of the DPA attack is to reveal secret keys of cryptographic device based on large number of power traces that have been recorded while the devices encrypt or decrypt different plaintexts. Suvarna H. Mane Chapter 2. SCA-resistant CPU: Preliminaries 11 The first step is to choose an intermediate result v of a cryptographic algorithm, which is a function of plaintext (P ) and key (K). In the second step, large number of measurements for the power consumption (or electromagnetic radiations) of a device are recorded, while it encrypts (decrypts) D different plaintexts. These traces can be written as a matrix A of size DxT , where T is the length of the trace. The next step of the attack is to calculate a hypothetical intermediate value for every possible choice of key k, as k = (k1 , k2 , ..., kN ), where N is the total number of possible keys. We refer elements of this vector as key hypotheses. Knowing plaintext, an attacker can calculate intermediate values for all D encryptions and for all N key hypotheses, which results in a matrix V of size DxN . In the next step, these hypothetical intermediate values (matrix V ) are mapped to hypothetical power consumption values (matrix H) using a power model. An attacker can obtain a power model using knowledge of analyzed device and simulation techniques. The most commonly used power models are Hamming weight and Hamming distance. We use Hamming weight power model in our experiments. In the final step, each column hi of the matrix H is compared with each column aj of A using a correlation technique. This means that the attacker correlates the hypothetical power consumption values of each key hypothesis with the recorded traces at every position. The result of this comparison is matrix R of size N x T , where each element ri,j represents a correlation coefficient. The element of the matrix R with the highest correlation value reveals the correct the key. Figure 2.2 shows an example of 256 correlation coefficient traces. The trace shown in red has a distinctly higher correlation than other 255 key guesses and thus it corresponds to a correct key byte. This attack is also referred as correlation power attack (CPA). We collect electromagnetic radiations and analyze them with CPA technique in our experiments. Suvarna H. Mane Chapter 2. SCA-resistant CPU: Preliminaries 12 Figure 2.2: Correlation based DPA attack: Example 2.1.3 Measurements to Disclosure (MTD) There is no direct way to quantify the side-channel leakage. Hence, it needs to be expressed using indirect parameters such as, correlation coefficient values or the number of measurements required for successful attack etc. The most widely used approach is to use Measurements to Disclosure (MTD). It represents the number of measurements required to attack a cryptographic device successfully. A higher MTD implies a more secure design. SCA uses statistical methods to analyze the acquired samples, which gives better results as the number of samples increases. Getting more samples means attacker needs to invest more time and more attack efforts. We use MTD to evaluate the security gain in our experiments. Suvarna H. Mane 2.2 Chapter 2. SCA-resistant CPU: Preliminaries 13 SCA Countermeasures Power attacks and electromagnetic radiation attacks are based on the dependency between power consumption (or electromagnetic radiation) and intermediate values in a cryptographic algorithm. The goal of SCA countermeasures is to obtain data-independent power consumption (or electromagnetic radiation). For the purpose of completely preventing SCA, there are two broad categories of countermeasures: masking and hiding. In this thesis, we base our secure solution on hiding. Masking can be implemented at the algorithm level without changing the low-level hardware to make their power characteristics independent of the processed data. It achieves SCAresistance by randomizing the intermediate values of the cryptographic algorithms. The idea is to mask an intermediate value with randomly generated mask m, where m needs to remain secret to achieve the expected SCA-resistance. This countermeasure is usually algorithm-specific, and requires in-depth understanding of cryptographic operations. Moreover, masking becomes very complex under advanced SCA techniques [18]. The goal of hiding is to suppress side-channel leakage by designing the cryptographic devices such that they have the same power consumption and electromagnetic radiation while manipulating sensitive data. Hiding allocates the task of eliminating data dependency to the device hardware. Due to its non-dependency on statistical characteristics of the intermediate values, hiding does not suffer from higher-order attacks. There are several different approaches to implement hiding. One of them is the differential logic, also called Dual-Rail Precharge Logic (DPL). The SCA-resistance in our solution is based on principle of DPL. 2.2.1 Principle of Dual-rail Precharge logic (DPL) The cause of side-channel leakage is data-dependent processing. In CMOS logic, such processing gives data-dependent signal transitions, which in turn results in data-dependent power consumption or radiation. The idea of Dual-rail Precharge Logic (DPL) is to eliminate Suvarna H. Mane Chapter 2. SCA-resistant CPU: Preliminaries Standard AND gate a b c Transition on c Power consumed 0 0 0 0 1 +P 1 0 -P 1 1 0 14 (a) DPL AND gate a b a b c c Transition on c P E Transition on c P E Power consumed 0 0 0 1 0 1 constant 0 0 1 1 0 0 constant 1 0 0 0 0 1 constant 1 0 1 0 0 0 constant P: Precharge phase, E: Evaluation phase (b) Figure 2.3: Comparison between CMOS standard AND gate and DPL AND gate; (a) A standard AND has data-dependent power dissipation; (b) A DPL AND gate has a dataindependent power dissipation side-channel leakage at the level of the implementation. The concept of DPL is explained as follows. First, every bit in the circuit is stored and processed in complementary form. For example, as shown in Figure 2.3, for logic operation AN D (a and b), there is a matching complementary operation OR (not(a) or not(b)). Both of these gates are evaluated simultaneously. This ensures that the load of AN D and that of OR are identical. Thus, 0 → 1 and 1 → 0 transitions on a signal c can not be distinguished. However, a transition on a signal can be distinguished from a non-transition. To address this issue, a pre-charge phase is introduced before each evaluation phase. Every complementary data pair (a, not(a)) is pre-charged to (0,0) before every evaluation. When combined, evaluation and precharge phases together, result in a constant power consumption: every evaluation has an active 0 → 1 transition, either on the true net, or else on the complementary net. Suvarna H. Mane Chapter 2. SCA-resistant CPU: Preliminaries 15 DPL has been applied in many different forms since it was first proposed, including ASIC, FPGA, and software [38, 18, 12]. Authors have also identified sources of residual leakage, including early evaluation and imbalance between complementary pairs [17]. However, DPL has demonstrated substantial reduction of side-channel leakage in prototypes. 2.3 Block Ciphers and AES Algorithm Block ciphers use symmetric-key cryptographic algorithms to encrypt a block of plaintext into ciphertext through successive round transformations. The majority of modern block ciphers are built out of a limited set of operations, including substitutions with lookup-tables (SBoxes), and operations such as XOR, modular addition, rotations, and shift. Furthermore, round transformations have a common structure, and use either a substitution-permutation network (SPN), or a Feistel network. Of course, within this framework, there are important differences among block ciphers as well, such as the number and size of lookup tables used, and the detailed configuration of the operations. The Advanced Encryption Standard (AES) [13] is a widely used symmetric block cipher, which is capable of using cryptographic keys of 128, 192, and 256 bits. The 128-bit AES algorithm takes a block of 128 bits of plaintext and using 128-bit key, it iterates that data through 10 rounds to produce ciphertext. An AES state (i.e. block of 128 bit data) is organized as a 4x4 matrix of 16 bytes and is processed by the AES round. Each AES round includes four operations: SubBytes, ShiftRows, MixColumns, AddRoundKey. Figure 2.4 describes the dataflow of AES. 2.3.1 AES T-Box There are two different ways to implement AES algorithm based on its way to perform substitution operations. The basic implementation AES S-Box, performs these substitutions Suvarna H. Mane Chapter 2. SCA-resistant CPU: Preliminaries 16 S Figure 2.4: 128-bit AES Dataflow. Shaded operations belong to single T-Box Operation using a 8x256 lookup table (S-Box ). The more optimized design AES T-Box [13] computes one complete round of AES just by using lookup tables followed by a large XOR network. Figure 2.4 illustrates the dataflow within a single AES round. Each of the s-operations in Figure 2.4, is a 8x256 substitution table. This S-Box operation on an input byte is calculated by obtaining its finite-field inversion (over GF (28 )) followed by an affine transformation [13]. Each of the AES-128 rounds consists of such 16 S-Box substitutions.This leads to byteoriented computations over the entire round making it inefficient for 32-bit architectures. A more efficient way to implement AES on 32-bit architectures is to use a T-Box table, which merges the shaded operations in Figure 2.4 into single lookup operation. The SubBytes and MixCoulumns operations are reformulated together to implement as a 32-bit wide 256-deep T-Box table. A T-Box maps byte-wide operations over four bytes, each of which represent part of a MixColumn result. In each AES round, four T-Box operations can be combined Suvarna H. Mane Chapter 2. SCA-resistant CPU: Preliminaries 17 ALU Operand A + >> Result & Operand B Custom Logic Figure 2.5: Customized Processor Architecture: NiosII to obtain a single row of the AES state matrix. This requires to have four different T-Box lookup tables. This approach results in a better utilization of the 32-bit datapath of a processor. We implement 128-bit AES T-Box algorithm to investigate our solution. 2.4 Customized Processor and Custom Instructions The crucial demands of embedded system design such as, high performance, low cost, timeto-market window, design flexibility, have resulted in the emergence of the Instruction-set Extensible Processors or Customized Processors. It consists of an existing processor core, that is extended with application-specific custom instructions. These custom instructions execute on a Reconfigurable Custom Hardware and allow user to reduce a complex sequence of standard instructions to a single instruction implemented in hardware. The idea is to implement desired custom logic in the custom hardware attached to an existing processor and call these functions using custom instructions from a software. Figure 2.5 shows a general architecture of customized processor. Optimized hardwired custom logic im- Suvarna H. Mane Chapter 2. SCA-resistant CPU: Preliminaries 18 plementation helps to improve performance through parallelism and chaining of operations. At the same time, custom instructions result in compact code size as well as less number of instruction fetches and decodes. Altera provides instruction-set extensible softcore processor N iosII compatible with their FPGA device CycloneII. NiosII supports an user-friendly software interface to access custom hardware using built-in user-defined Macros. These custom instructions can simply be written as C instructions. We use softcore NiosII processor for our experiments. Chapter 3 SCA-resistant CPU: Implementation We implement a side-channel resistant block cipher by creating DPL versions of both the lookup tables as well as the logic operations in hardware. These modules are efficiently integrated into the soft-core processor using the custom-instruction set interface. This way, SCA-resistant block ciphers can be executed as a sequence of custom instructions. Noncrypto software, on the other hand, is written using the regular instruction set without performance hit. The custom-instruction hardware for lookup tables is built from on-chip RAM macro’s. Research has demonstrated that such dedicated structures increase sidechannel resistance [35], and we further improve this technique. In this section, we discuss the components of our solution: the organization of data, the memory organization for lookup tables, and the system integration of SCA-resistant block cipher hardware into software. 3.1 SCA-resistant Data organization We need a data format that is compatible with the requirements of DPL and uses the wordlevel organization of an embedded system. Figure 3.1 shows our data arrangement. Each 19 Suvarna H. Mane Chapter 3. SCA-resistant CPU: Implementation Unbalanced Full Word b31..b16 20 b15..b0 Balanced-Interlaved b15 b15 Lower Half Word b1 b1 b0 b0 Balanced-Interleaved b31 b31 Upper Half Word b17 b17 b16 b16 Figure 3.1: Balanced Interleaved data format 32-bit word is split into two balanced half-words, and each bit from the original word is interleaved with an associated complementary bit. We call this representation a balancedinterleaved (BI) format. The logical and physical proximity of complementary bits improves symmetry between the bits (e.g. similar electrical loads), and in turn, this improves SCA resistance. Indeed, at the logical level, adjacent bits will share adjacent storage locations. In embedded architectures, storage organization may use a wordlength which is different from the processor wordlength; a 32-bit memory may be organized, for example, as two half-word banks. Keeping complementary bits adjacent ensures that they will share the same physical storage bank. Furthermore, at the physical level, adjacent bits in a bus structure have a better chance of being routed with equal wire lengths on the FPGA and PCB. This is important because we are building a generic design flow, that does not pre-assume special place-androute scripts. Having equal and similar routing paths for complementary bits creates very low electrical imbalance between them reducing the side channel leakage to the minimum. A consequence of using a balanced-interleaved format is that each 32-bit operation from the original, unprotected block cipher, requires expansion into two balanced operations, each processing a balanced half-word. Suvarna H. Mane Chapter 3. SCA-resistant CPU: Implementation Balanced Address 21 16 bit address address 0x00 0x01 0xFF 0xFE TBOX_H 256x32 0x00 Balanced Data (H,L) TBOX_L 256x32 0xFF 32 bit Figure 3.2: Balanced-Interleaved T-Box Organization 3.2 Memory Organization for Lookup Tables Because lookup tables are so common in block ciphers, we use a dedicated approach to implement side-channel resistant lookup tables using the RAM macro’s of the FPGA fabric. We use the AES T-Box implementation as a case study. The T-Box is a lookup table with 8 input bits and 32 output bits and is defined by grouping several steps of the AES round transformation. For the purpose of explaining our method, we treat the T-Box simply as an 8x32 lookup table. The complete AES algorithm requires five different T-Box tables. The secure T-Box design shown in Figure 3.2 uses a balanced-interleaved data organization. An 8x32 T-Box thus needs two 8x32 balanced-interleaved tables, each storing a half-word of the original T-Box with its complementary bits. Each balanced-interleaved table is stored in a separate RAM macro. In order to achieve balancing in the address decoding logic, we follow the storage order suggested in [35], namely that complementary RAM macro’s require complementary addresses. The difference with our design, however, is that the Suvarna H. Mane Instruction Chapter 3. SCA-resistant CPU: Implementation 22 Table 3.1: SCA-resistant Instruction Set for AES Return Value CONV INV(a) 0, 0, .., a[2], a[0] CONV BIL(a) a[15], a[15], .., a[0], a[0] CONV BIH(a) a[31], a[31], .., a[16], a[16] B XOR(a,b) balanced-interleaved xor(a, b) B TBx L(a) balanced-interleaved lookup-table (lower) B TBx H(a) balanced-interleaved lookup-table (upper) Each AES T-Box has its own B TBx H(a) and B TBx L(a); x=0,1,2,3,4 complementary RAMs do not store complementary data: the data within each RAM is already balanced. Summarizing, our proposed memory organization for lookup tables achieves side-channel resistance by combining three elements. First, the use of RAM cells reduces side-channel leakage because the increased logic density they offer. Second, the use of balanced-interleaved addressing for the overall lookup table. Third, the use of balanced-interleaved data storage for lookup table content. 3.3 System Integration An important, but often overlooked, aspect of side-channel countermeasures is the system integration. On an embedded processor, SCA-resistant encryption is just one of the many tasks handled by software. We have integrated our countermeasures as custom instructions into a soft-core processor. A custom-instruction interface offers the ability to introduce custom-hardware modules in the execution stage of a RISC pipeline. Suvarna H. Mane 3.3.1 Chapter 3. SCA-resistant CPU: Implementation 23 SCA-resistant Custom Instruction Set NiosII softcore CPU provides support for custom instructions and an interface to integrate them with software. It can support up to 128 custom instructions. We have implemented three different custom instructions to support AES routine. Table 3.1 shows the side-channel resistant instruction set for AES. CONV INV(a) extracts the even bits from a word, and thus converts balanced-interleaved format into direct form. CONV BIL(a) and CONV BIH(a) generate balanced-interleaved form from the lower resp. higher halfword of the input argument a. These instructions are used to initialize the data variables of a program into balanced-interleaved form before it starts executing the cryptographic routine. This makes sure that dataflow of these variables through the program always offer SCA resistance. When encryption routine is finished, the output variables are converted back into the standard 32-bit form. The round function for a T-Box based AES only requires a balanced XOR, which can be supported through a single custom instruction B XOR(a,b). It takes two 32-bit balancedinterleaved operands as inputs, perform XOR operation in DPL way and produces a balancedinterleaved 32-bit output. Move, shift and rotate operations are compatible with balancedinterleaved arguments, so that no custom instruction is needed for those. Custom logic takes care of shift and rotate manipulations. The AES T-Box has 5 different T-Box tables. There is a B TBx L(a) and a B TBx H(a) to access the lower resp. higher half of each T-Box table. These instructions are specific for the AES block cipher; a different block cipher would need to use different lookup tables. However, it is perfectly feasible to make the lookup tables fully reconfigurable, so that they can be programmed with the S-Box content required for a specific block cipher. The approach to implement lookup tables in the processor is an important difference with earlier work by Chen [12], and we will show how this brings considerable performance gain. The AES T-Box algorithm can be written in C by making use of custom instructions em- Suvarna H. Mane Chapter 3. SCA-resistant CPU: Implementation 24 bedded as inline assembly macro’s. The pre-charge operation can be supported from C as well, as illustrated in the snippet below. Note the use of volatile to prevent the removal of precharge by an optimizing compiler. volatile int t1, t2, t3; t1 = 0; // precharge t1 = B_TB0_L(in); // T-Box0 lower word t2 = 0; // precharge t2 = B_TB1_L(in); // T-Box1 lower word t3 = 0; // precharge t3 = B_XOR(t1, t2); // XOR A strong feature of this approach is that it is fully compatible with the existing memory hierarchy of an embedded system. Variables can be stored into RAM in balanced-interleaved form, and they will maintain their low side-channel leakage provided that pre-charge is properly implemented. Thus, our approach is independent of the number of processor registers; it will not run out of foreground storage (in contrast to e.g. [36]). Storing balanced interleaved format in background memory may still cause side-channel leakage due to asymmetry in the physical layout of background memory. We will analyze this later in this thesis. 3.3.2 System on Chip Configuration (SOPC) We use QuartusII- SOPC builder tool to configure our system on FPGA. It provides interface to choose configurations of processor and different peripheral components on board to be included in the system. It also allows user to define new custom instructions and select any of them to include in the system. The SOPC builder assigns address space to all these components and integrates them together to generate a desired system on chip. It Chapter 3. SCA-resistant CPU: Implementation System-on-chip SDRAM-1 (16-bit, 32MB) NiosII/s 25 Expansion Header (GPIO) Suvarna H. Mane Onchip Memory SDRAM-2 (16-bit, 32MB) RS232 (19200 bps) SDRAM/SRAM Controller JTAG Debug Module SSRAM (2MB) Figure 3.3: System-on-chip Configuration (SOPC) also assigns address space to the custom instructions, which are used to call them through assembly macros. Figure 3.3 illustrates the block diagram of the system used in our experiments. It incorporates a 32-bit NiosII/s (50MHz, pipelined) processor, an offchip memory (SDRAM or SSRAM), GPIO parallel port (trigger), and communication peripherals (RS232 and JTAGUART). NiosII/s is a 32-bit RISC softcore CPU with pipelined architecture. SDRAM is 64MB memory arranged in two 16-bit wide memory banks, whereas SRAM is 32-bit wide memory with capacity of 2MB. GPIO pin is configured as output IO to generate a trigger for the oscilloscope. RS232 serial bus handles the communication between NiosII and PC. Chapter 4 SCA-resistant CPU: Results Analysis To demonstrate that our solution improves the resistance against SCA, this chapter presents the experimental results based on real attacks. First section describes the setup we have used in our experiments. The experiment is divided into two parts, a proof-of-concept experiment (Single T-Box attack) and a real world implementation (128-bit AES T-Box attack) of AES prototype. These are detailed in subsequent sections. Next, we analyze the possible causes for residual side-channel leakage in case of SDRAM-system. In the subsequent section, we compare our work with related published secure implementations and finally we summarize the contributions of this work. 4.1 Experimental Setup Our designs are implemented on an Altera DE2-70 evaluation board, that has a CycloneII EP2C70F896C6 FPGA device and NiosII softcore processor. To uncover the secret key with SCA, we build a setup whose block diagram and real picture are shown in Figure 4.1. The setup contains the cryptographic device (Altera DE2-70), an oscilloscope (Agilent DSO5032A) and a PC. The three parts of the setup are connected in a circular fashion. A 26 Suvarna H. Mane Chapter 4. SCA-resistant CPU: Results Analysis 27 PC Measurement Analysis USB Waveform Oscilloscope RS-232 (Plaintext) Altera DE2-70 EM Probe (EM Traces) Trigger EM Probe GPIO- Trigger CycloneIINiosII processor to Oscilloscope RS232 cable connecting FPGA and PC Figure 4.1: Setup for SCA RS232 cable connects the cryptographic device and the PC. Between the oscilloscope and the PC is a USB cable, through which the PC is able to send commands to and get sampling waveform from the oscilloscope. Suvarna H. Mane Chapter 4. SCA-resistant CPU: Results Analysis 28 An Electromagnetic(EM) probe (ETS-LINDGREN Model 7405-903) is used to capture electromagnetic radiations emitted by the cryptographic device. We use the these radiations to represent the power consumption of the embedded system. These EM traces are sampled on an Agilent Oscilloscope DSO5032 (300MHz bandwidth, 2GSa/s sampling rate). An oscilloscope is configured to average out 32 consecutive traces so as to reduce the noise in the acquired traces. Side-channel analysis requires a number of measurements with different inputs (plaintexts for AES). In this example, the result of one measurement is the EM trace of the cryptographic device and the corresponding random plaintext block for encryption. Each measurement consists of the following 4 steps. A side-channel analysis that requires n measurements needs to repeat these 4 steps for n times. • Step 1: The PC sends a random plaintext block (16 bytes) to the embedded platform (DE2-70 board) through the RS232 cable. • Step 2: The embedded processor (NiosII) in the platform receives the plaintext and encrypts it with the AES software. The encryption is repeated 32 times to sample averaged trace. • Step 3: After sending out one block of plaintext, the PC sends command to the oscilloscope to sample the EM trace when NiosII is running the encryption. • Step 4: After sampling is done, oscilloscope averages out 32 samples and sends one EM trace is back to PC for side-channel analysis. After obtaining measurements, we move on to the analysis phase. The result of first AES round is selected as an intermediate value IV and a Correlation Power Attack (CPA)[34] using the hamming weight model is used to analyze the acquired EM traces. Suvarna H. Mane Chapter 4. SCA-resistant CPU: Results Analysis 29 8-bit Secret Key 8-bit Plaintext 32-bits Output AES TBOX XOR Figure 4.2: Single T-Box Experiment 4.2 Single T-Box Experiment We do this test as a proof-of-concept experiment to verify the security achieved due to the use of a specialized memory organization and balanced-interleaved data format. In this experiment, we target an attack on single T-Box operation and evaluate its security gain. As illustrated in Figure 4.2, this test design incorporates essential components of a block cipher (AES) i.e. logical XOR and T-Box lookup. SCA-resistant XOR and lookup table operations are implemented in a custom hardware and are accessed through custom instructions. We do this experiment in steps and change the number of balanced bits in a balancedinterleaved dataword for each step. We reconfigure XOR operation and change the format of T-Box table contents to have required number of balancing bits. The number of balanced bits are varied from 0 (unsecure) to 16 (fully secure) and an SCA attack is performed for each of these steps to evaluate its SCA-resistance. The SCA attacks the output of a lookup operation in our analysis. Figure 4.3 shows the results, where the maximum correlation value for correct key guess is plotted against the number of balanced bits present in a dataword. This correlation is calculated for 2000 traces, at its best attack point. It can be seen that the correlation of the correct key guess reduces with increasing number of balanced bits. For completely secure case, the correlation value reduces to 0.11 at 2000 traces. The unbalanced (0-bit) implementation is attacked with the MTD of 50, whereas 14-bit balanced version takes 1200 traces for a successful attack. We could not attack fully (16-bit) balanced design successfully Suvarna H. Mane Chapter 4. SCA-resistant CPU: Results Analysis 30 Maximum Correlation value 0.8 Successful Attacks 0.7 Unsuccessful Attacks 0.6 0.5 0.4 0.3 0.2 0.1 0 0 2 4 6 8 10 12 14 16 18 Number of Balanced bits Figure 4.3: Security Improvement: Single T-Box test with 170000 averaged traces. This shows that our countermeasure achieves a significant security improvement. In this implementation, though the sensitive contents are stored in onchip memories (RAM macros), the program data and stack are configured on offchip memories. It should be taken into account that the sensitive data variables flow back and forth between offchip memory and FPGA, while the program is being executed. These offchip data-bus transfers contribute in the side channel leakage significantly. All of the unbalanced steps of this experiment have at least 1 bit of sensitive data, that is not balanced by its complimentary part. This provides useful side-channel leakage to perform a successful attack. However, for fully balanced case, none of the data-bits contribute in the side-channel leakage and hence, we could not attack that configuration successfully. 4.3 AES Prototype In the second part of our experiment, we implement an SCA-resistant AES T-Box (128-bit) prototype to evaluate its efficiency in terms of security, performance and cost. We use the same platform as that of single T-Box experiment with two different configurations of offchip Suvarna H. Mane Chapter 4. SCA-resistant CPU: Results Analysis 31 16 Unsecured Number of revealed key bytes 14 12 Secured 10 8 6 4 2 0 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 -2 Number of Traces Figure 4.4: AES-TBOX implementation: NiosII/s + SDRAM memory (SDRAM and SSRAM). The T-Box lookup tables are implemented in onchip RAM macros and offchip memory is used for program execution and stack. The software uses custom instructions for secure operations and includes hardcoded secret key in a balancedinterleaved format. All intermediate variables are precharged to 0 before they are used for next operation. We attack first round of AES and conduct CPA analysis to evaluate its SCA-resistance. The SCA attacks are performed on unsecure and secure implementations for SDRAM and SSRAM configurations. We have used several different secret keys and Table 4.1 lists the average security gain for the set keys. It considers MTD at 75% success rate i.e. MTD to disclose 75% of the key bytes. As we use 128-bit secret key for AES, 75% success rate corresponds to revealing 12 key bytes. An unsecure AES implementation on NiosII/s with SDRAM offchip memory reveals 12 key bytes at around 1600 traces whereas, the secure implementation needs 40000 traces to reveal 12 key bytes. This results in an overall security gain of 25x at 75% success rate. Figure 4.4 plots the number of key bytes revealed as a function of number of traces for SDRAM configuration. Suvarna H. Mane Chapter 4. SCA-resistant CPU: Results Analysis 32 0.08 0.06 0.04 Correlation 0.02 0 −0.02 −0.04 −0.06 −0.08 0 100 200 300 400 500 600 Time Samples 700 800 900 1000 Figure 4.5: Attack results on secure implementation: NiosII/s + SRAM. Trace of correct key guess (here, first key byte) is plotted in black, while all other key guesses are in yellow(gray). The buried trace means unsuccessful attack. Table 4.1: AES Implementation: Security Configuration MTD (# revealed keys) Security gain Unsecure Secure NiosII/S + SDRAM 1600 (12) 40000 (12) 25 NiosII/S + SSRAM 633 (12) 300000 (0) >474 In case of SSRAM configuration, an unsecure implementation achieves 75% success rate at an average of 633 traces, whereas we could not attack secure implementation for 300000 traces. Figure 4.5 shows the correlation trace of correct key byte for 300000 samples. This results in security gain of at least 474, which significantly differs from that of an SDRAM configuration. We investigate the possible reasons for this difference later in this chapter. Figure 4.6 shows the waveforms of the secure and unsecure AES traces for SDRAM configuration and Figure 4.7 shows the same for SSRAM configuration. We achieve this security improvement at the cost of performance and area overhead. An Suvarna H. Mane Chapter 4. SCA-resistant CPU: Results Analysis 33 Table 4.2: AES Implementation: Area and Performance Configuration Area (LEs, M4K) Cycle count Unsecure Secure Unsecure Secure NiosII/S + SDRAM 3452, 143 3889, 161 13839 36977 NiosII/S + SSRAM 2814, 31 3252, 49 7375 19980 Area of a system with CPU, memory controller and custom hardware. Area is expressed in terms of Logic Elements (LE) occupied and M4K memory blocks (RAM Macro) used. area overhead for secure implementation is due to the extra logic for customized hardware. For software implementation, it needs to split every 32-bit sensitive dataword into two 32-bit balanced words. Additionally, all variables need to be precharged before they can be reused. This overhead of the additional instructions causes a small performance degradation. Our secure implementation is 2.7 times slower than unsecure implementation and takes 15% more area. Table 4.2 enlists these results. Suvarna H. Mane Chapter 4. SCA-resistant CPU: Results Analysis (a) Unsecure Implementation (a) Secure Implementation Figure 4.6: AES trace on oscilloscope (SDRAM configuration). 34 Suvarna H. Mane Chapter 4. SCA-resistant CPU: Results Analysis 35 (a) Unsecure Implementation (b) Secure Implementation Figure 4.7: AES trace on oscilloscope (SSRAM configuration). 4.4 Impact of PCB Layout The location of peripheral chips on a PCB board has a significant impact on the security of overall system. Figure 4.8 depicts the layout of DE2-70 board. We can see that, SSRAM Suvarna H. Mane Chapter 4. SCA-resistant CPU: Results Analysis 36 Byte0 Byte1 Byte2 SSRAM: symmetric pin positions: No Leakage Byte4 SDRAM: Asymmetric pin positions: Leakage Byte0 Byte 1 Byte2 Byte 3 Figure 4.8: Impact of PCB Layout on Residual Leakage has more symmetric location with respect to the CycloneII FPGA than that of SDRAM. A 32-bit SDRAM is configured as two 16-bit memory banks, whereas SSRAM is a 32-bit memory chip. With this layout, SDRAM does not always offer adjacent data-pin locations for a complementary bit-pair. This creates an imbalance between direct and complimentary bitlines irrespective of their balanced format. On the other hand, SSRAM has more symmetric data-pin pattern, which routes complementary bit lines together and thus, reduces the residual side-channel leakage. This highlights that the implementation of a SCA countermeasure in FPGA alone, does not guarantee the security of an overall system. Its integration with other offchip peripherals (that might be SCA-sensitive) is equally important. In addition, it shows that the balancedinterleaved data format provides a very good portability to different memory organizations Suvarna H. Mane Chapter 4. SCA-resistant CPU: Results Analysis 37 (even-bit databus, e.g. 8/16/32-bit) because of its ability to provides SCA-resistance on bit level. Our solution with SRAM configuration shows an excellent demonstration of how a given PCB can be used to implement an efficient secure system. 4.5 Related Implementations In this section, we compare our solution with other secure implementations. As shown in Table 4.3, these implementations target different technologies, different countermeasures and different cryptographic algorithms. As the attack methods are not standardized, it is not a straight-forward process to compare them on the same scale. A masking-based secure processor SecretBlaze presented by Barthe et. al. [5], does not provide a satisfactory level of security. Authors report a successful attack on their secure implementation with a significant correlation peak. Another custom-instruction based secure processor uses a masking countermeasure [36]. It needs to have a mechanism for maskgeneration, mask storage and management, which increases its design complexity. Chen et.al. [12] present a custom instruction based Virtual Secure Processor (VSC), which uses bit-slicing technique to employ hiding countermeasure. Despite of security improvement of 20X, this work suffers from a non-trivial performance penalty due to its bit-slicing technique. It also needs specialized coding to handle bit-slicing at software level adding to the design complexity. MUTE-AES [2] is another hiding countermeasure that uses multiprocessor platform. However, it has security vulnerabilities, as discussed in [1]. Regazzoni et. al. present a CAD-based design flow and evaluation framework to implement secure applications in ASIC [33]. It presents its results based on the simulations for PRESENT cipher, mainly targeted for ASIC. Our design is implemented using the state of the art methodology in a very systematic way, making design phase simpler than above mentioned implementations. It exceeds abovementioned solutions in terms of security, performance, area and design flexibility. Suvarna H. Mane Chapter 4. SCA-resistant CPU: Results Analysis 38 Table 4.3: Related work: Comparison Work Technology, Base Implementation processor Security gain/Area over- head/Performance degradation [5] Spartan-3, MicroB- Masking, DES 2X / 1.34X1 / – laze [36] Virtex-4, Leon3 Masking, AES 3.5X / – / 2X [12] Spartan-3E, Leon3 Hiding (VSC), AES 20X / 3.3X / 6.5X [33] ASIC 180nm, Open- Hiding Our Design (MCML), RISC1000 PRESENT CycloneII, NiosII/s Hiding, AES T-Box – / 2.65X / – >474X / 1.15X2 / 2.7X 1 This number represents area overhead in terms of slice-count. 2 Area of a system with only processor and SRAM memory controller. 4.6 Summary of Contribution This work reports an efficient and secure embedded system design on FPGA by using industrial design flow. We use a novel memory organization technique and interleaved data format in combination with a hiding countermeasure. Though, we have demonstrated our results Altera FPGA for AES T-Box implementation, the methodology is portable to other FPGA platform for majority of the block ciphers. We discuss how location of peripheral offchip components on PCB board plays an important role in the overall security evaluation. Our experimental results establish the feasibility of proposed methodology to implement an embedded system to achieve desired security at reasonable cost. Chapter 5 ECDLP Engine: Background In this chapter, we give a brief overview of Elliptic Curve Cryptography (ECC), Pollard rho algorithm, modular arithmetic and then we discuss the related work done in this area. 5.1 Elliptic Curve Cryptography (ECC) In a public-key cryptographic scheme, a key pair is selected such that the problem of deriving the private key from the corresponding public key is equivalent to solving a computational problem that is believed to be intractable. Elliptic curve cryptography uses elliptic curves to design public-key cryptographic systems [26, 23]. The idea can be explained as follows [21]. Let p be a prime number, and let Fp denote the field of integers modulo p. An elliptic curve E over Fp is defined by an equation of the form y 2 = x3 + ax + b, (5.1) where a, b ∈ Fp satisfy 4a3 + 27b2 6≡ 0 (mod) p. A pair (x, y), where x, y ∈ Fp , is a point on the curve if (x, y) satisfies equation (5.1). The point at infinity (∞) is also said to be on the curve. The set of all the points on E is denoted by E(Fp ). 39 Suvarna H. Mane Chapter 5. ECDLP Engine: Background 40 Using a special modular addition method, two elliptic curve points are added to produce a third point on the same elliptic curve. This results in a cyclic subgroup of points in E(Fp ). Such subgroups are used to implement ECC cryptosystems. For example, if P is a point in E(Fp ) with prime order n, then its cyclic subgroup on E(Fp ) is represented as hP i = [∞, P, 2P, 3P, ....., (n − 1)P ]. (5.2) Here, the elliptic curve E, prime p, point P and order n, are known as public domain parameters. A private key d is an integer randomly selected from the interval [1, n − 1] and corresponding public key is generated as Q = [d]P , where [d] denotes scalar multiplication with d. This multiplication refers to a modular point multiplication. ECC encryption takes place as follows [21]. First, a plaintext m is represented as a point M on the curve, and then it is added to [k]Q to get the encrypted data. Here, k is randomly selected integer. The sender transmits the points C1 = [k]P and C2 = M +[k]Q and recipient uses the private key d to compute [d]C1 = d([k]P ) = k([d]P ) = [k]Q (5.3) and recovers M = C2 − [k]Q. An attacker needs to know d to compute [d]C1 to recover the message m. 5.2 Pollard-rho Algorithm The security of ECC relies on the difficulty of Elliptic Curve Discrete Logarithmic Problem (ECDLP). It refers to a problem of determining secret key (d) given the domain parameters and the public key (Q). Mathematically, ECDLP is defined as: To find d, when P and Q are known and Q = [d]P , where P, Q ∈ E(Fp ). The Pollard Rho method [32], [10] is the strongest known attack against ECC today. This method solves ECDLP by generating points on the targeted curve iteratively, any of which Suvarna H. Mane Chapter 5. ECDLP Engine: Background 41 X6 X5 X7 Point of Collision X4 = X10 X8 X9 X3 X2 X1 Single pollard-rho step (f(X)) X0 Seed Point Figure 5.1: Standard Pollard-rho attack have the property X = [a]P + [b]Q. When the same point is encountered twice, for different [a] and [b], the collision occurs, which means that ECDLP is solved. Here, X denotes any point on a elliptic curve i.e. X ∈ E(Fp ). The Pollard rho algorithm [32] uses a pseudo-random iteration function f : hXi → hXi and calculates a finite number of points on the curve. It starts a walk from a random point (seed point) on the curve of the form X0 = [a0 ]P + [b0 ]Q, where a0 and b0 are generated from a random seed S. It then iteratively computes Xi+1 = f(Xi ) until it encounters a point on the curve twice: eventually, walk ends in a cycle. The name of the algorithm, rho, expresses the Greek letter ρ, which shows a walk ending in a cycle as shown in Figure 5.1. The function f is also referred as Pollard rho step, or simply an iteration. The collision point is located where the cycle starts. Therefore the underlying idea of this Suvarna H. Mane Chapter 5. ECDLP Engine: Background 42 algorithm is to search for two distinct points on the curve such that [ai ]P + [bi ]Q = [aj ]P + [bj ]Q. The iteration function is constructed in such a way that ad and bd can be computed using a0 and b0 . A Pollard rho step corresponds to an iteration function and is often defined as a point addition i.e. Xi+1 = Xi + x, where x is a precomputed, linear combination of P and Q, for example, [ci ]P + [di ]Q. When a collision occurs, two different linear combinations of Xd are computed using ad and bd of the collided points. The solution then can be obtained as ad1 − ad2 n= mod l bd1 − bd2 (5.4) Due to the birthday-paradox, the expected length of a walk before a collision is found, is p proportional to |hXi|. The function f generally corresponds to modular point addition function, where two points on the curve are added together to generate next point on the same curve. 5.2.1 Parallelization Van Oorschot [39] described a parallelization technique that enables parallel walks on a single curve of Pollard rho algorithm to speed up the computation of ECDLP. The idea is to define a subset of hXi as distinguished points (DPs), points which have a distinguishing characteristic. For example, a DP could be defined as a point with a given number of leading zero-bits in its x-coordinate. This method allows to distribute the random walks among multiple processing clients and share the DPs found by them with a central server, which then performs a collision search. This technique results in a linear speed up as the number of clients increases. Multiple random walks are continued till two different seed points reach the same distinguished point Xd . The expected number of DPs required to find a collision is a fraction of For P = 2128 – r, if A = a1. 2128 + a0, then A mod P = a1.r + a0 Suvarna H. Mane Chapter 5. ECDLP Engine: Background * a1 op1 N bits op2 N bits a0 2N bits Carry bits a1*r *r + a0’ a1’ Carry over (More than N-bits) Carry bits + a1’’ a1’*r A0’’ Carry over 43 Reduction step 1 Reduction step 2 (More than N-bits) Carry bits A mod P Reduced output N-bits Reduction step n Figure 5.2: Reduction with prime p, where p = 2N − r the expected path length. This depends on the density of DPs in a point set hXi, which in turn depends on the chosen distinguishing property. 5.3 Modular Arithmetic The mechanism of elliptic curves depends on finite field arithmetic involving the points of the elliptic curve. All these arithmetic operations (addition, subtraction, multiplication) are computed as modular operations with modulo prime p. This can be achieved by performing these operations similar to the standard integer arithmetic operations and by reducing the results with mod p. A straight forward way to perform reduction is to divide the result with p iteratively. As cryptographic operations involve large numbers and prime p is generally Suvarna H. Mane Chapter 5. ECDLP Engine: Background 44 chosen as a big prime, reduction using division proves to be inefficient. Divisions are costly operations in terms of hardware area and performance. However, there are more efficient algorithms available to perform reductions with prime p, which break down costly division operation into simple addition operations. We explain an example of hardware-friendly modular multiplication over N − bit field as shown in Figure 5.2. Let p be a N − bit prime represented as 2N − r. The inputs for multiplier are two N − bit operands, op1 and op2. Then result of modular multiplication is A = op1*op2 mod(p). Here, A can also be represented as A = a1.2N + a0. Hence, A mod(p) = a1.r + a0 (5.5) To perform modular multiplication, op1 and op2 are multiplied using a conventional school method to get 2N − bit multiplication result. Now, this 2N − bit result is reduced to N − bit output using reduction addition operations. As shown in the Figure 5.2, higher carry bits (a1) of the result are multiplied with r and are added to the lower significant bits (a0). This reduced result might still have residual carry bits (a10 ). So, an addition operation is repeated iteratively until it achieves the final N − bit result A mod (p). 5.4 Related Work Different solutions to solve ECDLP have been proposed in recent years, using software and hardware platforms. These solutions target different curves with different primes and perform arithmetic operations either using prime field arithmetic or binary field arithmetic. We discuss related solutions in this section. The software solution proposed by Bernstein on CELL platform is the fastest software solution at present, to solve the ECDLP over secp112r1 curve [6]. It uses the negation map and non-integer polynomial-basis arithmetic to report the speedup over a similar solution by Bos [9]. Both of these software solutions use prime field arithmetic in an affine co-ordinate Suvarna H. Mane Chapter 5. ECDLP Engine: Background 45 system, and they exploit the SIMD architecture and rich instruction set. Another software solution by Bos [8] describes the implementation of parallel Pollard rho algorithm on Synergistic Processor Units of Cell Broadband Engine Architecture to approach the ECC2K-130 Certicom challenge. Among hardware-platform-based solutions, Bulens et al. proposes an FPGA solution to attack the ECC Certicom challenge for GF(279 ) [31]. Though it discusses the hardwaresoftware integration aspect of the solution, the authors did not confirm if their system was operational. Fan proposes the use of a normal-basis, binary field implementation to solve ECC2K-130 [6]. Another binary field solution, for the COPACOBANA platform, targets the 160-bit curve [16]. Since a curve of this size would require a single COPACOBANA platform to run for 7.62 x 109 years, the authors did not demonstrate a collision that can validate their design. Guneysu et al. propose an architecture to solve ECDLP over prime fields using FPGAs and analyze its estimated performance for different ECC curves [20]. A three-layer hybrid distributed system is described by Piotr et al. to solve ECDLP over binary field [25]. It uses the general purpose computers with FPGAs and integrates them with a main server at the top level. Our solution is based on a hardware-software co-integrated platform. Chapter 6 ECDLP Engine: Implementation This section discusses architecture of ECDLP engine, experimental setup and analyzes the results of our solution. 6.1 Modular Multiplication Architecture A point addition is the most basic operation in Pollard rho algorithm. The overall performance of an ECDLP engine highly depends on its efficiency to perform point addition operation. The point addition routine refers to a sequence of modular arithmetic operations such as addition, subtraction, multiplication and inversion. The modular multiplication operation dominates among these operations with 80% of the share. So the design of highly efficient modular multiplier (modmul) unit results in faster point addition and consequently, in a more efficient ECDLP engine. For this reason, we optimize the modular multiplication architecture. We target prime field arithmetic using polynomial representation in an affine co-ordinate system (Equation 6.1). Typically, hardware solutions use binary field arithmetic, primarily because of the assumption that binary field avoids costly carry propagation. Our represen- 46 Suvarna H. Mane Chapter 6. ECDLP Engine: Implementation 47 tation for prime-field arithmetic is based on [6], which has similar properties. Although, the target curve considered in our solution is secp112r1 over 112-bit in GF((2128 − 3)/76439), all the arithmetic operations are performed over 128-bit field. After each iteration, the result is mapped to 112-bit to check it against the distinguishing property. This 128-bit to 112-bit conversion is obtained by the canonicalization i.e. multiplying a result with 76439 [6]. Table 6.1: ECC Arithmetic Operations Operation Cycle cost Addition/Subtraction 2 Modular Multiplication 14 Square 11 Inversion 1594 The point addition operation corresponds to a Pollard rho step that consists of four subtractions, one addition, four modular multiplications, and one inversion. Table 6.1 lists these arithmetic operations along with their cycle costs to compute a single operation. Subsequent sections explain the architecture of arithmetic modules in detail. 6.1.1 Modular Multiplication Bernstein [6] uses the non-integer basis for polynomial representation of data to achieve an efficient software implementation. We choose 16-bit coefficient representation to make the partial product computation uniform across all the coefficients, which also makes the design scalable over larger fields. These multipliers are mapped on the dedicated DSP48 cores available in the used Virtex-5 devices for efficient implementation. As depicted in Figure Suvarna H. Mane Chapter 6. ECDLP Engine: Implementation i=6 96 i=7 112 A7 A6 i=4 64 i=5 80 A3 A4 A5 i=1 16 i=2 32 i=3 48 A2 48 A1 i=0 Bit-0 A0 128-bit 1 Figure 6.1: Polynomial Representation of Data 6.1, the 128-bit data is represented as X= nX A −1 xi .2i.lA where, nA = l/lA , (6.1) i=0 l = 128 and lA = 16. The two operands are represented using above mentioned polynomial format and are multiplied with conventional school method using word aligned coefficients. Figure 6.2 illustrates the dataflow. The result consists of 15 partial products (S0, S1 .... S14 ). These partial products are accumulated in a way to achieve the first level of reduction, where higher order coefficients are folded and added to lower order coefficients. This results in eight 36-bit wide partial products (Preg0, Preg1, ..... Preg7 ), which are then passed through second level of reduction to get eight 16-bit coefficients (C0, C1, ... C7 ). Figure 6.3 shows how an adder chain achieves this second level of reduction. Suvarna H. Mane a7b2 Chapter 6. ECDLP Engine: Implementation a7 a2 a1 a0 op1 b7 b2 b1 b0 op2 a7b0 a2b0 a1b0 a7b1 a6b1 a1b1 a0b1 X a6b2 a5b2 a0b2 X X s2 s1 s14 s14 S7 Preg7 s13 a0b0 8 cycles s0 op1*op2 S6 S5 S4 S3 S2 S1 S0 S14 S13 S12 S11 S10 S9 S8 Preg6 Preg5 Preg4 Preg3 Preg2 Preg1 Preg0 Each word = 36 bits Figure 6.2: Standard Multiplication Method Preg7 Preg6 Preg5 Preg4 Preg3 Preg2 Preg1 Preg0 Upper 20 bits Lower 16 bits + + Upper 20 bits Upper 20 bits Lower 16 bits C1 C2 128 bit result C7 C6 C5 C4 C3 C2 C1 C0 Figure 6.3: Reduction with Adder-Chain C0 49 Suvarna H. Mane Chapter 6. ECDLP Engine: Implementation Op1_1 Op1_6 Op1_7 50 Op1_0 * P DSP C Op2_7 Op2_6 Op2_1 MULT 7 MULT 6 MULT 1 Op2_0 Upper carry bits Lower 16 bits * P + + + Reg 7 Reg 6 Reg 1 Reg 0 + + + + C S6 C S1 multiplication stage MULT 0 + S7 16 bits C Addition Reduction stage 1 * P Addition Reduction stage 2 S0 16 bits 128-bit modular multiplication result Figure 6.4: Modular Multiplication Architecture Figure 6.4 shows the architecture of the modular multiplication module in hardware. It takes two 128-bit inputs (Op1 and Op2) broken into 16-bit coefficients and gives 128-bit reduced output. There are eight DSP48 multipliers employed to find partial products of 16-bit coefficients. As it needs to compute nA 2 (64 in our case) partial products to get multiplication result, it takes eight multiplication cycles to get unreduced multiplication result. A similar architecture was presented earlier by Guneysu [19]. However, Figure 6.4 presents an important optimization of the reduction step. Since we are computing Op1*Op2 mod (2128 − p), the reduction adds to the cycle cost of a modular multiplication (p = 3 in our case). By multiplying the shifting operand op2 7 with 3, we perform the reduction in parallel with the multiplication. For the 128-bit data field with eight DSP multipliers, it takes eight cycles of multiplications Suvarna H. Mane Chapter 6. ECDLP Engine: Implementation 51 and 12 iterations of reduction. This needs a cycle cost of 20 per modular multiplication. As depicted in Figure 6.4, this cost has been reduced to 14 cycles, by overlapping reduction with partial multiplications. This is a significant improvement in terms of latency over Guneysu’s architecture [19], which takes 70 clock cycles for 256-bit modular multiplication. The reduction has been achieved by an adder chain, which adds lower 16 bits of ith partial product with the upper carry bits of (i-1)th one. The carry bits of the highest coefficient Reg7 are multiplied by 3 before adding them to the lowest coefficient Reg0. The final 128-bit result can be obtained by concatenating these eight 16-bit reduced outputs. A two stage pipeline of adder chains is employed to reduce the critical path of the reduction stage. This architecture supports the generalized modular multiplication for any p over 128-bit prime field. It can easily be extended over larger prime fields by adding additional multiplieradder columns, in which case, the performance might differ with the number of multiplieradder columns. 6.1.2 Dedicated Square Unit Table 6.1 shows that, an inversion is the most expensive operation in a Pollard rho step. Therefore, optimizing this operation results in an improvement of the overall system performance. An inversion involves a total of 137 modular multiplications, out of which, 75% are the squaring operations. Having a dedicated optimized squaring unit reduces the effective time of inversion and consequently accelerates the point addition operation. The squaring operation needs only half as many partial products as multiplication, so we modified the multiplication architecture to get an optimized square module as shown in Figure 6.5. It involves only 5 multiplication iterations instead of 8. The reduction stage for the square module is similar to that of multiplication. With this architecture, a square operation is completed in only 11 cycles, which achieves the speed-up of 1.2 over the multiplication architecture. Since the majority of operations required for inversion are squares, Suvarna H. Mane Chapter 6. ECDLP Engine: Implementation 52 * P op7 *2 op7 op6 * 2 op5 * 2 op3 op2 op3 *2 Mult 7 op4 * 2 Mult 6 Mult 5 op2 * 2 Mult 4 op1 op1 * 2 Mult 3 Mult 2 op0 op1 * 2 Mult 1 Mult 0 To Reduction Figure 6.5: Dedicated Square Architecture this translates to a significant reduction in the cycles cost of an inversion. 6.1.3 Vectorized Inversion To reduce effective cost of an inversion operation, we use few optimization tricks for an inversion operation as following. From the Fermat’s little theorem, it follows that the modular inverse of P ∈ E(Fp) can be obtained by computing P(p-2) . For the secp112r1 curve in 128bit arithmetic, an inversion requires 112 squarings and 59 multiplications. We use following techniques to optimize the inversion operation. Windowing Optimization A windowing method allows to reduce the number of squarings and multiplications needed to invert an input. For a window-size of four, we achieve an inversion operation in 108 squarings and 29 modular multiplications. It takes 1594 clock cycles for an inversion as opposed to 2058 cycles without windowing, providing a speed-up of 1.3. Suvarna H. Mane Chapter 6. ECDLP Engine: Implementation 53 Montgomery trick To further reduce the cost of an inversion, we use Montgomery’s trick [27], which enables to trade M inversions for 3(M − 1) multiplications and one inversion. It allows to vectorize multiple random walks together and run them simultaneously on a single ECC core, to share the inversion cost. With the large vector size, the inversion cost per iteration becomes small compared to other operations such as multiplication. We select vector size of 16 and optimization of this quantity is a future work. 6.2 System Architecture The system is implemented on a Nallatech computing platform. The FPGA performs the computationally expensive Pollard rho iterations, whereas the host processor manages the central database and executes collision search. The communication between software and hardware is carried out only for the exchange of seed points and distinguished points, which reduces the communication overhead. 6.2.1 Nallatech Platform Figure 6.6 depicts the architecture Nallatech system. It consists of one quad-core Xeon processor E7310 and three Virtex-5 FPGAs (1 xc5vlx110, 2 xc5vsx240t). A fast North Bridge integrates high-speed components, including a Xeon, FPGA, and main memory. A slower South Bridge integrates peripherals into the system, including the hard disk. Both the Xeon and the FPGA can directly access system memory using a Front Side Bus (FSB). A Field Programmable Gate array (FPGA) is used for computing additive random walks on elliptic curve, while a host Xeon processor executes the software driver. Suvarna H. Mane Chapter 6. ECDLP Engine: Implementation 54 Xeon Processor C0 C1 C2 C3 System Memory workspace 21GB/s (peak) North Bridge 3.2GB/s Interface 3.2GB/s 100MB/s South Bridge FPGA Hard Disk FPGA Accelerator Figure 6.6: Nallatech System 6.2.2 Software Implementation The Xeon processor executes a software driver (in C) and manages software interface to FSB. The software driver mainly handles the communication interface with FPGA, seed point (SP) generation, storage and sorting of DPs. As shown in Figure 6.7, two-way communication between the Xeon and the FPGA takes place over the FSB. When the program execution starts, the software calls APIs to configure FPGA card, to initialize the FSB link, and to allocate the workspace memory. It then generates random SPs on the curve E and starts an attack by sending them to FPGA over FSB. Every point has x - and y- coordinates of 128-bit length each. The hardware computes DP for each SP received and sends it to the software along with its corresponding SP. When the software receives SP-DP pair from an FPGA, it performs a collision search among all the received DPs. Once a collision is detected, it computes the Suvarna H. Mane Chapter 6. ECDLP Engine: Implementation 55 secret scalar. As the software takes care of the central database of DPs, a collision search is conducted in parallel with hardware computations. 6.2.3 Hardware Implementation On the hardware side, as shown in Figure 6.7, FPGA edge core provides an interface between the FSB and the ECC core. It consists of a control logic and two 256-bit wide FIFOs. RX FIFO buffers the incoming SPs and TX FIFO stores the DPs found, to send them back to the Xeon. The ECC core performs a random walk by computing the point addition operation iteratively until it finds a DP and stores that in the TX FIFO. We have defined DP as a point, which has y-coordinate with 16 zeros. The probability of a point being distinguished is almost exactly 2−16 . The distinguishing property of points allows the ECC core to send only few points back to the Xeon, which reduces the communication overhead and minimizes the storage requirement in the hardware. The required bandwidth of a communication bus is around 8Kbits/sec, which is well within the range of FSB. Following are the details of key components in the design. IO controller The IO controller manages the read/write interfaces of TX-RX FIFOs and controls the ECC core operation. It feeds the SPs from RX FIFO to the ECC core and initiates a Pollard rho walk. When a DP is found, the IO controller halts the ECC core operation until a new SP is loaded from the RX FIFO. The computed DPs are buffered in the TX FIFO and then transferred to the Xeon along with corresponding SPs. Suvarna H. Mane Chapter 6. ECDLP Engine: Implementation 56 XEON Processor 4-core (1.6GHz) C driver Distinguishing Point (Y) FSB bus Software Driver Seed (X) FPGA Edge Module Read interface Write interface Input/Output Controller ECC core Vector Sequencer SP RX FIFO Next-Address logic Micro-instruction ROM DP TX FIFO Hardware Virtex-5 FPGA PA module Datapath RAM Arithmetic Modules Figure 6.7: System Architecture ECC Core The ECC core consists of a micro-instruction sequencer and the point-addition (PA) datapath. It performs a random walk by computing point additions iteratively until it finds a DP or crosses the iteration limit (which is currently set to 220 ). Sequencer This secondary controller executes the micro-instruction sequence stored on a ROM and accordingly issues the control signals to the PA datapath module. This way, it controls the execution flow of the low-level arithmetic operations for a point addition. The micro-coded architecture adds to the flexibility of the micro-instruction sequence to support different vector sizes (N). It also supports scalability of the design by providing a mechanism to Suvarna H. Mane Chapter 6. ECDLP Engine: Implementation 57 control multiple ECC cores in a SIMD (Single Instruction Multiple Data) fashion. The vector size is a generic parameter of the design. Table 6.2 summarizes the micro-instruction sequence for a single point addition. Here, t1, t2, t3, t4, Px, Py, Qx and Qy represent N-entry register files. The register files Px, Py, Qx and Qy hold the x− and y− coordinates of the points to be added and remaining register files store the intermediate results. Table 6.2: Micro-instruction Sequence for Point addition: P + Q Instruction Function performed canonicalization Py * 76439 Subtraction t1 = Py - Qy Subtraction t2 = Px - Qx Inversion t2 = invert(t2) Addition t3 = Px + Qx Modular Multiplication t4 = t2 * t1 Modular Square t1 = t4 * t4 Subtraction t1 = t1 - t3 : Px (i+1) Subtraction t2 = Px - t1 Modular Multiplication t3 = t2 * t4 Subtraction t3 = t3 - Py : Py (i+1) The next-address logic (NAL) controls the execution of microinstructions from a ROM. Every micro-instruction consists of N phases, corresponding to the N entries of a vector. The same micro-operation is applied to every element of the vector. Furthermore, every microoperation can have a variable execution time. This allows the micro-instruction concept to Suvarna H. Mane Chapter 6. ECDLP Engine: Implementation 58 be applied to all the operations required for point addition regardless of latency. The NAL module reads a micro-instruction i and issues the corresponding control signals to PA datapath N times. These control signals include a start pulse, an operand select, the opcode of a micro-operation, a write pulse to store the results, a destination-register select and a vector phase ni , to which the operation belongs. For an inversion operation, NAL module extends start and write signals for N clock cycles, each with different vector phase ni . This enables to load N inputs into inversion module and write N results into the corresponding register files. An inversion is performed only once per N vectors, whereas the other instructions are executed N times. As the datapath of PA iteration is fixed, there is no need of the conventional conditional micro-instructions such as jump, check flag etc. This makes the instruction-set simpler. Datapath The datapath consists of modular arithmetic operators and memory. We carefully designed each of these sub-blocks to support vectorized point additions. A vectorized point addition allows the execution of multiple random walks simultaneously on a single ECC core, with only one inversion per N point-additions. As shown in Table 6.2, we need eight registers (t1, t2, t3, t4, Px, Py, Qx, Qy) to hold the intermediate results for a single point addition. For vector size of N , we use N -entry register files. For efficient implementation, these N-entry register files are mapped to the distributed RAMs available in FPGA. Each of these memories is 128-bit wide and has a depth equal to the vector size N. The address of an entry in the memory corresponds to the vector number, to which the content belongs. Suvarna H. Mane 6.3 Chapter 6. ECDLP Engine: Implementation 59 Implementation Results Though our work is dedicated to prime p = (2128 − 3)/76439, the same solution can work with little modifications for any curve of the form y 2 = x3 − 3x + b. For demonstration purpose, the seed points that we generate are carefully chosen to be of order 250 [6], which means we need only 225 steps to solve the ECDLP. This allows us to demonstrate collisions, proving that our solution works. 6.3.1 Overall Performance The whole system runs at 100 MHz and utilizes 4773 slices which is 12.7% area of the Virtex5 device xq5vsx240t with a single ECC core. It takes 1.5 microseconds per Pollard rho step and can perform upto 660K iterations per second per ECC core. With 16 ECC cores working in parallel, our system would need 176 years to solve secp112r1 ECDLP. The Guneysu’s architecture described in [19] targets 256-bit prime arithmetic over two fixed NIST primes. Our solution shows an improvement over it in terms of latency for the important ECC arithmetic operations. Assuming the cycle cost for 256-bit arithmetic as twice of that for 128-bit arithmetic (worst case scenario), we can see that our architecture has cycle cost of 28 for a modular multiplication and 302 for the point addition. This is 2.5X and 3X times lower latency for a modular multiplication and point addition operation respectively, than those of the design in [19]. The performance comparison among various implementations is not a straightforward process, as different solutions target different curves and different co-ordinate systems. Also the performance figures are specific to the target platform, the size of the target curves and an underlying arithmetic (i.e. binary or prime field). Suvarna H. Mane Chapter 6. ECDLP Engine: Implementation 60 Table 6.3: Comparison with Software Implementations Platform Time/PA in ns Iterations/sec Cell processor @3.192GHz, secp112r1 113 (362 cycles) 8.81M 142 (453 cycles) 7.04M 233 (745 cycles) 4.28M 1500 660K: single core curve [6] Cell processor, @3.192GHz, secp112r1 curve [9] Cell processor, @3.192GHz, ECC2K130 Binary Field [8] Our system, secp112r1 10.56M: 16 cores 6.3.2 Comparison with Previous Software Implementations Table 6.3 compares our solution with other software implementations. It shows that our results can be comparable with those of software if we have multiple ECC cores in parallel. The listed multi-core performance is an estimate; our measured results are for a single ECC core only. 6.3.3 Comparison with Hardware Implementation As shown in Table 6.4, the solution reported in [15] claims to have 111M iterations per second. It also claims to solve ECC2K-130 within a year with five COPACOBANA machines, but the system demonstration is not reported. Similarly, the solution reported in [31] claims to have 100M iterations per second based on paper design. We assume the difference of performance figures exists due to the factors, such as, binary field arithmetic, different curve sizes and use of pipelined architectures. We can Suvarna H. Mane Chapter 6. ECDLP Engine: Implementation 61 Table 6.4: Comparison with Hardware Implementations (per ECC core) Platform Arithmetic Iterations Area (slices) Demonstrated /sec Spartan-3 [16] Prime 47.28K 3034 Unclear 50.12K 2660 Unclear Binary claims 26731 Paper design (130-bit) 111M Binary claims 22236 (79-bit) 100M BRAMs 660K 4773 (160-bit) Spartan-3 [20] Prime (160-bit) Spartan-3 [15] Virtex-4 [31] Virtex-5 tem our sys- Prime (112-bit) Slices Slices 30 Paper design 9 Demonstrated BRAMs 62 DSP48 see that ours is the only solution at present, which demonstrates the Pollard rho algorithm successfully on a hardware-software integrated platform. 6.4 Summary of Contributions We successfully demonstrate a complete ECC cryptanalytic machine to solve ECDLP on hardware-software co-integrated platform. We also implement a novel architecture on hardware to perform modular multiplication over a prime field and this is the most efficient implementation reported at present for prime field multiplication. This architecture is further extended to a dedicated square module. This work also demonstrates the use of microinstruction based sequencing logic to support different vector sizes and to control multiple ECC cores in SIMD fashion. We compare our performance results with the previous hard- Suvarna H. Mane Chapter 6. ECDLP Engine: Implementation 62 ware implementations and show that our solution can have a comparable performance with multi-core implementation. The proposed system can further be used to demonstrate the prime arithmetic over other curves of different sizes. Chapter 7 Conclusion In this thesis we have addressed two problems in the hardware security area. We used reconfigurable hardware to demonstrate our results successfully. The first solution presents an efficient implementation of the SCA countermeasure on an FPGA platform by using industrial design flow. It proposes a novel memory organization technique and interleaved data format in combination with a hiding countermeasure. The methodology is portable to other FPGA platforms for majority of the block ciphers. A comparison with related solutions, shows that our solution offers a very good trade-off between security gain, performance, circuit area and design complexity. Our experimental results establish the feasibility of a proposed methodology to implement an embedded system to achieve a desired level of security at reasonable cost. In the second part of this thesis, we successfully demonstrate a complete ECC cryptanalytic machine to solve the ECDLP on a hardware-software co-integrated platform. We use a novel architecture on hardware to perform modular multiplication over prime field. This work also demonstrates the use of micro-instruction based sequencing logic to support different vector sizes and to control multiple ECC cores in SIMD fashion. We compare our performance results with the previous hardware implementations and show that our solution acheive a comparable performance with multi-core implementation. 63 Bibliography [1] Virtual secure circuit: Porting dual-rail pre-charge technique into software on multicore. 2010. chenzm@vt.edu 14739 received 10 May 2010. [2] J.A. Ambrose, S. Parameswaran, and A. Ignjatovic. Mute-aes: A multiprocessor architecture to prevent power analysis based side channel attack of the aes algorithm. pages 678 –684, nov. 2008. [3] Daniel V. Bailey, Lejla Batina, Daniel J. Bernstein, Peter Birkner, Joppe W. Bos, Hsieh-Chung Chen, Chen-Mou Cheng, Gauthier Van Damme, Giacomo de Meulenaer, Luis J. Dominguez Perez, Junfeng Fan, Tim Güneysu, Frank K. Gürkaynak, Thorsten Kleinjung, Tanja Lange, Nele Mentens, Ruben Niederhagen, Christof Paar, Francesco Regazzoni, Peter Schwabe, Leif Uhsadel, Anthony Van Herrewege, and Bo-Yin Yang. Breaking ecc2k-130. IACR Cryptology ePrint Archive, 2009:541, 2009. [4] Josep Balasch, Benedikt Gierlichs, Roel Verdult, Lejla Batina, and Ingrid Verbauwhede. Power analysis of atmel cryptomemory - recovering keys from secure eeproms. pages 19–34, 2012. [5] Lyonel Barthe, Pascal Benoit, and Lionel Torres. Investigation of a masking countermeasure against side-channel attacks for risc-based processor architectures. pages 139–144, 2010. [6] Daniel J. Bernstein, Tanja Lange, and Peter Schwabe. On the correct use of the negation map in the pollard rho method. pages 128–146, 2011. 64 Suvarna H. Mane Chapter 7. Conclusion 65 [7] I. Blake, G. Seroussi, N. Smart, and J. W. S. Cassels. Advances in Elliptic Curve Cryptography (London Mathematical Society Lecture Note Series). Cambridge University Press, New York, NY, USA, 2005. [8] Joppe W. Bos, Thorsten Kleinjung, Ruben Niederhagen, and Peter Schwabe. Ecc2k-130 on cell cpus. pages 225–242, 2010. [9] Montgomery P.L. Bos J.W., Kaihara M.E. Pollard rho on the playstation 3. SHARCS, pages 35–50, 2009. [10] R. P. Brent and J. M. Pollard. Factorization of the eighth fermat number. Math. Comp., 36:627–630, 1981. [11] Suresh Chari, Charanjit S. Jutla, Josyula R. Rao, and Pankaj Rohatgi. Towards sound approaches to counteract power-analysis attacks. pages 398–412, 1999. [12] Zhimin Chen, Ambuj Sinha, and Patrick Schaumont. Implementing virtual secure circuit using a custom-instruction approach. pages 57–66, 2010. [13] Joan Daemen and Vincent Rijmen. The Design of Rijndael. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2002. [14] Thomas Eisenbarth, Timo Kasper, Amir Moradi, Christof Paar, Mahmoud Salmasizadeh, and Mohammad T. Manzuri Shalmani. On the power of power analysis in the real world: A complete break of the keeloqcode hopping scheme. pages 203–220, 2008. [15] Junfeng Fan, Daniel V. Bailey, Lejla Batina, Tim Güneysu, Christof Paar, and Ingrid Verbauwhede. Breaking elliptic curve cryptosystems using reconfigurable hardware. pages 133–138, 2010. [16] Tim Gneysu, Gerd Pfeiffer, Christof Paar, and Manfred Schimmler. Three years of evolution: Cryptanalysis with copacobana, 2009. Suvarna H. Mane Chapter 7. Conclusion 66 [17] Sylvain Guilley, Laurent Sauvage, Florent Flament, Vinh-Nga Vong, Philippe Hoogvorst, and Renaud Pacalet. Evaluation of power constant dual-rail logics countermeasures against dpa with design time security metrics. IEEE Trans. Computers, 59(9):1250–1263, 2010. [18] Sylvain Guilley, Laurent Sauvage, Philippe Hoogvorst, Renaud Pacalet, Guido Marco Bertoni, and Sumanta Chaudhuri. Security evaluation of wddl and seclib countermeasures against power attacks. IEEE Trans. Computers, 57(11):1482–1497, 2008. [19] Tim Güneysu and Christof Paar. Ultra high performance ecc over nist primes on commercial fpgas. pages 62–78, 2008. [20] Tim Güneysu, Christof Paar, and Jan Pelzl. Special-purpose hardware for solving the elliptic curve discrete logarithm problem. TRETS, 1(2), 2008. [21] Darrel Hankerson, Alfred J. Menezes, and Scott Vanstone. Guide to Elliptic Curve Cryptography. Springer Publishing Company, Incorporated, 2010. [22] Timo Kasper, David Oswald, and Christof Paar. Side-channel analysis of cryptographic rfids with analog demodulation. pages 61–77, 2011. [23] Neal Koblitz. Constructing elliptic curve cryptosystems in characteristic 2. pages 156– 167, 1990. [24] Paul C. Kocher, Joshua Jaffe, and Benjamin Jun. Differential Power Analysis. CRYPTO 1999, LNCS 1666:pp. 388–397, 1999. [25] Piotr Majkowski, Mariusz Rawski, Tomasz Wojciechowski, Zbigniew Kotulski, and Maciej Wojtynski. Heterogenic distributed system for cryptanalysis of elliptic curve based cryptosystems. pages 300–305, 2008. [26] Victor S Miller. Use of elliptic curves in cryptography. pages 417–426, 1986. Suvarna H. Mane Chapter 7. Conclusion [27] Peter L. Montgomery. of factorization. 67 Speeding the Pollard and elliptic curve methods Mathematics of Computation, 48:243–264, 1987. URL: http://links.jstor.org/sici?sici=0025-5718(198701)48:177<243:STPAEC>2.0.CO;2-3. [28] Amir Moradi, Alessandro Barenghi, Timo Kasper, and Christof Paar. On the vulnerability of fpga bitstream encryption against power analysis attacks: extracting keys from xilinx virtex-ii fpgas. pages 111–124, 2011. [29] Amir Moradi, Markus Kasper, and Christof Paar. Black-box side-channel attacks highlight the importance of countermeasures - an analysis of the xilinx virtex-4 and virtex-5 bitstream encryption mechanism. pages 1–18, 2012. [30] Amir Moradi and Axel Poschmann. Lightweight cryptography and dpa countermeasures: A survey. pages 68–79, 2010. [31] Jean-Jacques Quisquater Philippe Bulens, Guerric Meurice de Dormale. Hardware for collision search on elliptic curve over gf(2m). SHARCS, April, 2006. [32] J. M. Pollard. Monte carlo methods for index computation (mod p). pages 918–924, 1978. [33] Francesco Regazzoni, Alessandro Cevrero, François-Xavier Standaert, Stéphane Badel, Theo Kluter, Philip Brisk, Yusuf Leblebici, and Paolo Ienne. A design flow and evaluation framework for dpa-resistant instruction set extensions. pages 205–219, 2009. [34] T. Popp S. Mangard, E. Oswald. Power Analysis Attacks: Revealing the Secrets of Smart Cards. Springer-Verlag New York, Inc., 2002. [35] Shaunak Shah, Rajesh Velegalati, Jens-Peter Kaps, and David Hwang. Investigation of DPA resistance of Block RAMs in cryptographic implementations on FPGAs. pages 274–279, Dec 2010. [36] Stefan Tillich, Mario Kirschbaum, and Alexander Szekely. Sca-resistant embedded processors: the next generation. pages 211–220, 2010. Suvarna H. Mane Chapter 7. Conclusion 68 [37] Stefan Tillich, Mario Kirschbaum, and Alexander Szekely. Implementation and evaluation of an sca-resistant embedded processor. pages 151–165, 2011. [38] Kris Tiri and Ingrid Verbauwhede. A digital design flow for secure integrated circuits. IEEE Trans. on CAD of Integrated Circuits and Systems, 25(7):1197–1208, 2006. [39] Paul C. van Oorschot and Michael J. Wiener. Parallel collision search with cryptanalytic applications. J. Cryptology, 12(1):1–28, 1999.