1 >> Kristin Lauter: Okay. So today, we're very pleased to have Yosi Oren visiting us from Tel Aviv University. He'll speak to us about the mechanical cryptographer. Thank you. >> Yossef Oren: Okay. So I'm going to talk to you about the mechanical cryptographer. It's a bit of a pretty exciting thing we did in our group in the previous few months, and I think it's really interesting and I hope to share the excitement with you, okay? So first of all, a bit about me. I am part of the Cryptography and Network Security Lab at Tel Aviv University, and among other things we do in our lab, we research [indiscernible] systems. We do foundations of cryptography work, RFID work. Personally, my interests are secure hardware, which means power analysis, ways to attack using power analysis, both applied work, lab work and [indiscernible] work. And low-resource cryptography for lightweight computers, particularly RFID attacks. And other things I do are cryptography in the real world, which are voter privacy, web application security and so on. Specifically, today, I'm going to talk about power analysis and other hardware attacks and specifically by the type of power analysis attack which is very unique in its application; specifically, it can be used where previous forms of power analysis were unusable. It needs much less data and it can be much more versatile. Okay. Things which I did which are not related to the academic but might be interesting to you, I did a lot of coding in my life. I was both a technical leader and a project manager. I know it's different things in Microsoft. And I also wrote some comedy. Might be apparent in the talk, depending on you. Okay. So I just want to calibrate people here that might be from different backgrounds. Who here knows what a flip-flop is? Wow. Not sure, right? Who here has ever touched a scope? Okay. Who knows what AES looks like from the inside? >>: AES? >> Yossef Oren: AES. Okay. So okay. So here is the AES cipher, okay? The 2 structure of the AES cipher, the uses of the AES cipher. You get plaintext on the top. You get ciphertext on the bottom. You get key and what's inside? Bit flipping, bit shifting, permutations and so on. This is a very efficient algorithm, okay? Modern CPUs can do 2 to the 31 AES operations per second. Very efficient. So if I give you the plaintext and the key, it's very easy to do the encryption. If I give you the ciphertext and the key, very difficult to do decryption. Yes? >>: What is two billion. >> Yossef Oren: >>: Two gigahertz. AES [indiscernible] multicore. Multicore, okay. >> Yossef Oren: Yeah. You have to do the -- you have to cheat a bit. terms of how many times per second the AES core is involved. But in Okay. So if I give you the plaintext and the key, cipher tex is easy to calculate. On the other hand, if I give you the plaintext and the ciphertext and the key is not given, it's difficult to calculate, okay? Why is this so hard? Well, this is what cryptography, this is why the cipher was designed this way. It's designed to be difficult to do cryptanalysis. Plaintext and ciphertext do not lead to the key. And what that means, essentially, is that there is no efficient way to represent the key as a function of the plaintext and the ciphertext, okay. In fact, most random functions which have 128 bit inputs or 256 bit inputs and 128 bit outputs are very difficult to represent most random functions. So if you have a very difficult to represent function, you can do either two things. You either spend a lot of space, you create a huge lookup table, or you can try the efficient thing, which is known key and known plaintext, ciphertext, and just go over it a lot of times until you find the pair you're looking for. This takes a lot of time. So it's either a huge space, a long time or some sort of trade-off between the two, which is inefficient. But anyway, there is no efficient way to represent the key of the function of the plaintext and the ciphertext, okay? I just want 3 to put this to the side. of the talk, okay? We're going to go into this a bit more in the future So let's put AES to the side for a moment and I want to talk to you about a nice software tool or a nice software machine called a solver. Who here as ever dealt into solver, tried to use a solver in your work? I know that MSRC is actually writing a solver. So you can talk to them. A solver, as it's defined, is designed to solve stuff, okay? How do I use a solver? I input into the solver a set of statements over variables written in some sort of logic language. There's no real restriction on what language I can use. I can use sat statements. I can use conditional logic statements. I can use English. I don't know. And the solver, upon receiving the set of statements, runs for a short while or a long while, and then it outputs a satisfying assignment for the set of statements. It outputs a set of variables which can be assigned to the statement such that all of them are satisfied. This is a very useful tool, it's very general purpose, very versatile. One thing, one example I can give to you, which it is used for is scheduling, for example, doctors' assignments. Okay. Let's assume I have a set of doctors and a set of assignments and the doctors have constraints. What are the constraints? For example, a doctor cannot be present at two different stations at the same time, okay, right? Because it's a classical doctor not a quantum doctor. And the doctor would like to sleep six hours per night, okay, perhaps he would like a day free or this doctor can't do Saturdays or so on. So the set of statements is given to you and the variables is Dr. Cohen is doing ophthalmology on Thursday and so on. So the software outputs a satisfying assignment, which is the way to assign the doctors to rooms. Similarly, the way to assign post-docs to universities and so on. Another thing which is a very common application of solvers is verifying hardware designs. Hardware design, you start with a high level language and you end up with actually a picture which is burned on to silicon. And some parts of this process leading from the hardware design to the silicon are manual. Some places are prone to errors. 4 So what you do is you give the solver both the high level definition and the actual net list, which is so-called -- this is a low level [indiscernible], and you tell the solver, is there an assignment of inputs which is different, which leads to a different output if I feed it to the higher level design, if I feed it to the silicon, okay? These are solvers, okay? Solvers are very versatile, so one of the things people thought of doing is let's try to do cryptanalysis with solvers, okay? And I'm just going to tell you what cryptanalysis is again. I am going to take this crypto algorithm, for example, AES. I am going to write the set of statements which is as follows, okay? The plaintext is P. The ciphertext is C. The transformation between plaintext and ciphertext is the set of logical expressions. And please give me the cryptographic key -- this is the assignment I want as variables. The cryptographic key will satisfy the set of statements. What this means is if I take the plaintext and the crypto with the key, I will arrive at the ciphertext. This is exactly cryptanalysis, okay? And that would be really cute if it could just give this problem to the computer and come back tomorrow and get the key, right. So this was tried in the year 2000 by Massaci and Marraro, 2000 AES was not live yet. They used a [indiscernible]. And what they found out was that not surprisingly, of course, modern crypto was strong enough to resist solvers. Obviously. You would have heard about it if it was successful. So what does this mean? You give the solver the plaintext and the ciphertext and the key, and the solver runs for an intractable time, okay, and the solver takes an unreasonable amount of memory. Just ran it for a while and then they said okay, we give up. We don't have enough hardware. Why is this so? Okay. We feel that it should be so, but why is it so actually? So if any of you have done some symmetric cipher design, you know that one of the most crucial elements of the symmetric cipher is what's called the diffusion property or the avalanche property. What this means is that if I change one bit in one of the inputs, very quickly it's going to affect half of the outputs, okay. One bit flips on the input, half of the bits flip in the output. And this happens very quickly. So AES has ten rounds. If I change a bit in one of the rounds, in two or three rounds, all of the bits are going to be affected. 5 So effectively what the solver is trying to do is trying to find a candidate assignment so it's guessing, okay, let's say one of the bits of the key is one. And then it has to propagate this belief and see if it makes all of the rest of the statements satisfiable. And as soon as it changes one of the statements, all of this other assignments become invalid because of this avalanche property. So the solver is reduced to actually brute forcing all of the keys, trying to find if one of them satisfies the relationship between the plaintext and ciphertext, okay? So it's established at this point that classical cryptanalysis, using a solver, is difficult. And the reason is diffusion. So okay. I'm an engineering. And when I'm faced with a difficult problem, when I do typically is I solve a different problem, right? This is what engineers do. So let's try to change the problem in a way which makes solvers more efficient, okay? So this is a definition of a cryptanalysis in the sense that I'm going to use today. I'm giving a description of an algorithm, plaintext and ciphertext. I want to output the key. So what can I do to make this easier for me, the engineer? So first of all, I'm going to take this AES algorithm and replace it with a concrete AES device, okay? This could be a smart card. This could be a software implementation. This could be a server running somewhere, okay. But this device, of course, implements AES, but it does so in a way which has -- in a physical way. And what happens is as it does its work, it keeps giving hints about what it's doing, about its internal structure. Specifically, in my case, I'm going to look at the power consumption, the instantaneous power consumption of this device every micro second and I'm going to output a power trace. A power trace is a sequence of, let's say, a million points and each one of these points is instantaneous power consumption of the device while it's doing encryption at a specific time. And now I have more information, which I can try to use for my attack, okay? And if I write this as a formal definition of power analysis, as I'm going to use it in this work, now I have a description of a cryptographic device, plaintext and ciphertext and power traces, okay? And I'm going to try to find the cryptographic key. 6 What's nice about power traces? As I told you previously, what really causes problems for me, the attacker, when I only have the plaintext and the ciphertext, is the diffusion property, the avalanche property. What happens, if I'm using traces, traces, if I'm looking at two very adjacent periods in the power trace, they do not have the avalanche property. The diffusion between two close in time spots in the power trace are close enough together for me to be able to make an hypothesis at times T and check it at time T plus one without compromising all of the rest of the assumptions I have about the system. So power traces are very good for me, the attacker, when I'm trying to attack this device. Power analysis has been demonstrated in 1996. It's well known that it works, okay? So what I'm trying to do now is trying to use this power trace in the scope of solvers. So what am I going to do? Okay. Oh, before I do that, I'd like to say a few things about power analysis. I understand that a few of you have touched it? No? Nobody's ever done it here? So power analysis, in a nutshell. Power analysis builds on the assumption that power consumption is variable. And not only is it variable, different instructions cause different power consumptions and different data cause different power consumption. And this means that the logical leap here means that I can run this backwards. If I analyze a power consumption, I can make some inferences about the instruction in the data. What does this mean? Inference about the instructions mean I can do reverse engineering. I can output the list of the instructions the device is performing and inferences about data, specifically in my case, is key recovery. Okay. So because power consumption depends on instructions in data, I can look at the power consumption and learn about the instructions in the data. Now, at this point, I'm doing this talk in variance for a few years now, and I would have a few animations and pictures of transistors where I would use to prove to you that power consumption which is something that is dependent on the instructions. But now fortunately, there is a revolution in the world and it's much easier for me to prove this to you. And this is called a smartphone. proof that power analysis works. Probably you have one. And this is living The smartphone has a CPU. It has a lot of 7 sensors. It has radios. It has accelerometers and so on. Let's ignore the sensors and the radios. It has a CPU, okay? And sometimes, we take our smartphones and perform a CPU intensive operation, which has no IO and no sensor. For example, we play a game. Some of us play games on our smartphones. We can admit it. And as we are playing this game on our smartphone, we feel that our smartphone gets hot, right? And the battery runs down, okay? Yes. >>: I saw the freebie games an incredible amount of battery and power downloading. >> Yossef Oren: Yes, about three times the amount of -- three times the power they spend on the games, they spend on the ads, yes. But it's kind of scary to think about that. But the whole thing only runs out to about a dollar a year, if you just charge your computer all the year, it costs you about a dollar in electrics. So now I can give you a living proof that power consumption is variable, okay? I just want to show this picture which I found. So power consumption is variable. Okay. So I have this fact, it's established and I'm going to do side channel analysis with solvers. And okay. Now I'm trying to do cryptanalysis. And two groups, one of them in Belgium, one of them in Princeton, thought of doing it. They just used a cipher and instead of feeding only the inputs and the outputs of the cipher, they also fed intermediate data, which is related to the power consumption. And, of course, I'm standing here and I'm talking to you so, of course, they were successful. Okay. The result. The key can be recovered from the side channel data. Right? Okay. But as you see, there is a small but. And let's see if we can try and find the mine in the statement together. The key can be recovered from the side channel data if there are no errors in the side channel trace, okay? So many of you are scope jockeys and have tried to measure physical phenomena using your scope. And as you know, there are never no errors in your scope measurement, okay? So this is the harsh reality of power analysis. Here is a very small device. 8 Can anybody guess what this device is? Yes, this is an inverter. It takes a square wave input here and it outputs a square wave output here, which is the inverse. This is an anti-transistor and this is a P-time transistor and you can just go through it in your head and see that a logical one here causes the zero to run out here and a logical zero here causes the one to run out here. Okay. So now I'm going to feed the square input into here and I'm going to take my oscilloscope and I'm going to look at the output. And how am I going to measure the power trace? I'm going to take a very small resistor, a one ohm resistor, and I'm going to put it in series between the device and the ground and I'm going to measure the voltage drop on this resistor. And the voltage drop is related to the power consumption, okay? Now, what do we see here? When nothing is happening on the circuit, there is no power consumption at all. This is something which is a property of CMOS transistors and you can really like it. This is why we can use our small battery to power our devices for a long time, okay? If something happens, there is practically no power consumption. And when this device switches from one to zero, this is a peak, okay. Sometimes it's a big peak, sometimes it's a smaller peak. It's related to the fact that these capacitors, passive capacitors get charged and discharged, okay? Okay. So if I was able, if I was somehow tasked with performing power analysis of this device, this power trace would be beautiful for me, okay. But the problem is that it never looks like this, right? What happens to the power trace on the [indiscernible] measurement, okay. So there are several things which are going on in this device which cause problems for me. First of all, I'm not measuring only the cryptographic operation, okay. I am attacking a system, a chip or a decoder or a video set top box. And this system is doing other stuff other than doing my encryptions. It might be threads and under test, this. This doing, I don't know, IO. Could be doing all sorts of CPU tasks, so on. What happens is that other stuff is going on in this device and I'm obviously measuring not only my calculations, but also is called switching noise. Another thing that happens -- oh, sorry. I gave you too much errors here. I'll bring it up slowly. Every piece of conducting wire on my device and connecting my device to the world is both a transmitting and a receiving 9 antenna. And the electrons which are moving through the air modulate themselves on to these cables. This is called electronic noise or thermal noise, because when you cool the device down, it gets weaker. I'm obviously measuring this thermal noise as well, because I'm measuring what's going on on the cable. And the third thing which is causing problems for me is that I'm using the physical device called an oscilloscope, which is connected using a physical device called a probe. The scope has its limitations. It has a digital, analog to digital converter, which run certain times of amount of times per second. It has its own impulse function. It has its own sensitivities. It has its quantization. What happens at the end of the day is instead of having this pretty, exact representation of power consumption, I have this disgusting errored thing which we feel is difficult for us to perform power analysis on. And let's try to give this a rigorous treatment, okay? And this is called the information robustness trade-off. Let's take a rigorous look at this phenomena we just saw. So let's assume I'm going to perform my solver attack. The solver needs a set of statements and each one of these statements is calculated from a measurement, okay. Takes the trace, power trace and a measurement. Let's say I measure it at 100 different times and I feed these measurements into my solver system. Okay. So I have a measurement space, which is 100 dimensional and each one of these axis is the value of a certain measurement, okay? Is this okay? So here is the precise measurement, okay? If there were no errors, no artifacts, nothing, this is what I would have been able to measure using my test setup. And it has been shown in the works of the Belgians and the Princeton guys that this is enough to recover the key with actually very good accuracy and very high speed, okay? And here is what I get on my scope, okay? If this was the case, I would be very happy. Because my actual measurement would be the precise measurement and I would be able to output the key quickly and efficiently. But what happens is that these errors and things, or noise -- you don't have to treat it as something bad. Error means that it's bad. Noise means that it's a 10 fact of life, okay? So what happens is if the circle moves aside. Now what happens, okay? I'm running, I'm providing this measurement to the solver, and the solver looks at the equations and looks at this measurement and what does the solver output? Noise. Unsatisfiable. There is no key which can give out this measurement, okay? Okay? The measurement stays -- has a code word here, it doesn't have a code word here. So now looking at this picture, what do we immediately feel inclined to do to make this work again? We want to do this, right? We increase the robustness of the equation system, okay. So we increase the robustness. We now say the equation set could be okay at time T1, the leak is three. Now we say it's not three. It could be two or three or four or somewhere, but it can't be seven or something of the sort. And we increase this robustness until we get into -until the set of valid measurements includes the precise measurement. But what happens here is that the measurement space does not look like this, but rather looks like this. And now, what happens is that inside the circle of measurements which are valid, which are validly represented by the side channel measurement, there are exponentially many satisfying keys. What happens when the solver has exponentially many satisfying keys? We know what it does. It just brute forces all of these keys. Because all of these keys are legal. What's happened is that the solver is again reduced to brute forcing. And this results, again, in an intractable running time. So this is what we call the robustness information trade-off. It's a very, very cruel trade-off. If we use an equation set which has errors in it, then the solver returns unsatisfiability, because the correct solution is not inside the set of measurement which I can accommodate. But if I increase the set of measurements if I add robustness, then the solver runs for intractable time, okay? So this is where we were stuck at 2009. Either one of these is not good for us. We can't use solvers for power analysis, cryptanalysis in the real world. Okay. So it's a hard problem so again, I'm an engineer. So if I have a hard problem, then I solve a different problem. Okay. So how can I change the problem to make it more acceptable for me? 11 Okay. Here is the solver. This is the tool I use previously. A solver receives a set of statements and it outputs a satisfying assignment. Now, instead of a solver, I can use a built of a more elaborate tool called an optimizer. Now, an optimizer works more or less the same way as a solver, but it doesn't receive only a set of statements. It also receives something called a goal function. And the solver, the optimizer chooses, among all of the possible satisfying assignments, the optimal assignment. The one that gives us the best value for the goal function. Let's say I want to minimize the goal function, give me the minimal value of the goal function. Again, this is a general purpose tool. It's a bit more elaborate and heavy. It takes more time to run than the solver, but it's also very useful. One example where it's used, actually, IBM has a group which rolled their own solver and all day they sell usage of the solver to companies around the world, transportation companies, shipping companies who want to optimize their routes. Here is a classical application for optimizers which actually is in the field today. I want to take the Russian railway system and I want to save fuel. I want to serve all of the stations in the railway system. I have constraints. A train cannot be present at two places at once. Two trains cannot be present at the same railway at the same time, okay. Train cannot travel so on and so forth before needing fuel. But I also have a goal function. The goal function is please serve the schedule of the Russian railway system in a way which minimizes the distance traveled by the trains. Minimizes the gas consumption of the trains and so on. And when you save a lot of money using this tool, suddenly, it becomes very, very important. So this is an optimizer. And similar to stat solvers, which are also developed in universities and there's competition between every year, there's something called an SAT race, where people will run their SAT travels and see which one is the fastest, there's also something called the pseudo-Boolean competition, the solver competition, optimizer competition. Every year, people compare their optimizers. It's an active work. People are actually researching them, trying to find the best way of doing it. So okay. How do I use this in my context? So if I go to the previous slide, I had this big circle, and my assumption was that all of the points inside the circle are equally eligible to be the correct key. So now, my insight is that 12 some mistakes are more expensive than others. What do I mean? I am going to give a price to each mistake I make, okay. There is the correct measurement, which is not the correct measurement. It's the measurement I got. Every time I deviate from this measurement, I'm going to pay a price, and the optimization would be let's try to pay the minimum price. And what's really nice is that I gave you this idea that there's a continuum of points, and the decoder which receives this trace outputs a point on this continuum. This is not correct. Actually, the decoder is a bit more elaborate. It's something called a soft decoder. The decoder for each one of the points along its axis outputs an [indiscernible] probability, which means how likely it is that this specific point is the one that was transmitted conditioned only the fact that this trace was the one received, okay. If somebody here that did some single processing work, this is a Bayesian decoder, a naive Bayesian decoder. Other decoders also exist. Okay. So I can give a price to every mistake I make, and I want to do the best I can. So here is my new definition of side channel analysis based on solvers. I have a description of a device. I have plaintext, I have ciphertext and now I have -- I need to find the key that minimizes the estimated error. So once I output a key, I can run this backwards and say that if this is the key, this is the power trace it should have been if there were no errors. I want the power trace that should have been to be closest as possible to the power trace that I did see with my scope. Okay? So now let's take a look at what really happens if I do this, okay? Again, I'm going to look at the measure measurement space. And this is a real measurement from my lab. Actually, it's based on a simulation from a device. I'm going to look at two dimensions of this measurement space, and the brightness of each point is correlated to the likelihood that this is the point which is really received, okay? So there are lots of points which are non-zero, okay. And if you would turn down the light, you would see them also less likely points. 13 Where is the precise measurement? Wow. Yeah, you see it? Great. Can I have lights again, Matt? Wow. Matt, you're awesome. So let's just take a look a bit closer at the points here. And where is the precise measurement? Here it is. Okay? This measurement, it's the 680th most possible of the 65,000 points here. So it's pretty likely, but it's still exponentially difficult to find it, okay? Now, I just gave you two measurements, okay? I don't give you two measurements. I'm going to give you 100 measurements, okay. So if I would try to brute force over all of these measurements, I would have to do an exponentially crazy amount of evaluation to see if it's a correct one. But now if I have a ranking by probability, I still have to do a crazy large amount of exponential work. Practically, in my measurements, the average rank of the correct byte out of 256 possible candidates is number 14, okay? So 14 to the power of 100 is still a lot of work. But the thing is that these points now are not IID. And this is the really -- this is the real trick here, because I also give to my solver the description of the cryptographic process which is creating these bytes to be output. So if one of these measurements is the input to sub-bytes, and one of them is the output to sub-bytes, as soon as I assume one of them, I also have to assume the other, okay? So what this means is that this point might be quite unlikely, but it's really, really powerfully suggested by other information I have. And as I said, because of the slow diffusion property, what I saw a minute before, what I'm going to see a minute after is going to really affect my choice at this stage, okay? So I'm just going to go a bit into the tool I use, which might be useful for you in other applications. The specific optimizing system I use is called a Pseudo-Boolean Optimizer. And this is how it works. The objective of the Pseudo-Boolean Optimizer is to output the vector X variable, which minimizes this goal function, okay. This is a constraint function called C. I multiply it by X, I get a number. I'm going to minimize this number subject to this matrix of linear constraints. A times X has to be greater than or equal to B. And why is this called a Pseudo-Boolean Optimizer. Because the variables 14 themselves are Boolean, but the coefficients of A are assigned integers, okay. So if you delved a bit into logic systems and so on, you know that the simplest logical system we've used are SAT statements. SAT statements are Boolean and Boolean. And on the other side, there's something called integer programming where the variables themselves are integers and the coefficients of integers. So this is somewhere in between. It's a compromise which is very good for me, because this simplification makes the solver quicker and it makes the implementation more easy to use. Okay. So using very simple to describe gadgets, I can turn a linear Psuedo-Boolean Optimizer into a nonlinear Psuedo-Boolean Optimizer. And this is using the classical linearization technique. So if I have a constraint, let's say, Y is X1, X2. Remember that Y and X1, X2 are Boolean, okay. So what this actually means is that X1 plus X2, both of them are 1, then Y has to be -- okay. More or less this. Okay? I'm not sure I wrote this correctly, but there is a linearization system and you can use nonlinear and they also can multiply these variables together and you can also use your inverts. So this language is very expressive. I can very easily describe to you a Pseudo-Boolean instance. You can look at it and understand what it does. And this is in contrast to SAT systems, which really you need somebody to understand assembly language to understand the SAT system. Here is a sample Pseudo-Boolean instant written in the open Pseudo-Boolean programming language. What am I trying to do here? I'm trying to find the vector X1, X2, X3. Which minimizes this goal function, okay, and it has this constraint, okay. X1 plus X2 plus X3 is greater than two, okay? So let's try to see if we can find in our heads the optimal assignment. Let's try the all zeroes assignment, okay? The all zeroes assignment gives us the goal function of zero, which is really great. But zero plus zero plus zero is not -- does not satisfy the constraint, okay. So that's not good. Let's try the all 1s assignment. This gives me four. Four is more than two. This is great. And the goal function now is going to pay the price of five. Okay. Can anybody looking at this equation system give me the optimal assignment? 15 >>: One, zero, one. >> Yossef Oren: Precisely. One, zero, one as well gives me a price to pay of two and one, zero, one is two. Two is greater than or equal to two. Great, okay. This language is very expressive and it's very rich, okay? What do I mean rich? It's really useful for my application. Specifically, I'm trying to find the variables, my variables which are either the inputs, the outputs or all sorts of internal state of illusions of my device. They're all flip-flops. Flip-flops can either be zero or one. So it's really great that the variables are zeroes and ones. On the other hand, the measurements, which are the things I pride to my systems, they're stamped on some continuous axis, but I can sample them into integers so the constraints is really great that they're integers, okay? And the non-linear notation is rich and expressive enough for me to be able to write very elegantly and succinctly the things which are -- you can see in crypto devices. For example, here are some very simple Pseudo-Boolean statements. Here is a negative OR, you see. Out is equal to not X1 times not X2, okay? Pretty trivial, okay? You can see this is a negative OR gate. Here is an exclusive OR gate. This is a bit cute, okay? I wanted to put this here, because you see that I'm actually using here a nonlinear -- I'm using here the Pseudo-Boolean, okay. This integer. If you would write the through table for this thing, you would see that this is exactly the exclusive OR statement, okay? You can just play with it in your head. Here is a bit of a more disgusting function. This is the Keelog ciphers nonlinear feedback function. It's five inputs, one bit output. There's no efficient algebraic representation of it. But as you can see, I just, I wrote something here which is Pseudo-Boolean representation of it, okay? And also, the goal function is really good for me, okay? The goal function, let's say I'm trying to determine the costs, the price of my -- I'm going to guess I'm trying to do a side channel analysis. I guess that the key is a certain vector, okay. Now, choosing the key forces, of course, all of the internal values of the state to be also strictly defined. So now, I have for each one of these 16 internal values, a chosen value and I know the aposteriori probability of each one of these values. I want to find the overall probability that this is the correct value. What do I do? I multiply all of these together and I get a single value which is -- yes? >>: There's a negate missing someplace. negate something. A max doesn't become a min unless you >> Yossef Oren: The max is less than one. All of these products, all of the aposteriori probabilities are less than one. So the log of a number less than one is negative. So I want the minimum sum. It's the minimum -- it's like entropy. >>: The maximum, it's a negative number. >> Yossef Oren: I want to maximize the negative, yes. Thank you. Right. So here is another example of a Pseudo-Boolean instance. Here I have a vector X, okay, you can just look at it. It's so apparent to see what I've written down here. X can be either zero, one, two or three. This statement means that one and only one of these variables can be true, okay? And X is an eight bit value. Here are the eight values of X. X sub 1, 2, X sub 7. And as soon as one of these events happens, it forces all of the other events to get their value. And these are the constraints and here is the goal function, okay? The goal function means that, okay, if you're going to choose zero, you're going to pay a pretty low price. If you're going to go ahead and say that X is 3, you're going to have to be pretty sure of it based on other evidence before you can go ahead and choose X is equal to 3. So as the solver, as the optimizer works, it would probably try first X is zero and then move down this list, okay. But again, because X might be determined by the previous value in the decryption, the former value in the decryption, it will do a bit more efficiently. But this is how my equation system looks. Any questions about this? Because now I'm going to describe my workflow and my results. >>: So I'm wondering how you get the expected trace. simulator, or do you run ->> Yossef Oren: Okay. So do you have a The way I prepare my decoder, it's a method that's well 17 known in the art. It's a paper from 2005 called template to text, which is very interesting. It basically you take your device and your cryptivity, and you force it to output values which you would like it to output. And you measure, let's say you want that value number two would be five, okay. You make it output value -- many traces in which value number two is equal to five and you find an interesting point in this trace, or a few interesting points, which are highly correlated with the value we are measuring. There's statistical tools which assist you in finding this point. And then for this point, this interesting point, you create a mean and standard deviation. And the mean and standard deviation are the mean and standard deviation conditioned on the fact that you were expecting five. Now, for all possible values, 256 values, you create 256 means and standard deviations, okay. If you're using, for example, two interesting points, you also have a covariance matrix for each of these. So now I have, if I get a trace which I don't know, I can use these variables, these means and variances to create 256 probabilities. Probability of this trace conditioned on one, conditioned on two, conditioned on three. And then using Bayesian inversion, I can flip this around and get the probability I'm looking for, okay? It's really interesting single processing and the paper describes it much better than I just did, okay? >>: I get it. Thank you. >> Yossef Oren: So here is the work flow. I have this device under test. I'm going to do something novel to it, which will end up in the optimizer outputting the secret key, okay? So what is my work flow? TASCA, by the way, is tolerant algebraic side channel attack. Tolerant, because I can tolerate errors, okay? So first of all, I'm going to take my device under test and subject it to reverse engineering. This can be based on prior knowledge. This could be based on really probing and so on. And the output of this would be the power model. The power model means that I am going to write a set of constraints which say is the device is doing this at time, is doing something at this time, then I expect its power, its precise power to be this, okay? This is output from reverse engineering. 18 I'm also going to take traces or, let's say, one power trace, which is output from the device, as I am attacking it and I'm going to put it into this decoder, this Bayesian decoder, which I just described, and this decoder is going to output the vector of aposteriori probabilities for each one of the measurements I'm going. Let's take a look at the amount of data I'm using. The data complexity. I have a single trace, for instance, and in this trace I'm going to take 100 interesting points, and each one of these interesting points is going to calls the output of a vector of aposteriori probabilities, okay? So from this one trace I'm going to get say 100 times 256 aposteriori probabilities, and then I'm going to put this into the optimizer, these together, and it's going to run and it's going to end up in a paper and I'm very happy, okay? We know how it works right? Our objective function is very well defined as researchers. So let's talk about a real attack, okay, and the results. By the way, this is going to appear in CHES 2012 a month from now. So the solver is SCIP. SCIP is an open source solver written by the Berlin university. It keeps running in the SCIP -- in these Pseudo-Boolean competitions and winning so it's pretty good. And the device, Belgian outputs claims. cryptosystem I'm attacking is a simulation. It's not a real lab but rather based on measurements which have been performed by my colleagues. So it's real data, but it's created by a tool which simulated data, okay? I just don't want to make any ground-breaking I didn't break a physical device yet, okay? And what I did, I took a single power trace and I extracted 100 from this single trace. So single power trace, data complexity signal to noise ratio is reasonable at 10 DB. It means there's amount of power in the signal than there is in the noise. This signal to noise ratio. measurements is one. The ten times the is a reasonable I put these instances, I created a lot of these instances. I said I have a tool that creates them. I took 200 of these instances and, on average, in about less than ten minutes, the key was recovered with perfect success rate, okay? One trace, ten minutes, 100 percent success rate, okay? 19 What does this mean? Before I describe what it means, I want to tell you a bit about the field of power analysis. Power analysis has been something which people in the academic world know about since '96, and it's well assumed that the government agencies and so on and so forth know about it from World War II and on, okay? So people know that power analysis attacks work. But what happens at the most power analysis attacks until this day are based on statistical methods. What that means, that you need many traces, and you try to make a hypothesis on these traces which will cause them to behave in a way which, for example, you can split them into two bins. If your assignment is correct, if your hypothesis is correct, then these two bins will be statistically significant. Or there will be a correlation between your hypothesis and the traces at a certain amount of time. So another thing which all of the previous -- not all of them, but many of the previous attacks assume is that there is a linear correlation between the power consumption of the device and the data it's processing, which we know because it's a CMOS device, okay? So how do you break power analysis? How do you make a device resist power analysis? You attack both of these venues. First of all, you try to make your protocol, you try to make your environment so that the keys are changed with high frequency. They're always fresh. So you won't be able to get a lot of measurements so your statistical message will fail. Or you can just inject noise to make the statistically correlation weaker. Another thing you do is you try to make electronic engineering tricks to break the linear correlation between the power consumption and the amount of bits which are flipping at a certain time. You can do this by, for example, what's called dual array logic. You carry your circuit and then you create a mirror image of the circuit and they always do the opposite from one another so you really don't know what's going on, okay? But what happens is that both of these assumptions, which make today's resisted devices resistant do not hold when I'm doing this attack. First of all, crypto devices don't need high data complexity to be attacked with this thing, okay? The data complexity is one. And if I really have a lot of noise, I can average two traces together so that the data complexity is two. Averaging really, really hits the signal to noise 20 ratio, okay? So I don't really need to average. If I average 16 traces together, I can do a great work in reducing the signal to noise ratio. So the data complexity is very low, which is something which previous countermeasures assumed would make them resistant to power analysis. Another thing is that I don't need anything particular about the leak. As I told you, I make no assumption about the leak. I leave this problem to the author of the decoder, okay. The guy who writes the decoder, he can, as long as he can output the vector of outputs to varied probabilities, that's enough for me. So if the leak is linear, that's fine. If it's not linear, okay. The device I showed you has a nonlinear correlation between the [indiscernible] rate of the byte and the power consumption, but if there is a relation, okay, so anything that I can write a soft decoder for and it doesn't even have to be power consumption. It can be anything exotic, anything else. As long as I can write a soft decoder, I can do this attack, okay? So these two facts together call into question the security of previously safe devices. So if you say a device is resistant to power analysis, you might have to go and check this claim again, okay? So I think this is pretty exciting. I still don't have any practical devices I've attacked, because it's a very fresh result, but I feel there's a lot of things that will have to be questioned if this really turns out to be practical. So where do I go from now? Future work. First of all, I used only power analysis decoder. So I feel that anything that leaks can be attacked. So a nice thing to do would be try to do different things that leak and try to feed them into my solver. What I do know, at least I feel it, that if I throw garbage at my solver, which means low quality data, it knows to ignore it, more or less. So if there is a measurement where all of the values in these measurements have the same probability, the solver will just ignore it. So it really feels that the more data I throw at the solve every, the more effective it will be. The second thing is different leakage models. Again, I can try different things. For example, I can try to use a cache timing leak, electronic leakage and so on and so forth. 21 And the third thing, I touched on it briefly. The decoders we currently use have this very elaborate pre-processing phase. We have a captive device and they do profiling and they create [indiscernible]. And then I have a very, very small data complexity. I only need one trace, and I use this trace and I get the key. So it might be that I have limited time with my device. There might be -- it might be possible to use less profiling time and more traces for the attack and still get an attack with low data complexity both in the online and offline phases, okay? And the fourth thing, which is really apparent is to get a real smart card or a car or a computer and attack it using this method, which is really something interesting to do, but difficult. Okay. So this time, I thank you. And you can get the paper from this website, and I'd really welcome any questions or comments. Thank you. >>: I don't know if this is really within your mandate, what you're interested in, but what would be the characteristics of the device that could defeat this attack? >> Yossef Oren: Let's see. I would say that what I really need is diffusion. I need low diffusion. So a device which has lots of diffusion would defeat the system, okay. So if, for example, if I make a guess and to check this guess I would have to go over a whole space of solutions, that would make the device difficult. One example is, for example, a device which does a lot of things in parallel. So I'd have to make a lot of guesses at the same time, okay? Anything else? >>: How does the attack scale with higher complexity crypto systems? >> Yossef Oren: What do you mean complexity? >>: So say you make the bit size larger or something. or ->> Yossef Oren: So a much bigger keys AES doesn't use a different round structure for larger keys. 22 It just does more rounds. So AES with a larger key is not more -- not more powerful when I attack it using this way. >>: So it just doesn't change at all? >> Yossef Oren: Yes. >>: With the exception of TV decoder cards, who in his right mind puts his secrets without the other guy get within two inches of them in the scope? >> Yossef Oren: Oh, yeah, that's true. If -- you can always say that if you have physical access to a device, then you're screwed anyway, okay? But first of all, TV decoders are a very interesting market. The second thing is that now we are taking our cell phones with all of the secrets and touching them to all sorts of -- at least once to do it once the software and hardware is in place, we're going to go and touch, okay, places you wouldn't touch with your hand, suddenly you're going to touch with your phone. And who knows what's on the other side. And the third thing I can say is if you're using electromagnetic leakage, suddenly your range is much larger, okay? Power analysis does always have this comment you have to make that to do power analysis, you need a power trace. Yes. Okay? Anything else? >> Kristin Lauter: >> Yossef Oren: Let's thank Yossef. Thank you.