>> Bryan Parno: So hello, everybody. Thank you for coming out... another talk in what I know is a very busy time....

>> Bryan Parno: So hello, everybody. Thank you for coming out to another talk in what I know is a very busy time. I'm here to introduce Jonathan McCune from Carnegie Mellon where Jon has done all kind of interesting projects around trusted computing involving talking to the TPM's, and I've heard tell that he can talk to TPM's through is fingertips and occasionally writes hypervisors in his sleep. So some of his previous work was on systems like Flickr and TrustVisor for creating minimal trusted computing bases, and today he's looking at what else might be untrustworthy about our systems. >> Jonathan McCune: So thank you. It's my pleasure to be here. I do appreciate your attention. Please feel free to interrupt me. I would be happy to go down, you know, discussion paths. I think there are design questions that get raised by some of these problems that might be fun to talk about. So I also want to credit graduate student Yanlin Li and then Professor Adrian Perrig; they were collaborators in this. So maybe you guys know what this is but just a block diagram with more than one processor and a bunch of different types of devices. What does this look like? Maybe a chip set? This is the kind of architecture that we tend to find in all of our PC-style devices today. And these aren't just dumb ASIC's anymore, right? These are basically entire computer systems. And so a reasonable question to ask is, "Well, what code runs on all these other processors, and what are the security properties of that code?" And in particular for the purposes of this talk, I think it's useful to think about it from the perspective of the network card. The reason I want to start there is because that's generally the most exposed surface of the system. So if a bad guy's coming in over the internet, there's a good chance he encounters your network card first. So I just want to go through a few anecdotes that I think are sort of fun or sort of scary depending on what kind of mood I'm in. But this is an off-the-shelf network card. And this guy Arrigo Triulzi, you know, he likes to reverse engineer things but he's a human, a mortal, and he bought a ten-pack of these network cards and just wanted to see what he could figure out. And before breaking the tenth card, he was able to successfully replace the firmware on the card with code of his choosing. Now he did this with physical possession of these devices so this wasn't necessarily, you know, tickling some latent vulnerability, but it sort of shows what's possible. And he decided to use peer-topeer bus communication -- Something that it turns out is legal but not very commonly used -- to inject other code into the graphics card. And, you know, this gives rise to this scenario where you have maybe a Bot without actually having a compromised operating system. There's a lot of memory and compute power inside of our graphics cards today, and so they can easily, you know, generate malicious traffic at a rate that can saturate the network card or something like that. So, you know, there are some very sophisticated subsystems inside our computers now. Another example, Apple aluminum keyboards like that one the left, run firmware, and there are firmware updates for these keyboards. I'm not sure what is still complicated about that. But there's a vulnerability in the firmware update mechanism and so, you know, you can infect the keyboard and subsequently, potentially infect the host. And to make the story from the previous slide real, there are now actual known vulnerabilities in the remote facing interfaces of certain network cards, you know, these manageability features that, you know, magically-crafted packets can overflow buffers. So, you know, it's important to recognize what are the root issues here. So malware on peripherals can readily eavesdrop on any data that they actually handle. Especially, you know, good prudent practice says that you don't trust the network anyway, but having a man in the network card certainly gives them control over a lot of aspects of your network. You know, if the IOMMU, right, if there's not some intelligent configuration of what memory a mperipheral device can access with a DMA transaction, it's not configured intelligently then you can run into problems with, you know, unfettered access to memory. Other peripherals can be infected; or not even necessarily infected, right, subverted to perform malicious work like the NIC GPU example. And I think this bottom thing sort of drives it home. Your system can still be a Bot even if there's nothing wrong with the operating system and the applications on top. So what's the state of the art in trying to keep the, you know, firmware-level portions of our systems in a state that we're happy with? So signed firmware, signed BIOS updates, you know, digital signatures these make us happy. You know you put a public key fingerprint in some immutable ROM location, and you make sure that any code that comes in purporting to be a firmware update as a signature that checks. That doesn't say anything about how new that firmware is, so it could be the old version with a known vulnerability pretty readily. An unfortunate recent example is this Intel disclosure. Their, you know, brand new security feature trustworthy execution technology, something that I do personally think is on the right track, unfortunately had a pretty serious vulnerability in what is legitimate signed code that's out there in the wild; systems will run it. No, and unfortunately the fix for this is pretty drastic. Every SINIT module ever released happened to have this vulnerability at the time it was disclosed, so they all needed to be updated. It turns out that there's some CPU microcode problems. And so, you know, microcode is am ephemeral in our modern processors, so that means every power-cycle forever we hope that the right microcode patch gets applied. And in order to ensure that there aren't rollback attacks to a previous version, there are vendor-specific BIOS changes. So that means every vendor that shipped a system that's capable of doing this should technically be updating their BIOS to make sure that it blacklists these known bad modules. And, I mean, that's just something that the commodity ecosystem can't really stomach today. There are a lot of legitimate reasons to rollback BIOS' as well. Right? If you're an enterprise with ten thousand PC's and the BIOS update breaks something important, you can believe that the vendor's going to find a way to roll it back. So I'm not very happy with the state of the art. I think it's a good idea to do signed code. At least you know where it comes from, but it certainly doesn't mean that we're done. So basically it's an open challenge to detect whether or not there's malware running on our peripherals. And peripheral devices are interesting because they tend to have some pretty significant resource constraints, you know, limited memory. Hardware-based protection mechanisms might be quite expensive relative to the cost of the device itself. You know, most keyboard micro-controllers can't do public key cryptography. And so what we're trying to do here is find a way to actually verify the integrity of the peripherals' firmware. That means learn for sure what version of the firmware is running in there. You know, and hopefully you can cross-check that and find out that it's a legitimate version from the legitimate vendor and that it's recent. And so just to drive this home, hopefully you get this, but we want full system security. Trustworthy execution just on the platform's primary processor is not enough. Right? We want to know that all these other peripherals that are basically full computer systems in their own right are behaving as intended. And so this brings up the question, how do you even approach this problem? And which one of these things should we verify first? Can we assume that once it's been verified it's not going to be subverted while we verify another one? And so there are a lot of different heuristics that come to mind about, "Well, should we start with the primary CPU?" It's the most powerful by some metrics but not necessarily all metrics. Maybe we care about proximity to the processor, maybe fewest hops is a useful thing. You know, so there are all these different metrics that you can dream up, and you'd like to find an answer that says that one of these metrics is superior to the others. So hopefully at this point I've made the case that there can be malware on peripherals, that it can be a significant threat unless it's an important problem to look into maybe a way to do something about that. We're going to propose VIPER, a way to verify the integrity of peripheral devices. We do this using a modified form of a software based out of station protocol. I have some background information on that that I'll come to shortly. And then, we actually prototype this on an off-the-shelf network card that happens to have open source firmware that we could modify without also doing reverse engineering. So our attacker is a remote attacker. He's coming over the network or something similar. For the purposes of this implementation, we're going to consider physical attacks to be out of scope. We're actually going to assume that the host CPU is trustworthy. I mean that's a strong assumption, but we wanted a place to stand. And the hardware changes that are in place and coming down the pipe for our PC platforms are -at least hardware changes for security have so far been focused around the primary processor. So it's an assumption; that's where we're going to try to start. At any attestation-style system, the thing that's going to serve as the verifier needs to have expectations about what it's going to try and verify. And so we're assuming that this verifier program knows something about the peripheral and what's supposed to be there. The attacker model is that the firmware can get compromised. We're not going to prevent it. We're going to detect it if it happens. We assume the attacker has fairly immense resources at his disposable as some remote location. Right, they can co-op DC2 or whatever. But we're going to assume that standard cryptographic primitives hold up. So there we go. So this is a basic motivation for attestation. We want to get code integrity of the firmware that's running inside our peripherals. If we can reference a cryptographic hash of that firmware that we have, faith is accurate, then we can cross-reference that with the golden database of sorts and convince ourselves that the right firmware is in place. And so this usually looks like some kind of basic challenge response protocol. You know, the verifier sends some kind of nonce to the target environment and back comes a sign, or a maybe message authentication code makes sense under certain conditions, statement of sorts that the verifier can then cross-reference with this database with known-good things. So that basic model we're going to apply here too. But we're going to apply it using software-based attestation. And are you guys familiar with that? Has anyone not heard of software-based attestation? So there were a few shakies. So the idea is to provide the type of root of trust that a hardware mechanism can provide but without any hardware support. So we assume explicitly that our peripheral device does not have a secure coprocessor; it doesn't not have TPM. This is reasonable because there's an immense population of devices out there that have no such support. And it will never make sense for certain price-points to add support, so it's always going to be something that's in scope. At a high level, this is glossing over a lot of detail, but the difference between regular attestation where you have some hardware root of trust that protects a secret like a private asymmetric key or a symmetric shared mat key is that we actually have no secret on the untrusted device. But we know a lot about the micro-architecture of the untrusted device, so we can do things like maybe we're going to be able to have a psycho-accurate simulation of what should happen on that untrusted device. So we want to try to make a combination cryptographic hash function in benchmark, right, where if it gets the answer on time then it has the properties of a cryptographic hash function. And if an adversary tampers with it then it will either return the wrong answer or take too long. I don't how to summarize this in 30 seconds very effectively. But the main thing is if you get the right answer on time then you feel good about the answer. You consider it to be authentic to have come from this device with assumptions about, you know, what's the clockfrequency at which that device runs. You know, if you do some kind of very sophisticated overclocking then you can, you know, cause this thing to execute more rapidly. Question? >> : So do we need to assume the communication channel can uniquely identify the device? >> Jonathon McCune: You do. So we tend to call that -- I mean, you described it well, but we use I think endpoint origin authenticity if you read some of the software-based attestation literature. So that has been a problem for software-based attestation mechanisms to date. You have to have this assumption that you know which devices are talking, so that automatically rules it out across the Internet. I'm getting ahead of the slides a little bit. But the other big problem with software-based attestation is something like a proxy attack. You know if this is our network card and it has an internal microcontroller at 200 megahertz, you can [inaudible] spoof its response time with a, you know, a big powerful system. So especially if we assume the adversary to have multiple data centers at his disposal, you know, we're going to have to do more than just worry about the time it takes to execute this checksum. Question? >> : So I have a question. If the trusted device knows that I'm answering attestation and all I need to do is [inaudible] condition check [inaudible]. If I'm under attestation, I just run the benign code so I can finish the computation within the requested amount of time. And if the condition check is just some very quick check, how can it differentiate the trusted device is not really a malicious code? >> Jonathon McCune: Yeah, so this is software attestation background. And so what you describe is a real problem. Right? The fact that like let's say we have our legitimate checksum function. You know, and it gets us our right answer in 100 milliseconds. And the best known attack changes one if condition and takes, you know, 100 milliseconds plus 10 microseconds or some tiny, infinitesimal adjustment, that's very hard to detect especially in an environment where you aren't 100% sure who's talking right now. So we're going to get to that. So that's one of the things that I like about this application of software-based attestation to peripherals that I think can overcome that problem. But that is a problem for, you know, your vanilla run the checksum function once over all of memory. And, you know, legitimately it either takes a second or a second plus one millisecond, and you're supposed to distinguish between these. And so that's been a real limitation of, you know, prior software-based attestation mechanisms. That was kind of clunky. Are we happy? All right. Great. So when you actually go to implement this what you need on the device of questionable refute, right, on the peripherals that we want to verify is this checksum function that I've alluded to as a hybrid hash function benchmark, and then you need some ability to send information back, and you need actually a more traditional cryptographic hash function because this fancy checksum function, to create what we're calling here a software-only root of trust. So if you want to draw the analogy between a hardware root of trust where you have, you know, something like a TPM that has a signing key in it and it just won't let the key out, right, there's no API for that. The only think it'll do is sign things. So if you get a signature from it and you trust that the hardware hasn't been compromised, you know it came from a particular device. And so that's the property that we want to attain in a parallel way with a software-only mechanism. And so on the verifier, for example in the host CPU in this context, then you have to have some kind of checksum simulator so that it knows what the right answer was supposed to be. Now remember this is a challenge response protocol with a nonce, so what the checksum function is going to do is actually contingent on that challenge. It's not so simple as to return the same answer every time. And then you have to have your golden image, right, your expected firmware in the ability to measure time. Hopefully this is consistent with what we've previously seen. The novelty of this VIPER system is actually with respect to communication latencies and things. And so the clunky bits of softwarebased attestation hopefully, you know, stood before I talk to you today and will remain after we're done. But hopefully in environments where you can make statements about communication latency, you'll see that some neat things can happen. >> : [Inaudible] the malicious coding will always [inaudible] resolving [inaudible] checksum problem. I could design a code in a way that will, I mean, so that it leaves the checksum function intact. >> Jonathon McCune: Yeah, so I maybe should've provided a little bit more background. But another one of the assumptions that goes with one of these checksum functions is that its implementation is optimal. So the idea is that it's a very small thing. The ones that have been developed to date don't look like cryptographic hash functions; they look more like a once such thing as T-function. I think it does ands and x-overs. So it's a very simple function. So my big concern about the practical ones is that you might be able to shave a few instructions off their implementation, but I'm more concerned that there's some kind of major algebraic failure where it's just not a sophisticated cryptographic function like you really want for your real has function. So this is still a limitation of the checksum functions that have been proposed to date for software-based attestation. The kinds of questions that you ask are sort of open questions in terms of, you know, getting like a reduction-proof like we're used to having in sort of more traditional crypto. Okay. Animation time. So maybe one last step we talked about a nonce comes across, that's the challenge. Right? This checksum function does its thing. It sends back a checksum. The minimal way to implement this is it's really only verifying itself, and that's not really of value alone. Right? That's only our software-based root of trust. What we really want is a root of trust that allows us to get a high integrity hash of the code of interest. And that's why the last step here is to actually invoke a cryptographic hash function. In that scenario it looks a lot more like sort of TPM style integrity measurement and attestation. Okay. So we already sort of described the proxy attack in response to questions, but the risk is you have this peripheral device with some kind of wimpy processor. It's already been corrupted by an adversary, so he forwards the challenge to something powerful but fakes it in time. So the adversary's able to get the correct checksum on time and fool the verifier. So we call that a proxy attack, and that's been the most significant, probably practical barrier even if you had a perfect checksum algorithm this would still be fatal. Now I want to talk about the differences for peripherals, and this is where it starts to get interesting because I think we have some properties that are not as quite far out as getting checksum function that adheres to all these desirable things. really talked about CPU performance when I talked about these checksums. Communication overhead for this proxy attack, this communication overhead was a problem between the verifier and hopefully start to a So I only earlier the intended target device because that's slop in your ability to measure exactly how long that checksum function took to execute. You know, and earlier we mentioned that the real checksum functions that have been proposed have only a minimal additional overhead under the best-known attacks. So, you know, if your network latency is too long, it just won't work. You can't tell the difference between a small change in network latency and the legitimate overhead induced by an attack on the checksum function. So that was the past. Now when we start to look at peripheral devices then suddenly, you know, the latency is a lot more comprehensible. So inside of our system we have buses as our communication mechanism. Right? If we have a gigabit NIC then there's a PCI express bus or something that connects it to the processor or to memory and then Ethernet goes from the network card out to the rest of the world. And although, you could build a system that violates this property, in the common case especially for OEM-provided systems they're not going to put a peripheral device in the system where the buses aren't able to keep up. Dell wouldn't spend the money to put a gigabit NIC in a system if the bus can't go at that speed because they're wasting their money on the network card. So I think it's a pretty reasonable assumption that in many, many cases the throughput is higher or the latency is lower between the main processor and the peripheral device than it is between the peripheral device and this proxy helper that might be out over the network. You look unhappy. >> : Yeah, because network traffic is [inaudible] >> Jonathon McCune: So I'm making this claim even with whatever the attacker's best case scenario is for network traffic. So even if the attacker is, you know, saturating a gigabit link, the PCI bus isn't necessarily saturated. Or whatever the latency is on the gigabit link, you know, from a host to two hops down the network is a lot higher latency than that between the NIC itself and your, you know, primary processor. >> : It comes down to throughput, though. Doesn't it? >> Jonathon McCune: Well, it depends how you build your verification system. So what I'm going to talk about in a slide or two here is one that exploits the latency advantage of the local processor as verifier. I think you could probably build something similar that takes advantage of throughput. >> : Okay. >> Jonathon McCune: So maybe I'll come -- Let's discuss this further in a couple more minutes if it doesn't help. Another neat thing about peripherals is, you know, in any kind of sort of dynamic or on-demand integrity measurement scenario, you may have data in memory. And forming expectations about what is the right value of data like, "What should be on my stack at this particular instant?" is a hard question. And one of the things that's nice about peripheral devices is periodic reset isn't necessarily disruptive. Peripheral devices get powered down all the time. You know our modern systems have a lot of powermanagement functionality built in, and so there tends to be pretty good support already for reverting that peripheral device back to a relatively known state. Okay. So with peripherals we have this stable communication pathway, and you generally have a better connection from the main processor to the peripheral device than that peripheral device might have to any proxy helper. Some of the symmetries that we've considered so far are latency, throughput and then also, you know, the relative rates of variance or jitter and things like either of these values and the loss rate as well. You know, packets didn't get dropped. Bus errors happen but they're comparatively quite rare. Yeah, so I think I made this point. So let's take a zoomed in view of how time elapses if you do a naïve software-based attestation protocol at first. So we have time going from left to right. We have a host processor and a peripherals. And let's not worry about an attack yet; let's just look at the benign case. So the host CPU to initiate one of these verification protocols is going to send a nonce down to the peripheral device. It's going to compute it's checksum and then it's going to send back an answer. And although it's fast on modern systems, it's not instantaneous. All right? So some amount of time passes as a nonce travels from, you know, your Core i7 down to your Broadcom NIC. Likewise, some amount of time passes as an answer comes back. Now if you look at the proxy attack scenario, the peripheral device has been compromised already. And instead of legitimately computing the checksum, it's going to forward that challenge to some helper, right, to some malicious proxy who's going to presumably have immense compute resources. And in the limit, we can assume he knows the answer. Let's just assume he actually broke some of our cryptographic assumptions and immediately knows what answer to send back. He's still going to incur some latency getting the message from the network card even if it's, you know, a one-foot crossover cable. It's not instantaneous. And so what you end up with is, you know, these red arrows add up to overhead, and that's communication overhead that's actually in the defender's favor here. You know, this is overhead that's only incurred under an attack scenario. In the benign case, that overhead is not in scope, so we don't have to -- You know, when we figure out what's our threshold -- Where on this line do we need to receive the answer in order to conclude that the answer came in on time? -- we don't have to take this variance into consideration. And so amount of overhead actually is useful in constructing a protocol, you know, to make it a lot more difficult for the adversary to get the right answer on time. So the question is, what do these various parameters need to look like in order to get this property? In order for these asymmetries to be in the defender's favor? So that the most conservative assumption that we wanted to make for the time that it takes the malicious proxy helper to compute the right answer is that it's instantaneous. So we do assume that he needs to receive the challenge before he knows which response to provide, but we assume that there's no computation time. That as soon as he gets the challenge he just sends back the right response. So we're going to assume that that time is zero. The communication time to the proxy is this Tproxy communication here. That's these red arrows, the time that it takes the information to get to the proxy helper and back. The legitimate checksum computation time is this. The legitimate peripheral is going to take some amount of time to execute this checksum function. And then because we are talking about maybe even nanoseconds here, being able to accurately measure these times isn't necessarily a given. You know, if you're executing on the main processor and maybe you're just going to use RDTSC or something like that as your timing mechanism, there's some, you know, quantum that is the shortest interval of time that can be accurately measured. And that actually comes into play when you think about putting a protocol like this together. So what are our requirements? The proxy communication needs to take longer than the legitimate checksum computation. Right? If it doesn't then the adversary, you know, in this conservative environment where he already knows the answer as soon as he receives the challenge is going to have an advantage. So we need the property that the proxy communication latency is greater than the legitimate checksum computation time on the peripherals. And the implication that that's going to have is the peripherals doesn't have a lot of time to sit there and run this checksum. And some of the existing proposals for the software attestation checksums were to do a pseudo-random memory traversal of the entire memory space of the target device. So that's roughly n log n pseudo-random memory accesses. And that takes too long. If you do that for any appreciable amount of memory, you will quickly takes longer than it takes to exchange an Ethernet packet with your next-door neighbor, for example. So the overhead that the proxy actually causes from the perspective of the verifiers being able to try to detect something is the time spent in communication with the malicious proxy but minus the legitimate computation because the verifier doesn't know. He thinks the thing is sitting there computing legitimately. And finally whatever this overhead turns out to be, it needs to be big enough to measure because if it's too small measure, you know, we can't tell. So how do we operate under these constraints? What does a protocol or a checksum function look like that can meet these requirements? So the basic mechanism is, well, we can't check all the memory in one go because that is too much execution time on the peripheral device. So we're going to use multiple non-checksum pairs. All right? Each nonce is going to result in the peripherals doing one of these checksum functions over only some small amount of memory. So you're going to need more than one of these in order to get good coverage of the memory space on your peripheral device. And so what you're going to end up with is the host CPU acting as verifier sends the first nonce to the peripheral device. It computes. The answer comes back. You know, naively it sends the second nonce, but by the time it sends the second nonce there's some idle time here. And if you have to do this many hundreds or even thousands of times to get good coverage of your peripheral device's memory then these idle times add up. And so that ends up serving as another source of slop and the types of expectations that a verifier can set for run time. So, you know, a simpler way to say this is we want the utilization of our peripheral device's processor to be 100%. And we'd also sort of like the utilization of the PCI bus between our peripheral and the processor to be 100%. We want that thing to be working as hard as it can so that any interference by an attacker attempting to change something is going to cause some kind of overhead. >> : [Inaudible] slop or error [inaudible]? >> Jonathon McCune: Because if we go back -- Let me see. >> : I mean, it's not during the measurement time, so what are you concerned about? >> Jonathon McCune: So it -- Okay, so there's a question of whether the measurement time is -- You know, are you measuring this and then this and then this, or are you just measuring something at the end? And certainly if you just measure something at the end then all these idle times add up to cause you problems. But even if you have timing expectations for each one, the little processor inside the network card is idle here and, you know, if the checksum function was a perfect cryptographic hash function then maybe you wouldn't have to care as much. But this time is time that an adversary might be able to be crunching on something. You know, it might increase the chances that he's able to correctly guess an input to the checksum function. You know... >> : All right. He knows nothing about nonce two yet at that time. If pre-calculating were helpful, he has all the time in the world to precalculate before nonce one. >> Jonathon McCune: So it also turns out to be a practical limitation for some devices that sending, for example, 128 bits as a nonce is problematic because you can't do it in one bus transaction. And so if you want to, at least for the device we considered, we wanted to have one bus-send and one bus-receive. And we ended up using a 32-bit nonce. You know, I'm not going to say that this is the only way that you could design this. I mean I guess the short answer is if your nonce is not a strong cryptographic nonce then you really don't want to give the bad guy time to compute. And even if you do use a strong cryptographic nonce, nobody's proven any of these checksum functions to look like the kinds of cryptographic functions you'd like them to be either. So I guess it's a little bit hand-wavy but we found it to be a conservative design. Another question? Jay? >> : Are you going to report that every single one of the responses could correct or are you going to accept 99.9% [inaudible]? >> Jonathon McCune: So that's a good question. We were able to get correct responses for hundreds of roundtrips on the device that we used before things got out of sync. So the final design that we used eliminates this idle time. I was going to talk about it in a minute but.... >> : [Inaudible] idle time, are you going to expect 100%? So if even one of the nonce-checksum responses is wrong you're going to say [inaudible]? >> Jonathon McCune: I'd be willing to accept a design where 99.9 was a tolerable parameter if we understood all the sources of variance. You know, we didn't use the bus analyzer when we set this up and ran it and explored with it. I couldn't tell you why every once in a while the latency on the bus goes up. You know, maybe it's another device having an interrupt. I just don't know. And so it... >> : Yeah, that's my concern. I'm concerned that, yeah, occasional interrupts will cause it to be disrupted. But also another source of disruption is one of your nonce's actually testing the location where they've stored their malware and that one is the one that it will fail on. >> Jonathon McCune: Right. >> : So I'm concerned that if you have to accept the occasional glitch because of bus interference >> Jonathon McCune: Right. >> : you're also going to [inaudible]. >> Jonathon McCune: Right. And I mean you'd have to figure out what the probability is that, you know, it lands right on the attacker's critical instruction only once. >> : [Inaudible] the other way around, right? If the attacker touches one critical instruction then that means that you only infrequently fail the test and that might just look like bus noise. Right? >> Jonathon McCune: So you do want to run this enough times that you cover every memory location with very high probability. >> : Right. But if every location is only covered in a few of the samples then those few samples might look like your few bus errors, right? Or, I mean, I'm assuming here that the attacker is in the comfortable position of only having to slow down when you're having to touch his memory as opposed to when you touch any memory. But if you give me that assumption then what Jay's saying is that that attacker is now in a position where the fact that he only changed bytes of code to [inaudible] vulnerability looks indistinguishable from sampling error. >> Jonathon McCune: Yes. So I don't want to defend any particular parameter choices on a particular device on a particular host. You're going to have analyze exactly which parameters make sense and are acceptable. We sort of approach this as a bit of a reverse engineering exercise as well. So if it's the designer of one of these devices and they want to, you know, have a checksum function or an entire protocol that makes a lot of sense for the, you know, particular microcontroller in their NIC. For example, this one -- or the one that we ended up using for our prototype was MIPS architecture but certain instructions just weren't there. You know, so it was this stripped-down version of MIPS that was the bare minimum amount of functionality that they needed. So it's difficult for me to make any general statement, I guess is what I want to say. I think our main goal is to raise the awareness of this, you know, possibility for a solution. And I do think that the latency characteristics are in the defender's favor and that's a big step from previous types of software attestation short of all or nothing. >> : You assumed that [inaudible] devices idle, not doing anything useful? >> Jonathon McCune: Yes. So we need to reset the device into some kind of known-state in order to even make sense out of the measurements that come back, right out of the checksum hashes that come back. >> : [Inaudible] be smart knowing that you are doing the checksum and then these remain [inaudible]. >> Jonathon McCune: So certainly. I mean in any type of detectorware the bad guy sees the detector coming. You know, he can at least leave the system. >> : Right. >> Jonathon McCune: You know, maybe not -- It's sort of system specific whether he can hide and reinfect. I mean presumably the root vulnerability is still there. So I mean as a detection mechanism, it's sort of something we suffer from. >> : So you just said that in order to check you have to reset first. But... >> Jonathon McCune: Yeah. Well, you need to know what state you expect the device to be in. A trivial way to do that is to reset it to some known-state. >> : Well, if you're willing to admit a reset, this is [inaudible] way back to the production slides. Feel free to >> Jonathon McCune: No, it's okay. >> : delay this question [inaudible]. But if you're willing to reset it seems like you can just use a -- I'm not sure what the correct phrase is but trust the boot path. I mean [inaudible] firmware. Why check the firmware when you've just [inaudible] firmware down? I mean this predicates the idea that we do have to trust the firmware boot-loader. >> Jonathon McCune: Right. >> : But... >> Jonathon McCune: And I guess that has, you know, some of these motivational examples were to show that the trust is maybe not a good idea. >> : I mean, I guess I certainly can believe that the whole firmware for the device is too big to want to [inaudible]. The thing that accepts the firmware? I mean that's also... >> Jonathon McCune: Well, so I mean this sort of comes back to the example in the beginning... >> : [Inaudible] this thing, right? >> Jonathon McCune: Say again. >> : The little boot-loader this thing on the device waiting to accept the firmware that you know you're running when you wiggle the reset line on the PCI bus. >> Jonathon McCune: Right. >> : That piece of coding seems like it's a lot smaller than all this system we're talking about here. >> Jonathon McCune: But you're also assuming you have some kind of, you know, signature checker. And not all devices necessarily... >> : Well, I'm assuming that I've reset the device and now it is in a mode where the only thing it does is accepts from the CPU the firmware. And so you don't have to check it. You're writing it, right? You're controlling the boot sequence. >> Jonathon McCune: So I think that's a reasonable design for a peripheral device, but in practice there are a lot of peripheral devices that aren't designed like that. >> : Well, but you're presupposing here you're going to alter the device, right? >> Jonathon McCune: No, this is only changing the firmware. We don't have to change anything in the hardware. This is just a software update. >> : Oh, I see. This is only changing the firmware and not the bootloader, the [inaudible]? >> Jonathon McCune: Yeah. But even that's a gray area. Right, that Apple aluminum keyboard, it is incredibly impoverished. The signature checker for its firmware updates is in the part that runs on the host environment. >> : But you don't need a signature checker. You need a trusted way to make sure that when you send the firmware it's actually acceptable firmware. Now assuming that the reset line might do that, I'm assuming that there is a designer who can have that -- If the designer of the device were to add one thing, they might add that as [inaudible]. >> Jonathon McCune: Sure. Yes. They might. I mean if you wanted to, you know... >> : Which one would be better? I guess is what I'm trying to ask. Is there a place where this approach dominates, is what I'm trying to understand? >> Jonathon McCune: I mean, a piece of hardware that already exists and where the architecture you described cannot be applied, there's not some way to reset it to load, you know, certain code. >> : [Inaudible] that there were constraints on what you were able to do. And you said, "Well, if you were the designer of this device you would fix those constraints." I mean, I thought you were assuming that we were going to understand this approach and then designers would build future systems to make this approach practical. And so it seems if you're admitting the designer into the loop then you can also just ask the designer whether they want to do a trusted boot path instead. This is what I'm trying to understand, is if the designer is involved for future devices -- And I guess... >> Jonathon McCune: So my intuition to date is that the space is so heterogeneous that there isn't necessarily a simple, single answer to getting, you know, integrity of the code that loads on any of our devices. I certainly agree that the simplest solution is the best. You know, if you could get some kind of start up procedure where you had a strong guarantee that it always loaded firmware "from blah" then great. But I mean even the assumption I'm making here that the main processor is a good place from which to do all these checks, that's pretty strong. You know, I don't think it's a given that the best way to build our future system is, you know, like micro-code patches for our main processors where we squirt the firmware into each peripheral device as we bring up the system. I mean it's a model that one could conceive, but I can't make a case that it should be exclusively adopted. >> : Okay. >> Jonathon McCune: All right. So what we really did is try to keep the bus utilized at the same time as the processor was computing the checksum, the processor inside the NIC. So we basically wanted nonce two to be in flight before the computation that was done is response to nonce one had completed. And in fact, you know, again just trying to close the window on the time interval that the adversary has to try to manipulate this process, we allow nonce two to select which part of the what we actually have inside as a larger checksum vector, you know, which part of that check sum vector to return. And again on our device it was either a 32 or a 64-bit limitation per bus transaction. And so we weren't able to return our full checksum state in every single one of these. And it basically is obligating any type of proxy helper to pre-load more data into the network card. So you know, again even if we had a full sort of cryptographically-strong nonce, if the checksum state was 512 bits and we could only send back 64 then which 64 bits to send back would be selected by the arrival of nonce two. So this is just the same model repeated. Each subset of the full internal checksum state gets selected by the arrival of the incoming nonce. >> : [Inaudible] interrupts and cause a variance in latency? >> Jonathon McCune: So for this particular device -- It's complicated. Sending to the device and... >> : The relationship between the device and the CPU itself. >> Jonathon McCune: Yeah, so it's not so simple as, you know, two ends of a network connection where one says, "Send," and the other one happens to get it. I'm trying to remember the exact specifics. But it's basically the host CPU polls a shared address space. So when the host CPU wants to transmit a nonce to the peripherals just as memory, right, and the magic happens in the hardware. But when the host CPU wants to receive a checksum, it doesn't actually just sit there and wait for an interrupt from the device. It reads a location in memory-mapped I/O space that happens to correspond to where that checksum's going to be written. And... >> : But when it breaks to memory that means it is going to -- Right, it's going to go over the shared memory bus. Won't that cause an infract on the device? >> Jonathon McCune: So this device that we used doesn't get an interrupt. It also polls. So it has this kind of mailbox abstraction, the details of which I would have to punt to Yanlin for. But it's not the simple interrupt-driven, event-driven model that we're used to seeing on our main processor. When, you know, a new packet arrived and the NIC sends off an interrupt to the host CPU, I mean, it's a NIC. It supports that mode of operation. The mode of operation where we were able to get good control of the timing behavior of everything was this memory-mapped I/O carefully synchronized rights and reads from the host processor. So like I don't go into this here but I talked about the granularity with which the host processor can measure time. That's not just how many cycles that lapse between two consecutive RDTSC instructions or something like that. Because this results in a bus transaction in order to get back the answer from the memory read, it takes a lot more than one cycle to fully execute. And so it turns out that if that's your granularity of time measurement, you run into a lot of challenges of being able to be like a one-hop gigabit Ethernet attacker. So what we opted to do for our prototype was... [ Silence ] Sorry. I confused myself a little bit. But, yeah, I'm sorry. I lost exactly which part of your question I was going for. Yeah. Do you want to reask anything or shall I just... >> : I'll wait until the end. >> Jonathon McCune: Okay. So we gave the name VIPER to this latencybased attestation protocol where, you know, we do a lot of nonce checks on pairs between the host CPU acting as verifier and the peripheral device, you know, acting as the target where we'd like to do some verification. In a real system a proxy attacker isn't necessarily just a bad guy that's connected over the Ethernet link. You know, in a real system you might have an attacker that's another peripheral device. Now this comes back to the question from the early slide on, "In what order should we verify the different devices inside our system?" And so, you know, the sort of intuition is that you want to check faster peripherals before you check slower peripherals because a faster peripherals could masquerade as a slower peripherals. Now it comes down to the details of how a particular checksum function is implemented. You know, the right kind of checksum function for a graphics card with lots of parallel threads of execution is very different from the right kind of checksum function for a little NIC microcontroller. And I would say that the right kind of things to do for a graphics cards are still unsolved; you know, there's still open questions. So we did prototype this on this particular Netgear GA620. We were able to get firmware for it that's why we chose it because we could actually have the full source code to the firmware and understand things like how it's memory-mapped I/O works so we don't just have to use interrupt-driven communication. The prototype that Yanlin put together implements the checksum and communication code and just hundreds of instructions, so it's not a huge thing. We also had as our hash function component just an off-the-shelf SHA-1 implementation. This particular card was PCI-X; that's sort of a weird bus standard that was a high end standard for a short period of time before PCI express came out. So we ended up having to use a somewhat unusual as our test platform just because it happened to have one of these slots. You know, this ancient Linux [inaudible] version is just because the firmware build environment for this NIC happened to work there, and we didn't undertake the effort of porting it forward. Yeah? >> : Just I have a delayed question for your previous slide. >> Jonathon McCune: Okay. >> : Is it conceivable that one peripherals is faster than the other for its own function but slower than the other for the other's function? >> Jonathon McCune: I don't see why it wouldn't be possible. >> : Because then it wouldn't be clear how to order them because you might do the one that finishes faster first but then... >> Jonathon McCune: So... >> : But another one then is considered slower could do its faster? >> Jonathon McCune: So another problem -- I grant that problem, and another problem is the difference between the fastest and the slowest is immense. If you think about a GPU compared to the microcontroller in your keyboard if you're talking about peripherals that go all the way out that far. So I think two actual fallouts that are recommendations for hardware design are the main processor or whatever ends up being the verifier needs to be able to strongly address who it's talking to, just enumeration of what hardware exists in this system. And maybe you can pre-configure that, but then once you know that, "I want to talk to the graphics card and only the graphics card right now," my understanding is that the newest PCI Express specifications do allow some of that. But there's still a huge legacy, you know, all of your legacy I/O in the south bridge tends to be this one integrated super I/O chip these days and so. >> : Unless you got the proxy attack problem. Even if you know who you're talking to, you still have the proxy problem. >> Jonathon McCune: Yes, but [inaudible] knowing who you're talking to might also be making it... >> : [Inaudible] talking to somebody else. Yeah, that's helpful [inaudible] too. >> Jonathon McCune: Yes. So I think PCI Express could do it. It's a much more hub and spoke-looking architecture. And I think the new versions have some access control mechanisms. But all the legacy stuff, right, once you hop a bridge to old PCI, I don't know how you could fix that. But maybe -- Moving on to my second point about the massive difference form fastest to slowest: something like a keyboard that's so important for the security of your system, right? You type your passwords into it. All new sensitive information enters through it in some sense. I think it makes sense to actually have some amount of security-specific hardware functionality in keyboards if we're going to architect a whole system where we try to worry about these kinds of problems because it's just too wimpy, otherwise, to make any kind of statement about it. If you can compromise anything, right -- The firmware in your little flash drives that you plug in that makes it look like a, you know, SCSI device could easily be compromised and keep up with the microcontroller in your keyboard, right, the firmware in a USB hub or something like that. So I think, you know, regardless of whether software-based attestation is the answer or not, strong addressing of devices and some amount of security functionality in the human input devices seems like things that make a lot of sense. Okay. So I have some detail about this particular network card and what it looks like. I'm going to proceed. So this actually is a duel-core network card. It has two microcontrollers in it. They are MIPS architecture; although, some instructions are depicted or just not there. They have a little bit of private memory which actually goes a very, very long way to helping us with our implementation. How to do one of these checksum functions for concurrent execution with shared memory? I mean, we already make a lot of assumptions for these checksum functions. I'd hate to add that. It turns out that the amount of scratch-pad memory in these things is not the same, one has twice as much as the other. >> : [Inaudible] intended functions for the CPU's? Do they do [inaudible] or...? >> Jonathon McCune: Yeah, good question. CPU A does just about everything but when it has a worker-type of task where in a more traditional context you might think, "Oh, I'll fork a thread and just send it off," it defers that kind of work to CPU B. Examples of those particular tasks? I don't want to say something wrong. So I'm going to say I'm not sure. >> : Do they have one [inaudible], right? >> Jonathon McCune: They do. This is RAM. This scratch-pad memory is random access memory that, I think, powers up [inaudible] zero. So the obvious, naïve design is if you put -- Let me finish. They also share a much larger buffer that's SRAM that they can both access, and it's also memory-mapped to host world. So this is sort of the shared internal to the NIC buffer that the host processor can talk to. These scratch-pad memories are not addressable from anything except the CPU's to which they are connected. So the obvious bad design is to put your checksum function in the shared memory because one CPU can be evil while the other one's good and make changes to the memory. So this a flavor of the problem that Jay mentioned about two different -- You mentioned heterogeneous devices at the same time but here, though, is very similar. So what we actually did is implement the checksum and the hash function in CPU A because there was more memory and just the checksum in CPU B and again let CPU A be the authority on how execution proceeds. So, it doesn't really matter where you start, but without loss of generality one of them goes first, does its own verification and remains in the verified code. It just sits there and waits for the other one to finish, let's the other one finish, and then, you know, that gives us our root of trust from which the hash function can be invoked to compute a hash over the actual firmware image. I'm not sure if it's mapped into this same address space or not, but the flash that holds the executable firmware is either part of this address space or some other shared address space. You know, without going through the firmware update process, it's not read-only but it takes a long time to write to it. Okay. So I think in our prototype we do CPU B first. The checksum runs. The program counter stays inside the verified code until CPU A has done both the checksum and the checksum has covered the hash function, and then the hash function can be used to hash whatever. Okay. So our checksum function was pretty unremarkable. I'm very sorry to say that we still don't have a checksum function that I can make strong mathematical statements about. If anybody has the background or the interest in doing something like that, I'd be delighted to talk to them. Maybe the interesting attribute of it is we took care to fit each block of this checksum function into the cache in the processor very carefully so that if the adversary tries to modify one of these blocks it'll overflow the cache and you'll get hopefully, you know, pathological overhead, not the overhead of incrementally one more instruction but the overhead of a bunch of cache misses as well. So we did mock this up. I mean, we prototyped it, you know, to the best of our ability without doing mathematical prototypes of the checksum function. The attacks that we wanted to consider were and Ethernetbased proxy attack where we had like a cross-over cable, like the shortest Ethernet link we could come up with. And in fact we actually had the other end of the Ethernet cable connected to the same network card which knew the answer and responded without even interrupting the OS. So we worked very hard to make the proxy attacker send back a response as quickly as we possibly could by doing it exclusively within the firmware of the attacker NIC. The other two attacks are sort of the classical best attacks on the types of checksum functions that people have proposed to date. The checksum functions are positioned dependent, so if you move the whole thing and run it then you need to forge both the data pointer and the program counter. You know, if you run from a malicious location and compute over the benign, unmodified checksum function image in its intended location then you're only really forging the program counter because the data pointer values are what you would've expected them to be. But anyway these are the two attacks that are most commonly, you know, the convention in the literature is that they're the best ones because they only add an instruction or two. And then an Ethernet-based proxy attack. So despite our efforts to make that Ethernet proxy attack as fast as we possibly could between these two network interfaces, we still couldn't get it fast than 43 microseconds. I don't know where that overhead comes from. If you do the math on gigabit per second and make, you know, optimistic assumptions about latency, if Yanlin did the math right it's 1.2 microseconds. So this is a lot slower than what we thought the best possible Ethernet proxy attack might be. So I think a bad guy who really understands the lower levels can probably do better than we did. Both of our attacks for the data pointer and program counter forging attacks turn out to add five extra instructions, and the way they fit and the various blocks of our checksum implementation cause two cache misses. So what does this look like? Our benign case ended up taking 2200 nanoseconds measured with RDTSC's. So we can't necessarily measure just one nanosecond. But we did have a nice actual grouping of the two forging attacks, the green and blue lines there. And then the red line is sort of theoretical, analytical expectation of the best that Ethernet might be able to do. So that's the, you know, 1.2. I think that's twice 1.2. But the actual implemented Ethernet attack, despite our efforts to make it as fast as possible, is way off the top. So I don't know what to infer from that. You know, whether our implementation was bad or we don't understand Ethernet well enough to recognize what the minimum latency is even for a small packet. But they did group nicely, you know, inside of our ability to measure passage of these short intervals of time. And we were able to set a threshold. We used 4.5% over the benign case, and that worked reliably for us for the, you know, sizes of scratch-pad memory and our checksum function implementation for this particular card. So we were able to detect, you know, one-bit changes in the checksum function implementation. I have my >> : Why'd you choose the 4.5 and not 4.6? >> Jonathon McCune: No good reason. I mean, you know, it's in between them and it's conservatively closer to the other one. And we could run a lot more experiments. We only had two or three of these devices, and so there's some amount of inter-device variance but not a ton. I think if you tried to roll this out on a large scale that might be a problem that comes up. You might have to tweak this threshold based on how much variance you see between devices. >> : I guess when I see 4.5 it looks carefully chosen. If you say 5% it would convey the impression that, "Oh, we just chose this." >> Jonathon McCune: Yeah. Guilty, I guess. [ Laughing ] >> Jonathon McCune: All right. So we sort of talked about this but I'm worried about -- You know, even if we can overcome a lot of these assumptions I'm worried about the really resource-impoverished devices that have a big footprint in the security properties of a system like a keyboard. I don't think that even if this system worked incredibly well that it would necessarily be ready to protect your keyboard from, you know, compromised other subsystems in your device. And then the other obvious practical challenge for any kind of attestation system is managing expectations. What's the white list? What's supposed to be in the device? Certainly reverse engineering this for more than just a small handful of devices would be a large undertaking. So a lot of this related work is either the existing literature on the software-based attestation checksum functions or existing literature on the VMA peer-to-peer attacks or these firmware reverse engineeringtypes of attacks. In particular the first one, Loic Duflot has done a lot of work on, you know, what happens if the firmware goes bad in your system. He proposed the more-- Sort of maybe getting towards John's earlier point, the broadcom NIC that he was looking at had a debug interface. Where if you did trust the main processor, you could just use that debug interface to measure what firmware's running in the device and convince yourself whether it's what you expected or not. You know, that worked for that device. So I hope I made the point that peripheral firmware integrity is an important problem. I had a bunch more slides on these attacks, but I made the assumption that you guys would appreciate the risks. I think compared to other domains for software-based attestation using it on peripheral devices is a relatively good fit. The communication latency, instead of being, you know, noise to be overcome actually looks like an asset to the design of some new protocols. You know, we showed that it was fairly practical to prototype on a device if you have access to the internals of that device. And I do think that if this stuff is going to make the leap into a practical application that this is one place that makes sense. We've had a few conversations with some of the people in the room about, you know, maybe doing it in a data center where you want to verify the hypervisor layer and you know how your fiber's routed in your data center or something like that. That's a scenario with maybe a lot more compute power available. I really appreciate your attention. Thank you for all the questions. I don’t mind being pushed. This is the paper that describes the gory details of what I tried to introduce today. We had an earlier sort of more preliminary version of this where we tried to apply software-based attestation to that keyboard with the firmware problem. This sort of lays out some of the basics in there too. So I'm always happy to discuss. Thank you. [ Applause ] >> : So traditionally I think the vendor of the peripheral to protect the firmware by [inaudible] to a system is very, very loud. It'd probably expose a few [inaudible] that the driver [inaudible]. But your approach basically widened the interface, right? You allow the system to actually send computation passing to the firmware. Well that actually could be inviting [inaudible] more attacks. Right? >> Jonathon McCune: Yeah, so... >> : [Inaudible] think about the [inaudible] attack, malware that jeopardizes the system they just keep on sending computation task to the peripheral [inaudible] for example. >> Jonathon McCune: So just to be clear, the only thing that gets sent in is, you know, nonces and request to this. It's not opening up more -- The host processor is not just blindly writing out executable bytes into the memory of this. It's just sending input to some portion that was already there. Yes, it's another function that runs inside this device, so in that sense the attack surface may increment. Depending on the device, it may increment a lot or only a small amount. Regarding denial of service, you know, you can tell the device to power down or to enter a low power state or something like this. I think as a practical consideration, a piece of software in the system with the ability to, you know, just to send it down the "run this checksum function forever so as not to send or receive packets" can achieve that goal in many other ways. I'm not sure we make things worse in that regard. >> Bryan Parno: Okay. Let's thank Jon again. [ Applause ]

>> Bryan Parno: So hello, everybody. Thank you for coming out... another talk in what I know is a very busy time....

Products

Support

&gt;&gt; Bryan Parno: So hello, everybody. Thank you for coming out... another talk in what I know is a very busy time....

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Bryan Parno: So hello, everybody. Thank you for coming out... another talk in what I know is a very busy time....