>> Bryan Parno: So hello, everybody. Thank you for coming out... another talk in what I know is a very busy time....

advertisement
>> Bryan Parno: So hello, everybody. Thank you for coming out to
another talk in what I know is a very busy time. I'm here to introduce
Jonathan McCune from Carnegie Mellon where Jon has done all kind of
interesting projects around trusted computing involving talking to the
TPM's, and I've heard tell that he can talk to TPM's through is
fingertips and occasionally writes hypervisors in his sleep. So some of
his previous work was on systems like Flickr and TrustVisor for
creating minimal trusted computing bases, and today he's looking at
what else might be untrustworthy about our systems.
>> Jonathan McCune: So thank you. It's my pleasure to be here. I do
appreciate your attention. Please feel free to interrupt me. I would be
happy to go down, you know, discussion paths. I think there are design
questions that get raised by some of these problems that might be fun
to talk about. So I also want to credit graduate student Yanlin Li and
then Professor Adrian Perrig; they were collaborators in this.
So maybe you guys know what this is but just a block diagram with more
than one processor and a bunch of different types of devices. What does
this look like? Maybe a chip set? This is the kind of architecture that
we tend to find in all of our PC-style devices today. And these aren't
just dumb ASIC's anymore, right? These are basically entire computer
systems. And so a reasonable question to ask is, "Well, what code runs
on all these other processors, and what are the security properties of
that code?" And in particular for the purposes of this talk, I think
it's useful to think about it from the perspective of the network card.
The reason I want to start there is because that's generally the most
exposed surface of the system. So if a bad guy's coming in over the
internet, there's a good chance he encounters your network card first.
So I just want to go through a few anecdotes that I think are sort of
fun or sort of scary depending on what kind of mood I'm in. But this is
an off-the-shelf network card. And this guy Arrigo Triulzi, you know,
he likes to reverse engineer things but he's a human, a mortal, and he
bought a ten-pack of these network cards and just wanted to see what he
could figure out. And before breaking the tenth card, he was able to
successfully replace the firmware on the card with code of his
choosing. Now he did this with physical possession of these devices so
this wasn't necessarily, you know, tickling some latent vulnerability,
but it sort of shows what's possible. And he decided to use peer-topeer bus communication -- Something that it turns out is legal but not
very commonly used -- to inject other code into the graphics card. And,
you know, this gives rise to this scenario where you have maybe a Bot
without actually having a compromised operating system. There's a lot
of memory and compute power inside of our graphics cards today, and so
they can easily, you know, generate malicious traffic at a rate that
can saturate the network card or something like that.
So, you know, there are some very sophisticated subsystems inside our
computers now. Another example, Apple aluminum keyboards like that one
the left, run firmware, and there are firmware updates for these
keyboards. I'm not sure what is still complicated about that. But
there's a vulnerability in the firmware update mechanism and so, you
know, you can infect the keyboard and subsequently, potentially infect
the host. And to make the story from the previous slide real, there are
now actual known vulnerabilities in the remote facing interfaces of
certain network cards, you know, these manageability features that, you
know, magically-crafted packets can overflow buffers.
So, you know, it's important to recognize what are the root issues
here. So malware on peripherals can readily eavesdrop on any data that
they actually handle. Especially, you know, good prudent practice says
that you don't trust the network anyway, but having a man in the
network card certainly gives them control over a lot of aspects of your
network. You know, if the IOMMU, right, if there's not some intelligent
configuration of what memory a mperipheral device can access with a DMA
transaction, it's not configured intelligently then you can run into
problems with, you know, unfettered access to memory.
Other peripherals can be infected; or not even necessarily infected,
right, subverted to perform malicious work like the NIC GPU example.
And I think this bottom thing sort of drives it home. Your system can
still be a Bot even if there's nothing wrong with the operating system
and the applications on top.
So what's the state of the art in trying to keep the, you know,
firmware-level portions of our systems in a state that we're happy
with? So signed firmware, signed BIOS updates, you know, digital
signatures these make us happy. You know you put a public key
fingerprint in some immutable ROM location, and you make sure that any
code that comes in purporting to be a firmware update as a signature
that checks. That doesn't say anything about how new that firmware is,
so it could be the old version with a known vulnerability pretty
readily.
An unfortunate recent example is this Intel disclosure. Their, you
know, brand new security feature trustworthy execution technology,
something that I do personally think is on the right track,
unfortunately had a pretty serious vulnerability in what is legitimate
signed code that's out there in the wild; systems will run it. No, and
unfortunately the fix for this is pretty drastic. Every SINIT module
ever released happened to have this vulnerability at the time it was
disclosed, so they all needed to be updated. It turns out that there's
some CPU microcode problems. And so, you know, microcode is am
ephemeral in our modern processors, so that means every power-cycle
forever we hope that the right microcode patch gets applied.
And in order to ensure that there aren't rollback attacks to a previous
version, there are vendor-specific BIOS changes. So that means every
vendor that shipped a system that's capable of doing this should
technically be updating their BIOS to make sure that it blacklists
these known bad modules. And, I mean, that's just something that the
commodity ecosystem can't really stomach today.
There are a lot of legitimate reasons to rollback BIOS' as well. Right?
If you're an enterprise with ten thousand PC's and the BIOS update
breaks something important, you can believe that the vendor's going to
find a way to roll it back. So I'm not very happy with the state of the
art. I think it's a good idea to do signed code. At least you know
where it comes from, but it certainly doesn't mean that we're done.
So basically it's an open challenge to detect whether or not there's
malware running on our peripherals. And peripheral devices are
interesting because they tend to have some pretty significant resource
constraints, you know, limited memory. Hardware-based protection
mechanisms might be quite expensive relative to the cost of the device
itself. You know, most keyboard micro-controllers can't do public key
cryptography.
And so what we're trying to do here is find a way to actually verify
the integrity of the peripherals' firmware. That means learn for sure
what version of the firmware is running in there. You know, and
hopefully you can cross-check that and find out that it's a legitimate
version from the legitimate vendor and that it's recent. And so just to
drive this home, hopefully you get this, but we want full system
security. Trustworthy execution just on the platform's primary
processor is not enough. Right? We want to know that all these other
peripherals that are basically full computer systems in their own right
are behaving as intended.
And so this brings up the question, how do you even approach this
problem? And which one of these things should we verify first? Can we
assume that once it's been verified it's not going to be subverted
while we verify another one? And so there are a lot of different
heuristics that come to mind about, "Well, should we start with the
primary CPU?" It's the most powerful by some metrics but not
necessarily all metrics. Maybe we care about proximity to the
processor, maybe fewest hops is a useful thing. You know, so there are
all these different metrics that you can dream up, and you'd like to
find an answer that says that one of these metrics is superior to the
others.
So hopefully at this point I've made the case that there can be malware
on peripherals, that it can be a significant threat unless it's an
important problem to look into maybe a way to do something about that.
We're going to propose VIPER, a way to verify the integrity of
peripheral devices. We do this using a modified form of a software
based out of station protocol. I have some background information on
that that I'll come to shortly. And then, we actually prototype this on
an off-the-shelf network card that happens to have open source firmware
that we could modify without also doing reverse engineering.
So our attacker is a remote attacker. He's coming over the network or
something similar. For the purposes of this implementation, we're going
to consider physical attacks to be out of scope. We're actually going
to assume that the host CPU is trustworthy. I mean that's a strong
assumption, but we wanted a place to stand. And the hardware changes
that are in place and coming down the pipe for our PC platforms are -at least hardware changes for security have so far been focused around
the primary processor. So it's an assumption; that's where we're going
to try to start.
At any attestation-style system, the thing that's going to serve as the
verifier needs to have expectations about what it's going to try and
verify. And so we're assuming that this verifier program knows
something about the peripheral and what's supposed to be there. The
attacker model is that the firmware can get compromised. We're not
going to prevent it. We're going to detect it if it happens. We assume
the attacker has fairly immense resources at his disposable as some
remote location. Right, they can co-op DC2 or whatever. But we're going
to assume that standard cryptographic primitives hold up.
So there we go. So this is a basic motivation for attestation. We want
to get code integrity of the firmware that's running inside our
peripherals. If we can reference a cryptographic hash of that firmware
that we have, faith is accurate, then we can cross-reference that with
the golden database of sorts and convince ourselves that the right
firmware is in place. And so this usually looks like some kind of basic
challenge response protocol. You know, the verifier sends some kind of
nonce to the target environment and back comes a sign, or a maybe
message authentication code makes sense under certain conditions,
statement of sorts that the verifier can then cross-reference with this
database with known-good things.
So that basic model we're going to apply here too. But we're going to
apply it using software-based attestation. And are you guys familiar
with that? Has anyone not heard of software-based attestation? So there
were a few shakies.
So the idea is to provide the type of root of trust that a hardware
mechanism can provide but without any hardware support. So we assume
explicitly that our peripheral device does not have a secure
coprocessor; it doesn't not have TPM. This is reasonable because
there's an immense population of devices out there that have no such
support. And it will never make sense for certain price-points to add
support, so it's always going to be something that's in scope.
At a high level, this is glossing over a lot of detail, but the
difference between regular attestation where you have some hardware
root of trust that protects a secret like a private asymmetric key or a
symmetric shared mat key is that we actually have no secret on the
untrusted device. But we know a lot about the micro-architecture of the
untrusted device, so we can do things like maybe we're going to be able
to have a psycho-accurate simulation of what should happen on that
untrusted device. So we want to try to make a combination cryptographic
hash function in benchmark, right, where if it gets the answer on time
then it has the properties of a cryptographic hash function. And if an
adversary tampers with it then it will either return the wrong answer
or take too long.
I don't how to summarize this in 30 seconds very effectively. But the
main thing is if you get the right answer on time then you feel good
about the answer. You consider it to be authentic to have come from
this device with assumptions about, you know, what's the clockfrequency at which that device runs. You know, if you do some kind of
very sophisticated overclocking then you can, you know, cause this
thing to execute more rapidly. Question?
>> : So do we need to assume the communication channel can uniquely
identify the device?
>> Jonathon McCune: You do. So we tend to call that -- I mean, you
described it well, but we use I think endpoint origin authenticity if
you read some of the software-based attestation literature. So that has
been a problem for software-based attestation mechanisms to date. You
have to have this assumption that you know which devices are talking,
so that automatically rules it out across the Internet. I'm getting
ahead of the slides a little bit. But the other big problem with
software-based attestation is something like a proxy attack. You know
if this is our network card and it has an internal microcontroller at
200 megahertz, you can [inaudible] spoof its response time with a, you
know, a big powerful system. So especially if we assume the adversary
to have multiple data centers at his disposal, you know, we're going to
have to do more than just worry about the time it takes to execute this
checksum. Question?
>> : So I have a question. If the trusted device knows that I'm
answering attestation and all I need to do is [inaudible] condition
check [inaudible]. If I'm under attestation, I just run the benign code
so I can finish the computation within the requested amount of time.
And if the condition check is just some very quick check, how can it
differentiate the trusted device is not really a malicious code?
>> Jonathon McCune: Yeah, so this is software attestation background.
And so what you describe is a real problem. Right? The fact that like
let's say we have our legitimate checksum function. You know, and it
gets us our right answer in 100 milliseconds. And the best known attack
changes one if condition and takes, you know, 100 milliseconds plus 10
microseconds or some tiny, infinitesimal adjustment, that's very hard
to detect especially in an environment where you aren't 100% sure who's
talking right now. So we're going to get to that. So that's one of the
things that I like about this application of software-based attestation
to peripherals that I think can overcome that problem. But that is a
problem for, you know, your vanilla run the checksum function once over
all of memory. And, you know, legitimately it either takes a second or
a second plus one millisecond, and you're supposed to distinguish
between these. And so that's been a real limitation of, you know, prior
software-based attestation mechanisms.
That was kind of clunky. Are we happy? All right. Great. So when you
actually go to implement this what you need on the device of
questionable refute, right, on the peripherals that we want to verify
is this checksum function that I've alluded to as a hybrid hash
function benchmark, and then you need some ability to send information
back, and you need actually a more traditional cryptographic hash
function because this fancy checksum function, to create what we're
calling here a software-only root of trust. So if you want to draw the
analogy between a hardware root of trust where you have, you know,
something like a TPM that has a signing key in it and it just won't let
the key out, right, there's no API for that. The only think it'll do is
sign things. So if you get a signature from it and you trust that the
hardware hasn't been compromised, you know it came from a particular
device. And so that's the property that we want to attain in a parallel
way with a software-only mechanism.
And so on the verifier, for example in the host CPU in this context,
then you have to have some kind of checksum simulator so that it knows
what the right answer was supposed to be. Now remember this is a
challenge response protocol with a nonce, so what the checksum function
is going to do is actually contingent on that challenge. It's not so
simple as to return the same answer every time. And then you have to
have your golden image, right, your expected firmware in the ability to
measure time. Hopefully this is consistent with what we've previously
seen. The novelty of this VIPER system is actually with respect to
communication latencies and things. And so the clunky bits of softwarebased attestation hopefully, you know, stood before I talk to you today
and will remain after we're done. But hopefully in environments where
you can make statements about communication latency, you'll see that
some neat things can happen.
>> : [Inaudible] the malicious coding will always [inaudible] resolving
[inaudible] checksum problem. I could design a code in a way that will,
I mean, so that it leaves the checksum function intact.
>> Jonathon McCune: Yeah, so I maybe should've provided a little bit
more background. But another one of the assumptions that goes with one
of these checksum functions is that its implementation is optimal. So
the idea is that it's a very small thing. The ones that have been
developed to date don't look like cryptographic hash functions; they
look more like a once such thing as T-function. I think it does ands
and x-overs. So it's a very simple function. So my big concern about
the practical ones is that you might be able to shave a few
instructions off their implementation, but I'm more concerned that
there's some kind of major algebraic failure where it's just not a
sophisticated cryptographic function like you really want for your real
has function.
So this is still a limitation of the checksum functions that have been
proposed to date for software-based attestation. The kinds of questions
that you ask are sort of open questions in terms of, you know, getting
like a reduction-proof like we're used to having in sort of more
traditional crypto.
Okay. Animation time. So maybe one last step we talked about a nonce
comes across, that's the challenge. Right? This checksum function does
its thing. It sends back a checksum. The minimal way to implement this
is it's really only verifying itself, and that's not really of value
alone. Right? That's only our software-based root of trust. What we
really want is a root of trust that allows us to get a high integrity
hash of the code of interest. And that's why the last step here is to
actually invoke a cryptographic hash function. In that scenario it
looks a lot more like sort of TPM style integrity measurement and
attestation.
Okay. So we already sort of described the proxy attack in response to
questions, but the risk is you have this peripheral device with some
kind of wimpy processor. It's already been corrupted by an adversary,
so he forwards the challenge to something powerful but fakes it in
time. So the adversary's able to get the correct checksum on time and
fool the verifier. So we call that a proxy attack, and that's been the
most significant, probably practical barrier even if you had a perfect
checksum algorithm this would still be fatal.
Now I want to talk about the differences for peripherals, and
this is where it starts to get interesting because I think we
have some properties that are not as quite far out as getting
checksum function that adheres to all these desirable things.
really talked about CPU performance when I talked about these
checksums. Communication overhead for this proxy attack, this
communication overhead was a problem between the verifier and
hopefully
start to
a
So I only
earlier
the
intended target device because that's slop in your ability to measure
exactly how long that checksum function took to execute.
You know, and earlier we mentioned that the real checksum functions
that have been proposed have only a minimal additional overhead under
the best-known attacks. So, you know, if your network latency is too
long, it just won't work. You can't tell the difference between a small
change in network latency and the legitimate overhead induced by an
attack on the checksum function. So that was the past. Now when we
start to look at peripheral devices then suddenly, you know, the
latency is a lot more comprehensible.
So inside of our system we have buses as our communication mechanism.
Right? If we have a gigabit NIC then there's a PCI express bus or
something that connects it to the processor or to memory and then
Ethernet goes from the network card out to the rest of the world. And
although, you could build a system that violates this property, in the
common case especially for OEM-provided systems they're not going to
put a peripheral device in the system where the buses aren't able to
keep up. Dell wouldn't spend the money to put a gigabit NIC in a system
if the bus can't go at that speed because they're wasting their money
on the network card.
So I think it's a pretty reasonable assumption that in many, many cases
the throughput is higher or the latency is lower between the main
processor and the peripheral device than it is between the peripheral
device and this proxy helper that might be out over the network. You
look unhappy.
>> : Yeah, because network traffic is [inaudible]
>> Jonathon McCune: So I'm making this claim even with whatever the
attacker's best case scenario is for network traffic. So even if the
attacker is, you know, saturating a gigabit link, the PCI bus isn't
necessarily saturated. Or whatever the latency is on the gigabit link,
you know, from a host to two hops down the network is a lot higher
latency than that between the NIC itself and your, you know, primary
processor.
>> : It comes down to throughput, though. Doesn't it?
>> Jonathon McCune: Well, it depends how you build your verification
system. So what I'm going to talk about in a slide or two here is one
that exploits the latency advantage of the local processor as verifier.
I think you could probably build something similar that takes advantage
of throughput.
>> : Okay.
>> Jonathon McCune: So maybe I'll come -- Let's discuss this further in
a couple more minutes if it doesn't help. Another neat thing about
peripherals is, you know, in any kind of sort of dynamic or on-demand
integrity measurement scenario, you may have data in memory. And
forming expectations about what is the right value of data like, "What
should be on my stack at this particular instant?" is a hard question.
And one of the things that's nice about peripheral devices is periodic
reset isn't necessarily disruptive. Peripheral devices get powered down
all the time. You know our modern systems have a lot of powermanagement functionality built in, and so there tends to be pretty good
support already for reverting that peripheral device back to a
relatively known state.
Okay. So with peripherals we have this stable communication pathway,
and you generally have a better connection from the main processor to
the peripheral device than that peripheral device might have to any
proxy helper. Some of the symmetries that we've considered so far are
latency, throughput and then also, you know, the relative rates of
variance or jitter and things like either of these values and the loss
rate as well. You know, packets didn't get dropped. Bus errors happen
but they're comparatively quite rare. Yeah, so I think I made this
point.
So let's take a zoomed in view of how time elapses if you do a naïve
software-based attestation protocol at first. So we have time going
from left to right. We have a host processor and a peripherals. And
let's not worry about an attack yet; let's just look at the benign
case. So the host CPU to initiate one of these verification protocols
is going to send a nonce down to the peripheral device. It's going to
compute it's checksum and then it's going to send back an answer. And
although it's fast on modern systems, it's not instantaneous. All
right? So some amount of time passes as a nonce travels from, you know,
your Core i7 down to your Broadcom NIC. Likewise, some amount of time
passes as an answer comes back. Now if you look at the proxy attack
scenario, the peripheral device has been compromised already. And
instead of legitimately computing the checksum, it's going to forward
that challenge to some helper, right, to some malicious proxy who's
going to presumably have immense compute resources. And in the limit,
we can assume he knows the answer. Let's just assume he actually broke
some of our cryptographic assumptions and immediately knows what answer
to send back.
He's still going to incur some latency getting the message from the
network card even if it's, you know, a one-foot crossover cable. It's
not instantaneous. And so what you end up with is, you know, these red
arrows add up to overhead, and that's communication overhead that's
actually in the defender's favor here. You know, this is overhead
that's only incurred under an attack scenario. In the benign case, that
overhead is not in scope, so we don't have to -- You know, when we
figure out what's our threshold -- Where on this line do we need to
receive the answer in order to conclude that the answer came in on
time? -- we don't have to take this variance into consideration. And so
amount of overhead actually is useful in constructing a protocol, you
know, to make it a lot more difficult for the adversary to get the
right answer on time.
So the question is, what do these various parameters need to look like
in order to get this property? In order for these asymmetries to be in
the defender's favor? So that the most conservative assumption that we
wanted to make for the time that it takes the malicious proxy helper to
compute the right answer is that it's instantaneous. So we do assume
that he needs to receive the challenge before he knows which response
to provide, but we assume that there's no computation time. That as
soon as he gets the challenge he just sends back the right response. So
we're going to assume that that time is zero.
The communication time to the proxy is this Tproxy communication here.
That's these red arrows, the time that it takes the information to get
to the proxy helper and back. The legitimate checksum computation time
is this. The legitimate peripheral is going to take some amount of time
to execute this checksum function. And then because we are talking
about maybe even nanoseconds here, being able to accurately measure
these times isn't necessarily a given. You know, if you're executing on
the main processor and maybe you're just going to use RDTSC or
something like that as your timing mechanism, there's some, you know,
quantum that is the shortest interval of time that can be accurately
measured. And that actually comes into play when you think about
putting a protocol like this together.
So what are our requirements? The proxy communication needs to take
longer than the legitimate checksum computation. Right? If it doesn't
then the adversary, you know, in this conservative environment where he
already knows the answer as soon as he receives the challenge is going
to have an advantage. So we need the property that the proxy
communication latency is greater than the legitimate checksum
computation time on the peripherals. And the implication that that's
going to have is the peripherals doesn't have a lot of time to sit
there and run this checksum. And some of the existing proposals for the
software attestation checksums were to do a pseudo-random memory
traversal of the entire memory space of the target device. So that's
roughly n log n pseudo-random memory accesses. And that takes too long.
If you do that for any appreciable amount of memory, you will quickly
takes longer than it takes to exchange an Ethernet packet with your
next-door neighbor, for example.
So the overhead that the proxy actually causes from the perspective of
the verifiers being able to try to detect something is the time spent
in communication with the malicious proxy but minus the legitimate
computation because the verifier doesn't know. He thinks the thing is
sitting there computing legitimately. And finally whatever this
overhead turns out to be, it needs to be big enough to measure because
if it's too small measure, you know, we can't tell.
So how do we operate under these constraints? What does a protocol or a
checksum function look like that can meet these requirements? So the
basic mechanism is, well, we can't check all the memory in one go
because that is too much execution time on the peripheral device. So
we're going to use multiple non-checksum pairs. All right? Each nonce
is going to result in the peripherals doing one of these checksum
functions over only some small amount of memory. So you're going to
need more than one of these in order to get good coverage of the memory
space on your peripheral device. And so what you're going to end up
with is the host CPU acting as verifier sends the first nonce to the
peripheral device. It computes. The answer comes back. You know,
naively it sends the second nonce, but by the time it sends the second
nonce there's some idle time here.
And if you have to do this many hundreds or even thousands of times to
get good coverage of your peripheral device's memory then these idle
times add up. And so that ends up serving as another source of slop and
the types of expectations that a verifier can set for run time. So, you
know, a simpler way to say this is we want the utilization of our
peripheral device's processor to be 100%. And we'd also sort of like
the utilization of the PCI bus between our peripheral and the processor
to be 100%. We want that thing to be working as hard as it can so that
any interference by an attacker attempting to change something is going
to cause some kind of overhead.
>> : [Inaudible] slop or error [inaudible]?
>> Jonathon McCune: Because if we go back -- Let me see.
>> : I mean, it's not during the measurement time, so what are you
concerned about?
>> Jonathon McCune: So it -- Okay, so there's a question of whether the
measurement time is -- You know, are you measuring this and then this
and then this, or are you just measuring something at the end? And
certainly if you just measure something at the end then all these idle
times add up to cause you problems. But even if you have timing
expectations for each one, the little processor inside the network card
is idle here and, you know, if the checksum function was a perfect
cryptographic hash function then maybe you wouldn't have to care as
much. But this time is time that an adversary might be able to be
crunching on something. You know, it might increase the chances that
he's able to correctly guess an input to the checksum function. You
know...
>> : All right. He knows nothing about nonce two yet at that time. If
pre-calculating were helpful, he has all the time in the world to precalculate before nonce one.
>> Jonathon McCune: So it also turns out to be a practical limitation
for some devices that sending, for example, 128 bits as a nonce is
problematic because you can't do it in one bus transaction. And so if
you want to, at least for the device we considered, we wanted to have
one bus-send and one bus-receive. And we ended up using a 32-bit nonce.
You know, I'm not going to say that this is the only way that you could
design this. I mean I guess the short answer is if your nonce is not a
strong cryptographic nonce then you really don't want to give the bad
guy time to compute. And even if you do use a strong cryptographic
nonce, nobody's proven any of these checksum functions to look like the
kinds of cryptographic functions you'd like them to be either.
So I guess it's a little bit hand-wavy but we found it to be a
conservative design. Another question? Jay?
>> : Are you going to report that every single one of the responses
could correct or are you going to accept 99.9% [inaudible]?
>> Jonathon McCune: So that's a good question. We were able to get
correct responses for hundreds of roundtrips on the device that we used
before things got out of sync. So the final design that we used
eliminates this idle time. I was going to talk about it in a minute
but....
>> : [Inaudible] idle time, are you going to expect 100%? So if even
one of the nonce-checksum responses is wrong you're going to say
[inaudible]?
>> Jonathon McCune: I'd be willing to accept a design where 99.9 was a
tolerable parameter if we understood all the sources of variance. You
know, we didn't use the bus analyzer when we set this up and ran it and
explored with it. I couldn't tell you why every once in a while the
latency on the bus goes up. You know, maybe it's another device having
an interrupt. I just don't know. And so it...
>> : Yeah, that's my concern. I'm concerned that, yeah, occasional
interrupts will cause it to be disrupted. But also another source of
disruption is one of your nonce's actually testing the location where
they've stored their malware and that one is the one that it will fail
on.
>> Jonathon McCune: Right.
>> : So I'm concerned that if you have to accept the occasional glitch
because of bus interference
>> Jonathon McCune: Right.
>> : you're also going to [inaudible].
>> Jonathon McCune: Right. And I mean you'd have to figure out what the
probability is that, you know, it lands right on the attacker's
critical instruction only once.
>> : [Inaudible] the other way around, right? If the attacker touches
one critical instruction then that means that you only infrequently
fail the test and that might just look like bus noise. Right?
>> Jonathon McCune: So you do want to run this enough times that you
cover every memory location with very high probability.
>> : Right. But if every location is only covered in a few of the
samples then those few samples might look like your few bus errors,
right? Or, I mean, I'm assuming here that the attacker is in the
comfortable position of only having to slow down when you're having to
touch his memory as opposed to when you touch any memory. But if you
give me that assumption then what Jay's saying is that that attacker is
now in a position where the fact that he only changed bytes of code to
[inaudible] vulnerability looks indistinguishable from sampling error.
>> Jonathon McCune: Yes. So I don't want to defend any particular
parameter choices on a particular device on a particular host. You're
going to have analyze exactly which parameters make sense and are
acceptable. We sort of approach this as a bit of a reverse engineering
exercise as well. So if it's the designer of one of these devices and
they want to, you know, have a checksum function or an entire protocol
that makes a lot of sense for the, you know, particular microcontroller
in their NIC. For example, this one -- or the one that we ended up
using for our prototype was MIPS architecture but certain instructions
just weren't there. You know, so it was this stripped-down version of
MIPS that was the bare minimum amount of functionality that they
needed. So it's difficult for me to make any general statement, I guess
is what I want to say. I think our main goal is to raise the awareness
of this, you know, possibility for a solution. And I do think that the
latency characteristics are in the defender's favor and that's a big
step from previous types of software attestation short of all or
nothing.
>> : You assumed that [inaudible] devices idle, not doing anything
useful?
>> Jonathon McCune: Yes. So we need to reset the device into some kind
of known-state in order to even make sense out of the measurements that
come back, right out of the checksum hashes that come back.
>> : [Inaudible] be smart knowing that you are doing the checksum and
then these remain [inaudible].
>> Jonathon McCune: So certainly. I mean in any type of detectorware
the bad guy sees the detector coming. You know, he can at least leave
the system.
>> : Right.
>> Jonathon McCune: You know, maybe not -- It's sort of system specific
whether he can hide and reinfect. I mean presumably the root
vulnerability is still there. So I mean as a detection mechanism, it's
sort of something we suffer from.
>> : So you just said that in order to check you have to reset first.
But...
>> Jonathon McCune: Yeah. Well, you need to know what state you expect
the device to be in. A trivial way to do that is to reset it to some
known-state.
>> : Well, if you're willing to admit a reset, this is [inaudible] way
back to the production slides. Feel free to
>> Jonathon McCune: No, it's okay.
>> : delay this question [inaudible]. But if you're willing to reset it
seems like you can just use a -- I'm not sure what the correct phrase
is but trust the boot path. I mean [inaudible] firmware. Why check the
firmware when you've just [inaudible] firmware down? I mean this
predicates the idea that we do have to trust the firmware boot-loader.
>> Jonathon McCune: Right.
>> : But...
>> Jonathon McCune: And I guess that has, you know, some of these
motivational examples were to show that the trust is maybe not a good
idea.
>> : I mean, I guess I certainly can believe that the whole firmware
for the device is too big to want to [inaudible]. The thing that
accepts the firmware? I mean that's also...
>> Jonathon McCune: Well, so I mean this sort of comes back to the
example in the beginning...
>> : [Inaudible] this thing, right?
>> Jonathon McCune: Say again.
>> : The little boot-loader this thing on the device waiting to accept
the firmware that you know you're running when you wiggle the reset
line on the PCI bus.
>> Jonathon McCune: Right.
>> : That piece of coding seems like it's a lot smaller than all this
system we're talking about here.
>> Jonathon McCune: But you're also assuming you have some kind of, you
know, signature checker. And not all devices necessarily...
>> : Well, I'm assuming that I've reset the device and now it is in a
mode where the only thing it does is accepts from the CPU the firmware.
And so you don't have to check it. You're writing it, right? You're
controlling the boot sequence.
>> Jonathon McCune: So I think that's a reasonable design for a
peripheral device, but in practice there are a lot of peripheral
devices that aren't designed like that.
>> : Well, but you're presupposing here you're going to alter the
device, right?
>> Jonathon McCune: No, this is only changing the firmware. We don't
have to change anything in the hardware. This is just a software
update.
>> : Oh, I see. This is only changing the firmware and not the bootloader, the [inaudible]?
>> Jonathon McCune: Yeah. But even that's a gray area. Right, that
Apple aluminum keyboard, it is incredibly impoverished. The signature
checker for its firmware updates is in the part that runs on the host
environment.
>> : But you don't need a signature checker. You need a trusted way to
make sure that when you send the firmware it's actually acceptable
firmware. Now assuming that the reset line might do that, I'm assuming
that there is a designer who can have that -- If the designer of the
device were to add one thing, they might add that as [inaudible].
>> Jonathon McCune: Sure. Yes. They might. I mean if you wanted to, you
know...
>> : Which one would be better? I guess is what I'm trying to ask. Is
there a place where this approach dominates, is what I'm trying to
understand?
>> Jonathon McCune: I mean, a piece of hardware that already exists and
where the architecture you described cannot be applied, there's not
some way to reset it to load, you know, certain code.
>> : [Inaudible] that there were constraints on what you were able to
do. And you said, "Well, if you were the designer of this device you
would fix those constraints." I mean, I thought you were assuming that
we were going to understand this approach and then designers would
build future systems to make this approach practical. And so it seems
if you're admitting the designer into the loop then you can also just
ask the designer whether they want to do a trusted boot path instead.
This is what I'm trying to understand, is if the designer is involved
for future devices -- And I guess...
>> Jonathon McCune: So my intuition to date is that the space is so
heterogeneous that there isn't necessarily a simple, single answer to
getting, you know, integrity of the code that loads on any of our
devices. I certainly agree that the simplest solution is the best. You
know, if you could get some kind of start up procedure where you had a
strong guarantee that it always loaded firmware "from blah" then great.
But I mean even the assumption I'm making here that the main processor
is a good place from which to do all these checks, that's pretty
strong. You know, I don't think it's a given that the best way to build
our future system is, you know, like micro-code patches for our main
processors where we squirt the firmware into each peripheral device as
we bring up the system.
I mean it's a model that one could conceive, but I can't make a case
that it should be exclusively adopted.
>> : Okay.
>> Jonathon McCune: All right. So what we really did is try to keep the
bus utilized at the same time as the processor was computing the
checksum, the processor inside the NIC. So we basically wanted nonce
two to be in flight before the computation that was done is response to
nonce one had completed. And in fact, you know, again just trying to
close the window on the time interval that the adversary has to try to
manipulate this process, we allow nonce two to select which part of the
what we actually have inside as a larger checksum vector, you know,
which part of that check sum vector to return. And again on our device
it was either a 32 or a 64-bit limitation per bus transaction. And so
we weren't able to return our full checksum state in every single one
of these.
And it basically is obligating any type of proxy helper to pre-load
more data into the network card. So you know, again even if we had a
full sort of cryptographically-strong nonce, if the checksum state was
512 bits and we could only send back 64 then which 64 bits to send back
would be selected by the arrival of nonce two.
So this is just the same model repeated. Each subset of the full
internal checksum state gets selected by the arrival of the incoming
nonce.
>> : [Inaudible] interrupts and cause a variance in latency?
>> Jonathon McCune: So for this particular device -- It's complicated.
Sending to the device and...
>> : The relationship between the device and the CPU itself.
>> Jonathon McCune: Yeah, so it's not so simple as, you know, two ends
of a network connection where one says, "Send," and the other one
happens to get it. I'm trying to remember the exact specifics. But it's
basically the host CPU polls a shared address space. So when the host
CPU wants to transmit a nonce to the peripherals just as memory, right,
and the magic happens in the hardware. But when the host CPU wants to
receive a checksum, it doesn't actually just sit there and wait for an
interrupt from the device. It reads a location in memory-mapped I/O
space that happens to correspond to where that checksum's going to be
written. And...
>> : But when it breaks to memory that means it is going to -- Right,
it's going to go over the shared memory bus. Won't that cause an
infract on the device?
>> Jonathon McCune: So this device that we used doesn't get an
interrupt. It also polls. So it has this kind of mailbox abstraction,
the details of which I would have to punt to Yanlin for. But it's not
the simple interrupt-driven, event-driven model that we're used to
seeing on our main processor. When, you know, a new packet arrived and
the NIC sends off an interrupt to the host CPU, I mean, it's a NIC. It
supports that mode of operation. The mode of operation where we were
able to get good control of the timing behavior of everything was this
memory-mapped I/O carefully synchronized rights and reads from the host
processor. So like I don't go into this here but I talked about the
granularity with which the host processor can measure time. That's not
just how many cycles that lapse between two consecutive RDTSC
instructions or something like that. Because this results in a bus
transaction in order to get back the answer from the memory read, it
takes a lot more than one cycle to fully execute. And so it turns out
that if that's your granularity of time measurement, you run into a lot
of challenges of being able to be like a one-hop gigabit Ethernet
attacker.
So what we opted to do for our prototype was... [ Silence ] Sorry. I
confused myself a little bit. But, yeah, I'm sorry. I lost exactly
which part of your question I was going for. Yeah. Do you want to reask anything or shall I just...
>> : I'll wait until the end.
>> Jonathon McCune: Okay. So we gave the name VIPER to this latencybased attestation protocol where, you know, we do a lot of nonce checks
on pairs between the host CPU acting as verifier and the peripheral
device, you know, acting as the target where we'd like to do some
verification. In a real system a proxy attacker isn't necessarily just
a bad guy that's connected over the Ethernet link. You know, in a real
system you might have an attacker that's another peripheral device. Now
this comes back to the question from the early slide on, "In what order
should we verify the different devices inside our system?" And so, you
know, the sort of intuition is that you want to check faster
peripherals before you check slower peripherals because a faster
peripherals could masquerade as a slower peripherals. Now it comes down
to the details of how a particular checksum function is implemented.
You know, the right kind of checksum function for a graphics card with
lots of parallel threads of execution is very different from the right
kind of checksum function for a little NIC microcontroller.
And I would say that the right kind of things to do for a graphics
cards are still unsolved; you know, there's still open questions.
So we did prototype this on this particular Netgear GA620. We were able
to get firmware for it that's why we chose it because we could actually
have the full source code to the firmware and understand things like
how it's memory-mapped I/O works so we don't just have to use
interrupt-driven communication. The prototype that Yanlin put together
implements the checksum and communication code and just hundreds of
instructions, so it's not a huge thing. We also had as our hash
function component just an off-the-shelf SHA-1 implementation. This
particular card was PCI-X; that's sort of a weird bus standard that was
a high end standard for a short period of time before PCI express came
out. So we ended up having to use a somewhat unusual as our test
platform just because it happened to have one of these slots. You know,
this ancient Linux [inaudible] version is just because the firmware
build environment for this NIC happened to work there, and we didn't
undertake the effort of porting it forward. Yeah?
>> : Just I have a delayed question for your previous slide.
>> Jonathon McCune: Okay.
>> : Is it conceivable that one peripherals is faster than the other
for its own function but slower than the other for the other's
function?
>> Jonathon McCune: I don't see why it wouldn't be possible.
>> : Because then it wouldn't be clear how to order them because you
might do the one that finishes faster first but then...
>> Jonathon McCune: So...
>> : But another one then is considered slower could do its faster?
>> Jonathon McCune: So another problem -- I grant that problem, and
another problem is the difference between the fastest and the slowest
is immense. If you think about a GPU compared to the microcontroller in
your keyboard if you're talking about peripherals that go all the way
out that far. So I think two actual fallouts that are recommendations
for hardware design are the main processor or whatever ends up being
the verifier needs to be able to strongly address who it's talking to,
just enumeration of what hardware exists in this system. And maybe you
can pre-configure that, but then once you know that, "I want to talk to
the graphics card and only the graphics card right now," my
understanding is that the newest PCI Express specifications do allow
some of that. But there's still a huge legacy, you know, all of your
legacy I/O in the south bridge tends to be this one integrated super
I/O chip these days and so.
>> : Unless you got the proxy attack problem. Even if you know who
you're talking to, you still have the proxy problem.
>> Jonathon McCune: Yes, but [inaudible] knowing who you're talking to
might also be making it...
>> : [Inaudible] talking to somebody else. Yeah, that's helpful
[inaudible] too.
>> Jonathon McCune: Yes. So I think PCI Express could do it. It's a
much more hub and spoke-looking architecture. And I think the new
versions have some access control mechanisms. But all the legacy stuff,
right, once you hop a bridge to old PCI, I don't know how you could fix
that. But maybe -- Moving on to my second point about the massive
difference form fastest to slowest: something like a keyboard that's so
important for the security of your system, right? You type your
passwords into it. All new sensitive information enters through it in
some sense. I think it makes sense to actually have some amount of
security-specific hardware functionality in keyboards if we're going to
architect a whole system where we try to worry about these kinds of
problems because it's just too wimpy, otherwise, to make any kind of
statement about it. If you can compromise anything, right -- The
firmware in your little flash drives that you plug in that makes it
look like a, you know, SCSI device could easily be compromised and keep
up with the microcontroller in your keyboard, right, the firmware in a
USB hub or something like that.
So I think, you know, regardless of whether software-based attestation
is the answer or not, strong addressing of devices and some amount of
security functionality in the human input devices seems like things
that make a lot of sense.
Okay. So I have some detail about this particular network card and what
it looks like. I'm going to proceed. So this actually is a duel-core
network card. It has two microcontrollers in it. They are MIPS
architecture; although, some instructions are depicted or just not
there. They have a little bit of private memory which actually goes a
very, very long way to helping us with our implementation. How to do
one of these checksum functions for concurrent execution with shared
memory? I mean, we already make a lot of assumptions for these checksum
functions. I'd hate to add that. It turns out that the amount of
scratch-pad memory in these things is not the same, one has twice as
much as the other.
>> : [Inaudible] intended functions for the CPU's? Do they do
[inaudible] or...?
>> Jonathon McCune: Yeah, good question. CPU A does just about
everything but when it has a worker-type of task where in a more
traditional context you might think, "Oh, I'll fork a thread and just
send it off," it defers that kind of work to CPU B. Examples of those
particular tasks? I don't want to say something wrong. So I'm going to
say I'm not sure.
>> : Do they have one [inaudible], right?
>> Jonathon McCune: They do. This is RAM. This scratch-pad memory is
random access memory that, I think, powers up [inaudible] zero. So the
obvious, naïve design is if you put -- Let me finish. They also share a
much larger buffer that's SRAM that they can both access, and it's also
memory-mapped to host world. So this is sort of the shared internal to
the NIC buffer that the host processor can talk to. These scratch-pad
memories are not addressable from anything except the CPU's to which
they are connected. So the obvious bad design is to put your checksum
function in the shared memory because one CPU can be evil while the
other one's good and make changes to the memory. So this a flavor of
the problem that Jay mentioned about two different -- You mentioned
heterogeneous devices at the same time but here, though, is very
similar.
So what we actually did is implement the checksum and the hash function
in CPU A because there was more memory and just the checksum in CPU B
and again let CPU A be the authority on how execution proceeds. So, it
doesn't really matter where you start, but without loss of generality
one of them goes first, does its own verification and remains in the
verified code. It just sits there and waits for the other one to
finish, let's the other one finish, and then, you know, that gives us
our root of trust from which the hash function can be invoked to
compute a hash over the actual firmware image. I'm not sure if it's
mapped into this same address space or not, but the flash that holds
the executable firmware is either part of this address space or some
other shared address space. You know, without going through the
firmware update process, it's not read-only but it takes a long time to
write to it.
Okay. So I think in our prototype we do CPU B first. The checksum runs.
The program counter stays inside the verified code until CPU A has done
both the checksum and the checksum has covered the hash function, and
then the hash function can be used to hash whatever.
Okay. So our checksum function was pretty unremarkable. I'm very sorry
to say that we still don't have a checksum function that I can make
strong mathematical statements about. If anybody has the background or
the interest in doing something like that, I'd be delighted to talk to
them. Maybe the interesting attribute of it is we took care to fit each
block of this checksum function into the cache in the processor very
carefully so that if the adversary tries to modify one of these blocks
it'll overflow the cache and you'll get hopefully, you know,
pathological overhead, not the overhead of incrementally one more
instruction but the overhead of a bunch of cache misses as well.
So we did mock this up. I mean, we prototyped it, you know, to the best
of our ability without doing mathematical prototypes of the checksum
function. The attacks that we wanted to consider were and Ethernetbased proxy attack where we had like a cross-over cable, like the
shortest Ethernet link we could come up with. And in fact we actually
had the other end of the Ethernet cable connected to the same network
card which knew the answer and responded without even interrupting the
OS. So we worked very hard to make the proxy attacker send back a
response as quickly as we possibly could by doing it exclusively within
the firmware of the attacker NIC.
The other two attacks are sort of the classical best attacks on the
types of checksum functions that people have proposed to date. The
checksum functions are positioned dependent, so if you move the whole
thing and run it then you need to forge both the data pointer and the
program counter. You know, if you run from a malicious location and
compute over the benign, unmodified checksum function image in its
intended location then you're only really forging the program counter
because the data pointer values are what you would've expected them to
be.
But anyway these are the two attacks that are most commonly, you know,
the convention in the literature is that they're the best ones because
they only add an instruction or two. And then an Ethernet-based proxy
attack. So despite our efforts to make that Ethernet proxy attack as
fast as we possibly could between these two network interfaces, we
still couldn't get it fast than 43 microseconds. I don't know where
that overhead comes from. If you do the math on gigabit per second and
make, you know, optimistic assumptions about latency, if Yanlin did the
math right it's 1.2 microseconds.
So this is a lot slower than what we thought the best possible Ethernet
proxy attack might be. So I think a bad guy who really understands the
lower levels can probably do better than we did. Both of our attacks
for the data pointer and program counter forging attacks turn out to
add five extra instructions, and the way they fit and the various
blocks of our checksum implementation cause two cache misses.
So what does this look like? Our benign case ended up taking 2200
nanoseconds measured with RDTSC's. So we can't necessarily measure just
one nanosecond. But we did have a nice actual grouping of the two
forging attacks, the green and blue lines there. And then the red line
is sort of theoretical, analytical expectation of the best that
Ethernet might be able to do. So that's the, you know, 1.2. I think
that's twice 1.2. But the actual implemented Ethernet attack, despite
our efforts to make it as fast as possible, is way off the top. So I
don't know what to infer from that. You know, whether our
implementation was bad or we don't understand Ethernet well enough to
recognize what the minimum latency is even for a small packet.
But they did group nicely, you know, inside of our ability to measure
passage of these short intervals of time. And we were able to set a
threshold. We used 4.5% over the benign case, and that worked reliably
for us for the, you know, sizes of scratch-pad memory and our checksum
function implementation for this particular card. So we were able to
detect, you know, one-bit changes in the checksum function
implementation. I have my
>> : Why'd you choose the 4.5 and not 4.6?
>> Jonathon McCune: No good reason. I mean, you know, it's in between
them and it's conservatively closer to the other one. And we could run
a lot more experiments. We only had two or three of these devices, and
so there's some amount of inter-device variance but not a ton. I think
if you tried to roll this out on a large scale that might be a problem
that comes up. You might have to tweak this threshold based on how much
variance you see between devices.
>> : I guess when I see 4.5 it looks carefully chosen. If you say 5% it
would convey the impression that, "Oh, we just chose this."
>> Jonathon McCune: Yeah. Guilty, I guess.
[ Laughing ]
>> Jonathon McCune: All right. So we sort of talked about this but I'm
worried about -- You know, even if we can overcome a lot of these
assumptions I'm worried about the really resource-impoverished devices
that have a big footprint in the security properties of a system like a
keyboard. I don't think that even if this system worked incredibly well
that it would necessarily be ready to protect your keyboard from, you
know, compromised other subsystems in your device. And then the other
obvious practical challenge for any kind of attestation system is
managing expectations. What's the white list? What's supposed to be in
the device? Certainly reverse engineering this for more than just a
small handful of devices would be a large undertaking.
So a lot of this related work is either the existing literature on the
software-based attestation checksum functions or existing literature on
the VMA peer-to-peer attacks or these firmware reverse engineeringtypes of attacks.
In particular the first one, Loic Duflot has done a lot of work on, you
know, what happens if the firmware goes bad in your system. He proposed
the more-- Sort of maybe getting towards John's earlier point, the
broadcom NIC that he was looking at had a debug interface. Where if you
did trust the main processor, you could just use that debug interface
to measure what firmware's running in the device and convince yourself
whether it's what you expected or not. You know, that worked for that
device.
So I hope I made the point that peripheral firmware integrity is an
important problem. I had a bunch more slides on these attacks, but I
made the assumption that you guys would appreciate the risks. I think
compared to other domains for software-based attestation using it on
peripheral devices is a relatively good fit. The communication latency,
instead of being, you know, noise to be overcome actually looks like an
asset to the design of some new protocols. You know, we showed that it
was fairly practical to prototype on a device if you have access to the
internals of that device. And I do think that if this stuff is going to
make the leap into a practical application that this is one place that
makes sense. We've had a few conversations with some of the people in
the room about, you know, maybe doing it in a data center where you
want to verify the hypervisor layer and you know how your fiber's
routed in your data center or something like that.
That's a scenario with maybe a lot more compute power available. I
really appreciate your attention. Thank you for all the questions. I
don’t mind being pushed. This is the paper that describes the gory
details of what I tried to introduce today. We had an earlier sort of
more preliminary version of this where we tried to apply software-based
attestation to that keyboard with the firmware problem. This sort of
lays out some of the basics in there too. So I'm always happy to
discuss. Thank you.
[ Applause ]
>> : So traditionally I think the vendor of the peripheral to protect
the firmware by [inaudible] to a system is very, very loud. It'd
probably expose a few [inaudible] that the driver [inaudible]. But your
approach basically widened the interface, right? You allow the system
to actually send computation passing to the firmware. Well that
actually could be inviting [inaudible] more attacks. Right?
>> Jonathon McCune: Yeah, so...
>> : [Inaudible] think about the [inaudible] attack, malware that
jeopardizes the system they just keep on sending computation task to
the peripheral [inaudible] for example.
>> Jonathon McCune: So just to be clear, the only thing that gets sent
in is, you know, nonces and request to this. It's not opening up more
-- The host processor is not just blindly writing out executable bytes
into the memory of this. It's just sending input to some portion that
was already there. Yes, it's another function that runs inside this
device, so in that sense the attack surface may increment. Depending on
the device, it may increment a lot or only a small amount. Regarding
denial of service, you know, you can tell the device to power down or
to enter a low power state or something like this. I think as a
practical consideration, a piece of software in the system with the
ability to, you know, just to send it down the "run this checksum
function forever so as not to send or receive packets" can achieve that
goal in many other ways. I'm not sure we make things worse in that
regard.
>> Bryan Parno: Okay. Let's thank Jon again.
[ Applause ]
Download