>> Tom Ball: I'm Tom Ball, and it is... a researcher at Microsoft Research, and now he is an...

advertisement
>> Tom Ball: I'm Tom Ball, and it is my great pleasure to welcome Brandon Lucia back. He was
a researcher at Microsoft Research, and now he is an assistant professor at Carnegie Mellon
University. He does amazingly cool stuff with hardware and runtimes and he's about to launch
into outer space via Spacex some cool little circuits, and I guess we're going to hear a little bit
about what the challenge might be if you want to put something in outer space that doesn't
have its own battery pack. Welcome back.
>> Brandon Lucia: Thanks for the introduction. This talk won't be about space entirely. It's
going to be about a general problem that we run into in space which is intermittently available
energy and energy harvesting. So there's a lot of different design challenges that you run into
when you're trying to build devices that you want to run off of energy that you're borrowing
from your environment. So I'll talk about what the challenges are, and then I'm going to talk
about some system support that we built for that that kind of dropped last year, and then I'll
talk about some stuff that we have going on right now which is kind of the state-of-the-art in
debugging devices that are doing intermittent execution in energy harvesting.
I'm from Carnegie Mellon University, the electrical and computer engineering department, and
the stuff I'm talking about is what I've been doing and my students have been doing together so
it's kind of a big happy family group effort from over there. So I probably don't need to tell you
guys too much. The motivation for this kind of work is that we have little devices everywhere.
So your medicine bottle isn’t just a bottle anymore. There’s like a computer in there that’s
telling you how many times you've opened the lid and where did that come from and where is
it going so you know if your medicine has been somewhere that you don't want it to be or in
the hands of someone that you don't want it in their hands; and these devices are in the
medical domain we can put these devices into space and they can do things like try and find
resources, extra-terrestrially that we want to mine or bring back to earth and lots of other
really cool stuff.
So emerging devices are everywhere and a lot of these, especially if we are looking at kind of
medical applications, we need the devices to have high reliability. In particular, things are
expensive or dangerous need to be reliable. So for the most part when we build devices using
little embedded systems like this we are tied to a power source like two AA batteries or maybe
you have to plug into a USB or maybe have some big chunky power pack or something like that.
So this is problematic because these devices you can’t put them everywhere. They're not tiny,
it's hard to put them inside of your body for medical applications, and if you're going to launch
something into space you probably want something more sophisticated than 4 AA’s that you're
going to go and shoot into space or if you want to monitor a volcano your AA’s are just going to
catch fire and go away so that won't work.
So we're looking at the result of kind of years of research on something called energy
harvesting. And energy harvesting is a technology that effectively untethers the device. So
here's how it works, this is the basics. You have an antenna which is an energy receiver; you
have a capacitor which is an energy buffer. It’s just like a bucket you’re pouring electrons into,
and then have kind of the downstream application hardware. So this is a computer, it could be
a microcontroller, you can have sensors, you can have a radio so that you can send
communications from your device, but the basic model of operation is you have some energy
source over here like this big square antenna that I have on the slide, you have some empty
space and this can be 20 meters, 40 meters, that kind of range, it's not infinite range especially
if you're using like an RFID reader antenna as your energy source, and then you have a target
device like this kind of janky[phonetic] breadboard mockup prototype that we built in my lab
that is implementing some application.
So the future what we hope for is that someday you can cram one of these things in your head
and it can help you to understand, say you have schizophrenia or say you have autism or
something, this can help you understand why you're behaving a certain way on a certain day.
These kinds of applications, I mean these are the exciting applications, doing things in especially
medicine that we can't do today because you don't want to be cutting someone's head open to
replace batteries or to power it and that's literally how these things work today is you’ll have a
power pack and you need to charge it. So this is a big shift. This would be a kind of a dramatic
shift from the way things are done today.
>>: Is this for illustrated purposes or>> Brandon Lucia: That is for illustrated purposes. This is effectively click and you clip art. This
is some existing medical procedure. I think that thing in the center is actually a battery. I'm
pretty sure that's a coin cell so I think this is a probe that someone put in for a short-term
experiment. This is a probe. That's some computation, probably a radio, and I don’t know
about this particular image because this is from Google, but>>: So we don't know the providence of this image. I was just curious.
>> Brandon Lucia: We don't. But in general, experiments like this they’re short-lived because
you have to take them out and replace the battery. That's the big problem here. So if we go to
this energy harvesting that I was just describing about where we can take energy from the
environment, we can use radio waves, we can use solar energy and we can power these
devices, there's a caveat and that is that the energy that we get is intermittently available. We
don't always have energy available. If you're near a radio tower then you have lots of energy;
you can soak it up and fill your capacitor and run your device, but if you go far away from your
radio tower suddenly you don't have anything available, your device turns off. Even if you're
near a radio tower, you charge up a capacitor and then you start running your device, that
drains the capacitor, and the capacitor is going to drain much more quickly than it can charge
up. So when you run your device, unless you’re very careful in software, that's actually an open
problem- how to be careful in software, you drain the capacitor and so actually the device just
kills itself at the cadence of the charge and discharge of the capacitor. I'm going to tell you
more about that, but that should give you an idea that these devices are inherently unreliable.
This is an important point. They're inherently unreliable because you don't have a continuous
execution model because the energy is intermittent.
>>: I’m wondering can I solve it by adding the switch between two capacitors and charging up
[inaudible]? If I do have the energy.
>> Brandon Lucia: You can do that but it's not going to entirely solve the problem because then
you have two things that are going to be charging unreliably and you have to power the logic
that switches between them. What if we had three? Let’s add three capacitors. Now we can
switch between the three, we may not have charge in any of them that's adequate to run our
device and now we have to have more sophisticated reasoning about which one we are going
to use. So we can try and fine tune the power system.
I think we should attack the problem in software and so that's what I'm going to talk about
today is actually how we can take this on in software, and there's a little bit of hardware stuff
I'll talk about later, but primarily I'm going to talk about how we can solve the problems of the
intermittent execution model, which I was just describing, and how we can think differently
about building systems, and in particular building software for these kinds of systems. I'm
going to talk about some system support that we've built for making intermittent executions
reliable, and then I'm later going to talk about some very recent work that we've been doing on
a hardware software platform for debugging intermittent devices. So this is filling in the gap,
the absence of a toolchain for understanding intermittently executed devices.
So to give you a clear idea of exactly what it looks like when we are running on intermittent
energy we have a plot here. The Y access is the available energy and we have time going across
the bottom, and here we have a lightning bolt indicating we have some radio frequency energy
being transmitted over to our device and so we charge, charge, charge and then we hit the
lifeline. This is an important point in the timeline of this device. When you hit that point it
turns on, you start operating, the device can use its sensors, communicate whatever, and when
we are operating you see we precipitously drop off in the amount of energy that we have. The
capacitor is being drained very quickly because maybe we're being naive about how we are
using the energy in the software.
Then we have the death line. We run out of energy. By the way, the lifeline is around 2.4 volts
for a lot of microcontrollers, three volts is kind of standard operating level for micro
[indiscernible], 2.4 volts, just a concrete idea of what the parameters are here, is kind of what
that lifeline is, 1.8 volts is the death line. And so if we are looking at kind of 100 micro watt
operation that should give you a ballpark of the kinds of devices we can support with this
model, with radiofrequency charging in particular.
So we hit this death line crossover and then we are not charging, we are dead, we are far away
from the radio antenna and we can't get any energy. A little while later maybe we get some
energy and we're charging and then we hit the lifeline again and okay cool, we can start
operating again. This is a complicated plot because there's a bunch of continuously varying
parameters. We have energy level and we have to find a way of measuring that accurately and
time is moving forward.
So part of the work that we've done in defining intermittent execution is abstracting some of
the details away. The way we abstract it is by noting that here the computer shuts down and
here the computer reboots. So really, instead of charging and discharging, all we need to see
from the software's perspective is that we have a sequence of periods of reboot and
computation. The red parts are when the device is turned off; the green part is when the
device is operating. This is what we call the intermittent execution model and this is stuff that
we published at PLDI in 2015. The intermittent execution model says you compute for a while,
then you drop out, and then some arbitrary period of time later, you don't know how long it's
been, you don't know what environmental changes have happened, that’s the intermittent
part, you turn it back on and you start computing again. And our goal is to make computations
that go longer than one green box. That's the hard part.
So just to look at what it looks like to run a computation, if we have some code like this you
want to append some things to a list. We have a main function at the top that's going to
append ten things and then we have the append function which bumps the pointer in your list
and then puts a character in the location that pointed to in the array. So let's run this on
intermittent energy and just see what happens. We could maybe run for a couple of loop
iterations and then we get a reboot. We’re going to start over then and we're going to do one
iteration, start appending, and we're going to get a reboot. We can do this all day and we’ll
maybe make it as far as three loop iterations or four depending on the size of our capacitor and
the amount of energy we have, but we are not going to be really successful at doing any useful
computation if that's our model.
We noted that we can model intermittence as a control flow problem. So this is a control flow
graph and it maps to the perimeter I just showed you and we can add edges to the control flow
graph that essentially go back in time. So you're at some arbitrary point in the program and
according to an external environmental condition, it’s not a data driven flow control instead it's
kind of an environmental one, you go back in time. So these are implicit control flows. That's
one thing that's really hard about this. There's nothing in the code that says from here you
could go back to the beginning of main. The programmer has to reason from each point in the
program that they may be able to go back in time. This is hard. Even if you're very clever and
you decided to use some kind of check pointing system, which I'll talk about in a minute, this
problem doesn't go away. So a big challenge to doing intermittent execution successfully is
dealing with the fact that you can go back in time.
These devices that I mentioned, intermittently executing energy harvesting devices, typically
have a hybrid memory system. So what do I mean by hybrid? The hybrid is between volatile
and nonvolatile memory. So a small amount of volatile memory is typical around 2K and then a
larger amount of byte addressable nonvolatile memory is typical. And a lot of microcontrollers
these days are using something called FRAM which has reasonably low access latency and
energy although it's still much higher than SRAM or DRAM. So we can use the nonvolatile
memory a lot like we use the normal RAM in the computer, and of course we can use the
volatile memory just like we would normally except that the volatile memory has this peculiar
property which is that its values disappear. So if we want to span periods of failure then we
have to deal with the fact that our memory gets erased all the time. And the nonvolatile
memory doesn't disappear. It keeps its values. This is straightforward but I'll show you in a
minute that that gets problematic if you're using the non-volatile memory just like RAM and we
are also using some volatile memory like RAM.
So zooming out for a minute the big idea is that the tools we have and the mental models we
have for building software don't align with the intermittent execution model. And in fact,
something that is correct when you have continuous power gets really weird and sometimes
incorrect in unintuitive ways when you do intermittent execution, especially when you start
mixing together non-volatile and volatile memory. So I'm going to talk about a new category of
bugs which are related to intermittence and the hybrid memory system. I thought this was a
really gross comic which is why I included it.
I'm going to talk about intermittence bugs now and you guys can stop me if you have any
questions along the way. I'd be happy to keep taking questions. The first intermittence bug
that I'm going to talk about is an out of thin air value that you can get and this shows up, so the
set up here is we have the same program we were looking at before. Now we could say that
there is a checkpoint at the beginning of that loop and we can go back to that checkpoint. Now
the set up for the checkpoint is capturing the volatile state of the program and when we fail we
are going to reboot, repopulate the volatile state. We're not going to do anything to the rest of
the non-volatile state because it seems like we don't need the checkpoint, that it’s not volatile;
we can just keep that around and so we'll go back to this checkpoint at the beginning of that
loop and we'll execute forward again.
So if we do that we can see that size and buf are in gray, that means they’re in the non-volatile
memory, and they're initialized to buf is an empty buffer of characters and size is negative one
meaning we have nothing in the list right now. So if we run the program like I showed you
before we get through incrementing size and then putting a character in the buffer. This is
pretty straightforward. This is easy to understand how this state gets manipulated there and
then we fail. So we were at this point and let's say we hadn’t made it around this loop. Like I
said, there was a checkpoint there, so each time we go around the loop we are going to
checkpoint and say we captured that we did one, two, three loop iterations but we died right at
the end.
So now we're going to restart. We are on loop iteration one again but that nonvolatile state
doesn't get erased when we turn the power off. That's going to stick around and so what we
see is we have an out of thin air value. A gets appended twice for the I equals one iteration. It's
a buffer state that should never exist. It's impossible. In a continuous execution we won't be
able to see this behavior. In fact, we can run this out to its kind of natural conclusion. Let's
keep going around this implicit loop. It's still loop iteration number one but if we just turn the
power off repeatedly like that we’re going to keep updating that buffer and eventually we get
the size to 11. That's a value we could never see before. We have this array. It has 11 A’s in it.
That's a value we could never see before. So we can produce values in the execution of a
program that if you run on continuous power you would never encounter. And the important
thing about this is the failure is dependent on the availability of energy and the rate of charge
and discharge of the capacitor. That should be concerning because that means if I'm using
radiofrequency energy to charge my device and the transmitter is here and the device is here I
can do this and change what the software does. That's a system behavior you probably don't
want.
There's another kind of bug that we encounter. This is an atomicity violation in the
conventional sense of an atomicity violation. Some things that should happen all at the same
time don't necessarily happen all at the same time and you get values that end up being weird.
In this example program we want the increment of size and the population of the buffer to be
atomic with one another and we can clearly see that if we kind of execute up to the point
where we are incrementing size and then we have a failure and then we restart, well we
incremented size twice and we only put one character into the buffer so now we have this
weird atomicity violation manifesting as corruption of the state. You have the size equals one
and you have the first entry of the buffer is unpopulated. So atomicity violations are another
problem we run into. We can get memory corruption in the nonvolatile memory. Remember,
we're using check pointing here. We’re preserving all the volatile state in a checkpoint and
we're going back and repopulating it.
Another problem that we can run into is if we have some high energy cost piece of code we
could get into a situation where we started a checkpoint and try to run forward and regardless
of how hard we try we just can't make it through the region of code between the two
checkpoints and this is something that could vary in an environmental conditions, rate of
charges of the capacitor and the programmer has no visibility into whether their program is
likely to or with what likelihood will succeed and so this is problematic. We eventually end up
in an infinite loop and this program works correctly if you plug your device in. That's one of the
big problems here is the programmer is probably going to do most of their initial development
using continuous power and they'll say my program works, all I have to do is foo and then highcost and then whatever, and in fact when you go over to the intermittent power supply you're
going to see that this program doesn't actually appear to do anything.
So I've given you some evidence that these devices are too unreliable to put inside your body
right next to your heart. I wouldn’t want to do that. We need better programming abstractions
to make these things reliable, and I think we need system support to go with the programming
abstractions to enforce some reasonable guarantees on the behavior of software that's running
on intermittently executing devices.
So we came up with a couple, and I'm going to not go into a lot of detail because I've given
some rendition of this talk at Microsoft before so I’ll skimp on some of the details here, but I'm
going to give you an overview of the kind of system support that we have in mind to make
these devices more reliable.
So in PLDI 2015 we had a system called DINO. DINO is an acronym for Death Is Not an Option.
And this addresses the problem that I pointed out before that check pointing doesn't get us out
of this mire. If you have non-volatile memory in the system and you have volatile memory in
the system you can end up with these problematic inconsistent memory states where your
nonvolatile memory has values in it that you don't expect. So if we take a program like this
one, we can convert it into a collection of tasks. Tasks are a demarcated by static boundaries in
the program; and tasks are dynamically defined as you flow through the statements in a
program you may traverse across a task boundary that ends the previous task and it begins the
next task. Our system supports task atomic semantics meaning that when we start a task if it
completes then it’s as though that task executed without being interrupted by any power
failures. So we eliminate the atomicity violation problem and we eliminate the out of thin air
value problem as well.
The way that it works is by selectively preserving not just the volatile memory but also grabbing
pieces of the non-volatile memory that through program analysis we realize that we need to
preserve otherwise we'll end up producing those funky memory states that I showed you
before. Another important thing is that we eliminate these effective control flow edges, these
implicit backward control flow edges, because we have these task boundaries that have
checkpoints and [indiscernible] with regard to control flow. So when we have a failure we go
back in time to not an arbitrary point in the past where may be a checkpoint was collected or to
main or something like that but rather to a point in the program that the programmer statically
defined. This is helpful because now we can understand if we look at the program from a
certain point in a function there's a small set of points backwards to which we could go on a
failure. So we've eliminated this kind of arbitrary implicit control flow problem that we had
before.
>>: It seems like all of the bugs are related, they’re very similar to concurrency bugs. This has a
feel of transactions, in some sense, where you have this [inaudible]. But transactions are really
hard. So are there things that people have solved in concurrency they can be applied in this
domain?
>> Brandon Lucia: That's a great question. This is a lot like concurrency. And in fact, in the
paper that I have cited here and you can go and take a look at it, we framed this problem as not
just the control flow model I should be before with implicit backward control flow edges but
also as preemption. It turns out that if you have two pre-emptible tasks that access the same
state which is like a data race then you have can have one of the problems that I showed you
before.
Now with regard to the similarity to transactions I think you're right on the money with that.
We want transactional semantics. We want tasks to be like transactions. There's a few
differences that make this model maybe a little weird but also really implementable. One of
those is that we have kind of statically bounded but dynamically defined regions. So you can
have, this is kind of a weird thing. Transactions are typically scoped with a beginning and an
end and each beginning has one end. In our case you could have an initial boundary and then
some statements going forward and then a branch and you can have a terminal boundary on
either side of the branch. So now there's, for that given start task boundary, you have two
potential end task boundaries so that's just kind of a mechanical difference from transactions.
We do have some problems that transactions also have like open nesting and IO and the fact
that you have a world that you're operating in and the effects of that world whether it's
through other software modules or through IO interfaces. I could make your life more difficult
for preserving the atomicity of your tasks. Great question though.
We want a lot of the same guarantees transactional memory provides and so going forward
we've actually been looking at how we can kind of take some of the big ideas there and boil
them down. So we are actually constrained in a way that a lot of transactional systems aren't
and that is doing rollback is expensive. We need to consume memory to store a log of
everything that we've done and if we encounter a conflict of some previous failed incarnation
of the same task that we are trying to execute we need to go and repopulate from that log or
maybe replay parts of the execution. We have a flavor of that here, going forward though we
need to move away from that model. So that part of transactions I think is similar now but
should not be similar in the future. It's too expensive in this domain. You can’t spend the
energy to go and repopulate the entire state of memory after a failure based on some log or
something.
>>: It also seems like, correct me if I’m wrong, another difference is that there's no such thing
as a local access because every single memory access you do is in the transaction whereas
chances [inaudible] only going to be read locally. There's no such thing as local when
everything is [inaudible].
>> Brandon Lucia: That's an interesting point. So something I’m not going to talk about today
but we can talk about some other time is distinguishing between volatile memory and
nonvolatile memory and treating volatile memory more like local memory and treating
nonvolatile memory more like global memory because you have this nice property that when
you fail your volatile values all go away so you don't have to worry about partial results being
preserved across a failure like you do with nonvolatile memory so in a way that’s sort of like
your local memory. Nothing else can mess with it. No previous executions can mess with it,
but you need to make sure that you reinitialize it every time you start a task and then run
forward. So if you apply that programming discipline to your volatile memory it begins to look a
little bit more like the local memory in transactional memory systems.
So I'm not going to dig into this in excruciating detail because the example is pretty simple, but
this figure shows how our system that I just described with tasks and check pointing and
selective versioning of the nonvolatile memory, eliminate bugs. So here's how it eliminates the
out of thin air value problem. We capture the checkpoint and this captures those nonvolatile
memory locations and it should be straightforward to see that if we preserve those here and
fail before we hit another task boundary when we restart, as long as we restore the values of
buf and size that we had before, we can go forward and we have a consistent state of memory.
This should be pretty straightforward. So we're essentially adding to the checkpoint is one way
of thinking about it. It’s important to point out that prior work on this overlooked the need to
do this versioning of nonvolatile memory. So there are systems out there, software systems
and hardware systems, actually there's another paper, and I'm not just picking on
[indiscernible] here, there's a system called Quick Recall that implemented volatile-only check
pointing in hardware even, and if you add nonvolatile memory to the mix, which I think their
devices even probably assume, it's problematic. You get the wrong result. So I can pick on this
one because my co-author on the paper that I'm talking about here is a co-author on that paper
too. So he happily welcomed the correction.
So the question though is which parts of the nonvolatile memory do we have to capture in the
checkpoint? Which parts do we need to version? We need to make conservative assumptions
because we want to do this statically. We want insight in advance. We can’t to do any runtime
reasoning; that's too expensive. So if we have a task boundary we have to say which parts of
this program do we have to capture? We can look at how data flows. So if we look at our
control flow graph we can see how data flows along the control flow arcs in that graph and we
can see that on the implicit edges that we added if there is no flow through nonvolatile data
then we don't need to capture that nonvolatile data as part of the checkpoint. So you can see
here we’re just assigning into this non-volatile array, we're putting I into the bin I in that
nonvolatile array and so we don't need to preserve it because when we restart we're going to
go back to the task boundary and we're going to just blow all those things away. This is just
straightforward that we don't have to capture those.
However, if we change this program a little bit and we say that we're going to be incrementing
each of the bins in this nonvolatile array now the story is different. Now we do need to
checkpoint that because we have a data flow around the loop that’s created by that implicit
control flow edge. So if you store something and then you restart you’re going to see the result
of what you did before, you're going to see the incremented value, and you’ll get wacky values
all over the place. So the key insight here is that we have to version the data if we both read
and write the value. That's something that we can look for in the compiler and that's actually
what we implemented. We look for situations where we have both kinds of operations and
that’s actually got a superset of what you need to look for. That was conservative. We could
have trimmed it down. That's what we did in our compiler because it was easy to implement
and it works to keep things correct.
>>: [inaudible] you really need to see a data flow around [inaudible]. You need to have a use
[inaudible] which means there’s a path to use from the beginning of the task.
>> Brandon Lucia: But some of those are okay because if you have an [indiscernible] use>>: You have to have a definition that>> Brandon Lucia: Yeah, yeah, yeah. Definition later. I think that produces a data flow around
the implicit flow control flow work.
>>: So you can make it more precise that way?
>> Brandon Lucia: Yeah. We're being conservative. We're saying if we see both of these on
the path to a task boundary then we would preserve that in a [indiscernible] way.
>>: [inaudible] results let me take a shot at summarizing the exact requirement, if it is exactly
[inaudible]? Anything [inaudible] is free, anything [inaudible]?
>> Brandon Lucia: That's true. If you put a task boundary at a point that defines all forward
tasks to be [indiscernible] tasks, in other words encloses, it never cuts one of those edges, then
every checkpoint is free and every one of those boundaries that you put in is free.
>>: [inaudible]. And once you know the check point then you have [inaudible]?
>> Brandon Lucia: Maybe. There's another factor and that is sometimes you don't want to put
a checkpoint our sometimes you want to put more checkpoints both of which could mess with
[indiscernible]. The reason you might want to do that is if you have IO and you want to, for
example, put task boundaries tightly around IO or if you need several IO operations to be part
of the same task because you're correlating values from two sensors and if you collect one
sensor value now and one 10 minutes from now then it doesn't make any sense. So subject to
those other constraints which by the way are not edge cases, they’re very real in this
environment like this environment is all about IO, you're doing sensing and you’re processing
and you’re transmitting and you may be able to get those [indiscernible] things but not always.
It's a good observation.
So this example reveals fundamental limitation of check pointing and this really sucks. And this
is something that we are trying to get away from because it's very expensive to implement.
This is what I was saying to you just a second ago. If we can't know in advance where we might
be accessing we need to capture the whole thing. That's a huge drag especially when your task
boundaries could be subject to external constraints like the position of IO in your code. You
could be really hosed. This is a really contrived example, we have a million entry array and we
are choosing a random element of it. That’s probably not what we're going to be doing, but
you can imagine how this manifests as control flow deciding which things you’re going to be
accessing. So my view is that in the future check pointing is not a viable solution for solving the
intermittence problem. We need to do something else that preserves state, keeps it consistent,
gives good guarantees, but isn't reloading values every time we restart.
So we built the prototype, this is sort of an unremarkable prototype. It's the way that you
would expect to be if you have done this kind of work before and I know most of you have. So
we start with a program, you have to put these annotations in, our compiler does a data flow
analysis and find the situations we were just talking about and links us to a runtime, and then
when we get over there we have a runtime library that the compiler links it to and it does check
pointing and it does the data versioning stuff dynamically as we need to, as you cross task
boundaries that’s the implementation.
We evaluate on a bunch of different benchmarks that we got from kind of various places. We
have Midi musical instrument interface, we have sensor logs and various stuff, we implemented
for a custom breadboard prototype and for the Wiz 5 platform which are two different
radiofrequency energy harvesting devices, and if we don't use any support for these things we
have errors and the systems don't work. All of these applications produce the wrong result or
they have a hard stopped failure that we can't recover from because, for example the
histogram that we were storing sensor log readings into, is corrupted. One of the bins got
unlinked and so now what are we going to do? We have no way of recovering the value.
So if we look at one benchmark, like an activity recognition benchmark that takes
accelerometer readings and decides using a nearest neighbor classifier whether you're having a
tremor which is kind of a medical condition that we can identify pretty easily, it’s a wearable
medical device kind of application, then we can see how much error do we really get if we are
running at say 10 centimeters or 20 centimeters, pretty close to the power source. This is a
radiofrequency power source and these distances are fairly short because we wanted high
fidelity in the experiments without having to do 26 gazillion measurements. As you go further
away the variance becomes higher in the measurements so we would have had to do more.
So we went out to 60 centimeters, which is not kind of notably far, but we saw that even out
there we had edging toward 10 percent error in our measurements and that's pretty close. So
if we want to use this device in a real environment where say we have instrumented rooms in a
hospital with RFID readers and we’re going to be going in and out of range say 10 to 20 meters
of these things we are really going to have to improve the fidelity of this because the error
seems to scale maybe with diminishing trend toward higher error at higher distance.
>>: Suppose that I have that exact scenario where I'm in a hospital and one question is are the
errors uncorrelated because you could imagine having to devices, for instance if the errors are
uncorrelated, if it's really particularly like a blip in the air then the probability that actually both
the devices, for instance, have an error when I enter a room goes down exponentially.
>> Brandon Lucia: That's a good idea. I think we should probably do that. The issue is going to
be>>: It depends on the correlation of the error. I'm kind of asking about the error.
>> Brandon Lucia: I don't know if the errors are correlated. I suspect, part of me wants to say
yes because you’ll have a more errors if you have more failures and so if you're further away
from the power supply you're going to be dropping out more frequently so that means that
both systems are more likely to be failing and so they're more likely to have errors, but then on
the other hand it’s a function of the software and we don't know. They might not be in the
same place, they might be just out of phase enough, it could also be a function of the hardware,
maybe you used a different memory technology in one device than the other to try and be
more robust, and an issue that I see in implementing this kind of system is that you have a
distributed system now and if they're computing the same value they need to agree on which
value they computed or you need to build your application in a restricted way so that they can
produce an arbitrary string with values and you know that whatever the strings are that the two
devices compute you can put them together and make something sensible that your application
cares about. In either case I think there's some distributed systems coordination stuff. You're
not going to run Paxos on these things just because there's not enough energy to take
checkpoint so good luck with Paxos. I think you would have to do some coordination stuff.
They can communicate. They could actually communicate with one another and that might be
one way out of this.
So with our system, no error; that’s because we eliminate it by construction. So the point here
is that we can compile programs differently and get better reliability guarantees for these
inherently unreliable devices. That's the punchline. We eliminate the error because we
compile the program differently and put runtime support in to prevent these things from
happening. Now I'm going to switch>>: [inaudible]?
>> Brandon Lucia: The cost, I don't have it in the slides but in the application we were just
looking at it’s hard to measure the overhead but it's about 2x slowdown on the baseline
implementation with the caveat that the baseline implementation is incorrect so it takes about
twice as long to complete the computation, and that's not just because we add more cycles to
each loop, but because if you arrive at a task boundary and your bucket is half full you have to
do more work and so you're more likely to fail and then you have to go back to the beginning of
the task and spend a quarter of your bucket to repopulate the memory and then you can
complete the task that you were doing. So it's not just the additional work per task overhead
it’s the additional work of wasted work in failing in tasks that you wouldn't have otherwise
failed in. So that's why it's 2x. To me this is a surprisingly high overhead number for kind of
sparse check pointing because we are manually placing checkpoints. Those checkpoints can be
as sparse as we want them to be. That's why it’s 2x though is because we have not just more
work per task but because we have more failed tasks.
>>: I'm not sure if my question is fully formed yet but I'll try it. So you kind of made this
comment that check pointing is where you, you guys want to move away from it. You had this
very concrete example of [inaudible] an array that's a billion elements. It seems like that's
from, your programming model is very low level [inaudible]. So one thing you could think
about is a set of abstractions where you still allow it to have check pointing but you've put the
right kind of data structures, whatever those things are, that let people do certain class of
operations easily where you can still [inaudible] check pointing. So I guess to kind of come back
to the question is have you thought more about [inaudible] distractions that help people do this
work as opposed to check pointing? Maybe that's stealing the rest of the thunder.
>> Brandon Lucia: No. This talk is actually a talk that three months from now I'll be giving that
talk. The basic idea is that we can do it sort of like decomposing checkpoints into the individual
memory locations that we want to be accessing and if we force the programmer to adhere to a
discipline in the way that they use their nonvolatile memory then we can eliminate the need for
check pointing but still have item potent restart. So we can talk more about this off-line. It's
basically roughly what you said except that we are not using any kind of ADT abstraction. It's
more of kind of general, you can think of it like a key value store abstraction, and then working
forward from that you can do arbitrary computation, build up arbitrary data structures. The
task abstraction is pretty similar though. That's roughly the same. It's a little bit different but
it's roughly the same.
Cool. So if we want to debug these things, we have all these weird bugs, and we need to
change the game a little bit. I'll show you why we need to change the game a little bit.
Debugging is a particularly hard problem because the intermittent behavior that we are seeing
not only causes conventional software like we're already fairly familiar with, but it also can lead
to kind of the hardware behaving in ways that are apparently invisible so both of these are
problematic. I'm focusing on software here but I wanted to mention that there is also the
possibility for the hardware to go wonky too. So debugging is something we need to be paying
a lot of attention to going forward.
So I showed you before how we can get intermittence bugs and these can lead to application
errors. Like if we have 11 then this is obviously a problem. We don't want 11 entries in our 10
entry buffer. And we note that this failure only manifests when we're running on intermittently
available energy, when we're running on intermittent power. So if we have the radiofrequency
harvester pointed at that device we’re going to get this a bug manifesting when we have a
reboot here halfway through our loop and then we start again and we execute and we
increment; now we are at 11, now we have this problem. So we don't want this to happen
because this doesn't correspond to any continuously powered execution. Our group is
approaching this set of problems with that as a correctness definition so when I say we're
debugging we are trying to move toward a more correct program. We want to permit
behaviors that are permitted in continuously powered execution. This is a reasonable
definition of correctness that we are pushing towards.
So what do we need to do to understand why this bug happens and how we can fix it? What
we need is to look at how the energy is changing, like on this curve that I showed you before,
and what's going on in the program at each of those points when the energy is changing. So
here we have size incremented and buf incremented and then we had a failure on that curve
and we saw that when I showed you the example and explained it we saw the why that was
happening. But if you're just writing an application and one out of every 10,000 executions fails
good luck figuring out that this was happening. You have events A, B, and C on the downward
slope of your discharge curve; then you restart and you have A and B again and that leaves you
with a corrupt state. This is difficult to reason about because we don't have any visibility into
the device. We want to be able to answer the question what was happening when that weird
thing happened and we saw our application produce the wrong value. What was going on
inside the device?
So the problem with trying to diagnose this type of failure is that typically we are in this kind of
cartoon set up. You attach a big chunky plastic box to a power supply like a USB port and you
attach the other end to your target device. You have visibility into your device. This is great.
You can do JTag debugging, you can get all the logic bugs out of your program, but you're
providing power so your discharge curve looks like this. You're at full voltage the whole time.
This grayed-out part never happens. You never see the bathtub of kind of discharging and then
recharging again. So we'll never get 10. You can run as many times as you want to. You’re only
ever going to see 10. So that's a real drag. We can see the failure behavior run on intermittent
energy but not be able to look inside the device and use JTag or use any kind of debugging
console or we can power the device image and look at all the registers and everything as the
thing is executing but that doesn't help us very much because we'll never see the failure. This is
the problem that we have. This problem is called the energy interference problem. No devices
that are on the market today or from research solves this, addresses this.
>>: [inaudible]. Couldn’t you just, again taking the idea that this is an instance of a concurrency
related bug, use concurrency related tools as a manipulator? So for instance, I could imagine>> Brandon Lucia: You use an emulator. The emulator is the problem there. You have to be
able to model the behavior that is causing your bug in the emulator. You could just>>: Fine. But it's a nontrivial mapping. Maybe could say something like, maybe I should take
this offline because I'm not sure how to ask my question. But the point is that it seems like you
could come up with [inaudible], for instance of a comparing execution that gives you a value of
11 for your [inaudible] bug. You could change things to be global in memory.
>> Brandon Lucia: That's interesting. We actually have a short-lived project that we haven’t
followed up on which is essentially simulating arbitrarily timed power failures and hoping that a
single power failure leads to the corruption that we are looking for. We found that while
sometimes it is in other cases we needed multiple power failures to get something really bad
to happen otherwise we wouldn't end up getting the wrong value.
>>: [inaudible]?
>> Brandon Lucia: It’s a package. It's one of these implicitly defined packages. So we could go
and instrument the program>>: To add the package.
>> Brandon Lucia: Let's get together and build that thing. We need the hands. That's what we
need. That's a good idea. The connection that concurrency is deep. Like I said in the PLDI
paper we kind of identified this and you can use the same kind of theoretical framework for
reasoning about this. Data races in particular are a nice way of thinking about it.
So then here we can also use logging. Logging is great. If we have an intermittent executing
system like this then we want to understand what the behavior was. We could maybe log the
energy periodically or we could blink an LED or something. The peripherals that we have are
pretty limited. Printf we could implement over like a Uart[phonetic] and that's great, but all of
these things cost energy and so we are effectively changing the discharge curve by putting this
stuff in. It's an observed effect and in some cases is pretty rough.
If you want to blink an LED when you have some condition, like maybe you blink your LED
when you're adding an element to the list and you want to see and then maybe you're blinking
another LED when you get down to 1.8 volts so you know when you're about to die, if you can
see really fast when those LEDs are blinking at the same time then that might be when your
failure condition happens. So think about that as your kind of optimistic hopefully this works
kind of debugging model. This is actually what people do. People do things like this when
they're doing embedded development. It gives you a situation like this though where you're
killing the device by doing a log and trying to blink the LED. So even if that would work you're
changing the system's behavior and you get this energy interference gap where the behavior
that you're looking at is different by some constant amount of energy from the behavior that
you are trying to investigate. So you may mask that. There's also the problem that you need to
know more or less what you're looking for in order to log it and try to find it. It’s the same as
doing any kind of tracing printf based debugging. So this is pretty bleak. This isn't the way
forward. This is a stopgap at best if you can make it work, if you can find some signals to get off
your board.
You can use an oscilloscope. That's a nice idea. I just sent a proposal for one of these and it
came out to about 29,000 dollars including all of the probes if you want digital probes to grab a
few signals coming out, you want to look at energy. By the way, the correlation between what
your program is doing and what the energy is doing is totally invisible to the oscilloscope
because you can't probe the internal logic of your microcontroller. So if you want to get a good
measurement of what the energy is you can do that with the oscilloscope. This isn't the way
forward either. Maybe combining the oscilloscope and logging and doing logging while you
have an oscilloscope trace that's how we developed what I'm about to tell you about and I can
tell you from that experience that that's extremely difficult and time-consuming and it wastes a
lot of time with false starts because the instruments begin to lie to you after a while. So it's
hard to debug these things. Oscilloscopes, logging, direct measurement, none of these work.
So we developed a new thing which is called EDB. This is the Energy Interference-free
Debugger. This does the things that I described us wanting before. We can correlate program
events and failures and certain value conditions being satisfied and certainly energy conditions
being satisfied with the energy behavior of the system. The device that we designed, I have
one right here; it's real hardware, we built this in my lab, we laid out the board; an undergrad
actually did a really awesome job laying out the board, I was impressed. It can not just monitor
the program state and the energy state; it can also manipulate that program state and the
energy state. We support all the kind of conventional familiar debugging operations like
assertions and break points on values or energy. We have an interactive Shell, we have a
Python library with interactive Shell so it's really easy to use. We’ve kind of tried to make it as
simple as possible so that you can really debug these energy harvesting devices in the same
way that you debug normal embedded systems.
So as for monitoring we have access to all of the program state, we implemented a thin client
runtime layer that we can ask questions of and then it sends us values back, and we can read
the target energy level which is nice because then we can correlate those points in the program
that I just described identifying using that client runtime layer with direct measurements of the
capacitor on the device. We can read IO messages, we can actually decode things as they go by
through the IO layer by capturing on this board and then proxying the communications out, so
we intercept and then forward is one way of doing it or we can passively just monitor the
communications as they go by. So we can grab IO, we can grab program state, we can grab
energy, this is all passive and it's all energy interference free. I'll tell you some numbers at the
end, but I think here we were like nano amps of current, and so if you think about two-ish volts
this is a negligible amount of power. This is an amount of power that doesn't make any
difference over the long-term.
>>: I don't quite understand why it's so cheap. [inaudible] make it cheaper than [inaudible]?
Why is it so cheap?
>> Brandon Lucia: Why is went so cheap?
>>: [inaudible] target energy level and the [inaudible] state. Why is that so cheap?
>> Brandon Lucia: Because all we're doing is using the ADC on this board and we don't need to
use the ADC on the target board if we connect wires at the right place. So that's why there's
this six pin header on the bottom here. If we pop this onto the header that we designed to fit
onto, and in our case we targeted the Wiz 5 because it's kind of the most mature platform
that's out there right now, then the pins that we connect to they expose the power over a
couple of lines then we can read them.
>>: [inaudible]?
>> Brandon Lucia: In this USB port we are not really plugged in. I should've mentioned that.
We are continuously powered so we don't get any interruptions. So we have a continuous UR
connection to a Python console, we can give commands all the time, and then this thing is
blinking on and off like it normally would be a thousand times a second or something and we
can reliably get state out.
Now state we want to do some more expensive monitoring actions than just reading the energy
level or reading target IO or hitting a breakpoint or something. To do that we need to be able
to be a little bit more sophisticated. We need to be able to manipulate the energy level and
what we do is we lie to the target about what was happening. We can charge up to
compensate for energy that the target uses doing debugging operations. So if there's a whole
bunch of kind of ad hoc breakpoint debugging operations we assess how much energy that took
and we can top off the capacitor. If we want to go into an interactive debugging mode we can
just power the device, record what the energy level was, power the device and then go into
arbitrary debugging activity and then reset the energy level to what it was before we went into
debug mode. And we've implemented all of that and we can get, again it's a variation of kind of
I think 2 to 4 percent of our capacitor is the error that we insert in the energy level device. So if
you're operating and you want realistic conditions we're within 2 to 4 percent.
So I'm nearing the end of my talk here, but there's a couple slides that I think are cool examples
of how you can use energy interference free debugging support. So say we have a program and
we want to insert elements into a list and then we want to do an expensive check to make sure
that the list is consistent so each element’s next pointers, previous pointers, points to the
element itself. This is a pretty simple list consistency check. If we do that we might see these
awfully colored magenta circles correspond to the consistency check happening. If we have
one element on the list then maybe we do some main loop work after we did the check for the
one element. That's easy. But then later we add two checks here with two elements on the list
and then three checks here and then we kill the device. Now we are not making any progress in
the main loop because the assertion is toasting us. So if you want this kind of either beta test
or development time assertion checking support to make sure your program logic is correct you
can't use it. The check itself is what's going to kill you. So instead what we did is we use EDB's
ability to manipulate the energy level of the device, continuously power it, then reset the
energy level to what it would have been and then do the main loop work. So by running it on
continuous power we can make forward progress under realistic conditions despite the fact
that we are doing these invasive checks.
So the last thing is we've integrated EDB with other hardware. So Tom mentioned earlier that
we are launching a satellite. So the satellite on one side has an application processor that has a
magnetometer and Gyro. We're going to send this into space, it's going to orbit the Earth at
6000 miles per hour and collect a bunch of radio measurements, but then on the other side we
have this, essentially; and this is going to be operating as a beta test diagnostics collection
platform so that we can monitor the behavior of the system while it's winging around in space
under conditions that you can really only get if you integrate this hardware with your
application and run under realistic conditions. It's not even like we can emulate this in a lab.
We’re going to be in space, there's going to be all sorts of wacky errors; there’s going to be all
sorts of intermittently available energy as the thing is tumbling and not in view of the sun. So
by integrating this together we have a reliable test framework and we can do this kind of
realistic beta testing and get useful information to help us fix our application if it’s not doing
quite what we want.
So I'm really excited about the space launch, March 23 is when we're going to send it up. We
already shifted to the launch company, so assuming everything goes well and there are no
surprises along the way we should be in the sky at the end of March.
>>: [inaudible]?
>> Brandon Lucia: I don’t know. I'll send around a link for sure. So that's pretty much what I
wanted to talk about today. I wanted to give you guys an idea about what intermittent
execution is and why it's tough to reason about. I showed you that we have some system
support for making unreliable executions reliable despite the intermittence, and I hope that I
convinced you that debugging intermittent devices is hard and that this is a good way to do it. I
have these available. If you guys think you'll be using intermittently executing hardware I can
send you some devices too. We have a whole big bag in my lab. Cool. Any other questions?
Download