>> Tom Ball: I'm Tom Ball, and it is... a researcher at Microsoft Research, and now he is an...

>> Tom Ball: I'm Tom Ball, and it is my great pleasure to welcome Brandon Lucia back. He was a researcher at Microsoft Research, and now he is an assistant professor at Carnegie Mellon University. He does amazingly cool stuff with hardware and runtimes and he's about to launch into outer space via Spacex some cool little circuits, and I guess we're going to hear a little bit about what the challenge might be if you want to put something in outer space that doesn't have its own battery pack. Welcome back. >> Brandon Lucia: Thanks for the introduction. This talk won't be about space entirely. It's going to be about a general problem that we run into in space which is intermittently available energy and energy harvesting. So there's a lot of different design challenges that you run into when you're trying to build devices that you want to run off of energy that you're borrowing from your environment. So I'll talk about what the challenges are, and then I'm going to talk about some system support that we built for that that kind of dropped last year, and then I'll talk about some stuff that we have going on right now which is kind of the state-of-the-art in debugging devices that are doing intermittent execution in energy harvesting. I'm from Carnegie Mellon University, the electrical and computer engineering department, and the stuff I'm talking about is what I've been doing and my students have been doing together so it's kind of a big happy family group effort from over there. So I probably don't need to tell you guys too much. The motivation for this kind of work is that we have little devices everywhere. So your medicine bottle isn’t just a bottle anymore. There’s like a computer in there that’s telling you how many times you've opened the lid and where did that come from and where is it going so you know if your medicine has been somewhere that you don't want it to be or in the hands of someone that you don't want it in their hands; and these devices are in the medical domain we can put these devices into space and they can do things like try and find resources, extra-terrestrially that we want to mine or bring back to earth and lots of other really cool stuff. So emerging devices are everywhere and a lot of these, especially if we are looking at kind of medical applications, we need the devices to have high reliability. In particular, things are expensive or dangerous need to be reliable. So for the most part when we build devices using little embedded systems like this we are tied to a power source like two AA batteries or maybe you have to plug into a USB or maybe have some big chunky power pack or something like that. So this is problematic because these devices you can’t put them everywhere. They're not tiny, it's hard to put them inside of your body for medical applications, and if you're going to launch something into space you probably want something more sophisticated than 4 AA’s that you're going to go and shoot into space or if you want to monitor a volcano your AA’s are just going to catch fire and go away so that won't work. So we're looking at the result of kind of years of research on something called energy harvesting. And energy harvesting is a technology that effectively untethers the device. So here's how it works, this is the basics. You have an antenna which is an energy receiver; you have a capacitor which is an energy buffer. It’s just like a bucket you’re pouring electrons into, and then have kind of the downstream application hardware. So this is a computer, it could be a microcontroller, you can have sensors, you can have a radio so that you can send communications from your device, but the basic model of operation is you have some energy source over here like this big square antenna that I have on the slide, you have some empty space and this can be 20 meters, 40 meters, that kind of range, it's not infinite range especially if you're using like an RFID reader antenna as your energy source, and then you have a target device like this kind of janky[phonetic] breadboard mockup prototype that we built in my lab that is implementing some application. So the future what we hope for is that someday you can cram one of these things in your head and it can help you to understand, say you have schizophrenia or say you have autism or something, this can help you understand why you're behaving a certain way on a certain day. These kinds of applications, I mean these are the exciting applications, doing things in especially medicine that we can't do today because you don't want to be cutting someone's head open to replace batteries or to power it and that's literally how these things work today is you’ll have a power pack and you need to charge it. So this is a big shift. This would be a kind of a dramatic shift from the way things are done today. >>: Is this for illustrated purposes or>> Brandon Lucia: That is for illustrated purposes. This is effectively click and you clip art. This is some existing medical procedure. I think that thing in the center is actually a battery. I'm pretty sure that's a coin cell so I think this is a probe that someone put in for a short-term experiment. This is a probe. That's some computation, probably a radio, and I don’t know about this particular image because this is from Google, but>>: So we don't know the providence of this image. I was just curious. >> Brandon Lucia: We don't. But in general, experiments like this they’re short-lived because you have to take them out and replace the battery. That's the big problem here. So if we go to this energy harvesting that I was just describing about where we can take energy from the environment, we can use radio waves, we can use solar energy and we can power these devices, there's a caveat and that is that the energy that we get is intermittently available. We don't always have energy available. If you're near a radio tower then you have lots of energy; you can soak it up and fill your capacitor and run your device, but if you go far away from your radio tower suddenly you don't have anything available, your device turns off. Even if you're near a radio tower, you charge up a capacitor and then you start running your device, that drains the capacitor, and the capacitor is going to drain much more quickly than it can charge up. So when you run your device, unless you’re very careful in software, that's actually an open problem- how to be careful in software, you drain the capacitor and so actually the device just kills itself at the cadence of the charge and discharge of the capacitor. I'm going to tell you more about that, but that should give you an idea that these devices are inherently unreliable. This is an important point. They're inherently unreliable because you don't have a continuous execution model because the energy is intermittent. >>: I’m wondering can I solve it by adding the switch between two capacitors and charging up [inaudible]? If I do have the energy. >> Brandon Lucia: You can do that but it's not going to entirely solve the problem because then you have two things that are going to be charging unreliably and you have to power the logic that switches between them. What if we had three? Let’s add three capacitors. Now we can switch between the three, we may not have charge in any of them that's adequate to run our device and now we have to have more sophisticated reasoning about which one we are going to use. So we can try and fine tune the power system. I think we should attack the problem in software and so that's what I'm going to talk about today is actually how we can take this on in software, and there's a little bit of hardware stuff I'll talk about later, but primarily I'm going to talk about how we can solve the problems of the intermittent execution model, which I was just describing, and how we can think differently about building systems, and in particular building software for these kinds of systems. I'm going to talk about some system support that we've built for making intermittent executions reliable, and then I'm later going to talk about some very recent work that we've been doing on a hardware software platform for debugging intermittent devices. So this is filling in the gap, the absence of a toolchain for understanding intermittently executed devices. So to give you a clear idea of exactly what it looks like when we are running on intermittent energy we have a plot here. The Y access is the available energy and we have time going across the bottom, and here we have a lightning bolt indicating we have some radio frequency energy being transmitted over to our device and so we charge, charge, charge and then we hit the lifeline. This is an important point in the timeline of this device. When you hit that point it turns on, you start operating, the device can use its sensors, communicate whatever, and when we are operating you see we precipitously drop off in the amount of energy that we have. The capacitor is being drained very quickly because maybe we're being naive about how we are using the energy in the software. Then we have the death line. We run out of energy. By the way, the lifeline is around 2.4 volts for a lot of microcontrollers, three volts is kind of standard operating level for micro [indiscernible], 2.4 volts, just a concrete idea of what the parameters are here, is kind of what that lifeline is, 1.8 volts is the death line. And so if we are looking at kind of 100 micro watt operation that should give you a ballpark of the kinds of devices we can support with this model, with radiofrequency charging in particular. So we hit this death line crossover and then we are not charging, we are dead, we are far away from the radio antenna and we can't get any energy. A little while later maybe we get some energy and we're charging and then we hit the lifeline again and okay cool, we can start operating again. This is a complicated plot because there's a bunch of continuously varying parameters. We have energy level and we have to find a way of measuring that accurately and time is moving forward. So part of the work that we've done in defining intermittent execution is abstracting some of the details away. The way we abstract it is by noting that here the computer shuts down and here the computer reboots. So really, instead of charging and discharging, all we need to see from the software's perspective is that we have a sequence of periods of reboot and computation. The red parts are when the device is turned off; the green part is when the device is operating. This is what we call the intermittent execution model and this is stuff that we published at PLDI in 2015. The intermittent execution model says you compute for a while, then you drop out, and then some arbitrary period of time later, you don't know how long it's been, you don't know what environmental changes have happened, that’s the intermittent part, you turn it back on and you start computing again. And our goal is to make computations that go longer than one green box. That's the hard part. So just to look at what it looks like to run a computation, if we have some code like this you want to append some things to a list. We have a main function at the top that's going to append ten things and then we have the append function which bumps the pointer in your list and then puts a character in the location that pointed to in the array. So let's run this on intermittent energy and just see what happens. We could maybe run for a couple of loop iterations and then we get a reboot. We’re going to start over then and we're going to do one iteration, start appending, and we're going to get a reboot. We can do this all day and we’ll maybe make it as far as three loop iterations or four depending on the size of our capacitor and the amount of energy we have, but we are not going to be really successful at doing any useful computation if that's our model. We noted that we can model intermittence as a control flow problem. So this is a control flow graph and it maps to the perimeter I just showed you and we can add edges to the control flow graph that essentially go back in time. So you're at some arbitrary point in the program and according to an external environmental condition, it’s not a data driven flow control instead it's kind of an environmental one, you go back in time. So these are implicit control flows. That's one thing that's really hard about this. There's nothing in the code that says from here you could go back to the beginning of main. The programmer has to reason from each point in the program that they may be able to go back in time. This is hard. Even if you're very clever and you decided to use some kind of check pointing system, which I'll talk about in a minute, this problem doesn't go away. So a big challenge to doing intermittent execution successfully is dealing with the fact that you can go back in time. These devices that I mentioned, intermittently executing energy harvesting devices, typically have a hybrid memory system. So what do I mean by hybrid? The hybrid is between volatile and nonvolatile memory. So a small amount of volatile memory is typical around 2K and then a larger amount of byte addressable nonvolatile memory is typical. And a lot of microcontrollers these days are using something called FRAM which has reasonably low access latency and energy although it's still much higher than SRAM or DRAM. So we can use the nonvolatile memory a lot like we use the normal RAM in the computer, and of course we can use the volatile memory just like we would normally except that the volatile memory has this peculiar property which is that its values disappear. So if we want to span periods of failure then we have to deal with the fact that our memory gets erased all the time. And the nonvolatile memory doesn't disappear. It keeps its values. This is straightforward but I'll show you in a minute that that gets problematic if you're using the non-volatile memory just like RAM and we are also using some volatile memory like RAM. So zooming out for a minute the big idea is that the tools we have and the mental models we have for building software don't align with the intermittent execution model. And in fact, something that is correct when you have continuous power gets really weird and sometimes incorrect in unintuitive ways when you do intermittent execution, especially when you start mixing together non-volatile and volatile memory. So I'm going to talk about a new category of bugs which are related to intermittence and the hybrid memory system. I thought this was a really gross comic which is why I included it. I'm going to talk about intermittence bugs now and you guys can stop me if you have any questions along the way. I'd be happy to keep taking questions. The first intermittence bug that I'm going to talk about is an out of thin air value that you can get and this shows up, so the set up here is we have the same program we were looking at before. Now we could say that there is a checkpoint at the beginning of that loop and we can go back to that checkpoint. Now the set up for the checkpoint is capturing the volatile state of the program and when we fail we are going to reboot, repopulate the volatile state. We're not going to do anything to the rest of the non-volatile state because it seems like we don't need the checkpoint, that it’s not volatile; we can just keep that around and so we'll go back to this checkpoint at the beginning of that loop and we'll execute forward again. So if we do that we can see that size and buf are in gray, that means they’re in the non-volatile memory, and they're initialized to buf is an empty buffer of characters and size is negative one meaning we have nothing in the list right now. So if we run the program like I showed you before we get through incrementing size and then putting a character in the buffer. This is pretty straightforward. This is easy to understand how this state gets manipulated there and then we fail. So we were at this point and let's say we hadn’t made it around this loop. Like I said, there was a checkpoint there, so each time we go around the loop we are going to checkpoint and say we captured that we did one, two, three loop iterations but we died right at the end. So now we're going to restart. We are on loop iteration one again but that nonvolatile state doesn't get erased when we turn the power off. That's going to stick around and so what we see is we have an out of thin air value. A gets appended twice for the I equals one iteration. It's a buffer state that should never exist. It's impossible. In a continuous execution we won't be able to see this behavior. In fact, we can run this out to its kind of natural conclusion. Let's keep going around this implicit loop. It's still loop iteration number one but if we just turn the power off repeatedly like that we’re going to keep updating that buffer and eventually we get the size to 11. That's a value we could never see before. We have this array. It has 11 A’s in it. That's a value we could never see before. So we can produce values in the execution of a program that if you run on continuous power you would never encounter. And the important thing about this is the failure is dependent on the availability of energy and the rate of charge and discharge of the capacitor. That should be concerning because that means if I'm using radiofrequency energy to charge my device and the transmitter is here and the device is here I can do this and change what the software does. That's a system behavior you probably don't want. There's another kind of bug that we encounter. This is an atomicity violation in the conventional sense of an atomicity violation. Some things that should happen all at the same time don't necessarily happen all at the same time and you get values that end up being weird. In this example program we want the increment of size and the population of the buffer to be atomic with one another and we can clearly see that if we kind of execute up to the point where we are incrementing size and then we have a failure and then we restart, well we incremented size twice and we only put one character into the buffer so now we have this weird atomicity violation manifesting as corruption of the state. You have the size equals one and you have the first entry of the buffer is unpopulated. So atomicity violations are another problem we run into. We can get memory corruption in the nonvolatile memory. Remember, we're using check pointing here. We’re preserving all the volatile state in a checkpoint and we're going back and repopulating it. Another problem that we can run into is if we have some high energy cost piece of code we could get into a situation where we started a checkpoint and try to run forward and regardless of how hard we try we just can't make it through the region of code between the two checkpoints and this is something that could vary in an environmental conditions, rate of charges of the capacitor and the programmer has no visibility into whether their program is likely to or with what likelihood will succeed and so this is problematic. We eventually end up in an infinite loop and this program works correctly if you plug your device in. That's one of the big problems here is the programmer is probably going to do most of their initial development using continuous power and they'll say my program works, all I have to do is foo and then highcost and then whatever, and in fact when you go over to the intermittent power supply you're going to see that this program doesn't actually appear to do anything. So I've given you some evidence that these devices are too unreliable to put inside your body right next to your heart. I wouldn’t want to do that. We need better programming abstractions to make these things reliable, and I think we need system support to go with the programming abstractions to enforce some reasonable guarantees on the behavior of software that's running on intermittently executing devices. So we came up with a couple, and I'm going to not go into a lot of detail because I've given some rendition of this talk at Microsoft before so I’ll skimp on some of the details here, but I'm going to give you an overview of the kind of system support that we have in mind to make these devices more reliable. So in PLDI 2015 we had a system called DINO. DINO is an acronym for Death Is Not an Option. And this addresses the problem that I pointed out before that check pointing doesn't get us out of this mire. If you have non-volatile memory in the system and you have volatile memory in the system you can end up with these problematic inconsistent memory states where your nonvolatile memory has values in it that you don't expect. So if we take a program like this one, we can convert it into a collection of tasks. Tasks are a demarcated by static boundaries in the program; and tasks are dynamically defined as you flow through the statements in a program you may traverse across a task boundary that ends the previous task and it begins the next task. Our system supports task atomic semantics meaning that when we start a task if it completes then it’s as though that task executed without being interrupted by any power failures. So we eliminate the atomicity violation problem and we eliminate the out of thin air value problem as well. The way that it works is by selectively preserving not just the volatile memory but also grabbing pieces of the non-volatile memory that through program analysis we realize that we need to preserve otherwise we'll end up producing those funky memory states that I showed you before. Another important thing is that we eliminate these effective control flow edges, these implicit backward control flow edges, because we have these task boundaries that have checkpoints and [indiscernible] with regard to control flow. So when we have a failure we go back in time to not an arbitrary point in the past where may be a checkpoint was collected or to main or something like that but rather to a point in the program that the programmer statically defined. This is helpful because now we can understand if we look at the program from a certain point in a function there's a small set of points backwards to which we could go on a failure. So we've eliminated this kind of arbitrary implicit control flow problem that we had before. >>: It seems like all of the bugs are related, they’re very similar to concurrency bugs. This has a feel of transactions, in some sense, where you have this [inaudible]. But transactions are really hard. So are there things that people have solved in concurrency they can be applied in this domain? >> Brandon Lucia: That's a great question. This is a lot like concurrency. And in fact, in the paper that I have cited here and you can go and take a look at it, we framed this problem as not just the control flow model I should be before with implicit backward control flow edges but also as preemption. It turns out that if you have two pre-emptible tasks that access the same state which is like a data race then you have can have one of the problems that I showed you before. Now with regard to the similarity to transactions I think you're right on the money with that. We want transactional semantics. We want tasks to be like transactions. There's a few differences that make this model maybe a little weird but also really implementable. One of those is that we have kind of statically bounded but dynamically defined regions. So you can have, this is kind of a weird thing. Transactions are typically scoped with a beginning and an end and each beginning has one end. In our case you could have an initial boundary and then some statements going forward and then a branch and you can have a terminal boundary on either side of the branch. So now there's, for that given start task boundary, you have two potential end task boundaries so that's just kind of a mechanical difference from transactions. We do have some problems that transactions also have like open nesting and IO and the fact that you have a world that you're operating in and the effects of that world whether it's through other software modules or through IO interfaces. I could make your life more difficult for preserving the atomicity of your tasks. Great question though. We want a lot of the same guarantees transactional memory provides and so going forward we've actually been looking at how we can kind of take some of the big ideas there and boil them down. So we are actually constrained in a way that a lot of transactional systems aren't and that is doing rollback is expensive. We need to consume memory to store a log of everything that we've done and if we encounter a conflict of some previous failed incarnation of the same task that we are trying to execute we need to go and repopulate from that log or maybe replay parts of the execution. We have a flavor of that here, going forward though we need to move away from that model. So that part of transactions I think is similar now but should not be similar in the future. It's too expensive in this domain. You can’t spend the energy to go and repopulate the entire state of memory after a failure based on some log or something. >>: It also seems like, correct me if I’m wrong, another difference is that there's no such thing as a local access because every single memory access you do is in the transaction whereas chances [inaudible] only going to be read locally. There's no such thing as local when everything is [inaudible]. >> Brandon Lucia: That's an interesting point. So something I’m not going to talk about today but we can talk about some other time is distinguishing between volatile memory and nonvolatile memory and treating volatile memory more like local memory and treating nonvolatile memory more like global memory because you have this nice property that when you fail your volatile values all go away so you don't have to worry about partial results being preserved across a failure like you do with nonvolatile memory so in a way that’s sort of like your local memory. Nothing else can mess with it. No previous executions can mess with it, but you need to make sure that you reinitialize it every time you start a task and then run forward. So if you apply that programming discipline to your volatile memory it begins to look a little bit more like the local memory in transactional memory systems. So I'm not going to dig into this in excruciating detail because the example is pretty simple, but this figure shows how our system that I just described with tasks and check pointing and selective versioning of the nonvolatile memory, eliminate bugs. So here's how it eliminates the out of thin air value problem. We capture the checkpoint and this captures those nonvolatile memory locations and it should be straightforward to see that if we preserve those here and fail before we hit another task boundary when we restart, as long as we restore the values of buf and size that we had before, we can go forward and we have a consistent state of memory. This should be pretty straightforward. So we're essentially adding to the checkpoint is one way of thinking about it. It’s important to point out that prior work on this overlooked the need to do this versioning of nonvolatile memory. So there are systems out there, software systems and hardware systems, actually there's another paper, and I'm not just picking on [indiscernible] here, there's a system called Quick Recall that implemented volatile-only check pointing in hardware even, and if you add nonvolatile memory to the mix, which I think their devices even probably assume, it's problematic. You get the wrong result. So I can pick on this one because my co-author on the paper that I'm talking about here is a co-author on that paper too. So he happily welcomed the correction. So the question though is which parts of the nonvolatile memory do we have to capture in the checkpoint? Which parts do we need to version? We need to make conservative assumptions because we want to do this statically. We want insight in advance. We can’t to do any runtime reasoning; that's too expensive. So if we have a task boundary we have to say which parts of this program do we have to capture? We can look at how data flows. So if we look at our control flow graph we can see how data flows along the control flow arcs in that graph and we can see that on the implicit edges that we added if there is no flow through nonvolatile data then we don't need to capture that nonvolatile data as part of the checkpoint. So you can see here we’re just assigning into this non-volatile array, we're putting I into the bin I in that nonvolatile array and so we don't need to preserve it because when we restart we're going to go back to the task boundary and we're going to just blow all those things away. This is just straightforward that we don't have to capture those. However, if we change this program a little bit and we say that we're going to be incrementing each of the bins in this nonvolatile array now the story is different. Now we do need to checkpoint that because we have a data flow around the loop that’s created by that implicit control flow edge. So if you store something and then you restart you’re going to see the result of what you did before, you're going to see the incremented value, and you’ll get wacky values all over the place. So the key insight here is that we have to version the data if we both read and write the value. That's something that we can look for in the compiler and that's actually what we implemented. We look for situations where we have both kinds of operations and that’s actually got a superset of what you need to look for. That was conservative. We could have trimmed it down. That's what we did in our compiler because it was easy to implement and it works to keep things correct. >>: [inaudible] you really need to see a data flow around [inaudible]. You need to have a use [inaudible] which means there’s a path to use from the beginning of the task. >> Brandon Lucia: But some of those are okay because if you have an [indiscernible] use>>: You have to have a definition that>> Brandon Lucia: Yeah, yeah, yeah. Definition later. I think that produces a data flow around the implicit flow control flow work. >>: So you can make it more precise that way? >> Brandon Lucia: Yeah. We're being conservative. We're saying if we see both of these on the path to a task boundary then we would preserve that in a [indiscernible] way. >>: [inaudible] results let me take a shot at summarizing the exact requirement, if it is exactly [inaudible]? Anything [inaudible] is free, anything [inaudible]? >> Brandon Lucia: That's true. If you put a task boundary at a point that defines all forward tasks to be [indiscernible] tasks, in other words encloses, it never cuts one of those edges, then every checkpoint is free and every one of those boundaries that you put in is free. >>: [inaudible]. And once you know the check point then you have [inaudible]? >> Brandon Lucia: Maybe. There's another factor and that is sometimes you don't want to put a checkpoint our sometimes you want to put more checkpoints both of which could mess with [indiscernible]. The reason you might want to do that is if you have IO and you want to, for example, put task boundaries tightly around IO or if you need several IO operations to be part of the same task because you're correlating values from two sensors and if you collect one sensor value now and one 10 minutes from now then it doesn't make any sense. So subject to those other constraints which by the way are not edge cases, they’re very real in this environment like this environment is all about IO, you're doing sensing and you’re processing and you’re transmitting and you may be able to get those [indiscernible] things but not always. It's a good observation. So this example reveals fundamental limitation of check pointing and this really sucks. And this is something that we are trying to get away from because it's very expensive to implement. This is what I was saying to you just a second ago. If we can't know in advance where we might be accessing we need to capture the whole thing. That's a huge drag especially when your task boundaries could be subject to external constraints like the position of IO in your code. You could be really hosed. This is a really contrived example, we have a million entry array and we are choosing a random element of it. That’s probably not what we're going to be doing, but you can imagine how this manifests as control flow deciding which things you’re going to be accessing. So my view is that in the future check pointing is not a viable solution for solving the intermittence problem. We need to do something else that preserves state, keeps it consistent, gives good guarantees, but isn't reloading values every time we restart. So we built the prototype, this is sort of an unremarkable prototype. It's the way that you would expect to be if you have done this kind of work before and I know most of you have. So we start with a program, you have to put these annotations in, our compiler does a data flow analysis and find the situations we were just talking about and links us to a runtime, and then when we get over there we have a runtime library that the compiler links it to and it does check pointing and it does the data versioning stuff dynamically as we need to, as you cross task boundaries that’s the implementation. We evaluate on a bunch of different benchmarks that we got from kind of various places. We have Midi musical instrument interface, we have sensor logs and various stuff, we implemented for a custom breadboard prototype and for the Wiz 5 platform which are two different radiofrequency energy harvesting devices, and if we don't use any support for these things we have errors and the systems don't work. All of these applications produce the wrong result or they have a hard stopped failure that we can't recover from because, for example the histogram that we were storing sensor log readings into, is corrupted. One of the bins got unlinked and so now what are we going to do? We have no way of recovering the value. So if we look at one benchmark, like an activity recognition benchmark that takes accelerometer readings and decides using a nearest neighbor classifier whether you're having a tremor which is kind of a medical condition that we can identify pretty easily, it’s a wearable medical device kind of application, then we can see how much error do we really get if we are running at say 10 centimeters or 20 centimeters, pretty close to the power source. This is a radiofrequency power source and these distances are fairly short because we wanted high fidelity in the experiments without having to do 26 gazillion measurements. As you go further away the variance becomes higher in the measurements so we would have had to do more. So we went out to 60 centimeters, which is not kind of notably far, but we saw that even out there we had edging toward 10 percent error in our measurements and that's pretty close. So if we want to use this device in a real environment where say we have instrumented rooms in a hospital with RFID readers and we’re going to be going in and out of range say 10 to 20 meters of these things we are really going to have to improve the fidelity of this because the error seems to scale maybe with diminishing trend toward higher error at higher distance. >>: Suppose that I have that exact scenario where I'm in a hospital and one question is are the errors uncorrelated because you could imagine having to devices, for instance if the errors are uncorrelated, if it's really particularly like a blip in the air then the probability that actually both the devices, for instance, have an error when I enter a room goes down exponentially. >> Brandon Lucia: That's a good idea. I think we should probably do that. The issue is going to be>>: It depends on the correlation of the error. I'm kind of asking about the error. >> Brandon Lucia: I don't know if the errors are correlated. I suspect, part of me wants to say yes because you’ll have a more errors if you have more failures and so if you're further away from the power supply you're going to be dropping out more frequently so that means that both systems are more likely to be failing and so they're more likely to have errors, but then on the other hand it’s a function of the software and we don't know. They might not be in the same place, they might be just out of phase enough, it could also be a function of the hardware, maybe you used a different memory technology in one device than the other to try and be more robust, and an issue that I see in implementing this kind of system is that you have a distributed system now and if they're computing the same value they need to agree on which value they computed or you need to build your application in a restricted way so that they can produce an arbitrary string with values and you know that whatever the strings are that the two devices compute you can put them together and make something sensible that your application cares about. In either case I think there's some distributed systems coordination stuff. You're not going to run Paxos on these things just because there's not enough energy to take checkpoint so good luck with Paxos. I think you would have to do some coordination stuff. They can communicate. They could actually communicate with one another and that might be one way out of this. So with our system, no error; that’s because we eliminate it by construction. So the point here is that we can compile programs differently and get better reliability guarantees for these inherently unreliable devices. That's the punchline. We eliminate the error because we compile the program differently and put runtime support in to prevent these things from happening. Now I'm going to switch>>: [inaudible]? >> Brandon Lucia: The cost, I don't have it in the slides but in the application we were just looking at it’s hard to measure the overhead but it's about 2x slowdown on the baseline implementation with the caveat that the baseline implementation is incorrect so it takes about twice as long to complete the computation, and that's not just because we add more cycles to each loop, but because if you arrive at a task boundary and your bucket is half full you have to do more work and so you're more likely to fail and then you have to go back to the beginning of the task and spend a quarter of your bucket to repopulate the memory and then you can complete the task that you were doing. So it's not just the additional work per task overhead it’s the additional work of wasted work in failing in tasks that you wouldn't have otherwise failed in. So that's why it's 2x. To me this is a surprisingly high overhead number for kind of sparse check pointing because we are manually placing checkpoints. Those checkpoints can be as sparse as we want them to be. That's why it’s 2x though is because we have not just more work per task but because we have more failed tasks. >>: I'm not sure if my question is fully formed yet but I'll try it. So you kind of made this comment that check pointing is where you, you guys want to move away from it. You had this very concrete example of [inaudible] an array that's a billion elements. It seems like that's from, your programming model is very low level [inaudible]. So one thing you could think about is a set of abstractions where you still allow it to have check pointing but you've put the right kind of data structures, whatever those things are, that let people do certain class of operations easily where you can still [inaudible] check pointing. So I guess to kind of come back to the question is have you thought more about [inaudible] distractions that help people do this work as opposed to check pointing? Maybe that's stealing the rest of the thunder. >> Brandon Lucia: No. This talk is actually a talk that three months from now I'll be giving that talk. The basic idea is that we can do it sort of like decomposing checkpoints into the individual memory locations that we want to be accessing and if we force the programmer to adhere to a discipline in the way that they use their nonvolatile memory then we can eliminate the need for check pointing but still have item potent restart. So we can talk more about this off-line. It's basically roughly what you said except that we are not using any kind of ADT abstraction. It's more of kind of general, you can think of it like a key value store abstraction, and then working forward from that you can do arbitrary computation, build up arbitrary data structures. The task abstraction is pretty similar though. That's roughly the same. It's a little bit different but it's roughly the same. Cool. So if we want to debug these things, we have all these weird bugs, and we need to change the game a little bit. I'll show you why we need to change the game a little bit. Debugging is a particularly hard problem because the intermittent behavior that we are seeing not only causes conventional software like we're already fairly familiar with, but it also can lead to kind of the hardware behaving in ways that are apparently invisible so both of these are problematic. I'm focusing on software here but I wanted to mention that there is also the possibility for the hardware to go wonky too. So debugging is something we need to be paying a lot of attention to going forward. So I showed you before how we can get intermittence bugs and these can lead to application errors. Like if we have 11 then this is obviously a problem. We don't want 11 entries in our 10 entry buffer. And we note that this failure only manifests when we're running on intermittently available energy, when we're running on intermittent power. So if we have the radiofrequency harvester pointed at that device we’re going to get this a bug manifesting when we have a reboot here halfway through our loop and then we start again and we execute and we increment; now we are at 11, now we have this problem. So we don't want this to happen because this doesn't correspond to any continuously powered execution. Our group is approaching this set of problems with that as a correctness definition so when I say we're debugging we are trying to move toward a more correct program. We want to permit behaviors that are permitted in continuously powered execution. This is a reasonable definition of correctness that we are pushing towards. So what do we need to do to understand why this bug happens and how we can fix it? What we need is to look at how the energy is changing, like on this curve that I showed you before, and what's going on in the program at each of those points when the energy is changing. So here we have size incremented and buf incremented and then we had a failure on that curve and we saw that when I showed you the example and explained it we saw the why that was happening. But if you're just writing an application and one out of every 10,000 executions fails good luck figuring out that this was happening. You have events A, B, and C on the downward slope of your discharge curve; then you restart and you have A and B again and that leaves you with a corrupt state. This is difficult to reason about because we don't have any visibility into the device. We want to be able to answer the question what was happening when that weird thing happened and we saw our application produce the wrong value. What was going on inside the device? So the problem with trying to diagnose this type of failure is that typically we are in this kind of cartoon set up. You attach a big chunky plastic box to a power supply like a USB port and you attach the other end to your target device. You have visibility into your device. This is great. You can do JTag debugging, you can get all the logic bugs out of your program, but you're providing power so your discharge curve looks like this. You're at full voltage the whole time. This grayed-out part never happens. You never see the bathtub of kind of discharging and then recharging again. So we'll never get 10. You can run as many times as you want to. You’re only ever going to see 10. So that's a real drag. We can see the failure behavior run on intermittent energy but not be able to look inside the device and use JTag or use any kind of debugging console or we can power the device image and look at all the registers and everything as the thing is executing but that doesn't help us very much because we'll never see the failure. This is the problem that we have. This problem is called the energy interference problem. No devices that are on the market today or from research solves this, addresses this. >>: [inaudible]. Couldn’t you just, again taking the idea that this is an instance of a concurrency related bug, use concurrency related tools as a manipulator? So for instance, I could imagine>> Brandon Lucia: You use an emulator. The emulator is the problem there. You have to be able to model the behavior that is causing your bug in the emulator. You could just>>: Fine. But it's a nontrivial mapping. Maybe could say something like, maybe I should take this offline because I'm not sure how to ask my question. But the point is that it seems like you could come up with [inaudible], for instance of a comparing execution that gives you a value of 11 for your [inaudible] bug. You could change things to be global in memory. >> Brandon Lucia: That's interesting. We actually have a short-lived project that we haven’t followed up on which is essentially simulating arbitrarily timed power failures and hoping that a single power failure leads to the corruption that we are looking for. We found that while sometimes it is in other cases we needed multiple power failures to get something really bad to happen otherwise we wouldn't end up getting the wrong value. >>: [inaudible]? >> Brandon Lucia: It’s a package. It's one of these implicitly defined packages. So we could go and instrument the program>>: To add the package. >> Brandon Lucia: Let's get together and build that thing. We need the hands. That's what we need. That's a good idea. The connection that concurrency is deep. Like I said in the PLDI paper we kind of identified this and you can use the same kind of theoretical framework for reasoning about this. Data races in particular are a nice way of thinking about it. So then here we can also use logging. Logging is great. If we have an intermittent executing system like this then we want to understand what the behavior was. We could maybe log the energy periodically or we could blink an LED or something. The peripherals that we have are pretty limited. Printf we could implement over like a Uart[phonetic] and that's great, but all of these things cost energy and so we are effectively changing the discharge curve by putting this stuff in. It's an observed effect and in some cases is pretty rough. If you want to blink an LED when you have some condition, like maybe you blink your LED when you're adding an element to the list and you want to see and then maybe you're blinking another LED when you get down to 1.8 volts so you know when you're about to die, if you can see really fast when those LEDs are blinking at the same time then that might be when your failure condition happens. So think about that as your kind of optimistic hopefully this works kind of debugging model. This is actually what people do. People do things like this when they're doing embedded development. It gives you a situation like this though where you're killing the device by doing a log and trying to blink the LED. So even if that would work you're changing the system's behavior and you get this energy interference gap where the behavior that you're looking at is different by some constant amount of energy from the behavior that you are trying to investigate. So you may mask that. There's also the problem that you need to know more or less what you're looking for in order to log it and try to find it. It’s the same as doing any kind of tracing printf based debugging. So this is pretty bleak. This isn't the way forward. This is a stopgap at best if you can make it work, if you can find some signals to get off your board. You can use an oscilloscope. That's a nice idea. I just sent a proposal for one of these and it came out to about 29,000 dollars including all of the probes if you want digital probes to grab a few signals coming out, you want to look at energy. By the way, the correlation between what your program is doing and what the energy is doing is totally invisible to the oscilloscope because you can't probe the internal logic of your microcontroller. So if you want to get a good measurement of what the energy is you can do that with the oscilloscope. This isn't the way forward either. Maybe combining the oscilloscope and logging and doing logging while you have an oscilloscope trace that's how we developed what I'm about to tell you about and I can tell you from that experience that that's extremely difficult and time-consuming and it wastes a lot of time with false starts because the instruments begin to lie to you after a while. So it's hard to debug these things. Oscilloscopes, logging, direct measurement, none of these work. So we developed a new thing which is called EDB. This is the Energy Interference-free Debugger. This does the things that I described us wanting before. We can correlate program events and failures and certain value conditions being satisfied and certainly energy conditions being satisfied with the energy behavior of the system. The device that we designed, I have one right here; it's real hardware, we built this in my lab, we laid out the board; an undergrad actually did a really awesome job laying out the board, I was impressed. It can not just monitor the program state and the energy state; it can also manipulate that program state and the energy state. We support all the kind of conventional familiar debugging operations like assertions and break points on values or energy. We have an interactive Shell, we have a Python library with interactive Shell so it's really easy to use. We’ve kind of tried to make it as simple as possible so that you can really debug these energy harvesting devices in the same way that you debug normal embedded systems. So as for monitoring we have access to all of the program state, we implemented a thin client runtime layer that we can ask questions of and then it sends us values back, and we can read the target energy level which is nice because then we can correlate those points in the program that I just described identifying using that client runtime layer with direct measurements of the capacitor on the device. We can read IO messages, we can actually decode things as they go by through the IO layer by capturing on this board and then proxying the communications out, so we intercept and then forward is one way of doing it or we can passively just monitor the communications as they go by. So we can grab IO, we can grab program state, we can grab energy, this is all passive and it's all energy interference free. I'll tell you some numbers at the end, but I think here we were like nano amps of current, and so if you think about two-ish volts this is a negligible amount of power. This is an amount of power that doesn't make any difference over the long-term. >>: I don't quite understand why it's so cheap. [inaudible] make it cheaper than [inaudible]? Why is it so cheap? >> Brandon Lucia: Why is went so cheap? >>: [inaudible] target energy level and the [inaudible] state. Why is that so cheap? >> Brandon Lucia: Because all we're doing is using the ADC on this board and we don't need to use the ADC on the target board if we connect wires at the right place. So that's why there's this six pin header on the bottom here. If we pop this onto the header that we designed to fit onto, and in our case we targeted the Wiz 5 because it's kind of the most mature platform that's out there right now, then the pins that we connect to they expose the power over a couple of lines then we can read them. >>: [inaudible]? >> Brandon Lucia: In this USB port we are not really plugged in. I should've mentioned that. We are continuously powered so we don't get any interruptions. So we have a continuous UR connection to a Python console, we can give commands all the time, and then this thing is blinking on and off like it normally would be a thousand times a second or something and we can reliably get state out. Now state we want to do some more expensive monitoring actions than just reading the energy level or reading target IO or hitting a breakpoint or something. To do that we need to be able to be a little bit more sophisticated. We need to be able to manipulate the energy level and what we do is we lie to the target about what was happening. We can charge up to compensate for energy that the target uses doing debugging operations. So if there's a whole bunch of kind of ad hoc breakpoint debugging operations we assess how much energy that took and we can top off the capacitor. If we want to go into an interactive debugging mode we can just power the device, record what the energy level was, power the device and then go into arbitrary debugging activity and then reset the energy level to what it was before we went into debug mode. And we've implemented all of that and we can get, again it's a variation of kind of I think 2 to 4 percent of our capacitor is the error that we insert in the energy level device. So if you're operating and you want realistic conditions we're within 2 to 4 percent. So I'm nearing the end of my talk here, but there's a couple slides that I think are cool examples of how you can use energy interference free debugging support. So say we have a program and we want to insert elements into a list and then we want to do an expensive check to make sure that the list is consistent so each element’s next pointers, previous pointers, points to the element itself. This is a pretty simple list consistency check. If we do that we might see these awfully colored magenta circles correspond to the consistency check happening. If we have one element on the list then maybe we do some main loop work after we did the check for the one element. That's easy. But then later we add two checks here with two elements on the list and then three checks here and then we kill the device. Now we are not making any progress in the main loop because the assertion is toasting us. So if you want this kind of either beta test or development time assertion checking support to make sure your program logic is correct you can't use it. The check itself is what's going to kill you. So instead what we did is we use EDB's ability to manipulate the energy level of the device, continuously power it, then reset the energy level to what it would have been and then do the main loop work. So by running it on continuous power we can make forward progress under realistic conditions despite the fact that we are doing these invasive checks. So the last thing is we've integrated EDB with other hardware. So Tom mentioned earlier that we are launching a satellite. So the satellite on one side has an application processor that has a magnetometer and Gyro. We're going to send this into space, it's going to orbit the Earth at 6000 miles per hour and collect a bunch of radio measurements, but then on the other side we have this, essentially; and this is going to be operating as a beta test diagnostics collection platform so that we can monitor the behavior of the system while it's winging around in space under conditions that you can really only get if you integrate this hardware with your application and run under realistic conditions. It's not even like we can emulate this in a lab. We’re going to be in space, there's going to be all sorts of wacky errors; there’s going to be all sorts of intermittently available energy as the thing is tumbling and not in view of the sun. So by integrating this together we have a reliable test framework and we can do this kind of realistic beta testing and get useful information to help us fix our application if it’s not doing quite what we want. So I'm really excited about the space launch, March 23 is when we're going to send it up. We already shifted to the launch company, so assuming everything goes well and there are no surprises along the way we should be in the sky at the end of March. >>: [inaudible]? >> Brandon Lucia: I don’t know. I'll send around a link for sure. So that's pretty much what I wanted to talk about today. I wanted to give you guys an idea about what intermittent execution is and why it's tough to reason about. I showed you that we have some system support for making unreliable executions reliable despite the intermittence, and I hope that I convinced you that debugging intermittent devices is hard and that this is a good way to do it. I have these available. If you guys think you'll be using intermittently executing hardware I can send you some devices too. We have a whole big bag in my lab. Cool. Any other questions?

>> Tom Ball: I'm Tom Ball, and it is... a researcher at Microsoft Research, and now he is an...

Related documents

Products

Support

&gt;&gt; Tom Ball: I'm Tom Ball, and it is... a researcher at Microsoft Research, and now he is an...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Tom Ball: I'm Tom Ball, and it is... a researcher at Microsoft Research, and now he is an...