16810 >> Jim Larus: Hello. It's my pleasure today...

advertisement

16810

>> Jim Larus: Hello. It's my pleasure today to have two visitors, both of whom are faculty members at the University of Illinois at Champagne Urbana and are part of the UPCRC project there. And they're going to be giving a tag team lecture on time travel, which is obviously a subject of great interest to all of us, particularly those of us getting older. [laughter]. So let me introduce Professor Josep Torrellas and Sam King. And I think Josep is going to go first.

>> Josep Torrellas: Thank you, Jim. And thank you everybody for coming to the lecture. I appreciate your attendance. So Sam and I are part of the UPCRC Center at Illinois. We thought it would be perhaps interesting for us to stop by and give you an overview of some of the work we are doing in trying to understand the time travel in multi-processors.

So feel free to stop me and stop us any time during the talk. I'm just going to go and give a brief, say a third of the talk, and then Sam will take over. So the motivation, of course, is that we're moving to multi-processors. The number of course per processor will continue to grow, and the major issue is primability. Primability should become a top priority arguably as important as high power performance efficiency.

What we're doing is trying to figure out good primitives that can help primability. And one such primitive is time travel. What is time travel? Time travel is the ability to visit and recreate past state and events in the computer.

Why do we want this? We can do this for debugging a program. So ideally we would want to know how we got to this point, what brought us to this bug.

But there are other applications of this. One of them is security. It would tell us how we got attacked, how did the attacker get into the system. Another one is fault tolerance. This thing doesn't seem to work. Fault tolerance, for example, you can have two machines. One of them executing and sending a trace of what it's doing to another one and this other machine is kind of a back up if the first one disappears or dies the second can continue.

How do we accomplish this? One technology is the deterministic replay of execution. That's what we're going to focus here.

So how does the deterministic replay works, how does it work? It works in two steps.

First, you have the initial execution also called recording. And the idea here is that you execute and then you try to record any nondeterministic events in a log. And this could be, say, the memory axis interleaving if you have a multi-threaded program or a set of interrupts, when they occur and the time.

And then what you do in a second step, you replay the program and what you do is you go back to a previous check point and then you use the log to force the software down to the same path.

So that's the idea of the deterministic replay.

There's been a lot of work on software support for deterministic replay. Some of the work done here. It can be done at the compiler level, virtual machine, operating levels, many levels. We know it's very nice. It works. It's flexible. Integrates well with the rest of the software stack.

The problem is that it doesn't work so well for multi-processors, at least that's what we think. And the reason it would be too slow to do it in a multi-processor because we would need to record the interleaving of all the shared memory axes to know exactly how this program interleafed, the memory axis interleafed. You would need to instrument all these things. This is an opportunity for the hardware. And that's the motivation for hardware-based deterministic replaced systems.

The idea here is that the hardware can record the ordering of shared memory axes sufficiently with some extra support. The way it works is the proposals, there's no existing hardware yet, but the proposals are as follows: Suppose you have two threads here. One of them with instruction at 1 writes to location A and then later on at instruction M1, this one here reads location A. So then I need to log as I execute this an event that says when I get to instruction and 1 I need to make sure I check P1 to make sure that P1 has already passed M1. That's the idea. The log will have in the simplest case an entry for each arrow that when I replay will read and make sure that when I get to M1 I wait for P1 to execute first. And that's going to enforce the same interleaving.

This seems appealing however there's a bunch of problems. First you would need to have a lot of, a large log, because you may need an entry for each arrow or group of arrows.

The second thing is that the replay occurs using sequential consistency. We're going to have threads going on and peak and instruction from each thread and replay them assuming that each instruction occurred automatically in reality processors have reordered this thing and things may not follow sequential consistency. That's another problem with existing systems.

Finally, most important, is people who have looked at this have focused mostly on this hardware primitive without caring about how do you integrate this into an operating system.

And that's the subject of our work. So what we're going to do here is try to figure out how one would build a system that works. Very briefly I'm going to present our hardware design called

DeLorean which appeared in first 2008. This is a very efficient way of recording and replay. The bulk of the talk, however, will be on a new software hardware interface for replay.

This appeared in [inaudible] a month ago. And the idea is how do you design an operating system to make hardware-based replace systems useful? So what's DeLorean. DeLorean is the motivation for this work is a hardware-based MP replace scheme that uses chunk-based execution.

The highlights are that it requires very, very small log, less than 1 percent of currently proposed hardware systems. It replays at very high speed. This could be particularly important for the fault tolerant environment.

And then it has a knob that allows us to trade off speed versus lowering requirements, depending on what stage in your debugging process you may want to use one or the other systems, knobs.

Values of the knobs.

To explain DeLorean, first I need to explain what I mean by chunk-based execution. So here we are looking at a system where processors execute chunks of instruction at a time.

These are consecutive dynamic instructions that we call chunks. A bunch of loads, source. And these chunks the machine supports atomically, so they're executed in a way that any side effects

are not visible until the chunk commits. Think about the transaction. So imagine a machine that works on transactions all the time.

So whenever I commit a chunk, only then I make the state visible of this chunk to the rest of the world. And the same thing here. Moreover, this chunks execute in isolation, like transactions, which means that if I have this chunk here having read, that has read location X and then this chunk here executes and writes location X and then commits, this one has to be squashed and restarted because it has seen an old value.

So we will squash this thing and restart. If we use this model, the system appears logically to execute in a total order of chunks, right? There's a chunk from processor one, one from processor four and processor zero and some order.

So this is the model. And this is interesting because memory X is interleaving happens only at ten boundaries. Imagine we have a machine that supports transactions all the time, and in this case I'm going to assume that they're invisible to the software.

Interesting thing here is that inside the chunk, memory axis can be reordered and overlapped in any way I want because they won't be visible. And examples of systems that use this is TCC, transactional memory consistency and coherence from Stanford and Balkasee [phonetic] from

Illinois.

So what's the advantage of what is the implication of chunk-based execution on the deterministic replay? If you use this model, now the deterministic replay becomes very simple. All we need to do is generate the same chunks during replay as in the initial execution, the same chunks, and commit them in the same order.

That's all. So we have this initial execution where we have these chunks and this order of commit. That's all I need to remember. I don't need to have the same exact timing between these two chunks, as long as they commit in the same order, that's enough.

Also, I don't need to follow the same instruction the same order of instructions within a chunk.

They can actually do any order. What matters is that at the end the chunk stops at the same point and it commits in the same order.

>>: How do you choose chunk boundaries?

>> Josep Torrellas: How do you choose chunk boundaries, you would want to use them deterministically, for example, every M number of constructions would be a design point. And M being a thousand, 2,000 or so.

So the result of this, two major implications. One, the log is very small. Second, I can have very high speed replay. Why the log is very small, because I don't need to store arrows, I don't need to store dependencies, all I need to store is the chunk commits in this transactional approach.

Therefore, the log is simply the total order of chunk commits and their size. So I'm going to have each end will be very short, processor ID and the number of instructions or the number of loads and stores, however I want to measure the size of a chunk.

>>: That's deterministic why do you need the chunks.

>> Josep Torrellas: This would be the naive case, that would be it would not be deterministic. I would need to store the size. The log would be updated infrequently, only every 2,000 instructions or whatever. This is the log. Processor ID size, processor ID size.

>>: If there is synchronization inside one of those deterministic, let's say constant boundaries, then it's still deterministic, but it would have to adapt to that because you need to -- for example, if

I want to release some other computation I need to get into my chunk.

>> Josep Torrellas: You don't have to. You can continue, right? All that's going to do is delay the time when the other can acquire the log.

>>: But if I were to release it up here.

>> Josep Torrellas: Inside.

>>: Yeah.

>> Josep Torrellas: If you can acquire that guy, fine. Otherwise you may have to spin. And if you spin, you cannot acquire a log, that counts as instruction. So eventually you'll cut the chunk.

These are not static chunks. They're dynamic. Therefore you don't have this deadlock that you could have.

Anyway, so the idea then would be in this case would be to have a set of processors. I'm going to have an arbitrary module here in hardware. This is kind of irrelevant. You could have a bar or whatever. The memory log is implemented as two different structures. The processor interleaving log, the one that has the P IDs, processor IDs, and then the chunk size log which I'm going to distribute across the machine.

So then the chunk size log will be in each processor and the arbiter will centralize the processor

ID log. So whenever a processor zero, for example, finishes a chunk then it will send a request to commit to the arbiter. Perhaps processor one will also send a request to commit and then the arbiter will select one of them say okay you're committed using a certain algorithm.

This one will store the size here and this one will store the processor ID. And then perhaps concurrently this one will request or will say that you're also committed and this one will store the size and the processor ID.

So kind of the idea here is that this log doesn't have to be centralized. It can be distributed. A combination of the processor ID and the history and the chunk size of the history is the log of the program.

Now, how do we reduce the size of the log in very simple. So suppose we have two threads only.

This is the history of the processor ID that commit. This is the history of the chunk size.

First thing I can do is use large chunks. The larger I use, the larger the chunks, the shorter the log, right? So I use say close to 2,000 instructions. This gets shorter. The second thing I do is use fixed size chunking, or, rather, deterministic chunking.

If I have, for example, a page fault or whatever that requires me to stop the chunk, right, as long as this thing is deterministic, I don't need to log the chunk size. So I'm going to have 2,000 and just remove this thing. Making the chunk size deterministic.

Third thing I can do is predefine the interleaving. So rather than having the arbiter take the chunks and record the ordering which they came, let's say have a round robin algorithm that says

I'm going to take one chunk from each processor.

And if somebody comes and says I want to commit and it's not his turn I'm going to say just wait.

Then I have 010101, I can get rid of this. In a sense I'm eliminating most of the logging. Of course, there's trade-offs in all cases. If I use large chunks you have fewer entries per log. I may have more collisions. The longer the chunks, more collisions, more overflows and so forth.

If I have deterministic chunking, then I can remove the CS log that used to keep the size of the log. But I may have nondeterministic events such as cache overflows where I need to cut the chunk. So I'm going to have small CS log that keeps only those chunks, the size of those chunks that would cut nondeterministically. There's some events that they need to log still.

If I use predefined interleaving I don't need to have a PI log but I may have some performance degradation because I fix the order of commits and I may have load imbalance and it may be okay.

All right. So to summarize, then, there are multiple -- there's basically a dial that says I can either use the nano or the pico log. Both cases are large chunks. Both cases use deterministic chunking so very tiny log, and this one uses no PI log, predefined interleaving.

So that's perhaps what you want to do is when you initially debug a program you want to use this one here that has very, very little logging, but may have some kind of slow down. But this one here, you may want to focus on this particular bug.

Now, the other advantage I said is that this is a high speed replay system and the reason is that processors, as they replay, you don't need any special support as you replay. You don't need basically what you do is processors still executing in parallel normally. All they do is they check their CS log to see when they have a chunk that is special and they need to cut it.

Otherwise, they continue executing regularly performing this deterministic chunking, and then the arbiter simply uses the PI log that it has from the initial execution to decide in which order it can pick the chunk commits.

So this can be say you have a replay machine that uses this approach, it can execute approximately at the same speed or comparable speed as the initial machine.

>>: Are there any guarantees or unusual choices you have to make so that guaranteed during replay to generate a chunk, you always succeed the same length as the originally logged chunks.

I ran a thousand instructions the first time in the chunk.

Now I'm trying to rerun it. I have to be able to get through a thousand constructions without cache overflow or whatever. Does that just follow?

>> Josep Torrellas: Well, Gaussian and branch predictor of the state may be different using execution and replay. You may have events like cache overflow that you didn't have in the original case. In that case you need to stop the chunk now but you tell the arbiter, wait, don't commit anything else let me finish what I need to do.

>>: So that would turn my replay chunk into two chunks?

>> Josep Torrellas: It could be. But it has to --

>>: That's all right because I'm still holding off.

>> Josep Torrellas: It has to be deterministic. Okay. But the other advantage is high replay speed, we're not interpreting, if you want the log as the other schemes, but instead we are using it to basically prove that we are following the same pattern.

Okay. So I have an evaluation of this thing which I'm going to skip in the interests of time. The only thing I want to say is in this model of execution, the logs can become very, very small. We get to a point where you could have, say, a machine that runs for a whole day, a whole day, and be able to have very few gigabytes per core.

So you have an eight processor machine, you would have perhaps about 20 gigabytes in one day. And that would allow you to, in this nano, in this pico mode to reexecute the code again.

Anyway, so this is the hardware aspect of the talk. Sam will continue and look at the second issue, which is how to make this usable.

>> Sam King: Okay. Thanks. So in the first part of our talk Josep showed you a very interesting hardware level mechanism that lets you record the state of an entire computer system very efficiently.

So what I'm going to talk about today is how we can take this mechanism and include it into an overall software system. So can we take this hardware and integrate it into an operating system to make it much more practical?

Now, one of the problems with current hardware-based replay schemes is that they're largely impractical. So there are three key problems with the current proposals. So first there's not really a notion of separating software that's being recorded from software that's not.

So in the typical hardware-based scheme, they'll view the state as the entire computer system.

And they'll record absolutely everything that's running on the system, which people have shown can be made effective. But the problem is, as Josep showed, we're storing gigabytes of data a day.

So how do you store these gigabytes of data you write it out to disk. This is where a little bit of software could help out. But now that software is part of the state that's being replayed. What happens when you replay something that's logging it ends up being a very difficult situation to try to solve.

Second is many of the hardware-based replay schemes require either a specialized VMM or simulator or a completely separate machine. But if you look at many of the common uses for replays such as debugging, this type of configuration doesn't make sense.

So you've got a student and they're trying to debug their program, and you ask them to use two separate computers, it's just not going to work. And finally, and perhaps most importantly, you can't mix normal execution with recorded and replayed execution on the same set of hardware at the same time.

So what this means is if on your computer you want to listen to an MP3 while you're programming, it's not going to work. So what we wanted to do is see if we could address these shortcomings. But fundamentally, in order to do this, we have to redesign the hardware mechanisms that we use for doing replay and we'd have to integrate it into a real operating system and a real software system.

So what I'll be talking about today is Capo, and this is our practical hardware assistive deterministic replay system. Really what Capo is it's a hardware software interface for allowing software to access underlying replay hardware.

Now, as one part of the Capo interface is this key abstraction called a replay sphere. So replay sphere is our way of reasoning about chunks of software that are being recorded or replayed.

And so what the replay sphere does is it allows us to isolate different entities on the same computer. So you've got three different replay spheres one might be recording and one might be replaying.

And the replay sphere abstraction helps us do that. Second, it gives us a very clean way to divide the responsibilities of the hardware and the software, things inside the sphere hardware needs to take care of. Things outside of the sphere are the software's responsibility.

Now, using these abstractions and this hardware software interface we built a system called

Capo One, which is the first implementation of this interface. And to me one of the really interesting aspects about this project is Josep and I together we have a lot of experience with deterministic replay. We've been looking at this for a number of years now.

He from the hardware side and me from the software side. But one of the interesting things we found is this isn't just a matter of taking a hardware scheme and a software scheme and combining them together. Because there are very subtle and very fundamental interactions when you combine these systems together. That's what I'm going to talk about today.

So overall for my section of the talk, I'll describe the internals of Capo including our replay spheres and hardware and software interface and talk about the fundamental interactions between the two and include our evaluation.

In addition at the end I reserved a little bit of time to talk about a few research projects my group is working on some which are associated with UPCRC. Is there a question, James?

>>: At the beginning of the whole talk there's this idea maybe we could use some of this replay stuff for security, figure out -- are you going to talk about what your threat model is for the underlying replay hardware how an attacker might try to elude detection?

>> Sam King: So, I think that this really boils down to a software problem. So it's kind of similar to a threat model of a processor. Right? So we're assuming that we've got a replay system implemented inside an operating system. If an attacker breaks into our operating system, then we're not going to be able to record things deterministically. So the hardware doesn't necessarily work into the threat model.

I think it's still the traditional threat model we would have. It's just a very useful tool for making what the software would do much more efficient.

So the first topic I'm going to discuss today are the replay spheres. And the first thing I'm going to talk about are how we can use replay spheres to isolate processes. So shown here on this figure we've got a 4 CPU system with some replay hardware and an operating system running up above.

Now we have our first process which is a blue process that includes two different threads that are running on CPUs 1 and 3. We've also got a second process which is our gray process running on, that has a thread running on CPU 2 and finally a third process here running on CPU 4.

So now what we want to do is we want to take these processes and we want to record them and replay them as their own unit. So one of the early distinctions we made is that we're only going to record user mode threads. So the hardware is only responsible for things that are happening up here inside user mode. When the system transitions to supervisor mode and into the operating system, the replay hardware stops recording dependencies.

So what this allows us to do is it allows us to exclude the operating system from our replay sphere and it gives us much more flexibility so that our operating system can have nondeterminism. Things running inside user mode have to be deterministic but we have much more flexibility with the operating system, because of this distinction.

Now, if we want to record and replay processes, we've got shown here replay sphere 1 where we're encompassing two different threads from the one process together, and then we've got another replay sphere on the right where we have another thread that's being encompassed. Ed.

>>: Is there anything you have to worry about when you're splitting out the system, problem data, two threads, entering the kernel and they're writing to a FIFO or rewrite to pipe and that interaction is clearly deterministic and you have to force that, the replay?

>> Sam King: Yeah. So if you're talking about outputs from system calls, you really only have to worry about that if you're reexecuting the system calls and using the results. So when we're replaying, nominally, we just throw the results away and it doesn't matter.

But there's an optimization that you can do where you can add another process inside the replay sphere and then use the outputs from one system call to feed into the other one. It's a way of optimizing. That's when you have to start worrying about these types of issues.

>>: [inaudible]

>> Sam King: Inputs we have to record. I'll spend a lot of time talking about that. So one of the key ways that we go away from the hardware-based schemes are this notion of an R thread. So for the previous hardware-based approaches, everything was processor based. You had processor one committing a chunk, processor two committing a chunk. But we wanted to express a slightly higher level of abstraction to the hardware. That's why we use an R thread. An R thread is analogous to a software thread but what this allows us to do is express this software level thread to the replay hardware directly. So this enables us to take a thread and replay it among multiple different processors and gives us this extra layer of interaction, gives us much more flexibility.

Now, one of the requirements of our replay spheres are that any R threads that share memory have to run inside the same replay sphere. Now, as I mentioned when I was answering Ed's question. There are some optimizations that are possible. So you could also include additional

processes within your same replay sphere and this would mean that these two processes are recorded and replayed as a unit and there's some opportunities to optimize log space.

Now -- I guess it's deterministic. So we have a software component that manages all of this entire system. So you can't really read this, but it's called our replay sphere manager or an RSM.

So in many ways our replay sphere manager is like a traditional operating system, where it's going to manage our resources below and it's going to provide abstractions for everything running up above.

Now one thing that I haven't really talked about explicitly is in what type of system does this work?

So for this talk I'm going to talk about a processing system that records and replays processes.

But the concepts we found here we believe translate to any type of layered system where you have a lower layer that's recording and replaying something running up above. We belive you can fit these techniques into something like a virtual machine monitor and use it to record and replay virtual machines. But for this talk I'll refer to an operating system recording and replaying processes.

So in order to separate responsibilities, the replay sphere provides a barrier between hardware and software, and so what it does is it says anything within the replay sphere is the hardware's responsibility. So this means that all of the R threads that are running within a replay sphere, they have to have the exact same interleaving. So the hardware is responsible for recording this and what we call an interleaving log.

When we go to replay, the hardware is responsible for enforcing this exact same interleaving within the sphere itself. Now, things that are outside of the replay sphere are the software's responsibility. So these are any types of nondeterminism that could affect your process have to be recorded by the RSM, the software level RSM so it can be reinjected during replay.

So some examples of things that we record are system calls, signals, the RDTSC instruction, and any other source of nondeterminism that could potentially affect the process we have to record in our sphere input log and then later reinject this into the sphere as it's replayed. Ed.

>>: DTC how are you recording?

>>: Sam King: That's right. RDTC and RDP we set up the processor so it will trap when it's executed from user mode and we emulate. We found that most of the software that executes

RDTC is the operating system. There's this little thing in CRT0 that we still don't understand that seems to call it but it's just once at the beginning. From a performance perspective we didn't have any trouble with it. That's what we do

>>: Synchronous DMA, diode stuff, do you handle that?

>>: Sam King: We don't handle it specifically in this project mainly because we don't have to.

But it's rare for a process to do memory map diode or DMA but the key insight if a process were to do these things it would have to go through the operating system. Because it would have to go through the operating system we can record it. Memory map IO we can do like we do with

RDSCT where we trap and emulate. Other people have done that. DMA takes a little bit more work where what you can do is create a separate buffer and do the DMA into that buffer and copy that.

So there's some overhead associated with it. But these are things that still fit within our conceptual model.

So overall the hardware level log and the software level log create our total replay log. So in addition to separating responsibilities, our RSM is also responsible to make sure that any threads that share memory have to run within the same replay sphere. So this is something that software has to guarantee.

If you have a thread that's sharing memory that's not within a replay sphere you're not going to be able to deterministically replay those processes. For the most part our operating system is allowed to be nondeterministic, but the RSM has to do a little bit of work to make sure a few key functions of the OS are deterministic. And these are around virtual memory and process creation and R thread ID allocation. I'll talk about that particular point in more detail later.

And in addition, the RSM has to manage these different input logs. So when you're recording it stores them in a file and spits it back at the hardware and software during replay and then finally it has to manage the underlying hardware.

So Capo's hardware interface, we design this interface to be independent of the underlying replay hardware we're using. So our hope is that the hardware and software interface and abstractions we developed would work regardless of what you have running underneath. Now the hardware we have running underneath is DeLorean for obvious reasons, but our hope is that we came up with an independent hardware and software interface. In order to handle logs we use a traditional interrupt buffering system like you would see with a network card.

And we've introduced two different data structures that are shared between the software and hardware and these make up the state of our replay system. We have a per processor R thread control block. And what this does is it allows us to specify to each individual processor which R thread and which replay sphere are currently running on that particular processor.

Now, we also have a per sphere replay sphere control block, which allows us to tell the hardware if a particular sphere is recording a replay.

Now, as I mentioned before, one of the jobs of our replay hardware, of our RSM is to virtualize this replay hardware. Now, in many respects this is similar to what an operating system does with processes. So if you have a process that's running, it thinks that it has its own copy of the underlying processor.

Now we all know that the operating system is multi plexing things but it's this illusion of infinite amounts of processors that the operating system provides that we're trying to copy here where we provide the illusion of infinite amounts of replay hardware. So what that can mean if we have the software level replay sphere manager and two different hardware replay spheres beneath, we could have one sphere that's running currently occupying one of our replay sphere control blocks and as it's running it's being recorded and the RSM will take the logs from software and hardware and store it into an overall replay log.

We can have a second sphere that's replaying that can then read data back in from the log and inject it into the hardware and inject it into the replay sphere at the appropriate times. We can have also a third sphere that's ready to run and is going to record but there's not enough hardware available.

So we put it in a queue just like you would with a traditional OS scheduler.

Now, one of the really interesting parts about this project were these three key challenges that we found when combining this type of hardware and software system. So the first part, the first challenge is when you copy data into a sphere, you have to be careful to make sure that there's a deterministic interleaving.

Second, when you're handling system calls, there are certain -- most of the system calls you get to emulate, others you have to reexecute and you have to make sure that there's enough determinism so it can be useful to the hardware.

Finally, when we have fewer processors available during replay than recording the software has to handle that.

So the first challenge I mentioned was copying data into a sphere. So in this particular figure we have an operating system with a replay sphere that has two different threads inside the replay sphere. And these threads have a shared buffer called buff.

So what happens is R thread 1 might issue a system read call which will be handled by the operating system. The operating system will grab the appropriate data and it will use a function such as copy to user to copy data into the process.

Now, before the actual copy takes place, the RSM will make a copy of the data that's about to be loaded into the process and then will go ahead and carry out the copy operation changing the data.

Now, the problem is what happens if R thread 2 accesses this buffer at the same time. Now remember things inside the replay sphere isn't a problem because the hardware will make sure we have a deterministic interleaving but now we have the operating system making this type of change. The operating system is not part of the hardware. Now we have a source of nondeterminism we have to handle.

So there are many different ways we could have solved this problem, but the solution we came up with is to include the copy to user function within the replay sphere. So this is a very elegant solution because the underlying hardware is already pretty good at recording these types of interleaving, so we use it.

Now, some of the trade-offs we make are that now our copy to user function has to be deterministic. So we have to guarantee that when this copy happens it's going to execute the exact same instructions given a set of inputs.

In practice we didn't find that to be too difficult but it's one of the requirements. Another more stylistic issue is that now we've slightly blurred the boundary between what's a replay sphere and what's not. Before we had this very clean, elegant notion where only user mode code is. Now we've got a little bit of the kernel that's being included as part of our replay sphere.

And there aren't any problems with this, necessarily, but it's just more of a stylistic thing. It's one of the trade-offs that we had that we felt we had to make.

So another issue that came up was this idea about emulating versus reexecuting system calls when we replay. Now, during replay the RSM emulates most system calls. So that means when the system call is issued, instead of passing to it the operating system, the RSM will just return whatever was returned previously.

But we do have a number of system calls that we have to reexecute. And the reason we reexecute these are for efficiency. So the one class of functions that we reexecute are thread management functions.

So we want to have something that's going to integrate cleanly with the OS scheduler and the best way to do that is to create a thread. So we reexecute thread management calls. Another example are address space modification calls. So certainly we could emulate loads in stores if we wanted to or we could rely on the underlying hardware to do this for us and in order to do that we have to reexecute some of our address space modification calls.

So to make this a little bit more clear, I've got a quick example of both emulating and reexecuting system calls.

>>: I thought I had the bug. I'm trying to figure out what RSM -- why we only see the bottom third of RSM.

>>: Sam King: Well, yeah, I --

>>: That's all right. Carry on.

>>: Sam King: Fortunately you have good enough imagination you're able to fill it in for us.

Shown here on this figure we have the operating system below and this white chunk is the RSM.

We also have a replay sphere running up here above and two different R threads. And one of the system calls that we emulate is the read system call. So if thread 1 were to execute the read system call, it traps down to the RSM, which will then grab data from a log and then reinject this data back into the process. So it doesn't ever pass control to the operating system. This type as of emulation is a pretty classic technique that other people have used with many replay systems.

Now, where things get a little bit more interesting, or where we have to reexecute system calls.

So let's say that thread 2 issues a fork system call. Now, what we do is the RSM will trap and it will pass control to the operating system and let it create a thread.

Now, part of the difficulty in this is that this thread is something that has to be visible to the hardware. Like the hardware has to know about OS level threads. And this thread No. 674 is not going to be something that's deterministic. So what we can do, is in the RSM we can trap control of the fork call before it returns, and we can include this thread within our replay sphere and then give it an R thread ID deterministically. So now we can guarantee that this individual thread has the exact same ID that it had before.

Now, one more subtle issue that I want to talk about are something we found that are called implicit dependencies. So in this particular figure we've got a replay sphere with two different R threads and they're sharing the same page table. So one of the things that we found that could happen is if you have one thread that issues an M protect system call, what this is going to do it's going to cause a modification of the page table and at some point later in time the second CPU is going to see this modification and it's going to affect how this software runs. Namely, it's going to cause a segmentation fault.

Now, one of the difficulties is that we have to guarantee that it happens at the exact loop iteration or else the replay isn't deterministic. So what you can see what's happening here is infect we've got a dependency that's being formed between these two threads in a way that's not direct.

Loads and stores we know how to handle. But this type of dependency is something that has to be deterministic.

So our solution is similar to the other, to the previous solution, where we use the underlying hardware to track these dependencies for us.

So the operating system has to know about these types of address space manipulations and it expresses this implicit dependency to the hardware explicitly so the hardware can make sure it can track any of these types of interactions.

So we built the Capo One system using modified DeLorean hardware that we simulated in Symix

[phonetic]. We modified the above 2726 version of the Linux kernel.

We have a user mode replay sphere manager portion that interacts with the kernel mode replay sphere portion and we record and replay processes. One slide that Josep didn't show you in the first part was what the DeLorean hardware looked like for the ISCA paper.

One thing I'd like to point out there's an interrupt log and an IO log and all these different entities for recording any of the things you might see on a computer system.

So what we found is by pushing some of the complexity up to the software, we're able to make the hardware simpler. So now there are fewer things that the hardware has to worry about and it can just focus on what it does best and push some of this complexity up into the software.

Ed.

>>: What about delivery after a scaling --

>>: Sam King: How do we handle signal delivery?

>>: What's it called?

>>: Sam King: So we pull the same trick that many different people play where such as yourself where we just deliver them at system call boundary. We took a page out of your playbook.

However, this is a problem that actually we could solve quite well with DeLorean. So the problem

Ed's pointing out is that we have an asynchronous event that happens at a place in the middle of the instruction stream. And these in an operating system are signals. So to simplify our implementation we push signal delivery for asynchronous signals to system call boundaries.

It's still within the semantics of Unix signals but it's a trick that other people have played. The right way of doing this is to have ways to identify the precise place in the instruction stream to deliver this signal.

So x86 performance counters are one way of doing this but the DeLorean hardware provides a very nice second way of doing this because we know how many chunks have executed. This is a fundamental part of the system.

In order to evaluate Capo, we use two different environments. We use a simulation environment that has simulated DeLorean hardware and the full-blown operating system and it's simulating a four-way SNP system. And we also have a real hardware configuration that obviously doesn't include DeLorean hardware but it gives us an idea of how much the software is going to add.

In this figure we have the log size we create as a result of our system. So on the Y axis we have log size expressed in bits per kilo instruction and on the X axis we have the three different benchmarks we ran. Splash 2 which is a traditional parallel benchmark. Apache, which is a web server that we had some web performance numbers running on it. And then Make, where we compile the kernel.

Now, one of the interesting results from this particular, there are two interesting results from this particular set of evaluation. So, first, in many respects it was a validation of the DeLorean design, because what we found are that the hardware level log sizes were roughly equivalent to what they were in the full system case. Now, the reason this is interesting is because when you integrate with software, there are more opportunities to cut these chunks short. Page faults and exceptions and all these different things could potentially cut chunks short and what we found in the common case these don't occur often enough to affect our log size.

Now the second interesting result was that the software portion, which is shown up here in yellow, is much smaller than the hardware portion of the log.

So we were planning on doing a number of different optimizations to decrease log size, but what we found is that we didn't have to. Because everything's being dominated by the hardware, that simplified our software implementation as well.

Now, in this graph what I have shown here is the performance overhead during logging of our replay system. So on the left we have the normalized runtime where it's all normalized to one.

And anything higher is slower. So higher is bad.

And we break down the performance by interposition layer and then everything else. So one of the interesting things we found here was that we made a very early design decision to use

P-trace to prototype our system. So I've developed a rant in my class about P-trace based on this experience where it gives you the allure of user mode code. Great I can implement the RSM in user mode everything's going to be so great and so simple. Your first implementation is. But the problem is it's really, really slow. Slower than you would ever guess. So now you're stuck with an implementation that works and you've got to make it fast.

So you start shoe horning optimizations into the kernel and it gets to a point where the optimizations to make the P-trace performance bearable end up being much more complex than if we had just implemented it inside the operating system to begin with.

So in hindsight, I guess the big lesson learned here is there are two fold. First is the performance is really going to be small. And it's going to be even smaller. So I would say with a kernel-based implementation around 20 percent overhead for kernel make is about what we expect.

But another lesson is don't use P-trace. So resist temptation. Now, finally, the last result I want to show is replay performance. So on the Y axis here, we have normalized runtime. And shown here we have the replay performance of three different benchmarks, Apache serving 1 K pages,

Apache serving 10 K pages and Apache serving 100 K pages. In this particular case lower is better. That means if you're lower than one that means we're able to replay faster than when we logged.

And this particular set of results are interesting for a couple of reasons. So one is we can replay a lot faster than we logged. And you know this is very related to a concept called idle time compression that was discovered by other researchers. We had a slightly different take on it where R compression was coming from the network, but the key idea is that by emulating these actions instead of actually carrying them out, things go much faster.

So to me the implication here is if we do implement hardware support for replay, obviously the replay speed needs to be fast. The faster you make replay speed the better. But what these results seem to indicate is that we might have a little more design slack on replay performance than we have on logging overhead.

So if we have a potential trade-off we could make it might be better to optimize for logging performance or recording performance as opposed to replay performance.

>>: Can you start replay from execution or do you have to start from the beginning?

>>: Sam King: The question is: Can you start in the middle of execution? And really this is not something we covered explicitly in this research but other people have looked at this. So the key idea is we've got these deterministic replay logs where you can recreate past states. That's only one version of time travel. Check pointing when combined with replay makes a very powerful combination.

So if you wanted to replay from the middle you would take checkpoints as you're going, roll back to a previous check point and start from there. We didn't do it with this project because other people -- I've looked at this in the past with some of my colleagues from Michigan.

>>: [inaudible] you should be able to add check point, is that a problem?

>>: Sam King: Yes, so the amount of state added from the replay hardware is relatively small, using off-the-shelf techniques should work effectively.

So the conclusions are in the first part of the talk what we showed was a very clever and very efficient way of recording thread interleavings. This provides some hardware support for doing something that's just very difficult to do in software.

Now, what we did is we took this very clever mechanism and included it within an overall replay system. And one of the things we found is there are some very subtle and fundamental interactions that happen when you combine these two things. What we presented was Capo

One, which is an overall replay system that includes both the hardware and software components. And that is the end of my talk. Are there any questions? I'm actually going to skip additional research from my group. Mainly in the interests of time. But it's profound, let me tell you.

So are there any final questions? Great. Thank you all for having us.

[applause]

Download