>> Tom Ball: Okay. I'm Tom Ball and it's... here. He's a professor in the Department of Computer Science...

advertisement
>> Tom Ball: Okay. I'm Tom Ball and it's my pleasure to welcome Charles Zhang
here. He's a professor in the Department of Computer Science and Engineering
at the Hong Kong University of Science and Technology. And his major area of
work is in the area of software engineering. He's been focusing lately on
debugging of concurrent systems and is a Ph.D., M.Sc. and B.Sc. with honors, all
from the University of Toronto. So welcome, Charles.
>> Charles Zang: Thank you. Thank you very much. All right, so I'd like to thank
Tom first for giving me this opportunity to talk about my work. So I'm very, very
happy to be here, to be the home [inaudible] to talk about debugging concurrent
programs, very happy that [inaudible] came to give us some opinions.
So before I start to talk about research itself, let me just first introduce where I
come from. There're some old friends who know where I come from but there are
many new. So I come from a small university in Hong Kong. It's located right here
on the shore of the ocean. It's quite small. It's got in total about 7,000 students
but it's quite specialized. So this is where I come from on the map. And my
research group is called the Prism Research Group. It was basically established
in 2009. I joined the university in 2008, so we're a pretty young research group.
That's why we're happy to be here to learn from the experts. I have four Ph.D.
students in the group. So the research focus of our research group -- we like
program analysis and we want to use it to help programmers debug and develop
large systems. Our current focus is concurrent systems as Tom has already said.
So we have a few major research projects. The topic of the talk today is one of
them. It's deterministic replay of concurrent software, so we'll talk about that of
course for this talk. And other than that we also work on automatic fix of
concurrency bugs. We have two papers there. We also work on causal analysis
or people call it predicative analysis over traces, and we have a few interesting
techniques there. We also work on something much lower on program analysis
itself. We have some technique that computes pointers more accurately, and
we're happy that it's been adopted by Soot. And we also work on some database
approach for program analysis; that's our ongoing work. Okay, so this is an
overview of the type of research we do in the group.
Of course today I want to talk about a specific area that we have made a few
interesting contributions to. It's called Deterministic User-level Replay of
Concurrent Programs. It's a joint work with Jeff who is also here and together
with pretty much everybody in the group. So this is something that everybody in
the group has invested some amount of time. Okay?
I want to give you an outline of the talk. For experts I guess I don't have to
explain what replay is. But for people who're not very familiar with replay, I want
to define it in a very simple manner so that we know what we're talking about.
And then, I want to use a very simple example to illustrate some of the main
challenges of performing user-level deterministic replay. Then I'll talk about some
prior art to allow you to better understand the contribution of our techniques. And
I'll talk about our techniques, specifically the three techniques LEEP, STRIDE
and CLAP. So LEEP was presented in FSE 10; STRIDE in ICSE this year; and
CLAP was presented in PLDI Student Research Competition.
So we know that probable the FSE and the ICSE probably are software
engineering centric conferences, so I like to even more welcome the opinions in
this audience on the work that we did. Then, I'll make some conclusions.
So to begin I'd like to first define what would we mean by replay. First of all when
I talk about replay in this talk I want to talk about programs that use threads on a
multiprocessor platform and communicate through shared memory. So these are
the very common kind of threaded programs that we deal with. So I'm not talking
about functional programs, parallel programs like Haskell. So these are shared
memory multi-threaded programs, so this is the kind of program we want to
replay.
So I'd like offer a definition. I'd like to point out that this is by no means a very
authoritative definition of what replay is. This is our working definition so that we
can have an easier way to position our contribution. So I define it to be a
technique that repeats a previously exercise execution according to some kind of
diagnostic criterion. So that means that sometimes we replay the program to be
able to time travel so that we can go back to any point in a program's execution
and start to execute from that point. So we want to replay for full execution. In
some other areas we want to just browse the computed values of a previous
execution, but we do not care how the values are computed. Okay, so it's some
other scenarios we replay for getting the computed values.
And a third category, which I argue that most of our replay techniques have fallen
to, is for bug reproduction, is for reproducing a particular failure. Okay? So these
are just to show you some different purposes or different criteria that each replay
technique would try to achieve. The basic mechanism of replay -- So the biggest
denominator despite the goal of the replay technique itself is that you always
want to try to monitor the application to produce some log, and we use the log to
try to reconstruct an earlier execution. So this is probably the basic mechanism
for replay technique.
In terms of research, replay has been active for many years. So the earliest
paper we can find -- We're not this is the earliest paper. The earliest paper we
can find is the TOCS paper. Oops. I guess Microsoft is checking on me. All right.
So the earliest paper we can find is the TOCS paper in 1986 and also all the way
to this year's ICSE for our replay work, just examples for showing that this
research has been going on for many, many years. And the publication of replay
works spanning many different areas all the way from course of engineering
conferences to hardware conferences like MICRO or VEE.
The good news is that if we conduct research in this area we can many different
venues that we can publish papers. Of course the bad news is that we have to
read the papers from all these conferences first to understand the state of the art.
So it has been a very prolonged research area active in many different areas. So
replay technique can be implemented from these conferences. We can probably
already sense that it can implemented in hardware such as modifying some
hardware cache coherence messages, or it can be implemented at software at
the user-level from program instrumentation, or at the system level where we
modify the virtual machine to enable logging of the low-level system events. So it
can implemented in hardware or it can be implemented in software.
A replay technique is what we call deterministic if it guarantees to reproduce the
earlier run. So we call it a deterministic replay. We call it a probabilistic replay if it
only tries to reproduce the earlier run with the best effort, okay, with high
probability. So these are just some common terminologies that we use in this
area.
Our focus in our research group is what I call deterministic user-level replay for
bug reproduction. So there are some terms here so let me just explain a little bit
better. We like deterministic replay because we think that it helps the diagnostic
process more effectively. And in addition many sort of client analyses require
deterministic replay such as cyclic debugging, fault localization of concurrent
programs or bug classification where we have to run the concurrent program
again and again. Okay? So in that case determinism is a very, very good thing to
have. And also we like user-level replay because we believe it's very easy to use;
it's very accessible. So user-level replay is a program will take the replay
technique and take a regular user program, instrument it or transform it and just
run it. Right? So the technique is going to monitor the application, produce a log
and replay it. So this would not affect any other applications and this would not
require any OS-level, system-level or hardware modification. So it's very easy to
deploy, so we like this characteristic as well.
And we like bug reproduction. We think that's probably the most common use of
replay. The reason why we replay is to try to troubleshoot, try to find bugs. And in
particular we focus on reproducing or producing the order of race. So I'm going to
illustrate that. That's basically one of the major reasons for concurrency bugs. So
we focus on producing the order of race that the programmer can have an easier
time to diagnose the bug. And by saying that, we can relax on reproducing the
original schedule or even the computed values. So we focus on the race order
and relax on reproducing the schedule and the computed values, so this is sort of
the focus of our technique.
>>: [inaudible]. So you can tell me if this is more appropriately asked somewhere
else, but it seems like those goals are a little bit kind of against each other.
Right? Like if you do it in the software there're a lot of benefits to that, but you're
kind of getting into the world of the program. Like if you truly want to be
deterministic, you know, it would seem to me that you need to remain outside of
the world that the program lives in. So it just kind of seems like software is really
good but it kind of encourages probabilistic; deterministic is really good but it kind
of, you know, encourages other decisions that are made.
>> Charles Zang: So yeah, well, I'll illustrate later that the software approaches
have some problems. But it's good for producing many different concurrency
bugs. It has its shortcomings but it depends on how you define...
>>: [inaudible] probability, right, like because it's practical. Right? I mean
deterministic? You always want deterministic but you'll settle for probability if it's
practical. Right?
>> Charles Zang: You have a point. So I'll illustrate later on for the challenges.
Probably I can answer your question after that.
>>: Okay, thank you.
>> Charles Zang: Okay. All right, so before I actually talk about the details of our
techniques, I just want us all together to go through a replay exercise so that we
can get familiar with some of the challenges and also some of the terminologies
that I'm going to use later on. So our replay exercise actually is very simple. It's
probably the simplest of concurrent programs you'll ever see. So we have two
threads that are both reading from a global variable, G, and incremented locally
and assigned back to global variable, G.
So these are the two threads. And if we execute the threads following this order
meaning that we execute entire computations of T2 after T1, we will compute 2
for the global variable, G, at the end. Okay? So G is 2.
But if for some reason Statement 4 is executed before Statement 3 due to a
different kind of scheduling, we'll have G equal to 1 at the end. Okay? So this is a
very simple example. And I want to illustrate for a concurrent program the order
of the race, or whether Statement 4 is executed before Statement 3 or after
Statement 3 is the key to deterministic replay. So we have to restore a particular
race order to be able to produce the same program output. So this is the sort of
essence of deterministic replay that we need to catch or need to restore the order
of race as illustrated by this example, otherwise, the computation result will be
different.
All right, so let me just illustrate the first solution which is a very intuitive solution
in that we can just record the order of race directly. Right? So if the order is
important, let's just record the order. So here suppose that the order is that
Statement 4 happens before Statement 3. And since we're doing user-level
recording we're going to insert some instrumentation code and try to record this
race order. So by sort of inserting this instrumentation we hope to reproduce a
log of 4, 3. Right? So the order is that 4 executed first followed by 3. But again,
since we're doing user-level instrumentation, our instrumentation code also
suffers from the [inaudible] of the scheduling so that our instrumentation code
can be executed later than the other instrumentation code on T1, on Thread 1.
This would produce a lock of 3, 4. So if we use that lock to replay this whole
program, we know that, as shown by my earlier illustration, that we're going to
produce a different result. So this is a very simple illustration of the problem with
user-level instrumentation.
So the common practice is we had to somehow make the instrumentation at the
actual statement to execute atomically or have to use some kind of
synchronization. And for people who know about concurrent programming, we
know that adding additional synchronization is really, really bad.
Here I want to illustrate another solution to solve the problem: we do not use any
synchronization but instead I'm going to record the values that we read or write
for each of the shared variables. Okay? So in this case we can record that T1
reads G and the value is zero and writes to G the value is 1. So from these
values we can sort of reason that the first statement with T1 cannot happen after
this last statement of T2 because the value is zero, otherwise, the value would be
1. Right? So this is a very simple example to illustrate that it's possible for us to
find a schedule given thread-local/store value just by searching.
So just to make sure we find a schedule that is correct, for example, sequentially
consistent. Gibbons et al. proved this many years ago that this problem is NPC.
Okay? So we can do it but it's an NPC problem.
So let me summarize what the example is trying to say. The example shows you
an order-based solution where we tried to record the thread-access orders. We
explicitly track all the read-write dependencies, and we can guarantee to replay
the execution. But we need to use locks, a lot of them. So this is really, really
bad.
Or we can try the search-based solution. We can infer the read-write
dependencies through the load/store values, require no use of locks, but it's an
NPC problem which means it doesn't scale. It means that we cannot always
produce a reasonable schedule, so we're losing the determinism.
>>: So help me understand why it's NPC.
>> Charles Zang: Well, it's not...
>>: The reconstruction is NP complete?
>> Charles Zang: So as I stated here given the thread-local load/store value
trace like the ones that I show here, to decide if there is a schedule that's
sequentially consistent, it's an NP complete problem. So this is shown in...
>>: Okay, so...
>> Charles Zang: ...the [inaudible] paper.
>>: ...[inaudible] that 1, 2, 4, 3, 5, 6. That’s [inaudible].
>> Charles Zang: Yeah, so if you assign a schedule order, which one executes
first, if you want to assign a total order to represent a schedule for these
statements, it's an NPC problem if you only have the value trace. Okay?
So I illustrated an order-based solution. I illustrated a search-based solution. And
the prior art for user-level deterministic replay is primarily order-based. Okay?
And the representative techniques are called Dejavu and Recplay, and let me
just briefly give you some information about these related works so that you can
understand our contribution a little bit better.
So to talk about the related work and also our contribution, I'm going to use a
slightly more sophisticated example as the one here that involves two branches.
And what's interesting about this example is that only following this schedule can
lead to evaluating both branches to be true, okay, so that's the only schedule that
we can follow. This just shows that we need to precisely reproduce the schedule
to compute the same result. So this is the example we're going to use.
And the first sort of representative work for user-level replay is called Dejavu. It
was developed by J.D. Choi which is also a well-known name in concurrency
work. So his proposal is very similar to the first order-based example that I gave
earlier which is simply if we can identify all the locations in the program where the
shared-access, shared-variable or shared-object is involved in a computation, we
can actually record the exact order of the threads that exercise these shared
computations. So we can record all of them in the global log; so we can record a
global sequence of the thread schedule that computes the shared-variable
states.
So this proposal obviously works, right, but it's pretty inefficient because it
involves synchronizing all the threads onto a global log. So in this simple
example, to produce this log it involves six global syncs. To log into a global log
we essentially tie all the threads together using a global log which incurs a lot of
runtime overhead.
So improve upon that, Recplay is another representative approach presented in
2003 TOCS, to use a Lamport's clock. So from distributed computing [inaudible],
we know that Lamport's clock is a pretty efficient way of clocking distributed
events. So we model each thread as a distributed process and we model objects
as the sources of distributed messages. Right? So if you know about Lamport's
clock, this is pretty straightforward mapping from that problem to this problem.
So what they do is that they associate each threat and each synchronized object
with a distinctive clock. And this is the standard Lamport's clock algorithm for
updating these counters or these clocks. So for our example, the details on how
we arrive at these counters, how we compute these counters is not important.
But basically the technique is to associate each shared variable, each thread with
a different clock, with a different counter. And at the end each thread would
maintain a local history of all the counter values for all the accesses to the shared
variables. So in this way the thread only --.
So we call it the local sync. So the thread only needs to synchronize the monitor
or synchronize the observation or synchronize the logging if the thread involves
the particular shared variables. So there's no need to synchronize across all the
threads. So in this way we call it local synchronization. So Recplay in this
example involves six local synchronizations plus six compare and update's for
updating the Lamport's clock counter.
So our contribution compared to the prior art is to still try to lower the logging
overhead while staying deterministic. And we also want to simply the replaying
process. And let me just talk about our techniques now in detail. So I'm still going
to use the same example, only following this path, the schedule that we can
produce this error.
So I want to first talk about LEAP which is our very simple idea that improves
upon the Recplay work. So the idea is very simple: instead of associating each
thread with a clock, we associated each shared variable with a sort of access
vector. So as the program executes, we just record the sequence of threads that
each shared variable sees during the execution. Okay? So we maintain the
access vector for each of the shared variables. So here, similar to Recplay, we
require local syncs. So each shared variable is required to only synchronize with
the threads involved in the computation or race against that variable. We do not
need to synchronize all the threads. So it's still the same; it's the six local syncs
in this example. So the way to replay is to actually think of these logs as stacks.
And we're just going to read the logs one by one or pop the item one by one and
use that value to direct a user-level scheduler. So a user-level scheduler can
start to check if the threads turn to read or write to that variable by checking the
top of the stack.
So in here we just check if it's t1's turn to execute. And the next is that we check
the top of the stack for y. And it shows it's t2 so we switch to t2 and continue to
execute. And then we continue to check the log. It's a very simple fashion in this
way when we consume all the logs, we can guarantee that our user scheduler
will follow the same execution pass as the recorded run and finally produce this
error.
Okay? It's -- Yeah?
>>: I have two questions. First, how do you identify the shared variable
[inaudible]?
>> Charles Zang: Yeah. We use a static analysis prior to the run. So there is a
phase of static analysis for identifying the shared variables.
>>: Based on [inaudible], right?
>> Charles Zang: Yeah, we do it conservatively. Yeah.
>>: And the second thing is, the main reason why you're like your recording
phases faster than the global is that you basically allowing parallel-like recording.
Is that right?
>> Charles Zang: That's right. That's right. That's right. It's a very simple idea.
We wonder why anybody hasn't thought of it earlier. But, nonetheless, it is state
of the art order-based deterministic replay technique. It's more lightweight
compared to existing approaches. And we use static analysis and bytecode
transformation to actually achieve like the questions that you just asked. We
have formal proof of replay correctness which I haven't found in any of the earlier
papers, and we contributed the first automated tool available to the public for
replay which is very surprising. We couldn't find any other automated tool for
replaying, but our tool is publicly available for downloading.
And weaknesses of this approach are also apparent. There are still too many
synchronizations. We have six synchronizations, although, they are local ones.
And essentially when we tried to lock these access vectors, we essentially
eliminated all of these low-level data races even though many of them are benign
or intentional. Essentially we're forcing a sequentially consistent execution of our
program. So these are the weaknesses. Yeah?
>>: In your tool would you do anything to try to optimize away conditionals?
Because you talked about how you execute a statement and then do, you know,
"If I should keep going then execute next statement." Is there anything you do to
try to, you know, say, "Oh, I should just do the next five off the stack?"
>> Charles Zang: You mean for the replay part or for the...
>>: For the replay part.
>> Charles Zang: For the replay part we just execute -- whenever we reach a
shared variable we will actually check the log to check if it's my turn to execute.
So just as I illustrated...
>>: So you're doing each time?
>> Charles Zang: Yeah.
>>: You haven't found a need to put like different tricks in there?
>> Charles Zang: Yeah, it's pretty -- Well, in other words, it's pretty slow. So we
just force, actually -- Each computation we had to check the log to replay.
>>: Yeah.
>> Charles Zang: Yeah, we haven't done any optimization on it.
>>: Thank you.
>> Charles Zang: But our OOPSLA paper actually does some kind of
optimization on it, but I probably don't have time to talk about it. So given all
these weakness of LEAP, we have improved upon it, and we propose to use a
hybrid approach to address some of the weaknesses. So this is an approach that
we require the log, the recorder to synchronize on the writes but not on the
reads. So this is the main contribution of this technique, of STRIDE. I call it a
hybrid approach because it is order-based where we record write orders similar -actually the same as the approach of LEAP. And we match the values of reads -Sorry. It's order-based where we have to record the write order. It's search-based
where we record the values read, and we match those values with the writes
during replay. Okay? So there is sort of a research element in the algorithm as
well. And we use the write order to bound the search complexity, so we don't
have this NPC problem. Actually it's a polynomial time. Okay? So the total
number of steps to search is bounded by KN, where K is number of threads and
N is the total number of operations. So it's a polynomial search.
So let me just give you some details of how STRIDE works, still the same
example. First of all, the difference is that record sort of access vector for writes
only. All right? So for the shared variables x and y, we record the sequence of
threads that are involved in writes to these variables. At the same time we record
two values for each read.
So I'm going to zoom in on this particular thing later on, but right now this is sort
of the log produced by STRIDE compared to LEAP. The write log is identical to
LEAP. And we have sort of two values logged for each read so we call it a double
log.
So let me just zoom in on this double log feature of STRIDE. So this is actually
the key contribution of STRIDE. Suppose that we have different instances of
writes on the same variable. So in my example the series of writes produced sort
of writing a series of prime numbers on a particular variable. And our reads
actually -- For example, in a recorded execution the read actually reads a second
write which is the value 3. As I showed you earlier if we want to record the exact
order for race, we need to make sure the recorder and actually the actual
operation happen atomically using lock. And I've shown you that this lock is
actually a bad idea, right? We have to insert a lot of lock to do this. In STRIDE
the big contribution is that we do not need this lock anymore. Instead I'm going to
read twice. The first read is going to read the actual value that is committed by
the remote write. And the second read is going to read the bound sort of version,
that when this read commits it's going to return whatever the version the second
read returns. We call this one the bound.
So how do we use this value to restore the race order? So our purpose is to sort
of restore the order of race. Right? So it turns out that our bound, number four,
gives us a search bound. We only need to match all the values written and
committed by writes before the version four. Okay? So this bounds our search,
so that's why we're polynomial. And the rest is very simple. We just scan from the
bound backwards and find the match by the value because we have the value of
reads recorded.
It's very apparent that it's possible that we can establish a very different readwrite link than the original one; it's possible. And we prove that this does not
affect the replay correctness.
>>: Are you going to define that more? "We prove this does not affect replay
correctness?"
>> Charles Zang: Yeah, the full proof is in the paper but I can give you more
information if...
>>: Is the idea basically that it has to read the same value?
>> Charles Zang: Yeah.
>>: Even if it's from a different write.
>> Charles Zang: Yeah. It returns the same value.
>>: Okay.
>> Charles Zang: Even if it's from a different write.
>>: Okay.
>>: But if the user is debugging the code, that's still going to kind of bring -- I
mean, I'm just thinking about this from a practical standpoint. If they're debugging
the code and they're trying to define the line in their source code that's causing
the problem, if it's still going to bring them there, I consider the proof correct,
right? Just practically.
>> Charles Zang: So for...
>>: So if you tell me that's the case, I'm happy.
>> Charles Zang: So if it's a failure -- So this guarantees to reproduce the failure.
But our focus is to try to restore the exact race order so that your programmer
can actually understand how that failure happened. And this, in theory, does not
guarantee the original order but give you one of the orders that can actually lead
to the same failure.
>>: Okay. So at least it gives them something to debug....
>> Charles Zang: Yeah, give you at least, for example, another possible order
that leads...
>>: Yeah, that's [inaudible].
>> Charles Zang: ...to the same failure. Yeah, so it doesn't guarantee to give you
the original order. So let me summarize: compared to LEAP, on this particular
example we only need to perform four synchronizations and we need to log two
actual values and two versions. So the big change is that by double logging we
do not require to synchronize. Right? So we only need four locks instead of six.
So this doesn't seem to be a lot of improvement compared to LEAP or compared
to Recplay, but in practice this has a huge performance improvement.
So it's very intuitive that -- So what we found is that in practice, on the
benchmarks we evaluated a large majority of the shared variable accesses are
reads. So for example, in some of the benchmarks such as Derby and SpecJBB,
more than 80% of the shared variable accesses are reads. Okay, so think about
the degree of lock reduction that we can achieve using STRIDE compared to
LEAP. And the writes to shared variables in our experience are mostly already
carefully protected by programmers because they are more cautious about
writing to a shared variable. So our investigation reviews that, for example, in the
benchmark of Derby more than 95% accesses potentially incur some kind of
race. However, if we only count the write-write race, it's only 15% meaning many
of the writes to the variables are already protected by the programmer. This
means that we do not need to insert additional locks for these writes. We can
already leverage -- So in that the performance penalty is very limited because the
programmer already has a lock for that write. We can just insert instrumentation
inside the lock region without inserting additional locks.
Another amazing find -- Well, probably also others have already observed the
same thing -- is the context switch between those two double logs actually is not
that frequent. So remember we do not require a log to synchronize those two
reads. So in our experiments more than 90% of the reads can be solved in the
first comparison. That means there is no context switch even if I read the value
twice; there's no context switch in between. Only 58 out of 400 million reads in
our experiments require 4 comparisons, which means there are 4 different
versions of the value that are inserted in between those two reads.
>>: And those numbers are based on those facts, those...
>> Charles Zang: Those numbers are based on the subjects that we evaluated,
so we cannot say this could generalize any program. But intuitively I think it's not
that surprising. Okay? So the context switch is not that frequent for the accesses.
So we do log extra values, so the value logs are separate from the version log
which means that we can take advantage of the locality so a long continuous
array of the same value. So it's very friendly for compression. So in our
experiments we don't find we generate a lot of logs as well.
So this is actually a summary of one of the evaluations we did for LEAP and for
STRIDE. On the left side is an array of popular multi-threaded Java programs.
And the second column actually shows the percentage of reads of all the traces
that we collected. So a large portion of these operations are reads, so that's
where the technique with STRIDE can actually help a lot.
So the third column shows the overhead of LEAP compared to the original
execution. So in some of the subjects such as Derby you can see that the LEAP
already has only 10% overhead for recording. And the last column shows the
overhead of STRIDE, and you can observe that it's significantly faster or has
significantly lowered the recording overhead compared to LEAP.
In particular in this subject, Moldyn, LEAP incurs 100 times overhead where
STRIDE can really drive it down to 1.5. Okay? So this shows that to relax on the
synchronization on all the reads actually has a huge performance impact on the
recording overhead.
So the last technique I want to talk about is called CLAP. It's a solver-based
approached which we go a step further to drive down the recording overhead. In
this approach we do not require any synchronization, okay, and we record only
the branch choices. In other words we do the pass profiling. And we compute
schedule by solving constraints. So this is ongoing work. I do not have full-blown
details, and our evaluation might be still premature. But the core idea was
presented at the PLDI Student Research Competition and actually won first
place.
So the basic idea of CLAP is not to use any synchronization at all. So let's just
record thread local paths at runtime. That's all we record is the local path. And
then we can construct the execution constraints on all the shared variable
computations using symbolic execution on the path. And we use SMT solvers to
compute the schedule. Okay? And more specifically we have path constraints,
we have program order constraints, partial order constraints, read-write
constraints, and we fit it to a constraint solver to compute a global order of
shared variable accesses.
And the good news is path profiling is pretty lightweight. It's a classic technique
that people really work hard to optimize. For example, the Ball-Larus profiling is
only 30%.
So let me just give you some basic high-level steps of how we approach
recording in CLAP. So same example, we first record all the branch decisions.
Okay? So in this case we record the branch decisions and we collect the
execution path for each thread. And the next step is that we try to encode the
constraints based on this path. We have different kinds of variables. We have a
symbolic variable for shared variables, all different instances of shared variable
accesses. We have an S variable that symbolically represents the value returned
by remote read. And we have order variables corresponding to each of the
shared variable instances to represent a position in the global schedule. It's a
corresponding position in the global schedule. Okay?
So on the bottom, I'm showing how we can construct the constraints. The first
order constraint is just program execution order. It's very straightforward. And the
read-write constraint is basically saying that I'm trying to match the symbolic
variable that represents remote read and I'm trying to match it with actual writes.
But there are possible instances to match to. For example this one, the one I
circled, can be potentially matched to its local write or to a remote write. Okay?
So it depends on the actual pairing. I can construct a different order constraint for
the order variables. So if you do not bother with the detail, this basically says, "If
that variable corresponds to the local write, the computation of the other thread
must happen either completely before that write or completely after the actual
read." Okay?
>>: So in this example, I mean the only difference in the programs that you're
observing are whether or not the conditionals evaluate true or false, right?
>> Charles Zang: Right.
>>: So then you can actually have -- So, okay, there are certain executions in
which, for example, y is equal to 1 on the right-hand side. So you're saying by
observing the control flow, you are getting constraints on the interthread
execution? Because if...
>> Charles Zang: So this example...
>>: ...if six executed and then seven executed then you know something about y
changing, and so that would require some sort of race [inaudible].
>> Charles Zang: So we use the control flow to compute -- This example doesn't
show that explicitly. It actually computes the constraints for each shared variable
to be executed first.
>>: But you're not...
>> Charles Zang: Local constraints...
>>: ...recording the values of variables.
>> Charles Zang: Not recording.
>>: So the problem is if you don't branch on some shared variables value, how
will you be able to reproduce the race on it?
>> Charles Zang: We use -- Probably you can answer this [inaudible], but we use
a low-store --.
>>: Well, maybe I just don't understand. You do have orderings on the shared
variables? Or is that what you want to compute? You don't have order?
>> Charles Zang: So the purpose of computation is to compute values for these
O's, for these orders.
>>: Right.
>> Charles Zang: To assign a global array of integers to these variables
[inaudible]....
>>: Okay, so I guess my point is if you just have straight line code, no
conditionals then there's only one path on each side. So what do you do in that
case?
>> Charles Zang: Then we don't have the path condition. We only have the
order constraints and the read-write constraints we don't. So it's easier...
>>: Actually so in that case you basically say any interleaving is possible. Like in
that case you just observe these two threads executing, but actually all the
interleaving's are possible in that case?
>> Charles Zang: No basically you have...
>>: I mean all interleaving's that are consistent...
>> Charles Zang: Yeah.
>>: ...with the values.
>> Charles Zang: For example, you have a bad property. Right? So you have to
satisfy this bad property, so you have another constraint here.
>>: Right. I'm just saying that in the case where you just have straight line code
with no conditionals and just a bunch of assignment to shared variables and then
read's, you have a lot of possible interleavings.
>> Charles Zang: Right.
>>: So [inaudible] is sort of a predicate.
>>: Okay. So I'd assume...
>>: [Inaudible].
>>: ...there's some predicate. Okay.
>> Charles Zang: There's some predicate that we symbolically execute, form all
the constraints and [inaudible]....
>>: Okay, so there's at least one predicate that tells you something.
>> Charles Zang: Yeah there's one predicate.
>>: Okay.
>> Charles Zang: Yeah, that's right. So [inaudible], I think this is a simplified
representation of what we did. But the core idea is to encode different constraints
symbolically. And while on the top I’m giving all the constraints for this particular
example and we fit it to the solver. And at the end the solver will return an integer
assignment to all the order variables that represent a position of a particular
access to a shared variable in the global schedule. So this would be the order
variable. And we would just this variable and the user-level schedule to replay
this program.
So let me just summarize the characteristics of CLAP. So we think the most
important contribution we make is to reduce multiprocessor replay to solving two
well known problems: one is the thread-local profiling and the other is automatic
theorem proving which are really very developed areas or very developed
techniques. And we have also formal proof of why that replay is correct so that
the schedule computed by CLAP is guaranteed to reproduce the bug.
>>: If the bug doesn't depend on [inaudible] then it's not right. For example, if
you have like [inaudible], you have two data races then since you have not
recorded any like values then I think your schedule does not say which order has
occurred.
>>: But that's a bug too.
>>: Yeah.
>>: I mean that's an observable bug under [inaudible].
>>: Sure, sure.
>>: But I mean generally I think you're also pushing off alias analysis to your
constraint solver because in your simple case with scalars it was great. But when
you had pointers and direction, since you're not recording values, you can use
value difference to disambiguate pointers. And generally there's a rule that if star
P is not equal to star Q then P and Q don't alias...
>> Charles Zang: That's right.
>>: ...which you can use to your advantage when you're recording values. But
you don’t even have that. So you're also going to have to put all the alias or nonalias in constraints...
>> Charles Zang: Right.
>>: ...into your theorem prover too and ask it to also resolve potential alias,
yeah?
>>: Yes, that's true.
>> Charles Zang: That's true.
>>: So you're pushing off a lot -- You're making your runtime cheap but you're
pushing a lot of problems onto the theorem which is not bad but I just sort of
wonder. I mean it's not just about the -- You're going to have to have some
memory model to reason about aliasing. Because like in his case, you may or
may not get the error depending on whether some alias occurred or didn't occur
earlier.
>> Charles Zang: Yeah, the alias thing is definitely a problem. And I think Lee
gave us a lot of symbolic representation of the pointer values. But there are also
some problems with the race, with stuff like that. And, yeah, we admit that's a
problem for us too. By having the order variable, another advantage of our
approach is that we can represent different relax memory models very easily. For
example, TSO/PSO models and we just encode different order constraints into
our constraint solver to be able to address the bugs in these relax memory
models. Another interesting characteristic is the same as the work of CHESS, we
can encode to bound the search. To give an easier time for the constraint solver,
we use a context switch bound, preemption bound, to reduce the search space.
Basically we can provide a constraint on the sum of the delta of the thread local
consecutive order variables so that we can sum them and put that as a constraint
into the constraint solver so that we can ask the constraint solver only to give us
a schedule within a certain number of preemption bounds. So this is another way
for us to be able to make computation tractable.
This is a preliminary evaluation result for CLAP, and these are some of the
subjects that we used. And CLAP is completely implemented on LLVM so on C
programs different from my earlier subjects. Okay, so this is all on C programs.
What we observe is that recording overhead ranges from 10% to around 269%.
So we also implement, for evaluation purposes, a version of LEAP on LLVM as
well. And we observed some significant reduction which is as expected.
All right. So I guess let me just make some conclusions. I hope I sort of
highlighted the challenges of user-level deterministic replay which is that we want
to reduce the recording overhead and want to avoid some Heisenberg effect
where the instrumentation makes a certain type of bug, disappear. We want to
also reduce replay complexity; although, in this talk I didn't have time to talk
about that. And I presented three techniques: LEAP, STRIDE and CLAP. And the
future work, we'd like to focus on things like event driven programs. We very
much want to address how to replay long running systems and also how to
replay a bug in the distributed systems.
Okay? So thank you very much and thanks a lot for you time and attention. For
more information about our research, this is my home page. And I think I took a
little bit long but...
>> Tom Ball: Thank you very much.
[applause]
>> Charles Zang: Maybe some feedback and suggestions.
>> Tom Ball: Yeah, we have plenty of time for some more questions.
>>: So I'm curious about your plans for future work. Can you say a little bit more
about what you're planning to do with event driven programs? I mean what kind
of programs are you referring to when you say event driven programs?
>> Charles Zang: So I think about two types of event driven programs. One is
the low-level event driven programs. The most famous one is the Linux kernel
things like that. And for my background I'm more interested in addressing event
driven [inaudible] systems, for example, or these messaging systems where
debugging them is very, very difficult. So I have some preliminary ideas, but my
surprise is that there's -- Maybe I don't know yet, but I don't know of any replay
work that addresses this kind of system. So that's why I think it's worth
investigating.
>>: I think that one challenge here, [inaudible] thought about this problem, is that
when we started looking at, you know, multi-threaded systems there were
standard libraries that programmers could use to express multi-threaded
computation. So in the open source or the UNIX board we had these B threads.
On Windows there is [inaudible] library and Java and C sharp, they all have
these standard API's. But in this [inaudible] world, it seems to me that there's just
a huge variety of primitives and libraries. So it's kind of hard to imagine which
one should you target. You see what I'm saying?
>> Charles Zang: Well, we start with -- Well, I think the world is not completely
chaotic. For example, there are also well-defined type systems, type-based
messaging so that we can have some kind of information about the semantics of
the messages. It's not -- And there're also standards like Java messaging
system, it's a standard. So I think we have some ground to start. It's not really
completely chaotic.
>>: Okay.
>> Charles Zang: But, unfortunately, I don't have further experience than that. I
just feel it's a great need to have that.
>>: Yeah, I agree with that.
>>: [Inaudible] share the, like, actually analysis time for the offline one. How was
that? I mean, was it really like scalable to [inaudible] solve the schedule
problems?
>> Charles Zang: So the complexity is the same as CHESS. It's exponential to
the number of threads.
>>: Yeah. [Inaudible]...
>> Charles Zang: To enumerate the...
>>: I can understand, like, the complexity in a series, but I'm just asking in your
practice have you really like re-execute or replay the [inaudible] in a reasonable
amount of time?
>> Charles Zang: Yeah, for all these subjects I list here we can -- Yeah, so I
think I want to race against the time. So I didn't explain it. So the second column
is actually the time taken to reproduce the -- to trigger the error for these
subjects.
>>: Including the [inaudible] analysis?
>> Charles Zang: No, this is only...
>>: [Inaudible]...
>> Charles Zang: ...recording the runtime overhead.
>>: ...[inaudible]. Yeah.
>> Charles Zang: And the time taken -- We have a table, but I should put it on a
backup slide.
>>: [Inaudible] for the first six programs, we finished it. We didn't [inaudible]. And
for the last one [inaudible].
>>: What are the last ones? Did they like run overnight?
>>: It'd have been difficult because there are too many [inaudible] needed to
explore.
>>: That isn't really like a special benchmark that one, two...
>>: That's like all it does.
>>: Yeah, all it does. And it's just kind of like a [inaudible] testing where a replay
[inaudible].
>>: Which one is that?
>>: That is racey.
>> Charles Zang: The last one, it's...
>>: It's the worst one.
>> Charles Zang: It's basically...
>>: [Inaudible].
>> Charles Zang: ...wastes all the time.
>>: I see.
>> Charles Zang: So for that I think it's a special -- Well, that's the only
benchmark that we are unable to produce the schedule. But hopefully in real
world programs, none of the race condition is like that.
>>: It is not a real [inaudible].
>> Charles Zang: So all these subjects are small, but they are carefully
constructed to produce a concurrency error. So in terms of replay difficulty I don't
think they're more easier than the real or larger benchmarks. But we're still
working on larger benchmarks.
>>: You said you're using KLEE, right?
>> Charles Zang: Yeah.
>>: So you're actually using KLEE then to do exhaustive enumeration to find the
path? So you're not really truly doing symbolic...
>> Charles Zang: We modified KLEE not to duplicate states. Because we have
the paths, we just use it as an engine to give us the constraints along the path.
>>: Right so you're using KLEE to give you the intra-thread path constraints? But
then to that you have to add all these other constraints.
>>: Right.
>>: Right? So then those are being added in.
>> Charles Zang: Right.
>>: I see. So you're going to let KLEE execute to some point or are you just -- I
don't quite understand. So are you going to just take all -- Okay, you have the
separate thread traces. You're just going to make one pass with KLEE on each to
get a formula per thread?
>>: Right.
>>: And then you make one big call to the solver or --?
>>: Well, basically after we collect the path constraints for each thread...
>>: Right.
>>: And then we [inaudible] and then we encode all the other execution
constraints.
>>: So there's one big call.
>>: There's going to be one big call, yes....
>>: There's one big call, right? Okay. And that call, if it comes back, will give you
a model for the order variables.
>>: Right.
>>: That's interesting. So actually that trace then -- So a typical use of KLEE or
Sage is that you let go until the first branch and then you say, "Can I go a
different way?" You're not using it...
[simultaneous inaudible comments]
>>: ...long execution and I just want the path constraints for...
>>: That's right.
>>: ...the whole thing.
>>: That's right.
>>: And furthermore, that's why you get into this issue of symbolic pointers, too,
because KLEE is not doing everything symbolically and you would like it to make
more things symbolic probably. Is that true?
>>: So basically we need to mark the [inaudible] to be symbolic.
>>: Right. Because KLEE is -- I mean as I understand it with Sage, you sort of
mark some bit of state as symbolic. Is that correct? But then there's all this other
stuff that you want actually to be symbolic that, if KLEE were just to treat it its
default way, it would not be symbolic.
>>: [Inaudible]?
>>: Because it would be concrete.
>>: Yeah, we have to address the symbolic [inaudible] problem.
>>: So you need to make more things symbolic. KLEE is essentially like trying to
find changes to the input byte string that will cause the sequential program to go
a different way. Right? But in their case, they have all these global memory
accesses. Like for a single thread it's just like a function of the [inaudible]....
>>: It's basically just extra data because they're analyzing each thread
separately. There's this extra data [inaudible]...
>>: They want to make symbolic.
>>: ...in symbolic. Yeah, okay.
>>: Is that correct?
>> Charles Zang: It has its own symbolic memory representation and then it
actually emulates sort of the memory assignments.
>>: Yeah, that's right.
>> Charles Zang: As well. So it gives us some kind of [inaudible] information but
not for all the variables.
>>: This is interesting.
>>: Yeah, so this is...
>> Charles Zang: It doesn't...
>>: ...quite interesting.
>> Charles Zang: Yeah, it doesn't solve arrays, for example.
>>: You definitely want to do partial -- I mean you definitely want to do reduction
on this trace so that you don't like look at all the possible...
>>: So there is...
>>: ...points for preemption.
>>: ...the research group at NEC Labs.
>>: Yeah.
>>: They do stuff like this, right?
>>: Yeah, I was telling them about the -- I mean...
>>: Like [inaudible]?
>> Charles Zang: Yeah, Chao Wang.
>>: But it's different...
>>: And Chao Wang, yeah.
>>: But it's different. They don't have this local -- They're doing something
different because of the local stuff. Generally there what happens is, you record a
partial order of a parallel execution then you search for other schedules that obey
the same constraints but might find the bug.
>> Charles Zang: I see.
>>: Right?
>> Charles Zang: Okay.
>>: So...
>>: Charles Zang: So, yeah, this is different.
>>: So this is different. It's different, yeah. It's a different take.
>> Charles Zang: Definitely a lot of room for organization but the key idea is do
pass profiling and then do symbolic execution on that pass and solve the
constraints. And...
>>: You shouldn't call it pass profiling, though. I mean you're really doing a full
instruction trace.
>> Charles Zang: Well, pass...
>>: I mean...
>> Charles Zang: [Inaudible].
>>: I know the path decision, right. But it's like -- Yeah, I would talk about
instruction [inaudible]. Okay. Cool, well thanks again.
>> Charles Zang: Thanks a lot. [applause] Thank you.
Download