23441 >> Madan Musuvathi: Hi everyone. I'm Madan Musuvathi... and Software Engineering Group, and it's my utmost pleasure to...

>> Madan Musuvathi: Hi everyone. I'm Madan Musuvathi from the Research
and Software Engineering Group, and it's my utmost pleasure to introduce
Junfeng Yang. He is an assistant professor at Columbia University, and we go a
long way back.
We met in graduate school. So he worked on model checking as well, and he
applied model checking to find lots of errors in the Linux file system, and then he
moved on to apply those techniques to distributed systems.
And recently he's been working on reliable multithreading, how do you get
deterministic semantics to multithreaded programs. He recently received an NSF
Carrier Award for this work. So it will be exciting to hear more about it. Junfeng.
>> Junfeng Yang: Thank you for the introduction. And thank you guys for being
here. Since this is a small group, if you guys have any questions, feel free to
interrupt me. I tend to explain things better if it's more interactive.
So this talk is about our recent work on making multithreaded programs more
reliable. And the key idea here is called schedule memoization, which I'll explain
in detail in this talk. This talk is based on the work coming from two papers. One
paper and the system is called TERN, and another paper, the system is called
Peregrine. And it's joint work with my students Heming Cui, Jian Wu, [phonetic],
John Gallagher, Huayang Guo and Chia-Che Tsai. These are great students. I
like enjoy working with them. If you guys are looking for interns, talk to these
Okay. So let's look at a few trends that shape our computing today. The first
trend is a hardware trend. The machines are getting more and more cores. So
this application shows the number of cores for AMD Opteron processors. You
can see in 2010 we have 12 cores on magni cores processor. How do we use all
these cores? Developers are writing more and more market strategy code.
The second trend is a trend in software. The coming storm of cloud computing.
So we have more and more users connect to the cloud of servers, more and
more devices connect to the servers as well.
And this creates lots of computation. And at the same time the computation is
actually aggregated on the source in the datacenter. How do we get good
performance in this setting? For performance virtually all these sources imply
multithreaded programs.
So at the same time, multithreaded programs are known difficult to get right.
They very often contain the tricky hidden box the, data races, for example, that
show up, sometimes disappears. It's really hard to write test and debug these
multithreaded programs.
And the key reason, I believe, and also the other researchers also believe that
these programs are nondeterministic. We run them. It's like doing a coin toss.
Sometimes you get heads; sometimes you get tails. Success or failures.
Specifically, today, when we run the multithreaded programs, even when we run
this program on the same input, we run it repeatedly. Each time we may actually
use a different threading, leaving our schedule, this rectangle represents the
input. Each time we run it, we make out a different schedule. There's essentially
a one-to-many mapping on input to schedules.
This nondeterminism actually complicates a lot of things. For example, testing
becomes a lot less assuring, right? Lots of these -- many of these schedules are
good to not contain bugs but some of the schedules contain tricky errors such as
databases and concurrency errors. And when we test a program, when we run
into these good schedules, so therefore the bugs did not show up in the testing
lab. But once we ship the code to end users, when we run the program, they
attribute the interleavings where the bugs show up, get crashes. Similarly,
debugging becomes very complicated, because in order to track down this bug,
right, we have to reproduce the same bugging interleaving which can be
So recently researchers have proposed a good idea called deterministic
multithreading which conceptually forces the program to always use the same
schedule and the same input. Essentially there's a one-to-one mapping for
existing DMP systems.
You know, when we run the program on the same input, regardless of how many
times we run it, each time -- we always run into the same schedule, right?
So this idea is great. It solves a lot of issues caused by nondeterminism. But at
the same time these existing systems actually suffer a limitation. Basically if you
make the schedule tightly dependent on input, if you change the input slightly,
let's say you just change the input by one bit, for example, and these systems
may force a program to run into a very different schedule which may also contain
But there's this tight sensitivity, input sensitivity that has value input dependency
for existing DMP systems.
As I just mentioned, so some of these schedules also contain bugs. And this will
complicate testing as well because we can test some inputs in the testing lab but
not cover all the inputs. May shift the program to end users and run on some
slightly different inputs, they may actually run into bugs, right? Similarly,
debugging becomes challenging because in order to reproduce this bug
execution we have to reproduce exactly the same input. The input coming from
the environment. Other Bs have to match, which can be hard, for technical
reasons or privacy reasons.
By the way, actually if you observe this behavior, if you change the input slightly,
the bugs show up. If you change input more, the bugs appears with existing
DMP systems.
So to address this instability problem, to reduce this unstable behaviors, we
propose an idea called schedule memoization, the idea is to memorize the
schedules that work in the past, right, and in the future, when input comes, we
can try to reuse these schedules.
Therefore, we can avoid potential and known schedules, and also buggy
schedules. And we call this stability, right? Hand on to the good schedules while
test schedules in the future, if new input comes, we'll try to reuse these
schedules that are shown to work. Therefore, can avoid bugs in the unknown
Do you guys have questions? So I guess typical question I got here is if you do
this, sort of leave with the program which contain bugs, right, because you hang
on to the schedules which do not contain bugs, and the program itself is not very
correct, right?
My answer is these can be turned on. This technique can be turned on for
production systems. Actually deploy the system. We'll ship the code to users
and hang onto the good schedules and avoid the schedules we have not tested.
In testing, you can turn this technique off and use techniques such as chess to
actually explore different schedules and get a set of correct schedules. So
therefore they ship the schedules together with a program to end users.
Question? Right.
>>: So are you going to explain what you mean by the same schedule?
>> Junfeng Yang: I'm going to explain that. Define what schedule is, right?
Next few slides. So we actually call the first system that implement this schedule
memoization idea in turn, because the idea actually matches our natural
tendencies in animals and humans. We repeat familiar things and avoid errors
and known bugs, and this slide shows migration draws for birds, for migrant
birds. They follow fixed rods. Don't run into random rods. The first system that
implements memoization is TERN, after this arctic turn, and the bird that
migrates along, the most is known as animals. Picture of tactic turn.
Now we can reuse schedules for different inputs. 600 remains, which is from
determinism or efficiency. You can only pick one, cannot pick both. I'm going to
explain what I mean by determinism efficiency in the next few slides and also
define what the schedule is.
At the language level, where does nondeterminism come from? It comes from
mainly two sources. First we have nondeterminsitic foundation, for example,
locks. Locks can be acquired in different orders.
This shows a bug in Apache that's caused by nondeterminism memoization. And
the first slide acquires a lock and a size, uses objects, shared object. And the
second thread is object. And they both compete for the same lock. If the code is
[inaudible], this code is correct. However, if it's run in the reverse order, we have
this nondeterminism order, in terms of lock of position the code will run towards
use after free crash. So here, this piece of code crashes or runs correctly
depending on simulation order.
Another source is, nondeterminism source is data race. This slide shows data
race example from one of the benchmarks we use from implementation in
SPLASH2, a popular parallel benchmark. So here, this first slide thread 0 prints
out some results and thread one actually computes the results.
This computation, is a print. If the race results this way, we get correct result.
However, if the race is result the other way, we actually print out wrong results.
Here, this piece of code prints out correct or run results depending on how the
races are result, the other memory accesses.
So even these two types of nondeterminism, researchers help propose two types
of schedules, right? The first is called sync schedule. Basically it's a total other
deterministic simulation operation such as lock and unlock operations. And the
second type is deterministic other shared memory access operations. So when
you first other load and store operations to the same memory location. Both
types of schedules have pros and cons. So sync schedule, basically for this
particular example, if you just enforce this other, right, or we enforce this other,
either way we can make this code deterministic, the bug always show up or the
bug always not show up. So the one you see here is actually very efficient.
Enforce total efficient operations. Most of the code does not control simulation.
Therefore they can still run in parallel. Overhead reported is average 16 percent.
The downside is if there are data races, it's approach one work. Let's look at this
race again. We can enforce determinism others on the [inaudible]. But at the
same time this result can be resulted in different ways. Therefore you can still
have nondeterministic behaviors.
So the second way to make program deterministic is to enforce this memory
access schedule, right, for this particular example. If you can always enforce this
order, or this order, either way, we can always enforce the same consistent order
to make the behavior consistent. Either the bug always show up or the bug
always not show up.
The problem is this is deterministic; that is the advantage. The problem is that
the normal program does lots of memory accesses will run shared memory
accesses. If you enforce these orders, you can slow down the program but a
lot -- the number reported in the previous system is from 1.2 to 10x slowdown.
All right.
>>: So there are two kinds of overheads. So one is the instrumentation
overhead, just by having to do some extra work, each memory operations we are
going to base some costs.
>> Junfeng Yang: Slow down, right.
>>: And then there's this other cost of actually enforcing the determinism wherein
the hardware has more parallelism because you're not able to use that because
of the data.
>> Junfeng Yang: Right.
>>: Do you know how you separate the overhead?
>> Junfeng Yang: That's a good question. Actually I do not have specific
breakdowns for the instrumentation overhead for memory access. But for the
sync simulation, order overhead, like this -- so right here, if you enforce this
simulation slowdown, this or hat actually comes mostly from waiting. Because
there's two field simulation operations in code. And here -- so from memory
access, I think the instrumentation alone gives you maybe 5X, 10X slowdown.
It's reasonable, the waiting time is not critical that here.
So the challenge here is either we pick the same schedule, enforce simulation
order or enforce memory access order. If you use sync order, we do not get full
determinism, because if the database is not deterministic at the same time the
code actually runs efficiently. If you pick memory access order, we get full
determinism. If the races, the races are resolved deterministically. And at the
same time we get this efficiency problem because the slowdown can be quite a
lot. Can we actually get both?
With Peregrine, you can get actually both. You can get both determinism and
efficiency. I actually explained how I computed that. Now one key idea in
Peregrine is that we can enforce hyperschedules instead of man schedule or
sync schedules. And the insight here is that races actually rarely occur.
Although most programs have races either benign or harmful. But these races in
practice actually do not occur that often. Intuitively, if your program has lots of
races, your run is in testing lab, you probably have detected these races already.
Empirically, if you run six programs that contain races. And we run some
standard workload such as compress a file for parallel compressing ability, and
there are million s of memory accesses, [inaudible] operations. But we detect it
only it after 10 races. The races actually do not occur that frequently. If you look
at less races in execution, most part of the execution is fine, do not contain
The main portion contains races.
>>: What does it mean by [inaudible].
>> Junfeng Yang: Contain race which means they're concurrent memory
accesses, there are others.
>>: Happen at the same time or using happens before ->> Junfeng Yang: Using happens before. Using happens before. Access, have
access here, another access, if you look at absolute time, they're not concurrent.
But they could help others if you run it again.
>>: So this 10 of 10 happens before races.
>> Junfeng Yang: Right. It happens before races. This is tied to the simulation
retrack rate. I'll show more details when I talk about example rate. You run a
regular detector, detector can detect a lot more races, but here we assume that
we enforce more simulation order which print a lot of races.
So based on this idea, right, that we can, on this side we propose to use hybrid
schedule to combine the benefits of both sync schedules and meant schedules.
The idea is for this portion, it's very rich, does not contain races, let's use sync
schedules, enforce a total simulation order to get in TERM efficiency. And for
this portion, the manual portion that contains races can enforce more expansive
memory access schedule.
Here we have more overhead wave but this portion is only a small portion of the
entire execution, it wouldn't be that high, right?
And we implement this idea in a system called Peregrine. This execution
actually shows the Peregrine Faulkner, which is the fastest flying bird.
Specifically how do we compute hybrid schedules? And the idea of schedule
memoization actually makes it very easy to compute hybrid schedules. First, we
run the program on the input, new input we have not seen in the past.
We will record execution traits which contains both simulation operations and
also memory access operations, their order.
We do that because we do not know if this execution will run into races or not.
And once we've done that, we have this actual trace where we'll run some offline
analysis of the trace to relax this trace into a hybrid schedule.
And this schedule, hybrid schedule contains a simulation optimization for the
races part, and memory access schedules for the racey part. And with this
schedule, in the future if there's an input compatible with this schedule comes,
we can reuse this schedule and this new input with both determinism and
efficiency. So what can we actually reuse schedules, right? We actually ask this
question. Think of a simple parallel program such as parallel compression, it
gives you a tool in our benchmark. It splits the following into multiple blocks and
then compresses blocks with multiple threads. That's to make sure that the
number of plugs and the number of threads remain the same we can contrast the
file using the same compression schedule regardless of the file contents.
It's often the case you can actually reuse these schedules. And your evaluation
actually found out that for programs such as Apache, we can reuse schedules
over 90 percent of the time for many programs, not just Apache.
And also this basically -- the reuse part requires some analysis. And in the
Peregrine system we made this part automatic. It does not require user
annotations. Question?
>>: So would you need to know which is a data race feed portion and which is ->> Junfeng Yang: So depending on what you mean by new input. Once you
figure out one schedule, we're going to compute, okay, for this set of inputs you
can totally reuse this schedule without introducing your races at runtime, right?
Therefore, these inputs would be covered by this schedule. There is some new
schedule that is not covered by your existing schedules [inaudible] rate, best
input requires this new lattice.
>>: And you're going to have details on ->> Junfeng Yang: Right. I'm going to show you an example explaining how this
works, right. Question?
>>: You probably are going to talk about it later, but where do you get the inputs
that you tried for Apache?
>> Junfeng Yang: For Apache, the input comes with the natural receive method,
the receive system call to receive network data.
>>: I'm just wondering, like, because it really depends on how different the
includes are.
>> Junfeng Yang: Right.
>>: I'm just wondering what sort of maybe probably issues a new request, how
you generated those requests.
>> Junfeng Yang: So we generate the request by using two methods. One
method is actually run this synthetic testing tool, which launches a bunch of ATV
requests. Fresh different URLs. And the second type of, the second workload
we got is from the trace from the Columbia website.
So we trace -- we collected a one month's trace from the website, looking at all
the HTV requests issued from the Columbia website. And the evaluation we
actually launched the HTV request to our system and got the reuse rate. Does
this answer your question?
>>: Yes. Mostly static, or was it static and a combination?
>> Junfeng Yang: Mostly static, those ATV requests. So our system also
handles piece optimizations, runs Linux, and also for the reliability we actually
made our system run in user space instead of modifying the kernel. And it also
works with several programs like Apache.
So I'm going to show you the summary of results before I talk about the details of
the system. We evaluate the Peregrine diverse set of 18 programs, including a
program such as Apache Web Server, parallel compression UDP, the two I just
talked about.
And also Agat parallel for downloader and also PF Scan, a parallel file scanner.
And also try -- also evaluate our system, 13 scientific programs. Programs
implementing scientific algorithms, ten from the popular special benchmark and
three from popular PARSEC benchmark. We also tried a program called Racey,
which contains tons of races. And this program is actually a stress testing tool
for deterministic replay and determinism execution because if any result is
differently, you get different results, this Racey benchmark, stress testing tool.
And Peregrine can determinatively resolve all the races in these programs. We
get determinism. And the overhead, it sometimes can actually speed up the
program by up to 54 percent. And the overhead is up to 49 percent, and it can
also frequently reuse schedules for nine of the 18 programs.
>>: The speed up.
>> Junfeng Yang: The speed up, basically, no. So there are two key things
where we can actually speed things up. The first is once we enforce
synchronization order, we can replace expensive synchronization operations with
cheaper ones. For example, there's operations such as barrier weight. Each
causes unconditional contact switch that can cause a lot of overhead, right? But
once it enforces order, it can use cheaper ones, just not do connect switch. Can
make it faster.
And also some of the benchmarks actually use mu sleeps [phonetic] for
synchronization. So synchronizes it does not use explicit synchronization, uses
mu sleep to sleep for a while. And synchronize, once you enforce the order, you
can skip sleep. That's why we can speed things up. Okay. So let me show you
an overview and then like talk about how the system is implemented using a
detailed example and show the evaluation and conclude.
So Peregrine actually use both runtime components and compile time
components. But given a program first we'll instrument the program using our
instrumenter, which runs in the RVM compelling for structure. And then runtime,
we maintain these key data structures called schedule cache which contains a
set of TUPLs. C constraints as schedule. Because this is a heavy schedule. SE
is heavy schedule and C are the preconditions required for us to reuse the
schedule on new input.
So just talk about this parallel compression example where if we maintain the
same number of outlooks and the same number of threads, we can reuse the
same schedule all the time, right?
However, if the input actually changes. For example, we want to compress a file
using more threads or less threads, fewer threads, we need to determine that this
schedule is not compatible with input, need to use a new schedule. How do we
do that? We actually capture these constraints in these preconditions. When we
get new input, we can check the precondition and determine whether or not we
can reuse the schedule.
So let's say the input comes. First do this lookup in this schedule cache to figure
out if the input matches the preconditions. If they do not match, right, for this
input into a component called recorder, which will record a detailed execution
trace including synchronization operations, memory access operations. Then we
run this offline analyzer to extract a hybrid schedule and hybrid schedule and
precondition required to reuse the hybrid schedule on new inputs, try this new
TUPL and input it into the schedule cache.
And if the input actually matches this schedule cache, right, it's simple, just
forward the input and schedule to the replayer, and the replayer will run this
program while enforcing this schedule S.
And because the analyzer can compute correct preconditions, we can -- we can
be sure that this input can always be precise by the program, while enforcing this
schedule S.
So, okay, so I'm going to show you, this may sound like abstract, just a bunch of
conceptual ideas. Let me show you how this works using a real example.
So this code is taken from the DB2 and also FFT programming in our
benchmark. CJ5 if they fit within this slide. And first recent input, and then it's
sponsored a bunch of worker threads to process some data thread. If you look at
the worker thread, it does some memory allocation and reads data and do some
computation. This is defied. The real code is more complicated than that.
And they update results by grabbing new tags, update this shared variable result
and release the mu tags. And the main threat also participate in this computation
cause workers as well. And optionally, the main thread also update the result as
well based on these argument threes. And then it presents out the results
because this code actually contains a bug. Can you see the bug here?
>>: Oh, yeah.
>> Junfeng Yang: Right. So there's a missing piece where joined, this code
cache is running parallel with this update code. Therefore we can have races, or
write or read racies, for example.
So given this piece of code, we'll first prepare it for use at runtime and run it with
instrumenter. Instrumenter will first annotate, intercept all these lines that
actually read inputs, for example, the lines that access our V, common line
arguments considered inputs to our program. And this function, the system
called read to refile data, we also implement the system call to mark that data
input. Given this input data at runtime, we can track how the input data is used,
and therefore we can compute the preconditions required for reusing some
certain schedules. And this instrumenter also actually utilizes simulation
operations such as lock operations, a piece we can create to record schedules.
Let's say after instrumentation we get the instrument of the program, let's say we
run this program with this particular set of arguments, 220, which means we're
going to respond to threads. We're going to process the dataset of two bytes.
And 0 means we do not actually update this result optionally.
And the recorder will record a D execution trace. Record the statements that
reads inputs, record the execution of these group as well.
Since the number of threads is two, the first time this I is compared to N thread,
we get this true return, right? And the second time we compared 2 to N thread,
we get this 0 to 10 rate. So this code corresponds to the execution of this loop.
And we run this worker thread which also -- would also record the other in
statements, the instruction activity. Also this gray area corresponds to the loop
Similarly, we do the same thing for the worker caught by the main. And then this
main function will update the result by grabbing this lock and the child thread, the
worker thread will also do the same thing and record the output of this
comparison, a red flag to RV3. And then the last statement is to print out the
result. So once we get this trace, we're going to show you how we can actually
extract hybrid schedules and also constraints.
The first we're going to copy the simulation operations to the hybrid schedule.
And we also enforce their order. So if the order in the execution trace is at this,
right, in the future, when we reuse the schedule, we're going to enforce the same
And next we're going to detect races with respect to this simulation operation.
And there are three accesses to share variable result. If you look at the first pair,
they actually -- they're not a race, right? Because there's a simulation order
constraint here making them not concurrent. But here these two threads can
totally grab different locks, right, would not flag this as a race as well.
Even though the locks are different, we're going to reinforce this total order
simulation operations at runtime. Therefore, these two accesses narrower have
not currently. Therefore they're not a race. Does this answer your question?
Okay. And let's look at the other two accesses. These two accesses. There's
no synchronization operation in between they haven't [inaudible] it happened
before race is detected as well. As a result, this race we're going to add to
memory accesses into the schedule. We'll enforce an order between these two
accesses so in the future when we review the schedule, it's totally deterministic.
So once we get the hybrid schedule, you need to compute the preconditions
required for reusing this schedule. So how do we do that? One naive approach
is we can -- okay. So there are a bunch of challenges here. First the
preconditions need to ensure that all these events are reached when we rerun
this program on the future inputs. Enforced feasibility of these events, that the
events can be reached at runtime. And also for determinism, when we reuse
schedules, we want to make sure we don't get new data races, because if there
are races, execution results can be different.
So this slide actually shows the example of a race that may show up if we just
reuse this schedule, right? Let's say we enforce this schedule, but this argument
thread, V3 becomes true, right. So that this becomes 1, right? And this flag is
1 -- this particular check satisfies. Therefore, we're going to run the true branch
of this statement. We're going to access results, right, which actually races
without accesses.
They're coming new races, coming up if we reuse the schedule and inputs.
These are things we want to avoid as well. We want to avoid new races.
How do we do that? So one approach is we can look at the conditionals
associated in the trace, and if the conditionals depend on input, that means if the
input changes, the value, the result of the conditional may change. We can grab
all these conditionals, right, and use these conditionals as a precondition. So
that in the future if the input satisfied these conditionals occur, right, we're again
going to see that all threads will go down the same path following the execution.
>>: [inaudible] concentrate.
>> Junfeng Yang: So input -- here we consider input as the arguments passed
to the commander line. And also inputs read from the network and also input
reads from file.
>>: [inaudible] lifestyle and system load and -- although I'm getting like signal
altering execution and that -- [inaudible].
>> Junfeng Yang: Right. Right.
>>: Determinism.
>> Junfeng Yang: So the signal part currently we could not handle. The get
timestamp recurrent do not handle, right. If the code is get handle, get back to
value, and that value affects operations. That's something that we do not handle.
And the things that we handle are, for example, the random returns, right? So if
the program caused random, gets back value, we can just mark it as input and
do the same thing.
And for get time of day, it could have used the same approach. But the things, if
they get time of day is not true value, is not an integer, cannot be checked by
our, the limitation of our system.
>>: [inaudible] what you've done.
>> Junfeng Yang: Set the same random value.
>>: Is that a generator ->> Junfeng Yang: It has the same random seed. You get the same random
value. Could be reading things from DEV slash DEV runtime, totally random
depending on the system setting.
So these are constraints. And if the input matches all these constraints, we go
down the same path. Therefore, we address all these challenges. And this
constraint actually can be further simplified to be this. So this is simplified
constraints. The problem is this naive approach is it actually computes
overconstraining preconditions. For example, let's look at this size is too
constrained, right. It says in order to reuse this schedule, the hybrid schedule we
just computed, you have to make sure that the data size is true, right? But
actually this schedule -- this synchronization schedule, if you look at the piece I
created, all the memory accesses, you can actually reuse it a lot of different data
sizes. If the data size is true, it doesn't actually matter. In some sense this size
is true constraint, overconstraining. Precludes a lot of the opportunities where
we can actually reuse the same schedule.
And one way to solve this problem would be to throw away the constraints that
do not matter, right? How do we actually figure out what constraint matters and
what constraints do not matter, right? That's one of the challenges when you
solve here. And turns out this problem is very hard to solve. Stop most of our
efforts in this project, because then we come up with some program analysis
techniques. And the first techniques to slice out operations that do not matter to
affect schedules.
The second technique is to improve the precision of the first technique. We
actually need to simplify the program toward a schedule. Therefore, analyze the
simplified program so that we get better precision. So the details are actually
described in our paper, and I'm also happy to discuss the techniques with you
guys off line. But it's just to complicate, to include in the talk.
And intuitively, intuition, here's the, intuitively you look at the computation loop, it
contains at least thread local variable. So local data. For example, this data is
allocated locally. In some sense it's private data for this threat. It's on the private
And these computation reads files, do the computation; but, yes, we don't touch
the data access for other threads. So it's local computations as well. These local
computations can be basically a slice out.
>>: I have a question.
>> Junfeng Yang: Okay.
>>: So what -- what makes a relaxation appropriate for you? Say if I have
relaxed spectrums you need to do something, how do you know of that relaxation
is still okay? Does it mean that you know for any state in there, this particular
schedule that you computed is still feasible? Is that what you want.
>> Junfeng Yang: Still feasible does not introduce new races. Who can be ->>: Not introduce new races.
>> Junfeng Yang: Right. Because remember there's this conditional update of
the results. So depending on the third argument. So if you run on a new input,
there's a new risk coming up, right? The results may be nondeterministic
depending on how the race is resolved. Those are the two conditions. All the
events that are feasible, at the same time no new races coming up. So there's
some past slicing work.
And the algorithm can be used to solve part of the problem within a sequential
execution. Here I need you to look at interthread dependencies. If I read a
particular variable here, that's defined by another threat. They could use chains
across threads. It can make things complicated, right? That's part of the
problem we solve here, make it [indiscernible] aware, right? Question?
>>: Do you do all the analysis, goes in statically?
>> Junfeng Yang: We do this analysis statically in the sense that you get the
program, you get the execution traits and you do the analysis.
It's off line. But uses the dynamic information because we have this dynamic
trace. It's a combined dynamic and static analysis.
>>: [inaudible] doing all the intuitions up front as part of the training phase. So
when the program was deploying all this stuff is not being done. This allocation
is not ->> Junfeng Yang: Can imagine we could do this also at runtime as well, and
updating the schedule cache, right? Right now the analyzer runs for a long time,
right. So doing it online does not make a whole lot of sense because it's low,
right. So the ideal model would be even after you deploy, deploy the system, it
can still collect schedules, still run the analyzer off line instead of updating online.
>>: Symbolic execution at all, how would that help or not help?
>> Junfeng Yang: What?
>>: Symbolic execution, for --
>> Junfeng Yang: Symbolic execution?
>>: Thinking about what is effective, what condition ->> Junfeng Yang: So basically once you slice the program out, like once this
statement is sliced up, we get a smaller slice, right. Actually run symbolic
execution to get the constraints. So we look at input-dependent, track the
conditional inputs and collect constraints. That part is done by symbolic
execution. We leverage symbolic execution.
Okay. So once we slice things out, like all these statements, you have reverence
to the schedule; we slice them out. We get the preconditions; we separate them.
These are the final preconditions. This is a hybrid schedule we use from the
trace, and these are the preconditions we compute for reusing this schedule.
And if we run this program on the new input, let's say 202 arrays, we want two
threads, process the dataset, the dataset of 100 elements, and then the third
argument is three. And this input data, the arguments match the preconditions.
Therefore, we can reuse this schedule, enforce the schedule to get determinism.
So there are a few benefits of this approach. The first we can actually
deterministically result these races, deterministically result. Also the
preconditions computed being sure there will no races coming up, when we
reuse a schedule. So get determinism. And also it's very efficient, you can see
enforce other update determinization, access only two classes, and major
computation part can still run in parallel.
And also this schedule can be reused across a bunch of different inputs. We
don't care about the data content at all, right, and don't care about data sizes.
We can give the same dataset, we can reuse the same schedule. So questions
so far?
>>: I had a question.
>> Junfeng Yang: Okay.
>>: [inaudible] how do you enforce the memory?
>> Junfeng Yang: Access? So we actually instrument the program. So when
there's access here, we put some of our operation up here. Before this access
we put sample operation down there, right. So the former constraint. It's like ->>: Dynamically asked for an instrument. I'm talking about what schedule to use.
>> Junfeng Yang: Issued by the program, that's right.
>>: But made this point that if you instrumented programs, then there will be
reactions; but just to do nothing, but no-off, you automatically start being five to
ten times overhead and how do you avoid doing that?
>> Junfeng Yang: If you do it for all the loaded, can slow down even more. Here
we instrument this specific access we know, they're involved in races, all the
memory access you run in parallel with instrumentation, because they do not
involve any races. But these two access, you get some small business
slowdown there. But because you summon two, a few memory accesses to get
away with smaller overhead.
>>: So there's one other area. What you could do is you essentially vice one of
the listing data array detectors.
>> Junfeng Yang: Exactly.
>>: Those accesses that [inaudible] in data arrays.
>> Junfeng Yang: Right.
>>: They need some of the affect to, even though the result that particular
program point, let's say if it's empty, because you're instrumenting you're going to
pay the cost for instrumentation.
>>: That's a good observation.
>>: I think that might be accessing these.
>> Junfeng Yang: Actually, you guys have good aesthetic, easy [indiscernible] .
So this is some idea where you should pursue in the future. We can study do
raise detection. And those points and next schedule them together with the
operations therefore forget deterministic multithreading and that as well, but we
detect with recent programs, and it has a read of how to function. For each
function, you get maybe a hundred reports. Do you guys have like good static
race detectors we can leverage, we can use?
>>: [inaudible] automatically come to me I think. But when there's annotation ->> Junfeng Yang: Notice it with annotation.
>>: And, yes, and things like that.
>> Junfeng Yang: I see. Maybe you can be clever. Interesting using such a
raise detector right now. Another transfer we're looking at is not just for C
programs, for Java programs, there's a Racey detector developed by Aicon
[phonetic] and his students, I think. They've got good precision, and I think
they're second [indiscernible], right.
We're looking at whether or not we can actually do this for Java, with that
powerful [inaudible]. We can get the SQL work or C plus work that would be a
great thing to do.
>>: Why do you need to also order the locks and unlocks synchronization, can
you just do memory; you'll be doing pretty much the same thing, right?
>> Junfeng Yang: So if you just do memory, you can get determinative. We
know that, right? First shared the determinus access for shared memory
efficiently. At the same time, you pay overhead, because you have
instrumentation overhead and instrumentation restore. The weight overhead,
you cannot do memory access if all the ones prior to the access are not done
right. So there's a weight overhead there as well, right, and the overhead can be
up to 10 times, 11 times based on previous papers.
>>: But this lock and unlock ->> Junfeng Yang: This one? Okay?
>>: It has the same overhead because after, if that lock comes first, you have to
wait until the unlock is done first. So if you order that result between the lock and
unlock on the left, on the right, then you won't need to unlock and lock.
>> Junfeng Yang: Okay. So, first of all, there are two questions here. First
question is the overhead of instrumenting and controlling all the synchronization
operations. And the second is if they do this, enforce system memory access, do
we get roughly the same overhead. And for the first question, named this
interception operations tend to be cheap, tend to be cheaper, right, because
these are already leverage costs. Easily done right. And instead of loop load
and store operations for each load optimized code that contain load and store
operations we do a function call.
>>: But you're already doing that, right, you're already doing that for the first one,
and then ->> Junfeng Yang: Right. For the first one we do that, but for the future one we
don't do that. We can avoid the overrun for the future ones, the overrun is as
much as the existing tools that enforce memory access others.
Really those does not give you a lot of overhead at first. The second, if for these
two particular access, if we enforce this constraint, right, and you probably will
get roughly the same overhead as enforcing this constraint, right? But at the
same time lots of memory accesses here as well, right, and how do you know
whether or not you want to enforce others for those accesses, memory accesses.
If you do enforce others, you get slow-down, right, if you do not ->>: You need between that is opened, it was all down there. Can't you figure the
rest of them out the same way?
>> Junfeng Yang: So, we actually -- here we know there are no races between
these two pieces of code, therefore can slice them out, right? And are you
suggesting that for this particular access, this result access, the result access
would just add this instead of adding this edge, right? But this true access do not
race, right? If you're using your algorithm to do not race, therefore we will not
add an edge here, right. Therefore we can get nondeterminism synchronization,
>>: But you're already adding the lock and unlock edge, which means that they
would -- there is a race if that didn't exist, right?
>> Junfeng Yang: We do this, right, actually to resolve the race on the lock. So
this guy grabs one lock. This guy grabs the same lock. They can grab the lock
in different order. Therefore it can get different results to add this edge to say,
okay, this always get the lock first, right. So therefore you do not have this
nondeterministic lock intention, right? It doesn't matter -- it doesn't involve these
two accesses.
>>: I agree, but what I'm trying to say is that what you're doing there is
fundamentally basically trying to have an edge between the memory location of
the lock itself, right? Kind of like -- so if you didn't have lock and unlock and you
had custom synchronization, add what -- you had to do that you had to do have
the lock memory and have one of these arrows continue with them.
>> Junfeng Yang: Maybe a better way to look at this is would be, the source
thing. The nondeterminism lock operations translate to nondeterministic access
others to memory accesses. Maybe a better way to look at this would be here it
shows example with just one access, right? You can have large region with lots
of accesses, right, and you can ignore synchronization operations, enforce others
for memory accesses. You first lock edges and therefore get better performance.
But right now just look at synchronization, know synchronization, enforce this
synchronization order which automatically covers all the memory accesses,
meaning a critical region. Therefore it's going to be more efficient.
So our system actually has -- some folks just pointed out, first of all, because the
exception is we can actually reuse schedules which may not always be possible.
For example, actually one nondeterminism in your simulation to get those results
so those workloads or programs will not fit within our approach. And also by
enforcing some deterministic order operations, memory access or
synchronization operations introduce, sometimes introduce delays. If your
program is latency intensive, that's not a good workload for us, right?
And also we need to keep track of constraints, in order to figure out the
preconditions and our current [inaudible] constraints are integers. If you have
floating pond operations, the schedule actually depends on floating pond which
we cannot handle. It also requires source code or the RRVM at the compiler
intermediation bit code. That's another limitation.
And currently we do not handle nondeterminism to get from time of day or
matlock, you matlock something you get a memory address and you do
something nondeterministic synchronization based on the return address of
matlock. We cannot handle. So actually this probably can be handled using
some existing techniques, right.
For example, their implementations of heap implementations that actually are
deterministic, right?
And also work with only a single process. This doesn't work if you have multiple
threads, new message to each other, although we're planning on making this
approach work for MPI programs as well or programs that communicate by
sending and receiving messages.
>>: You also have the [inaudible].
>> Junfeng Yang: Right now, no. There's no determinism of system costs. It
can have race inside the kernel, possible.
So there are lots of applications we can do on top of this idea, right, at the basic
level you can simplify the program understanding and testing debugging, right?
Enforce the same schedule, test the schedule at the production. And also can
use this for doing deterministic replication of multi-strategy programs. One
program runs, has some nondeterminism instead of logging and sending the
nondeterminism if the program heaps the schedule cache, you use the same
schedule [inaudible] and therefore it's deterministic. You can report bugs without
reviewing the four inputs which may contain credit card numbers. And also once
you have the schedule right, you know what the program is going to do in the
future, therefore it can do some scheduling trick there as well, right? And
actually very interesting application we're currently exploring is you can use this
idea to build some precise static analyzer for market strategy programs. I'll
briefly talk about this idea. When you analyze market strategy programs we can
use dynamic analysis which analyze the schedules as they occurred. It's
basically unsung, because the next run of the program we actually use a
schedule that's not analyzed. We can do static analysis of all schedules. But
that requires a lot of approximations, and sometimes we get imprecise results
like we see detection, you can get tons active false positives.
With Peregrine, you can actually solve these, address these issues to some
degree, right. And we can analyze a program only with respect to a small set of
schedules to be enforced at runtime therefore we get precision right. We don't
assume all possible schedules just a small set of schedules. And also we can
enforce this analyzed schedules at runtime using Peregrine. Therefore we can
guarantee some at least for this schedules, the schedules that we analyzed. And
if there are new schedules that required a runtime, to gather new input that could
be not be covered by these schedules, right, we need to learn this new schedule.
And for these rare cases we can actually use some extensive techniques such as
we do simulation of the program to guarantee, to check for errors, for example.
And another assumption that this -- a inference that we can frequently use
schedules as most of the inputs, 90 percent inputs can hit the schedule cache.
We can actually guarantee a precision and soundness for 90 percent of the input,
which I think is a good contribution.
And, by the way, this is some ongoing work, right. So we have this idea. We
have this preliminary study and get some results. This is not published yet.
Okay. So one change here is how do you actually analyze a program towards
just a small set of schedules. When you do set analysis, you don't have this
notion of schedules. And my approach is modify every static analysis to say this
is the schedule, right, take that schedule into account and the program, right. But
that's kind of the strenuous, because you have modified a lot of analysis and also
sometimes when we build a tool we can pull a bunch of analyses together and if
there's imprecision in one of the analysis which does not consist the schedule
and the imprecision may propagate to other analysis, right?
There are issues here. You know this is the naive solution I just talked about,
modify the analysis which is not very practical. So the solution is actually taking
this program, let's take this program. Let's take the schedule, right? We can
special like the program toward the program, so runtime we run the specialized
program is going to generate executions matching the schedule. So you
transform the program to get a simpler program.
And this transformation specialization process involves specializing the control
flow plus the data flow of the program. For example, look at data flow. We only
look at [inaudible] used chains a lot by the synchronization schedule, the
schedule reinforced and we can define use here, right? You can have other load
and stores on the same memory address, but if they preclude, the data cannot
flow because when you force certain schedules, you can ignore those deaf use
And once we get this specialized program, which is a lot simpler than the original
program, consider the schedule as well. We can apply stock analysis to the
specialized program.
>>: You think with the specialized program as annotated program? A program
with annotations or do you think that some statements get dropped or sliced
>> Junfeng Yang: Some statements get dropped. Some get loops get unrolled
and some data, some variables become constant. For example, if the program, if
the schedule dictates two threads and the program uses a loop to create threads
and can say, okay, this loop bound must be true, right because the schedule only
contains two threads and it can propagate this value to all the other places where
this value is referred. So you can read it it's a simplified program. Not just slicing
up statements.
>>: What you're doing is really optimization. You're optimizing program, rewriting
the codes to make it run faster.
>> Junfeng Yang: [inaudible] about that part. We're sure about the simplified
>>: You're making it run faster. You said that in the examples.
>>: And the same thing.
>> Junfeng Yang: What?
>>: This part is ->> Junfeng Yang: This part is an application, technique. So you can take the
program, do simplification, remove stuff. And it may run faster but we do not
have external data to back it up. We believe we can do optimization.
>>: That sounds like an interesting ->> Junfeng Yang: We'll definitely pursue that direction. But right now our result
is mostly on precision part. So once you throw away a lot of crap and specialize
the program based on the schedule you can do really precise analysis. For
example, you get automatic [inaudible] because the schedule says two thread,
we're going to clone it two times. We're going to get automatic stress in 3-D.
>>: How do you bake the schedule into the program?
>> Junfeng Yang: Okay. Let's look at control flow part. The synchronization
operations. And look at the control flow go from [inaudible] output to another as
strengthen the program as to new statements corresponding to the simultaneous
events become control equivalent, right?
And the other pass, by passing this [inaudible] gets chopped off. Because if you
ever go there you want to reach the second [inaudible] operation.
And if there's a loop with three synchronization operations, and loop with one
synchronization operation, and in my schedule I see this operation, appears two
times where I know the loop must be run twice, right? And I can actually unroll
the loop. That's how we specialize the control flow, right?
>>: So it's funny to me because what you're doing really is the techniques we do
for optimization. Like a loop and so on and so forth. But your goal is simplifying
the analysis.
>> Junfeng Yang: Right. Right. To get more precision. Question?
>>: Can I ask it? The specialized programs still will have threads, right?
>> Junfeng Yang: Yes. Will still have threads, right.
>>: And does it -- is the semantics of the thread usually when you have a
multi-thread program, the semantics of the multi-thread program, is that thread in
context which statically at least at arbitrary places?
>> Junfeng Yang: Right.
>>: The specialized program, does it continue to have the same semantics.
>> Junfeng Yang: Yes. So okay we also needed dynamic parts. Maybe this is
what you're getting at. The synchronization operations in my specialized
program. So this synchronization operations map one-to-one to my schedule.
And runtime I need some technique to enforce the order. That's where this -why I say that.
We actually need to enforce schedules using Peregrine, right? Does that -- is
that what you're getting at?
>>: Yes. That is what I'm getting at. Because then the static analysis needs to
be [inaudible] because if you want static analysis to focus on those schedules
and the static analysis needs to be aware of these other edges, right.
>> Junfeng Yang: Sure, sure. This is how actually we sort of get around the
problem. This data flow part, we're going to -- we actually build this area of
analysis right that considers where it flows, based on the schedule. And then
you have another analysis which wants to query its analysis, right? Just query all
areas of analysis. And therefore this area of analysis is made schedule aware,
right? If your analysis depend on the area of analysis, right, your analysis does
not have premade the schedule.
>>: I see.
>> Junfeng Yang: Okay. So you're saying that there's some basic value flow or
data flow analysis that you will provide [inaudible] that holds our information
there, and you just query that as a black box, analysis that as a query black box.
>> Junfeng Yang: We have some initial results, for example, we looked at alias
redaction rights if they do not fit the schedule, get this many aliases, with the
schedule 99 percent actually gets turned up. Precision is pretty -- the increasing
precision is pretty big. And we haven't built a Racey technique based on this
technique yet, but buying running the analysis and looking at the results, we
actually detected, I think, at least five or seven real races in the programs
analyzed by a lot of previous systems. You see a lot of paper to the programs.
But they did not report the recent detector, previous unknown.
I think I'm very excited about this direction, that we're actually pursuing.
So how much time do I have? Good? I guess I kind of drop off, right? So you
guys have seen the summary of results. I'll skip the evaluation. You want the
evaluation? Okay.
>>: I wanted ->> Junfeng Yang: Might want to ->>: [inaudible].
>> Junfeng Yang: Overhead?
>>: One of the things I'm worried about not just [inaudible] but in general the
whole deterministic multithreading, you go see the think schedule let's say you
run your program on a full core machine and then you find a schedule, then you
try to run the same program on the same input on an encore machine, and you
try to force the same synchronization schedule. I would like to know what the
overheads are.
>>: Let me ask you the more question, programs are nondeterministic for a
reason because then they can respond to environment changes, presence of
more cores, suddenly like full cores become available because of the process,
finished, et cetera. They're nondeterministic for a reason do they use all those
computing capability? And so if you're restricting it by adding some determinism,
I would expect some performance.
>> Junfeng Yang: Right. There is some overhead determinance, particularly
with the work case workload that you mentioned about to get the schedule on a
four-core machine, get the schedule on the four-core machine. They were
running it on an eight-core machine. How would you, overhead would be large
because on the four-core machine it loses switches; but on the other it probably
doesn't map really well to a machine. So for the scientific competition [inaudible]
normally what they do is if you have four cars and I just pay four schedules, eight
across I'm going to use eight threads. Four across, eight threads, right. So this
number is actually part of your preconditions. It's part of your schedule. So on
an 80 core from some scientific benchmark.
Normally what they do is if you have four cars just going to use four schedules.
So A plus I'm going to use eight threads, four cards, four threads, eight cards,
eight threads. So this number is actually part of your preconditions, right? It's
part of your schedule, right. So on eight core, for some scientific [inaudible]
automatically find use the efficient schedule in that sense.
>>: Those programs are in a stack because it ends the number of processors the
number of codes are part of the input.
>> Junfeng Yang: Right. Part of the input.
>>: If you look at more dynamic scheduling algorithm, past library, what they're
doing is running Apache, the same time to be happening Internet Explorer.
>> Junfeng Yang: Right.
>>: And then Internet Explorer is using two core but Apache is using six core and
when Internet Explorer finishes Apache grows to use their two cores. So there's
a lot of dynamic [inaudible] in the columns which are designed precisely to
makers of these additional computation [inaudible] so I'd like to know what ->>: My impression that's always been the weak point of the existing
multithreaded story and that thing from under the rug from day one.
>>: I want to get -- since -- you know ->>: Experiments like what did you find?
>>: Our experiments these are actually -- those are mostly scientific programs,
and these programs that's how programs randomly run full course and this
Apache basically also uses Apache to use other cores. So we do not always -[inaudible] use that. So my impression is that if they do the experiments, you run
a bunch of competing processes together and you made performance for some
of the processes within this system, you may get higher overhead. But I think I
can see two ways to address this problem. I do not know whether these ideas
will work. So one idea is if the workload is actually predetermined like this kind of
computations, Markov processes the processes for number of threads have them
together have them as a bigger submitting execution framework, and there you
can capture constraints, use this approach. That's one way, right? Another way
to address this would be we can have not just one schedule for many inputs, we
can have a few schedules for this many inputs and runtime rate, and this
workload actually becomes a factor for us to choose what schedule to go, to use,
right, and that can get efficiency. So in some sense do we want one-to-one
determined probably not. If you change the input, you get really different
schedules. Do we want many-to-one, maybe right. But we can also have many
to N schedules a small number of schedules choose at runtime to get
performance. Those may be the way to go. So I haven't tried these ideas yet.
Right. Do you guys want to see more results or? These are the overhead.
Some of them like we get speedup, because they use barriers [inaudible] which
can be avoided by us whether it slips, and some of them get bigger overhead.
>>: What's the overhead if you don't -- if you don't get [inaudible].
>> Junfeng Yang: So that's the recording overhead, our current recorder is like
[inaudible] x-ray. That's pretty slow because our recorder is not very optimized.
So there are lots of papers talking about how to do efficient recording, just
haven't got the time to implement that yet. So it can be really large. If you look
at the papers, previous papers like idea developed by MSR guys, it's like 10
percent recorded, the full execution stream, low storing instructions.
>>: So by Apache is it the old Apache server?
>> Junfeng Yang: Co-op Apache server, core Apache server. If you do HPP we
haven't tried but I think the HPP won't matter because single thread to elaborate,
interpret some PHP Scripps and go back, single thread go back with schedule
within PHP module.
>>: I'm a little bit troubled because you could just forget the whole story about
determinism and here's an automatic method speed up programs. And it works
fully automatically and this number of programs it could actually get better
performance end of story. No determinism needed.
>> Junfeng Yang: So maybe that's a better way to spin the work, right? So right
now we started this work by focusing on reliability. Avoid bugs. [inaudible] we
remember schedules and avoid bugs. But right now in I think the steady analysis
frame we're really excited about and also junior one of my fantastic students is
working on that, right? I think down the line we want to do the augmentation
story. I think that will work well as well.
>>: Crazy girl if I look at -- so improvement ->> Junfeng Yang: Maybe the program will not reason in a smart way because
it's heavy simulation rights. It's arguable ->>: [inaudible].
>>: You don't have to be that smart. [laughter].
>>: Face the program particularly aren't skillfully written. And so sort of improves
them by figuring out what is a better schedule than ->> Junfeng Yang: Right. Right.
>>: And the wonderful thing about working on optimization you don't ever
guarantee anything, right?
>> Junfeng Yang: I think it has a guaranteed semantics. It's correct.
>>: And you just have to make sure you don't change semantics or something
like that.
>> Junfeng Yang: Right. As you were saying ->>: There's some race there. Let there be. Why wasn't it a program.
>> Junfeng Yang: That's actually maybe a better direction, good direction for us
to go in.
>>: But the original same schedule -- [inaudible] the optimizations you're getting
is six schedule plus your optimization, right? And the sync schedule plus here.
>> Junfeng Yang: The sync schedule gives us most optimizations, right.
>>: So but Kindle as implemented does not give you the speedup, you guys paid
some [inaudible] with the [inaudible].
>> Junfeng Yang: Right. Optimization. Here the optimization mostly comes
from the synchronization optimization. So right now we do this implication, we
get a programmer, as a programmer we get results.
Actually, when we run programmers we still run the original program. But we
haven't tried, I think, as we're suggesting so actually run the synchronization
program and also can ignore races, the analysis part that deals with races,
because as [inaudible] suggested the programmer has resist the programmer -you do it by races, you do it by much better performance. Great direction.
>>: So do you know what percentage of memory access is best incuration and
[inaudible] like when you are [inaudible].
>> Junfeng Yang: Actually I have this number. Okay so these are the races,
basically. So these are the program that contain races, right, the two real ones
have races. These are the [inaudible] benchmarks they also contain races as
well. These are from the past sec benchmark. We run the race, the number of
memory accesses can be nearly as right. [inaudible] or even more. I don't know
the exact number. It's in the paper. And [inaudible] others that we were solving,
look at the Ray says, brings us back to the simulation schedule. They're pretty
small, the number of races here. These contend partially once you're focusing
other, no races are detected and for cluster similar things happen. For the order
no races detected.
It should be like most programs, you run but not help lots of races. That's just the
general intuition.
Any other questions? Okay. I guess I'll just conclude.
So I'll talk about the segmentation, the memorized schedules and reusing inputs
to avoid, we can reuse the past sec schedules and reword buggy schedules. I'll
talk about this idea of schedules basically combines the benefit of single
schedules and schedules as Madan suggested can actually implement this idea
using a recent study that we're currently looking at. And also talk about this
peregrine system with it confuse schedules by taking a trace and relaxing into
heavy schedules. It's deterministic makes all these programs deterministic
efficient or have this glow and plus it can preview preliminary schedules. Okay.
That's all.