>> Yuxiong He: So good afternoon, everyone. Welcome to the talk.
It's my great pleasure to introduce Borzoo Bonakdarpou. I hope I pronounced it
So he's a research assistant professor for University of Waterloo. He's broadly
interested in the correctness of software development using formal methods,
especially in the area of distributed and embedded realtime systems. So today
he's going to share with us his work on the model-based code generation
debugging of concurrent programs.
>> Borzoo Bonakdarpou: Okay. Thank you very much for the introduction and
your kind invitation.
So as Yuxiong said, you know, I have been working on not necessarily software
but system correctness since I became a graduate student. This talk will focus
on model-based code generation, which basically focuses on correctness by
So let me give you an overview of the long-term vision that I have, at least for the
next few years.
So I'm generally interested in program correctness, and I'm basically following
two lines of research. One is offline or in a precompiled time and one is online.
So by offline I mean correctness by verification, correctness by construction. I've
done some work on composition verification, and correct by construction is what
I'm more interested in, so at what I call a micro-level I'm interested in model
repair and model synthesis where given an existing model and a set of
properties, the problem is how we can fix bugs and fix errors in that model.
But what I call it at a more macro level is model-based code generation where
there exists code, and we want to transform it to different types of workable
platforms. And recently I've been working tracing and run time verification.
Because of all the limitations that we all know that state level synthesis and
multi-checking have, there is now an emerging trend on using run time
verification and tracing for debugging purposes.
So this has been my fascination more recently. I gave a talk here at MSR in
2008, early 2008, on model repair, and today I'm going to talk about model-based
code generation and a little bit about tracing and debugging.
So the first part is on model-based construction of distributed programs, and this
is a joint work with Maria Bozga, Mohammed Jaber, Jean Quilbeuf and Joseph
Sifakis, so the work goes back to my post doc time at Verimag, and I'm still
collaborating with them, so that's been fun.
So the idea here is we want to start from the high-level model and generate code
from that high-level model, but the high-level model may be associated with code
as well. So let me describe in more detail.
Why do we do model-based software development? I think you all -- you're all
familiar with the reasons. Because model, they abstract implementation details.
Analysis of models is easier. We can do levels of verification, model checking on
model, whereas that's more difficult to do at implementation level. We can do
testing, simulation, and all sorts of procedures that ensure model correctness,
and then we can transform the model into actual implementation, but the fact is
during this transformation when it is done manually, the bugs are introduced by
programmers because it's a human process, it's a manual process, and bugs are
inevitably introduced.
So in distributed systems I would argue that the problem is a little bit more
amplified when we develop software because of the inherent complex structure
of distributed systems, because of concurrency, non-determinism, race
conditions, low atomicity, the issue of faults, so basically in distributed systems
the problem is a little bit more difficult, and the gap between developing models
and implementing and deploying software is, I would argue, is a little bit wider.
So when I joined Verimag, it was in the early stages of a framework which is
called BIP. I would like to emphasize that I'm not religious to this framework in
any way, but it's just one way to deal with the problem.
My slides are, I think, cut a little bit from the bottom and the top, but that should
be fine.
So the idea is that we start from a high-level model, and I will say what I mean by
high-level model, and in the high-level model we can use verification tools such
as the D-Finder tool that's been developed at Verimag for finding [inaudible]
states in a compositional fashion, and then there are some performance, for
instance, criteria involved. Then we do some source-to-source transformations,
for instance, to get more close to distributed computing reality, you know, for
instance, point-to-point message passing and things like that, and then we
generate code.
So this is the part that I'm going to talk about today from the system model BIP to
source-to-source transformation and then eventually code generation.
So the philosophy of this framework is that we want to first have different
programming paradigms such as data flow, event driven, synchronous, and so
on, and we want to follow and principle of constructivity, which means that if
low-level components satisfy some set of properties, then by composing them,
we want to preserve those properties.
Of course, this is hard problem. We cannot solve it in all cases. But for some
safety and some [inaudible] properties that can be done.
And them from this high-level model we want to do model transformation and
then get implementation that is correct by construction by preserving functional
properties and some extra-functional properties that I will talk about.
So this is sort of the outline of the first part of the talk. Let me first start by
describing the global state semantics of BIP.
So basically a model has three layers. The first layer is the layer that we
described the behavior of the model, so these are the set of components, and
each component is basically a straight transition system or a PetriNet [phonetic].
Each transition in each component can be associated with a C++ function.
Then there is a level of interactions which is basically a set of [inaudible]
primitives such as rendezvous, broadcast, and so on, and there can be priorities
for scheduling purposes, which basically means if two interactions are enabled at
the same time, we can give priority to one.
There is composition going on. So if you have a composite component here and
another one right here, then we can compose another set of interactions and
priorities and we can get basically a new component that can be composed with
other components later.
Let me show a very simple example. We have four components here. We
intend to have the first component as a sender, the rest of them are receivers,
there's only two transitions -- sorry, one transition and two states. This transition
somebody labeled by s and it is associated with a port s, r1, r2, and r3. We
compose them by the interaction sr1r2r3. That means when all of these ports
are enabled, then this interaction can take place, and each component takes a
transition from the first state to the second state. So when all of these
components are at their initial state on top, this interaction is enabled because s
is enabled here, r1 is enabled here, and r2 is enabled here, r3 here, and
therefore the transitions can be taken locally and the components will reach the
second state.
Let's imagine we don't have any priorities. We actually have just one interaction
so priorities don't -- a priority doesn't make sense anyway. And this would be a
rendezvous interaction. And that's how we show it graphically.
Now, the same components but with a different interaction. So this interaction
says that if s is enabled or s and r1 or s and r2 and all the way to the end, which
is the last one is s and r1 and r2 and r3, then the local components can make
their move based on the ports enabled here.
So this is basically a broadcast, and it gives priority to the larger interaction. So
when all of these components are in their initial state, the maximal interaction is
going to be sr1r2r3. So that is basically a broadcast, and that's how we show a
broadcast graphically.
So you can come up with more interesting types of interactions. For instance,
the upper one is an atomic broadcast, which means that when a sends a
message, then either b and c both receive it or none of them.
And the bottom one is a causal chain, so when a is enabled, then the interaction
would causally go downwards here.
So I was not involved in this part of the work, but it has been shown that this
framework gives expressiveness more than any existing [inaudible] algebra. So
that was the basic semantics in very simple words.
So when I joined the effort, we wanted to take such models and then generate
distributed code, and by distributed code I mean that each component that I
showed becomes one standalone application and the interaction basically
becomes a synchronization primitive, and then the question is how do we
generate distributed code out of multi-part interactions. If it is only binary
interaction that is easy because most existing networks, they provide primitives
for point-to-point message passing, but when they have multi-part interactions,
the problem becomes a little bit more difficult.
So I started the effort by doing a case study. And that was distributed reset,
which was a very well-known algorithm in self-stabilizing systems due to Arora
and Gouda. I'm going to describe the algorithm very quickly and a high-level
The algorithm has three layers. There's an application layer, and the goal is we
want to achieve be a distributed reset. A process initiates a reset and then the
whole system is supposed to get reset.
There's an application layer, there's a wave layer which performs diffusing
computation and there's a tree layer which maintains a spanning tree throughout
the network.
Let me just focus on the middle layer here, the wave layer, which is a diffusing
computation. The algorithm is very simple. A node such as, let's say, 4 requests
a reset, a global reset. It sends a message to its parent, 1. 1 sends the
message to its parent, 0, which is the root. Then when the root receives this
message, it resets its own state, then it bounces back, sends a message to its
children, 1 and 2. The same thing goes on until we reach the leafs. Then when
all the leaf children of a node are reset, then they start the completion wave.
For instance, 7 is the only child of 4. It sends a message to 4. The same thing
between 3 and 1 and the same thing between 8 and 6 and the same thing
between 5 and 2. So in the first wave of completion, 3, 7, and 8, which are the
leaves, are complete. Then they're parents. And now since these two are
complete, 2 and 1 become complete and then all the way back to the root and
the diffusing computation is complete.
It's a very simple algorithm. A self-stabilizing version of this algorithm is shown
by these four or five -- one, two, three, four -- five innocent-looking target
commands. By self-stabilizing I mean that starting from any state -- that's
Dijkstra's notion of self-stabilization -- by starting from any state -- it doesn't have
to be reachable -- any state of the system, the diffusing computation is complete
within a finite number of steps. So that means that it takes care of faults as well.
So the algorithm was very easy. This is now a little bit more complex. And now
I'm going to develop a model for that in our framework. So I introduce three
states: normal, init, and reset. We go from normal to init. This is only for one
component. If a node receives a request it resets its state when its parent says
to do so and then it completes. There are some self-loops going on here to
make sure that multiple sessions of reset can go on safely.
Of course, faults -- these would be the ports. Faults can happen. That means it
can change the state of the system randomly, it can change the session number
of a component, and these would be the recovery transitions -- I'm not going
through the details so don't worry about the details. What I'm going to do is I'm
going to freak you out by the level of details here -- and these would be the ports.
So from that pictorial description that I gave you of the tree which was very
comprehendible, easy, then to guarded [phonetic] commands and now to this
what I call almost a mess you can see how things get complex and complicated
from a high-level description to a lower level of formalism, right?
So then I would argue that let's stop here and not go through, you know,
developing distributed code. Now, this is one component. The interaction level
of these components would be complex too because at development time we
don't know which components are going to interact with each other. We don't
know which components are going to be parent and child to each other. So we
have to develop this sort of, you know, complex set of interactions to make sure
that we capture all types of parent-child relationships at run time.
Now, imagine we ask a programmer to develop a program using, say, socket
programming to implement this algorithm. Although the algorithm is very simple,
it's going to be difficult to anticipate all types of synchronization and to guarantee
So the goal is to stay at this level of abstraction and then generate code,
distributed code.
So in action a BIP model is orchestrated by an engine or a scheduler, and this
engine or scheduler can be centralized or distributed, and my focus is going to be
how we develop a distributed scheduler for -- during code generation for these
type of models.
Okay. So the next step of my talk is about transforming a high-level BIP model
into an intermediate model where we don't have multi-part interactions and we
have partial-state semantics.
So let's say what I mean by partial-state semantics, let's imagine this is a very
simple BIP model. There are seven components here. There's interaction I1, I2,
and I3. Now let's imagine I1 is enabled. That means these three components,
they start executing code, and let's imagine that C3 finishes first. That means C3
is available for computation. C4 is also available for computation. That means I2
is enabled and that means again C3 and C4 can run concurrently. Right?
So there is an issue of atomicity here. In the high-level model either I1 or I2
could execute, but when we go to a concurrent setting I2 and I2 -- depending
upon enabledness of components C3 and C4, I1 and I2 could run concurrently.
So then the question is how do we manage atomicity in the sense that the
concurrent implementation adheres with the high-level description of the model.
So let me give a more concrete example. This is the high-level model for
components, and these three interactions, alpha 1, alpha 2, and alpha 3. Now, if
you assign an engine or a scheduler to manage interactions alpha 1, alpha 2 and
another one to manage interaction alpha 3, these components, these four
components, are replaced here. That means when alpha 2 and alpha 3 are
enabled this component sends a message to this engine and says this port is
enabled. The same thing happens for these two components. And then both
alpha 2 and alpha 3 are enabled and they send a message back to these
components that they can execute alpha 2 and alpha 3 and their associated
transitions locally, but then the problem is does -- how would we guarantee that
this doesn't violate the semantics of the original model.
On top of that there is the issue of conflicting interactions. So what if we have
interaction alpha 1 and alpha 2 that share port P. Obviously we cannot execute
both at the same time because there is only one transition here enabled which is
labeled by P. So both interactions cannot be executed.
There's another type of conflict. What if we are at this state that means this
transition is enabled, this transition is also enabled, and if this -- so that means
potentially these two interactions can -- alpha 1 and alpha 2 -- can be enabled,
but this component can take only one of these transitions at a time. Then the
question is how do we resolve these type of conflicts if we are in a distributed
In a centralized setting it's easy. All these enablednesses are sent to the
scheduler, the scheduler automatically decides which interaction should go
forward. That is straightforward. But in a distributed setting that's going to be a
little bit more difficult. So ->>: [inaudible]
>> Borzoo Bonakdarpou: Sure.
>>: The conflict is defined by whom?
>> Borzoo Bonakdarpou: You can find out where potentially there could be
conflicts by doing a very simple static analysis, because -- let me go back here.
So if you have a model like this, then by looking at this I know that starting from
this state, both transitions P and Q could be -- are enabled, and if this port and
this port here are also enabled, that means alpha 1 and alpha 2 are concurrently
enabled, but this component can only take one of these transitions. And if these
two interactions are managed by two different engines, they sort of have to, you
know, coordinate at run time and let this component know that either alpha 1 or
alpha 2 exclusively are enabled, and therefore only one of these transitions are
taken. Right?
So as simple static analysis finds out where the conflicts are. Or I should say
potential conflicts.
Okay. So here's a short story. When I started working on this problem, I
developed an algorithm. Actually, it was a very simple algorithm, it was a very
naive algorithm, but still it was an algorithm. It worked. We did some
experiments. It was fun. And then I was just browsing one of the books that I
really liked by Chandy and Misra, one of the God books, and I bumped into this
problem which is called the committee coordination problem. And that was only
by chance.
Professors -- and this is [inaudible] -- professors in a certain university have
organized themselves into committees. Each committee has an unchanging
membership roster of one or more professors. From time to time a professor
may decide to attend a committee meeting. It starts waiting and remains waiting
until a meeting of a committee of which it is a member is started. All meetings
terminate in finite time. The restrictions on convening a meeting are as follows:
Synchronization, meeting of a committee may be started only if all members of
that committee are waiting, and exclusion is no two committees may convene
simultaneously if they have a common member.
The problem is to ensure that if all members of a committee are waiting, then a
meeting involving some member of this committee is convened, which is the
progress property.
This is exactly our problem. The committee is an interaction. A professor is a
component. If two committees -- that means two interactions that share one
component -- are enabled at the same time, only one should convene, right?
So that was exciting. So this problem was introduced back in '88 and I thought
there should be a long line of research doing all sorts of -- you know, introducing
all sorts of committee coordination algorithms, so I started doing a survey on the
literature. As I said, the problem was introduced in '88, and the solution was
given by reducing this committee coordination problem to distributed dining or
drinking philosophers problem.
In '89 Rajive Bagrodia introduced some other solution based on message counts,
and, in sort, zincs by reduction again to dining and drinking philosophers
problem, and he presented some simulations, and then in 1990 the whole thing
There is absolutely no -- nothing happening on this problem after 1990. I have
no idea why. Maybe -- the closest language to what we were doing is ADA
[phonetic], and so I think ADA would be the only language that could use
committee coordination and -- you know, because of all the reasons that ADA
was not as popular as imperative languages, other imperative languages. There
was no research on this problem.
So now reviving the committee coordination problem, there are all sorts of
questions that maybe did not make a lot of sense back in the '80s. One of them
is how do we guarantee maximal or maximum concurrency? Because if we want
to deploy -- generate code and then deploy a model, we want to have as much
parallelism as we can, right? So maximal concurrency is a problem.
Fairness. This is another problem. So these are all the problems that have not
been addressed very -- in detail.
Fault tolerance or self-stabilization. Waiting time. Service time. And the last one
which I cannot see, the effect of time -- oh, I think this is the utilization.
So these are all the problems that have not been addressed. So there is going to
be a lot of opportunities to do research on solving different types of committee
coordination problems.
So some of the results that we developed are -- these are the impossibility
results. Maximal concurrency in the presence of an unfair scheduler is
impossible. Impossible in distributed computing community means that we
cannot develop an algorithm that guarantees maximal concurrency if you have
an unfair scheduler.
Providing maximal concurrency and committee fairness even in the presence of
a fair scheduler is impossible. Maximal concurrency and bounded waiting times,
satisfying both is going to be impossible.
So that's one side of the story, developing committee coordination algorithms.
The other side is how can we transform a high-level model using committee
coordination solutions in a structured manner and then generate code? Because
one can develop, you know, an ad hoc solution that starts from high-level model,
uses some committee coordination problem, and then generates code. But that's
not going to be very efficient, and I will back it up by some experimental results,
because different committee coordination algorithms will end in completely
different experimental results.
So the first solution that I was talking about before I discovered there exists a
problem called committee coordination was to assign an engine to a set of
conflicting interactions. So, for instance, in this case if -- this is our high-level
model, and alpha 1 conflicts with alpha 2 and it conflicts with alpha 3, and the
same thing for alpha 4, 5, and 6. Then what we're going to do is we assign one
engine to these three which resolves all the conflicts between these three locally,
and there will be another engine which resolves all the conflicts between these
So, again, these are the components, and there will be engine that manages
alpha 1, 2, and 3, and there be another engine that resolves the conflicts
between the alpha 4, 5, and 6. And these two engines do not have to talk to
each other because there is no, you know, cross conflict between these two
So that was a very straightforward solution. The implementation of an engine
done by a PetriNet, and let's not worry about the details of this PetriNet, but
basically the PetriNet -- in this PetriNet, this is where we receive a message of a
transition is enabled from a [inaudible] component here, and when all the ports of
an interaction -- for instance, alpha 1 -- is enabled, then the token is passed
through this barrier here.
One of the preliminary experiments that we did was on bitonic sorting, and the
reason that we were able to do that was this was -- this is a high-level description
of bitonic sorting and, and all these interactions here are non-conflicting.
And this is how the model looks like by inserting those engines. These are the
interactions that are managed by these engines here. Let's not worry about the
details of how the transformation is done because I will talk about the traditions in
more detail after these results.
So these are some numbers of bitonic sorting with different sizes of arrays for
sorting in parallel. This is an MPI handwritten code. So, for instance, for 20k
size array it takes 80 seconds. Our implementation based on [inaudible] sockets
takes -- or I should say implementation and transformation -- takes 96 seconds.
Obviously it has more overhead. That we expected.
Then what we did was we developed a direct transformation from a high-level
model to an implementation using MPI. The strange thing here is our
transformation takes less time than a handwritten code.
So, I mean, this was very strange because an automated generated code is now
performing better than a handwritten code, which didn't make a lot of sense.
So when we dig into the problem we realized that we are using collective send
and receive primitives in our transformation, whereas when we developed the
handwritten code here, it uses the regular send and receive primitives and MPI.
So this happened to be faster.
Of course, if you used the same type of primitives here, it's going to behave
Now, the first one was in case of one CPU, this is four CPUs, and this one is 4
CPUs on four different machines. So this was a multi-core implementation and
this was a distributed implementation.
So if you ignore this column, the overhead is less than 50 percent, and the fact is
our focus was not really on, you know, performance evaluation. Our focus was
more on [inaudible] correctness. But obviously if we want to take this to a more
serious step we have to take into account performance issues, and that's one of
the things that I'm hoping to achieve by my visit here because Yuxiong is an
expert at parallel computing.
All right. So the problem with that simple solution was that if we manage all the
conflicting interactions by one engine is in some models it can be the case that
all the interactions are conflicting, such as this one. These all share the same
port. So the first interaction conflicts with the second one, the second one with
the third one, third one with the fourth and all the way to the end.
So if we take that approach, that means we have to manage all the interactions
by one engine, which is a totally centralized solution.
And then the question is how can we add to the level of parallelism and have
distributed engines but at the same time we resolve the conflicts safely?
So then we go back to committee coordination problem. Now, what I'm going to
show you in the next few slides is there can potentially be completely different
solutions to committee coordination.
Let's only focus on binary interactions, and if we take only binary interactions
committee coordination is going to be almost the same as the matching problem
in graphs.
In matching we are looking for a set of edges that do not share a common vertex,
right? So, for instance, in this graph, this is a matching because it's an edge that
does not -- there's no other edge so there's no conflict. This is also a matching,
the red vertex here -- I'm sorry, the red edge here and here because they do not
share a common vertex and the same thing here. Right? So that could be one
potential solution for binary interactions.
I'm now going to show another variation. Let's imagine this is my high-level
model. What I'm going to do is I'm going to represent each conflict, for instance,
between I1 and I2 by one vertex. So there is a conflict between I1 and I2. This
is this -- okay, maybe I should rephrase that.
I'm going to represent each interaction by one vertex and I'm going to show each
conflict by an each between these two vertices. So I1 conflicts with I2, there's
this edge between I1 and I2, I2 conflicts with I3 and I add this edge between I2
and I3. It's a reduction basically to solve the problem.
Now, if I solve the independent set problem for this graph I'm basically solving
the conflict resolution problem. The independent set problem, the solution would
be either I1 or I3 or I2. That's another solution.
Now, I can translate this to also finding a clique. If this is my high-level model,
I'm going to, again, represent each interaction by a vertex and I correct this
vertex to all non-conflicting interactions. So I1 does not conflict with I7, I6, and
I4, and this is what I get. I do the same thing for other interactions.
Now, if I find the clique or the maximum subgraph, the larger subgraph, then I'm
basically finding the maximum size of interactions that can run concurrently. So
three different approaches to solve the same problem.
So my point here is to solve the same problem, we can take completely different
approaches. And then what I'm going to show you is by taking different
approaches, we get completely different performances for the same experiment,
for the same benchmark.
So now the other aspect of the problem is we want to develop some type of
transformation where we can embed different solutions to the committee
coordination problem in a plug-and-play fashion.
So what we developed was a three-layer architecture as follows. Let's imagine
this is the high-level model, and we take as input a partition of interactions. So
the first class or the first set here is alpha 1 and alpha 2 and the second set is
alpha 3 is alpha 4, this is the input. This is given by the designer.
We replace each of these components by the same component with the same
name here, but the structure is going to be a little bit different. I don't go through
the details of how we have to change their internal structure to make sure that we
have partial state semantics.
Let me just say that each component is not going to have two ports. One is the
offer port which means that -- which transition is enabled. So the intention of this
port is to send a message to the scheduler, and this port which is called port is
intended to receive a notification to execute a transition. So that's the first layer.
The second layer is a layer which we call interaction protocol. So the interaction
protocol resolves local conflicts internally. It also receives from each component
which ports are enabled. If it can solve -- if it can resolve a conflict internally -for instance, the one between alpha 2 and alpha 1 here or the conflict between
alpha 3 and alpha 4 here -- then it does so, but some of the conflicts are going to
be external, such as the conflict between alpha 2 and alpha 3 and alpha 2 and
alpha 4.
Then these two have to talk to each other to make sure that, for instance, alpha 2
and alpha 3 are handled properly. So we're going to have a third layer which
implements a solution to committee coordination. But the interface between this
layer and the third layer are going to be fixed depending upon the conflict
resolution protocol or the committee coordination algorithm that we employ.
So let me just show you three implementations. If we employ a centralized
committee coordination algorithm or conflict resolution protocol, this is how it's
going to look like. All the reservation requests are sent to this centralized
component. So this is almost straightforward.
This is an implementation based on a token ring implementation of committee
coordination. So each component that has the token can resolve the conflict.
For instance, this component can resolve the conflict between alpha 3 and alpha
2. And this can resolve the conflict between alpha 4 and alpha 2, and the same
here. And the token is passed between these three components.
Another implementation is based on dining philosophers. Each two conflicting
interactions are handled by a common fork between two components that want to
resolve the conflict.
So then as I said, the interface is here. It's going to be constant. And then we
can easily embed a committee coordination algorithm that we want for
In terms of correctness, what we have to show is that all these transformations
that I talked about, you know, starting from a high-level model such as this, which
is this one up here, and then get this transformation which has only
point-to-point, you know, message passing, we have to problem somehow that
this model behaves similarly to this model.
So we were able to prove that this three-layer architecture preserves the
semantics of the high-level model by showing observational equivalence
between the high-level model and our transformation. By observational
equivalence I mean it's a bisimulation. So I'm not going to go through the details
of the proof.
Now, in terms of implementation, obviously for be a single engine, the
implementation is easy, but when we want to go to distributed engines we have
to take a partitioning scheme, and I'll show how a partitioning scheme can
change the performance of the generated code. Then there is also an input of a
conflict resolution protocol. It can be centralized, token, ring, dining
philosophers. These are the ones that we have implemented so far.
And in terms of code generation, we now have implementation for TCP sockets,
for MPI primitives and POSIX threads. So the POSIX threads implementation is
ideal for a multi-core platform.
So going back to our diffusing computation, I'm just going to show you the results
of experiments. So these are the components that need to achieve a diffusing
computation. These red and blue interactions show you the structure of the
interaction. So this component is interacting with these one, two, three, four
components, and this is a multi-party rendezvous.
Then there is the partitioning scheme which basically means which interactions
are handled by what interaction [inaudible] components. This means that there is
only one partition, this means there is two, this means that -- the second item
here means which committee coordination algorithm is employed. It can a
centralized one token ring or dining philosophers, or it can be nothing. That
means it's a centralized implementation on just one machine.
So let me directly go to this graph, which is interesting. The y axis here is the
total execution time of the generated code for a Torus of 6 by 4 components, and
these are different types of generated code. This is, for instance, 24 partitions
implemented by dining philosophers committee coordination. This one is 24
partitions by centralized token ring, 4 partitions dining philosophers and so on.
So one trend that we can see is almost by increasing the number of partitions,
we are increasing the level of concurrency, and we get better execution time in
the generated code.
The other thing is the centralized -- the totally centralized one, which means
there's no concurrency, takes the maximum time. So this means that distribution
is improving the performance.
The other observation here is -- which is very interesting is the token ring
implementation is performing better than dining philosophers. And this is a little
bit counterintuitive because the dining philosophers implementation is -- allows
more concurrency, allows more parallelism, a higher level of parallelism, but it is
not performing as good as the token ring one.
The reason here is although dining philosophers allows higher level of
parallelism, since there are more components that need to interact with each
other to resolve conflicts there's going to be more messages, and therefore
there's going to be more overhead.
Then we thought that this is a 6 by 6 Torus. Let's increase the number of
components, and means if the local engines can work concurrently with higher
level of parallelism, maybe dining philosopher performs better than token ring,
and that was indeed the case.
So for a 20 by 20 Torus it takes 500 seconds to complete diffusing computation
for dining philosophers implementation, and for the token ring implementation it
takes about an hour.
So, I mean, this graph and the brief one clearly shows that different committee
coordination problems, different partitioning results in completely different
performance, and that means several things. One is there's no silver bullet to
generate code for a distributed implementation or a parallel implementation, and
it also suggests that if we want to take this approach to generate code for
distributed systems, we need to have a very rich library of different partitioning
schemes and different implementations of committee coordination to get the best
>>: [inaudible]
>> Borzoo Bonakdarpou: Yes. I mean -- in what sense?
>>: [inaudible]
>> Borzoo Bonakdarpou: Right.
So -- well, yes. One problem is if that component dies, then, I mean, the whole
system gets stuck, right? I mean, it's -- it's basically the question of distribution
versus, you know, a central scheduler. And when there's only a central
scheduler, that becomes a point of -- if it becomes a point of failure, then the
whole system dies, right?
Did I answer your question?
>>: [inaudible]
>> Borzoo Bonakdarpou: The part of this is -- so, for instance, go back here. If
this component dies the system can still resolve the conflicts between alpha 1
and alpha 2, also alpha 2 and alpha 3. So interactions alpha 1, alpha 2 and
alpha 3 can still take place. The only thing that would not take place is alpha 4
was this component would never authorize alpha 4 to be executed.
So let me summarize the first part of the talk. The second part is going to be
So we basically addressed the problem of generating distributed codes or
concurrent parallel code out of a high-level model where we have a set of
components that are synchronized by a set of simple primitives. Each
component then, after code generation, becomes a standalone application, and
the synchronization primitives are meant to -- how the components work with
each other.
After the implementation is in C++ we tried different transformations. We tried
different committee coordination algorithms for implementing synchronization
issues. These are some of the publications out of this work. So as you can see,
the publications up here in -- for distributed systems conferences such as DISC
and SSS and IPDPS, and some of them are in the embedded systems
conferences such as EMSOFT and SIES.
Embedded systems conferences are interested in this work because it ensures
correctness which is good for safety critical systems.
And for future work there is a lot of things that we would like to do. For instance,
we want to tailor our transformations for multi-core platforms, for shared memory
platforms. Then we can leverage block based or weight-free or transactional
memory and all of those types of implementations and solutions for -- to
implement mutual exclusion.
But for distributed setting there is still a lot of room to work on different types of
schedulers, different types of committee coordination algorithms. I showed you
different -- three different solutions that we have not implemented. We have no
idea yet how they are going to work in action.
Another paradigm is to develop transformations for different type of applications.
For instance, for peer-to-peer networks, for sensor networks where energy is
important, because it's not going to be all about execution time. In sensor
networks, execution and computation is going to be cheap. What is expensive is
radio, is communication.
So then the question is how we can develop transformations preserve energy as
much as possible and possibly do more computation.
There are questions on synthesizing the glue level or the interaction level. And
another question is how we can leverage this framework to do compositional
verification for, for instance, algorithms such as transactional memory and so on.
So before I go to part two, are there any questions? Not really?
Okay. So part two is about some work that I have been doing when I joined
University of Waterloo, and that is on debugging and testing and tracing, and this
is a joint work with Sebastian Fischmeister and Samaneh Navabpour.
The question that we are trying to address here is if we want to do debugging or
tracing in a system, how we can minimize the probe effects on that system.
Normally we add instrumentation to a program. By instrumentation I mean, you
know, print f statements, break points, and things like that.
And this is not in all cases desirable because adding instrumentation means
adding more probe effects. In time sensitive systems it means, you know,
tampering with the deadlines, it means changing the overhead, it means basically
manipulating the normal behavior of the program.
So the question that we started asking ourselves was how can we minimize
these probe effects. So we came up with a notion of observability, and by which
we mean that if we want to debug or test a program we basically want to trace
the value of a set of variables to evaluate a set of properties. For instance, if you
want to evaluate whether A is greater than 100 -- I'm talking in the debugging and
testing context, not in a verification context -- then that means we have to sort of
record the value of A or the same thing for E and C here, and then the question
is, of course, one way to do that is we add be a print f here to add the value of A
in a file or on the screen, but then the question is, is there any way to minimize
this addition of print f's.
So achieving observability in an ad hoc fashion is -- can be done by traditional
methods which tamper with the natural behavior of the program, and as I said,
the examples are print f's and break points.
Now, in a concurrent setting, in a multi-threaded program and in a realtime
program the outcome of this tampering with is going to be a little bit more
amplified because adding print f's and adding break points is going to affect the
context switches, it's going to affect the interleavings of threads and so on.
So let's imagine we have these two threads. Don't worry about what the code is
doing. It is just for an illustrative example.
G is x plus y plus z -- x plus z plus y. There is an if then l statement here and
three more assignments here. The value of c can depend on whether the if or
the else branch is taken, and there is another thread which also defines the value
of variable e, and variable e is shared between these two threads.
So in an ad hoc manner, we have to add some instrumentation in the instructions
to record the values of a, e, c and e and so on. And as I said, these addition of
instrumentation instructions causes more interleaving scenarios, unpredictable
context switches, changes in the timing behavior, in resource usage, and so on.
So our goal is to reduce the deviation between the behavior of a given program
and the program that is instrumented. So we have source code and we have a
list of desired variables that we want to monitor or we want to log or debug, and
our goal is to find a minimum number of variables that needs to be instrumented.
So this is an optimization problem. Let me give you an extreme example of what
I mean, and this extreme example is as follows. We have these six instructions
which define the values of x, y, z, v, u, and w. X is q times 10, y is q divided by
10, z is q plus 10, q minus 10, q modulo 10 and to the power 10. All right.
So if we want to debug the values of x, y, z, v, u. W, one possible way is to add
a print f here after x, after y, z, and so on, right?
But perhaps the smarter way -- this is what we call the set of variables of interest
or desired variables. But probably a smarter way to do this is to only print f the
value of q and then we can extract the value of x, y, z, v, u, and w, right? So by
only adding one print f we can extract the value of all these variables.
This is what I meant by minimizing the instrumentation -- the number of
instrumentation instructions.
So q would be the set of variables that have to be instrumented in order to
observe the values of x, y, z, v, u and w. I call this an extreme example because,
you know, the value of all these variables depend on only one variable, right?
So this is our technical approach. We first extracted program slices of each
variable of interest, then we create something that we call observability graph
and then we check whether those variables of interest are observable using that
graph. If yes, we try to minimize that instrumentation. If no, then we come up
with a new instrumentation scheme.
So this meets the notion of dependency between variables to extract slices. For
instance, if you want to observe the value of a, a depends on the value of b, c, x,
and e, so that means a depends on b in instruction L9. The same type of
dependencies exist, and there can be chains of dependency. The other problem
is there is -- because we are doing static analysis here, statically we do not know
if this if branch is taken or else, so we have to take both into account by static
analysis. That means c can depend on f and g, it can depend on d. So what we
do is we take c and then it depends on the instructions L3 and L5 at the same
So dependency chains is based -- dependency chain is basically a chain of
dependencies where we start from the value of interest that we have -- for
instance, a -- and then we have to trace back all the dependencies to where the
chain is complete.
So one example of this chain is a depends on e, e depends on g, g depends on z
up here. Right? And if we develop a complete dependency, then we should be
able to always extract the value of a variable.
In multi-threaded programs it's going to be a little bit more complex because of
existence of shared variables. So in this case e is a shared variable between
thread 2 and thread 1. So e is defined here in instruction L7 of thread 1, and it is
also defined by instruction L1 of thread 2.
So that means we have to take into account all possible interleavings. This can
also depend on the memory models that we use, so the problem can be more
complex. What we aimed here was basically sequential consistency.
So to obtain the set of variables that define the value of a variable, we can use
the notion of program slices, and there's a lot of work in the literature on slicing.
And what we did basically is we took part of the algorithms for slicing concurrent
programs. So, for instance, to find the value of a -- to extract the value of
variable a, it's only sufficient to take these instructions into account because the
other instructions don't affect the value of a.
Then we -- when we have the slices, we construct this observability graph. So
an observability graph is defined as follows. This is variable a. We call this a
variable vertex, and this is a context vertex, which is the instruction. This
instruction uses values of variables b, e, and c. And the same thing for x.
E is defined by instructions either L1 or L7 which use variables s and G, and so
on. We can complete the observability graph. And then having the graph, if we
want to observe the value of a, then we have to find the coverage that covers the
value of a. For instance, if I add instrumentation for variables d, f, y, and x, this is
not going to be sufficient because basically y and x are not sufficient to extract
the value of g, right? And when we don't have g, that means we don't have c. If
we don't have c, then that means we don't have a.
Now, if we want to observe the value of multi variables, that means we have
observability graph for multiple variables, and then that issue of minimization
makes sense.
All right. So the optimization problem that I was talking about is formally as
follows. We have a current program, a multi-threaded program, we have a set of
variables v, which is the set of desired variables that we want to debug, and the
question is does there exist a subset of variables v prime which is less than k,
and k is the size of b, whose instrumentation makes all the variables in v
This problem -- we showed that for sequential programs, not even for a
multi-threaded program, this problem is NP complete. For multi-threaded
programs, we are going to have another exponential blow-up, and that is
because extracting program slices for multi-threaded programs has another
exponential blow-up. So we are suffering from two exponential blow-ups, one to
solve our optimization problem and the other one to generate the program slices.
And this is the work that we used -- it was published in [inaudible] that we
implemented to extract program slices.
Our tool chain works as follows. We take a C program to a set of variables of
interest, we implemented the slicer on the LLVM compiler, then we give it to our
observability checker. Our optimization problem is mapped into an SMT model
which we solve with Yices and that shows where should be instrumented in the
Let me show you the result of some experiments. So we aimed at studying two
things. One was how much we can reduce instrumentation, and the second
study was this reduction affects execution time of a program by what level.
So our case studies were some popular data -- concurrent data structure such as
concurrent linked-list and red-black trees, and we took different implementations.
There is a lock-based implementation, there is a non-blocking implementation.
The lock-based implementation is by Tim Harris. Actually, I think Tim Harris
works with Microsoft Research in Cambridge.
The obstruction-free or the non-blocking implementation is by [inaudible] Attiya
and there is also a lock-free implementation and there is also an implementation
by transactional memory.
So we took all these implementations to see how they behave when we
instrument a set of variables in a ad hoc fashion and when we instrument the
variables optimally.
So this is the reduction that we gained using our method. For linked-list, which
uses nested locks, the original instrumentation had 43 instructions, and the way
we came up with this instrumentation for debugging was we took the third top
variables that are defined in the program by the highest frequency and we
instrumented those. Any definition of those variables -- the value of that variable
is instrumented.
Then we tried to -- then we applied our method to find a minimum number of
instrumentations that has to be done in order to extract the value of those
In the case of this nested lock, from 43 instrumentation instructions we went
down to 20. So that is pretty good. In some cases it's less than 50 percent. In
some cases it's less, in some cases it's more. In average in the experiments that
we conducted, we gained 45 percent reduction. So that was the effectiveness of
our method.
In terms of the effect on execution time, we conducted the following experiments.
This graph shows the performance improvement factor, and this is the I/O delay
of instrumentation. This is simulated. We didn't use different devices for
So one of them was print f, the other one simulates print f, the other one
simulates logging on an E2 prong [phonetic], the other one on the screen, the
other one on disk. So these are different instrumentation schemes.
So, for instance, in case of linked-list instrumentation, the obstruction-free
algorithm by [inaudible] Attiya, this shows the best improvement factor. So we
took, in our experiment, 100 insertion operations in the red-black tree, and for
different types of I/O delay for instrumentation the improvement factor can go all
the way to -- in terms of execution time -- all the way to 50 times.
So we did not -- we did the same type of work for sequential programs before we
worked on multi-thread programs. We never reached this much improvement.
50 percent was something that I never thought -- I never imagined. Sorry, not 50
percent. 50 times improvement.
In some cases the improvement is less than 10 times, and that is the case where
the instrumentation was in the loop. When the instrumentation is in the loop and
we cannot reduce it that much, then the improvement is not that much either.
This graph shows the same improvement factor but for a different number of
insertions. So we changed the number of insertions to red-black tree or to
linked-list by -- from 200 to 1000, and as you can see, in some cases the
improvement factor goes all the way to 70. So that means that ad hoc
instrumentation for debugging can slow down a program at least by 70 times in a
concurrent setting.
So the summary of this part of the talk is we looked at reducing the probe effect
introduced by instrumentation instructions for debugging and testing purposes,
and we tested our method for popular concurrent data structures such as
linked-list and the red-black trees. The problem is NP complete. That means
there is still a lot of work to be done for designing heuristics. What we did is we
transferred our problem to an SMT problem, but for large model, it's still going to
be an intractable problem.
And in some cases we showed up to 70 times gain in performance when we
instrumented a problem using our method optimally. For future work we are
looking at designing heuristics, we are looking at the notion of observability in
weak memory models, we're looking at optimization with respect to log size. It's
not always the case that we want to optimize. Sometimes we -- for instance, in
embedded systems we have a device that only has, you know, a certain size,
and we can log things up to some extent, so we want to log things that makes the
program as observable as possible. And we're looking at that probabilistic
observability. That means finding a set of variables that makes another set of
variables observable by the highest probability.
So thank you very much for your attention. I'll be happy to take questions.