>> Yuxiong He: So good afternoon, everyone. Welcome to the talk. It's my great pleasure to introduce Borzoo Bonakdarpou. I hope I pronounced it right. So he's a research assistant professor for University of Waterloo. He's broadly interested in the correctness of software development using formal methods, especially in the area of distributed and embedded realtime systems. So today he's going to share with us his work on the model-based code generation debugging of concurrent programs. >> Borzoo Bonakdarpou: Okay. Thank you very much for the introduction and your kind invitation. So as Yuxiong said, you know, I have been working on not necessarily software but system correctness since I became a graduate student. This talk will focus on model-based code generation, which basically focuses on correctness by construction. So let me give you an overview of the long-term vision that I have, at least for the next few years. So I'm generally interested in program correctness, and I'm basically following two lines of research. One is offline or in a precompiled time and one is online. So by offline I mean correctness by verification, correctness by construction. I've done some work on composition verification, and correct by construction is what I'm more interested in, so at what I call a micro-level I'm interested in model repair and model synthesis where given an existing model and a set of properties, the problem is how we can fix bugs and fix errors in that model. But what I call it at a more macro level is model-based code generation where there exists code, and we want to transform it to different types of workable platforms. And recently I've been working tracing and run time verification. Because of all the limitations that we all know that state level synthesis and multi-checking have, there is now an emerging trend on using run time verification and tracing for debugging purposes. So this has been my fascination more recently. I gave a talk here at MSR in 2008, early 2008, on model repair, and today I'm going to talk about model-based code generation and a little bit about tracing and debugging. So the first part is on model-based construction of distributed programs, and this is a joint work with Maria Bozga, Mohammed Jaber, Jean Quilbeuf and Joseph Sifakis, so the work goes back to my post doc time at Verimag, and I'm still collaborating with them, so that's been fun. So the idea here is we want to start from the high-level model and generate code from that high-level model, but the high-level model may be associated with code as well. So let me describe in more detail. Why do we do model-based software development? I think you all -- you're all familiar with the reasons. Because model, they abstract implementation details. Analysis of models is easier. We can do levels of verification, model checking on model, whereas that's more difficult to do at implementation level. We can do testing, simulation, and all sorts of procedures that ensure model correctness, and then we can transform the model into actual implementation, but the fact is during this transformation when it is done manually, the bugs are introduced by programmers because it's a human process, it's a manual process, and bugs are inevitably introduced. So in distributed systems I would argue that the problem is a little bit more amplified when we develop software because of the inherent complex structure of distributed systems, because of concurrency, non-determinism, race conditions, low atomicity, the issue of faults, so basically in distributed systems the problem is a little bit more difficult, and the gap between developing models and implementing and deploying software is, I would argue, is a little bit wider. So when I joined Verimag, it was in the early stages of a framework which is called BIP. I would like to emphasize that I'm not religious to this framework in any way, but it's just one way to deal with the problem. My slides are, I think, cut a little bit from the bottom and the top, but that should be fine. So the idea is that we start from a high-level model, and I will say what I mean by high-level model, and in the high-level model we can use verification tools such as the D-Finder tool that's been developed at Verimag for finding [inaudible] states in a compositional fashion, and then there are some performance, for instance, criteria involved. Then we do some source-to-source transformations, for instance, to get more close to distributed computing reality, you know, for instance, point-to-point message passing and things like that, and then we generate code. So this is the part that I'm going to talk about today from the system model BIP to source-to-source transformation and then eventually code generation. So the philosophy of this framework is that we want to first have different programming paradigms such as data flow, event driven, synchronous, and so on, and we want to follow and principle of constructivity, which means that if low-level components satisfy some set of properties, then by composing them, we want to preserve those properties. Of course, this is hard problem. We cannot solve it in all cases. But for some safety and some [inaudible] properties that can be done. And them from this high-level model we want to do model transformation and then get implementation that is correct by construction by preserving functional properties and some extra-functional properties that I will talk about. So this is sort of the outline of the first part of the talk. Let me first start by describing the global state semantics of BIP. So basically a model has three layers. The first layer is the layer that we described the behavior of the model, so these are the set of components, and each component is basically a straight transition system or a PetriNet [phonetic]. Each transition in each component can be associated with a C++ function. Then there is a level of interactions which is basically a set of [inaudible] primitives such as rendezvous, broadcast, and so on, and there can be priorities for scheduling purposes, which basically means if two interactions are enabled at the same time, we can give priority to one. There is composition going on. So if you have a composite component here and another one right here, then we can compose another set of interactions and priorities and we can get basically a new component that can be composed with other components later. Let me show a very simple example. We have four components here. We intend to have the first component as a sender, the rest of them are receivers, there's only two transitions -- sorry, one transition and two states. This transition somebody labeled by s and it is associated with a port s, r1, r2, and r3. We compose them by the interaction sr1r2r3. That means when all of these ports are enabled, then this interaction can take place, and each component takes a transition from the first state to the second state. So when all of these components are at their initial state on top, this interaction is enabled because s is enabled here, r1 is enabled here, and r2 is enabled here, r3 here, and therefore the transitions can be taken locally and the components will reach the second state. Let's imagine we don't have any priorities. We actually have just one interaction so priorities don't -- a priority doesn't make sense anyway. And this would be a rendezvous interaction. And that's how we show it graphically. Now, the same components but with a different interaction. So this interaction says that if s is enabled or s and r1 or s and r2 and all the way to the end, which is the last one is s and r1 and r2 and r3, then the local components can make their move based on the ports enabled here. So this is basically a broadcast, and it gives priority to the larger interaction. So when all of these components are in their initial state, the maximal interaction is going to be sr1r2r3. So that is basically a broadcast, and that's how we show a broadcast graphically. So you can come up with more interesting types of interactions. For instance, the upper one is an atomic broadcast, which means that when a sends a message, then either b and c both receive it or none of them. And the bottom one is a causal chain, so when a is enabled, then the interaction would causally go downwards here. So I was not involved in this part of the work, but it has been shown that this framework gives expressiveness more than any existing [inaudible] algebra. So that was the basic semantics in very simple words. So when I joined the effort, we wanted to take such models and then generate distributed code, and by distributed code I mean that each component that I showed becomes one standalone application and the interaction basically becomes a synchronization primitive, and then the question is how do we generate distributed code out of multi-part interactions. If it is only binary interaction that is easy because most existing networks, they provide primitives for point-to-point message passing, but when they have multi-part interactions, the problem becomes a little bit more difficult. So I started the effort by doing a case study. And that was distributed reset, which was a very well-known algorithm in self-stabilizing systems due to Arora and Gouda. I'm going to describe the algorithm very quickly and a high-level fashion. The algorithm has three layers. There's an application layer, and the goal is we want to achieve be a distributed reset. A process initiates a reset and then the whole system is supposed to get reset. There's an application layer, there's a wave layer which performs diffusing computation and there's a tree layer which maintains a spanning tree throughout the network. Let me just focus on the middle layer here, the wave layer, which is a diffusing computation. The algorithm is very simple. A node such as, let's say, 4 requests a reset, a global reset. It sends a message to its parent, 1. 1 sends the message to its parent, 0, which is the root. Then when the root receives this message, it resets its own state, then it bounces back, sends a message to its children, 1 and 2. The same thing goes on until we reach the leafs. Then when all the leaf children of a node are reset, then they start the completion wave. For instance, 7 is the only child of 4. It sends a message to 4. The same thing between 3 and 1 and the same thing between 8 and 6 and the same thing between 5 and 2. So in the first wave of completion, 3, 7, and 8, which are the leaves, are complete. Then they're parents. And now since these two are complete, 2 and 1 become complete and then all the way back to the root and the diffusing computation is complete. It's a very simple algorithm. A self-stabilizing version of this algorithm is shown by these four or five -- one, two, three, four -- five innocent-looking target commands. By self-stabilizing I mean that starting from any state -- that's Dijkstra's notion of self-stabilization -- by starting from any state -- it doesn't have to be reachable -- any state of the system, the diffusing computation is complete within a finite number of steps. So that means that it takes care of faults as well. So the algorithm was very easy. This is now a little bit more complex. And now I'm going to develop a model for that in our framework. So I introduce three states: normal, init, and reset. We go from normal to init. This is only for one component. If a node receives a request it resets its state when its parent says to do so and then it completes. There are some self-loops going on here to make sure that multiple sessions of reset can go on safely. Of course, faults -- these would be the ports. Faults can happen. That means it can change the state of the system randomly, it can change the session number of a component, and these would be the recovery transitions -- I'm not going through the details so don't worry about the details. What I'm going to do is I'm going to freak you out by the level of details here -- and these would be the ports. So from that pictorial description that I gave you of the tree which was very comprehendible, easy, then to guarded [phonetic] commands and now to this what I call almost a mess you can see how things get complex and complicated from a high-level description to a lower level of formalism, right? So then I would argue that let's stop here and not go through, you know, developing distributed code. Now, this is one component. The interaction level of these components would be complex too because at development time we don't know which components are going to interact with each other. We don't know which components are going to be parent and child to each other. So we have to develop this sort of, you know, complex set of interactions to make sure that we capture all types of parent-child relationships at run time. Now, imagine we ask a programmer to develop a program using, say, socket programming to implement this algorithm. Although the algorithm is very simple, it's going to be difficult to anticipate all types of synchronization and to guarantee correctness. So the goal is to stay at this level of abstraction and then generate code, distributed code. So in action a BIP model is orchestrated by an engine or a scheduler, and this engine or scheduler can be centralized or distributed, and my focus is going to be how we develop a distributed scheduler for -- during code generation for these type of models. Okay. So the next step of my talk is about transforming a high-level BIP model into an intermediate model where we don't have multi-part interactions and we have partial-state semantics. So let's say what I mean by partial-state semantics, let's imagine this is a very simple BIP model. There are seven components here. There's interaction I1, I2, and I3. Now let's imagine I1 is enabled. That means these three components, they start executing code, and let's imagine that C3 finishes first. That means C3 is available for computation. C4 is also available for computation. That means I2 is enabled and that means again C3 and C4 can run concurrently. Right? So there is an issue of atomicity here. In the high-level model either I1 or I2 could execute, but when we go to a concurrent setting I2 and I2 -- depending upon enabledness of components C3 and C4, I1 and I2 could run concurrently. So then the question is how do we manage atomicity in the sense that the concurrent implementation adheres with the high-level description of the model. So let me give a more concrete example. This is the high-level model for components, and these three interactions, alpha 1, alpha 2, and alpha 3. Now, if you assign an engine or a scheduler to manage interactions alpha 1, alpha 2 and another one to manage interaction alpha 3, these components, these four components, are replaced here. That means when alpha 2 and alpha 3 are enabled this component sends a message to this engine and says this port is enabled. The same thing happens for these two components. And then both alpha 2 and alpha 3 are enabled and they send a message back to these components that they can execute alpha 2 and alpha 3 and their associated transitions locally, but then the problem is does -- how would we guarantee that this doesn't violate the semantics of the original model. On top of that there is the issue of conflicting interactions. So what if we have interaction alpha 1 and alpha 2 that share port P. Obviously we cannot execute both at the same time because there is only one transition here enabled which is labeled by P. So both interactions cannot be executed. There's another type of conflict. What if we are at this state that means this transition is enabled, this transition is also enabled, and if this -- so that means potentially these two interactions can -- alpha 1 and alpha 2 -- can be enabled, but this component can take only one of these transitions at a time. Then the question is how do we resolve these type of conflicts if we are in a distributed setting? In a centralized setting it's easy. All these enablednesses are sent to the scheduler, the scheduler automatically decides which interaction should go forward. That is straightforward. But in a distributed setting that's going to be a little bit more difficult. So ->>: [inaudible] >> Borzoo Bonakdarpou: Sure. >>: The conflict is defined by whom? >> Borzoo Bonakdarpou: You can find out where potentially there could be conflicts by doing a very simple static analysis, because -- let me go back here. So if you have a model like this, then by looking at this I know that starting from this state, both transitions P and Q could be -- are enabled, and if this port and this port here are also enabled, that means alpha 1 and alpha 2 are concurrently enabled, but this component can only take one of these transitions. And if these two interactions are managed by two different engines, they sort of have to, you know, coordinate at run time and let this component know that either alpha 1 or alpha 2 exclusively are enabled, and therefore only one of these transitions are taken. Right? So as simple static analysis finds out where the conflicts are. Or I should say potential conflicts. Okay. So here's a short story. When I started working on this problem, I developed an algorithm. Actually, it was a very simple algorithm, it was a very naive algorithm, but still it was an algorithm. It worked. We did some experiments. It was fun. And then I was just browsing one of the books that I really liked by Chandy and Misra, one of the God books, and I bumped into this problem which is called the committee coordination problem. And that was only by chance. Professors -- and this is [inaudible] -- professors in a certain university have organized themselves into committees. Each committee has an unchanging membership roster of one or more professors. From time to time a professor may decide to attend a committee meeting. It starts waiting and remains waiting until a meeting of a committee of which it is a member is started. All meetings terminate in finite time. The restrictions on convening a meeting are as follows: Synchronization, meeting of a committee may be started only if all members of that committee are waiting, and exclusion is no two committees may convene simultaneously if they have a common member. The problem is to ensure that if all members of a committee are waiting, then a meeting involving some member of this committee is convened, which is the progress property. This is exactly our problem. The committee is an interaction. A professor is a component. If two committees -- that means two interactions that share one component -- are enabled at the same time, only one should convene, right? So that was exciting. So this problem was introduced back in '88 and I thought there should be a long line of research doing all sorts of -- you know, introducing all sorts of committee coordination algorithms, so I started doing a survey on the literature. As I said, the problem was introduced in '88, and the solution was given by reducing this committee coordination problem to distributed dining or drinking philosophers problem. In '89 Rajive Bagrodia introduced some other solution based on message counts, and, in sort, zincs by reduction again to dining and drinking philosophers problem, and he presented some simulations, and then in 1990 the whole thing dies. There is absolutely no -- nothing happening on this problem after 1990. I have no idea why. Maybe -- the closest language to what we were doing is ADA [phonetic], and so I think ADA would be the only language that could use committee coordination and -- you know, because of all the reasons that ADA was not as popular as imperative languages, other imperative languages. There was no research on this problem. So now reviving the committee coordination problem, there are all sorts of questions that maybe did not make a lot of sense back in the '80s. One of them is how do we guarantee maximal or maximum concurrency? Because if we want to deploy -- generate code and then deploy a model, we want to have as much parallelism as we can, right? So maximal concurrency is a problem. Fairness. This is another problem. So these are all the problems that have not been addressed very -- in detail. Fault tolerance or self-stabilization. Waiting time. Service time. And the last one which I cannot see, the effect of time -- oh, I think this is the utilization. So these are all the problems that have not been addressed. So there is going to be a lot of opportunities to do research on solving different types of committee coordination problems. So some of the results that we developed are -- these are the impossibility results. Maximal concurrency in the presence of an unfair scheduler is impossible. Impossible in distributed computing community means that we cannot develop an algorithm that guarantees maximal concurrency if you have an unfair scheduler. Providing maximal concurrency and committee fairness even in the presence of a fair scheduler is impossible. Maximal concurrency and bounded waiting times, satisfying both is going to be impossible. So that's one side of the story, developing committee coordination algorithms. The other side is how can we transform a high-level model using committee coordination solutions in a structured manner and then generate code? Because one can develop, you know, an ad hoc solution that starts from high-level model, uses some committee coordination problem, and then generates code. But that's not going to be very efficient, and I will back it up by some experimental results, because different committee coordination algorithms will end in completely different experimental results. So the first solution that I was talking about before I discovered there exists a problem called committee coordination was to assign an engine to a set of conflicting interactions. So, for instance, in this case if -- this is our high-level model, and alpha 1 conflicts with alpha 2 and it conflicts with alpha 3, and the same thing for alpha 4, 5, and 6. Then what we're going to do is we assign one engine to these three which resolves all the conflicts between these three locally, and there will be another engine which resolves all the conflicts between these three. So, again, these are the components, and there will be engine that manages alpha 1, 2, and 3, and there be another engine that resolves the conflicts between the alpha 4, 5, and 6. And these two engines do not have to talk to each other because there is no, you know, cross conflict between these two engines. So that was a very straightforward solution. The implementation of an engine done by a PetriNet, and let's not worry about the details of this PetriNet, but basically the PetriNet -- in this PetriNet, this is where we receive a message of a transition is enabled from a [inaudible] component here, and when all the ports of an interaction -- for instance, alpha 1 -- is enabled, then the token is passed through this barrier here. One of the preliminary experiments that we did was on bitonic sorting, and the reason that we were able to do that was this was -- this is a high-level description of bitonic sorting and, and all these interactions here are non-conflicting. And this is how the model looks like by inserting those engines. These are the interactions that are managed by these engines here. Let's not worry about the details of how the transformation is done because I will talk about the traditions in more detail after these results. So these are some numbers of bitonic sorting with different sizes of arrays for sorting in parallel. This is an MPI handwritten code. So, for instance, for 20k size array it takes 80 seconds. Our implementation based on [inaudible] sockets takes -- or I should say implementation and transformation -- takes 96 seconds. Obviously it has more overhead. That we expected. Then what we did was we developed a direct transformation from a high-level model to an implementation using MPI. The strange thing here is our transformation takes less time than a handwritten code. So, I mean, this was very strange because an automated generated code is now performing better than a handwritten code, which didn't make a lot of sense. So when we dig into the problem we realized that we are using collective send and receive primitives in our transformation, whereas when we developed the handwritten code here, it uses the regular send and receive primitives and MPI. So this happened to be faster. Of course, if you used the same type of primitives here, it's going to behave similarly. Now, the first one was in case of one CPU, this is four CPUs, and this one is 4 CPUs on four different machines. So this was a multi-core implementation and this was a distributed implementation. So if you ignore this column, the overhead is less than 50 percent, and the fact is our focus was not really on, you know, performance evaluation. Our focus was more on [inaudible] correctness. But obviously if we want to take this to a more serious step we have to take into account performance issues, and that's one of the things that I'm hoping to achieve by my visit here because Yuxiong is an expert at parallel computing. All right. So the problem with that simple solution was that if we manage all the conflicting interactions by one engine is in some models it can be the case that all the interactions are conflicting, such as this one. These all share the same port. So the first interaction conflicts with the second one, the second one with the third one, third one with the fourth and all the way to the end. So if we take that approach, that means we have to manage all the interactions by one engine, which is a totally centralized solution. And then the question is how can we add to the level of parallelism and have distributed engines but at the same time we resolve the conflicts safely? So then we go back to committee coordination problem. Now, what I'm going to show you in the next few slides is there can potentially be completely different solutions to committee coordination. Let's only focus on binary interactions, and if we take only binary interactions committee coordination is going to be almost the same as the matching problem in graphs. In matching we are looking for a set of edges that do not share a common vertex, right? So, for instance, in this graph, this is a matching because it's an edge that does not -- there's no other edge so there's no conflict. This is also a matching, the red vertex here -- I'm sorry, the red edge here and here because they do not share a common vertex and the same thing here. Right? So that could be one potential solution for binary interactions. I'm now going to show another variation. Let's imagine this is my high-level model. What I'm going to do is I'm going to represent each conflict, for instance, between I1 and I2 by one vertex. So there is a conflict between I1 and I2. This is this -- okay, maybe I should rephrase that. I'm going to represent each interaction by one vertex and I'm going to show each conflict by an each between these two vertices. So I1 conflicts with I2, there's this edge between I1 and I2, I2 conflicts with I3 and I add this edge between I2 and I3. It's a reduction basically to solve the problem. Now, if I solve the independent set problem for this graph I'm basically solving the conflict resolution problem. The independent set problem, the solution would be either I1 or I3 or I2. That's another solution. Now, I can translate this to also finding a clique. If this is my high-level model, I'm going to, again, represent each interaction by a vertex and I correct this vertex to all non-conflicting interactions. So I1 does not conflict with I7, I6, and I4, and this is what I get. I do the same thing for other interactions. Now, if I find the clique or the maximum subgraph, the larger subgraph, then I'm basically finding the maximum size of interactions that can run concurrently. So three different approaches to solve the same problem. So my point here is to solve the same problem, we can take completely different approaches. And then what I'm going to show you is by taking different approaches, we get completely different performances for the same experiment, for the same benchmark. So now the other aspect of the problem is we want to develop some type of transformation where we can embed different solutions to the committee coordination problem in a plug-and-play fashion. So what we developed was a three-layer architecture as follows. Let's imagine this is the high-level model, and we take as input a partition of interactions. So the first class or the first set here is alpha 1 and alpha 2 and the second set is alpha 3 is alpha 4, this is the input. This is given by the designer. We replace each of these components by the same component with the same name here, but the structure is going to be a little bit different. I don't go through the details of how we have to change their internal structure to make sure that we have partial state semantics. Let me just say that each component is not going to have two ports. One is the offer port which means that -- which transition is enabled. So the intention of this port is to send a message to the scheduler, and this port which is called port is intended to receive a notification to execute a transition. So that's the first layer. The second layer is a layer which we call interaction protocol. So the interaction protocol resolves local conflicts internally. It also receives from each component which ports are enabled. If it can solve -- if it can resolve a conflict internally -for instance, the one between alpha 2 and alpha 1 here or the conflict between alpha 3 and alpha 4 here -- then it does so, but some of the conflicts are going to be external, such as the conflict between alpha 2 and alpha 3 and alpha 2 and alpha 4. Then these two have to talk to each other to make sure that, for instance, alpha 2 and alpha 3 are handled properly. So we're going to have a third layer which implements a solution to committee coordination. But the interface between this layer and the third layer are going to be fixed depending upon the conflict resolution protocol or the committee coordination algorithm that we employ. So let me just show you three implementations. If we employ a centralized committee coordination algorithm or conflict resolution protocol, this is how it's going to look like. All the reservation requests are sent to this centralized component. So this is almost straightforward. This is an implementation based on a token ring implementation of committee coordination. So each component that has the token can resolve the conflict. For instance, this component can resolve the conflict between alpha 3 and alpha 2. And this can resolve the conflict between alpha 4 and alpha 2, and the same here. And the token is passed between these three components. Another implementation is based on dining philosophers. Each two conflicting interactions are handled by a common fork between two components that want to resolve the conflict. So then as I said, the interface is here. It's going to be constant. And then we can easily embed a committee coordination algorithm that we want for implementation. In terms of correctness, what we have to show is that all these transformations that I talked about, you know, starting from a high-level model such as this, which is this one up here, and then get this transformation which has only point-to-point, you know, message passing, we have to problem somehow that this model behaves similarly to this model. So we were able to prove that this three-layer architecture preserves the semantics of the high-level model by showing observational equivalence between the high-level model and our transformation. By observational equivalence I mean it's a bisimulation. So I'm not going to go through the details of the proof. Now, in terms of implementation, obviously for be a single engine, the implementation is easy, but when we want to go to distributed engines we have to take a partitioning scheme, and I'll show how a partitioning scheme can change the performance of the generated code. Then there is also an input of a conflict resolution protocol. It can be centralized, token, ring, dining philosophers. These are the ones that we have implemented so far. And in terms of code generation, we now have implementation for TCP sockets, for MPI primitives and POSIX threads. So the POSIX threads implementation is ideal for a multi-core platform. So going back to our diffusing computation, I'm just going to show you the results of experiments. So these are the components that need to achieve a diffusing computation. These red and blue interactions show you the structure of the interaction. So this component is interacting with these one, two, three, four components, and this is a multi-party rendezvous. Then there is the partitioning scheme which basically means which interactions are handled by what interaction [inaudible] components. This means that there is only one partition, this means there is two, this means that -- the second item here means which committee coordination algorithm is employed. It can a centralized one token ring or dining philosophers, or it can be nothing. That means it's a centralized implementation on just one machine. So let me directly go to this graph, which is interesting. The y axis here is the total execution time of the generated code for a Torus of 6 by 4 components, and these are different types of generated code. This is, for instance, 24 partitions implemented by dining philosophers committee coordination. This one is 24 partitions by centralized token ring, 4 partitions dining philosophers and so on. So one trend that we can see is almost by increasing the number of partitions, we are increasing the level of concurrency, and we get better execution time in the generated code. The other thing is the centralized -- the totally centralized one, which means there's no concurrency, takes the maximum time. So this means that distribution is improving the performance. The other observation here is -- which is very interesting is the token ring implementation is performing better than dining philosophers. And this is a little bit counterintuitive because the dining philosophers implementation is -- allows more concurrency, allows more parallelism, a higher level of parallelism, but it is not performing as good as the token ring one. The reason here is although dining philosophers allows higher level of parallelism, since there are more components that need to interact with each other to resolve conflicts there's going to be more messages, and therefore there's going to be more overhead. Then we thought that this is a 6 by 6 Torus. Let's increase the number of components, and means if the local engines can work concurrently with higher level of parallelism, maybe dining philosopher performs better than token ring, and that was indeed the case. So for a 20 by 20 Torus it takes 500 seconds to complete diffusing computation for dining philosophers implementation, and for the token ring implementation it takes about an hour. So, I mean, this graph and the brief one clearly shows that different committee coordination problems, different partitioning results in completely different performance, and that means several things. One is there's no silver bullet to generate code for a distributed implementation or a parallel implementation, and it also suggests that if we want to take this approach to generate code for distributed systems, we need to have a very rich library of different partitioning schemes and different implementations of committee coordination to get the best result. >>: [inaudible] >> Borzoo Bonakdarpou: Yes. I mean -- in what sense? >>: [inaudible] >> Borzoo Bonakdarpou: Right. So -- well, yes. One problem is if that component dies, then, I mean, the whole system gets stuck, right? I mean, it's -- it's basically the question of distribution versus, you know, a central scheduler. And when there's only a central scheduler, that becomes a point of -- if it becomes a point of failure, then the whole system dies, right? Did I answer your question? >>: [inaudible] >> Borzoo Bonakdarpou: The part of this is -- so, for instance, go back here. If this component dies the system can still resolve the conflicts between alpha 1 and alpha 2, also alpha 2 and alpha 3. So interactions alpha 1, alpha 2 and alpha 3 can still take place. The only thing that would not take place is alpha 4 was this component would never authorize alpha 4 to be executed. So let me summarize the first part of the talk. The second part is going to be shorter. So we basically addressed the problem of generating distributed codes or concurrent parallel code out of a high-level model where we have a set of components that are synchronized by a set of simple primitives. Each component then, after code generation, becomes a standalone application, and the synchronization primitives are meant to -- how the components work with each other. After the implementation is in C++ we tried different transformations. We tried different committee coordination algorithms for implementing synchronization issues. These are some of the publications out of this work. So as you can see, the publications up here in -- for distributed systems conferences such as DISC and SSS and IPDPS, and some of them are in the embedded systems conferences such as EMSOFT and SIES. Embedded systems conferences are interested in this work because it ensures correctness which is good for safety critical systems. And for future work there is a lot of things that we would like to do. For instance, we want to tailor our transformations for multi-core platforms, for shared memory platforms. Then we can leverage block based or weight-free or transactional memory and all of those types of implementations and solutions for -- to implement mutual exclusion. But for distributed setting there is still a lot of room to work on different types of schedulers, different types of committee coordination algorithms. I showed you different -- three different solutions that we have not implemented. We have no idea yet how they are going to work in action. Another paradigm is to develop transformations for different type of applications. For instance, for peer-to-peer networks, for sensor networks where energy is important, because it's not going to be all about execution time. In sensor networks, execution and computation is going to be cheap. What is expensive is radio, is communication. So then the question is how we can develop transformations preserve energy as much as possible and possibly do more computation. There are questions on synthesizing the glue level or the interaction level. And another question is how we can leverage this framework to do compositional verification for, for instance, algorithms such as transactional memory and so on. So before I go to part two, are there any questions? Not really? Okay. So part two is about some work that I have been doing when I joined University of Waterloo, and that is on debugging and testing and tracing, and this is a joint work with Sebastian Fischmeister and Samaneh Navabpour. The question that we are trying to address here is if we want to do debugging or tracing in a system, how we can minimize the probe effects on that system. Normally we add instrumentation to a program. By instrumentation I mean, you know, print f statements, break points, and things like that. And this is not in all cases desirable because adding instrumentation means adding more probe effects. In time sensitive systems it means, you know, tampering with the deadlines, it means changing the overhead, it means basically manipulating the normal behavior of the program. So the question that we started asking ourselves was how can we minimize these probe effects. So we came up with a notion of observability, and by which we mean that if we want to debug or test a program we basically want to trace the value of a set of variables to evaluate a set of properties. For instance, if you want to evaluate whether A is greater than 100 -- I'm talking in the debugging and testing context, not in a verification context -- then that means we have to sort of record the value of A or the same thing for E and C here, and then the question is, of course, one way to do that is we add be a print f here to add the value of A in a file or on the screen, but then the question is, is there any way to minimize this addition of print f's. So achieving observability in an ad hoc fashion is -- can be done by traditional methods which tamper with the natural behavior of the program, and as I said, the examples are print f's and break points. Now, in a concurrent setting, in a multi-threaded program and in a realtime program the outcome of this tampering with is going to be a little bit more amplified because adding print f's and adding break points is going to affect the context switches, it's going to affect the interleavings of threads and so on. So let's imagine we have these two threads. Don't worry about what the code is doing. It is just for an illustrative example. G is x plus y plus z -- x plus z plus y. There is an if then l statement here and three more assignments here. The value of c can depend on whether the if or the else branch is taken, and there is another thread which also defines the value of variable e, and variable e is shared between these two threads. So in an ad hoc manner, we have to add some instrumentation in the instructions to record the values of a, e, c and e and so on. And as I said, these addition of instrumentation instructions causes more interleaving scenarios, unpredictable context switches, changes in the timing behavior, in resource usage, and so on. So our goal is to reduce the deviation between the behavior of a given program and the program that is instrumented. So we have source code and we have a list of desired variables that we want to monitor or we want to log or debug, and our goal is to find a minimum number of variables that needs to be instrumented. So this is an optimization problem. Let me give you an extreme example of what I mean, and this extreme example is as follows. We have these six instructions which define the values of x, y, z, v, u, and w. X is q times 10, y is q divided by 10, z is q plus 10, q minus 10, q modulo 10 and to the power 10. All right. So if we want to debug the values of x, y, z, v, u. W, one possible way is to add a print f here after x, after y, z, and so on, right? But perhaps the smarter way -- this is what we call the set of variables of interest or desired variables. But probably a smarter way to do this is to only print f the value of q and then we can extract the value of x, y, z, v, u, and w, right? So by only adding one print f we can extract the value of all these variables. This is what I meant by minimizing the instrumentation -- the number of instrumentation instructions. So q would be the set of variables that have to be instrumented in order to observe the values of x, y, z, v, u and w. I call this an extreme example because, you know, the value of all these variables depend on only one variable, right? So this is our technical approach. We first extracted program slices of each variable of interest, then we create something that we call observability graph and then we check whether those variables of interest are observable using that graph. If yes, we try to minimize that instrumentation. If no, then we come up with a new instrumentation scheme. So this meets the notion of dependency between variables to extract slices. For instance, if you want to observe the value of a, a depends on the value of b, c, x, and e, so that means a depends on b in instruction L9. The same type of dependencies exist, and there can be chains of dependency. The other problem is there is -- because we are doing static analysis here, statically we do not know if this if branch is taken or else, so we have to take both into account by static analysis. That means c can depend on f and g, it can depend on d. So what we do is we take c and then it depends on the instructions L3 and L5 at the same time. So dependency chains is based -- dependency chain is basically a chain of dependencies where we start from the value of interest that we have -- for instance, a -- and then we have to trace back all the dependencies to where the chain is complete. So one example of this chain is a depends on e, e depends on g, g depends on z up here. Right? And if we develop a complete dependency, then we should be able to always extract the value of a variable. In multi-threaded programs it's going to be a little bit more complex because of existence of shared variables. So in this case e is a shared variable between thread 2 and thread 1. So e is defined here in instruction L7 of thread 1, and it is also defined by instruction L1 of thread 2. So that means we have to take into account all possible interleavings. This can also depend on the memory models that we use, so the problem can be more complex. What we aimed here was basically sequential consistency. So to obtain the set of variables that define the value of a variable, we can use the notion of program slices, and there's a lot of work in the literature on slicing. And what we did basically is we took part of the algorithms for slicing concurrent programs. So, for instance, to find the value of a -- to extract the value of variable a, it's only sufficient to take these instructions into account because the other instructions don't affect the value of a. Then we -- when we have the slices, we construct this observability graph. So an observability graph is defined as follows. This is variable a. We call this a variable vertex, and this is a context vertex, which is the instruction. This instruction uses values of variables b, e, and c. And the same thing for x. E is defined by instructions either L1 or L7 which use variables s and G, and so on. We can complete the observability graph. And then having the graph, if we want to observe the value of a, then we have to find the coverage that covers the value of a. For instance, if I add instrumentation for variables d, f, y, and x, this is not going to be sufficient because basically y and x are not sufficient to extract the value of g, right? And when we don't have g, that means we don't have c. If we don't have c, then that means we don't have a. Now, if we want to observe the value of multi variables, that means we have observability graph for multiple variables, and then that issue of minimization makes sense. All right. So the optimization problem that I was talking about is formally as follows. We have a current program, a multi-threaded program, we have a set of variables v, which is the set of desired variables that we want to debug, and the question is does there exist a subset of variables v prime which is less than k, and k is the size of b, whose instrumentation makes all the variables in v observable. This problem -- we showed that for sequential programs, not even for a multi-threaded program, this problem is NP complete. For multi-threaded programs, we are going to have another exponential blow-up, and that is because extracting program slices for multi-threaded programs has another exponential blow-up. So we are suffering from two exponential blow-ups, one to solve our optimization problem and the other one to generate the program slices. And this is the work that we used -- it was published in [inaudible] that we implemented to extract program slices. Our tool chain works as follows. We take a C program to a set of variables of interest, we implemented the slicer on the LLVM compiler, then we give it to our observability checker. Our optimization problem is mapped into an SMT model which we solve with Yices and that shows where should be instrumented in the program. Let me show you the result of some experiments. So we aimed at studying two things. One was how much we can reduce instrumentation, and the second study was this reduction affects execution time of a program by what level. So our case studies were some popular data -- concurrent data structure such as concurrent linked-list and red-black trees, and we took different implementations. There is a lock-based implementation, there is a non-blocking implementation. The lock-based implementation is by Tim Harris. Actually, I think Tim Harris works with Microsoft Research in Cambridge. The obstruction-free or the non-blocking implementation is by [inaudible] Attiya and there is also a lock-free implementation and there is also an implementation by transactional memory. So we took all these implementations to see how they behave when we instrument a set of variables in a ad hoc fashion and when we instrument the variables optimally. So this is the reduction that we gained using our method. For linked-list, which uses nested locks, the original instrumentation had 43 instructions, and the way we came up with this instrumentation for debugging was we took the third top variables that are defined in the program by the highest frequency and we instrumented those. Any definition of those variables -- the value of that variable is instrumented. Then we tried to -- then we applied our method to find a minimum number of instrumentations that has to be done in order to extract the value of those variables. In the case of this nested lock, from 43 instrumentation instructions we went down to 20. So that is pretty good. In some cases it's less than 50 percent. In some cases it's less, in some cases it's more. In average in the experiments that we conducted, we gained 45 percent reduction. So that was the effectiveness of our method. In terms of the effect on execution time, we conducted the following experiments. This graph shows the performance improvement factor, and this is the I/O delay of instrumentation. This is simulated. We didn't use different devices for instrumentation. So one of them was print f, the other one simulates print f, the other one simulates logging on an E2 prong [phonetic], the other one on the screen, the other one on disk. So these are different instrumentation schemes. So, for instance, in case of linked-list instrumentation, the obstruction-free algorithm by [inaudible] Attiya, this shows the best improvement factor. So we took, in our experiment, 100 insertion operations in the red-black tree, and for different types of I/O delay for instrumentation the improvement factor can go all the way to -- in terms of execution time -- all the way to 50 times. So we did not -- we did the same type of work for sequential programs before we worked on multi-thread programs. We never reached this much improvement. 50 percent was something that I never thought -- I never imagined. Sorry, not 50 percent. 50 times improvement. In some cases the improvement is less than 10 times, and that is the case where the instrumentation was in the loop. When the instrumentation is in the loop and we cannot reduce it that much, then the improvement is not that much either. This graph shows the same improvement factor but for a different number of insertions. So we changed the number of insertions to red-black tree or to linked-list by -- from 200 to 1000, and as you can see, in some cases the improvement factor goes all the way to 70. So that means that ad hoc instrumentation for debugging can slow down a program at least by 70 times in a concurrent setting. So the summary of this part of the talk is we looked at reducing the probe effect introduced by instrumentation instructions for debugging and testing purposes, and we tested our method for popular concurrent data structures such as linked-list and the red-black trees. The problem is NP complete. That means there is still a lot of work to be done for designing heuristics. What we did is we transferred our problem to an SMT problem, but for large model, it's still going to be an intractable problem. And in some cases we showed up to 70 times gain in performance when we instrumented a problem using our method optimally. For future work we are looking at designing heuristics, we are looking at the notion of observability in weak memory models, we're looking at optimization with respect to log size. It's not always the case that we want to optimize. Sometimes we -- for instance, in embedded systems we have a device that only has, you know, a certain size, and we can log things up to some extent, so we want to log things that makes the program as observable as possible. And we're looking at that probabilistic observability. That means finding a set of variables that makes another set of variables observable by the highest probability. So thank you very much for your attention. I'll be happy to take questions. [applause]