1 >> Tom Ball: I'm Tom Ball, and it's my pleasure to welcome Stefan Leue from Germany. He's a professor of computer science in the Department of Computer and Information Science ask the University of Konstanz, where he holds the chair for software engineering. And he's interested in formal techniques for the design and analysis of complex systems, looking at model checking, safety analysis, system debugging and causality reasoning, which I think we're going to hear about today. So welcome. >> Stefan Leue: Thank you very much, Tom, for the introduction and also thank you very much for the kind invitation. Glad to be here. And yes, we want to talk about causality checking today, and first want to say this is joint work with my Ph.D. student, Florian Leitner-Fischer and also it is ongoing work so work we are still shaping up. What is the setting of our work? We are interested in, as Tom pointed out in the introduction, in the analysis of complex systems and very typically, these systems, they comprise -- they consist of hardware components as well as software components. So very often referred to as embedded systems. And let me just pick an example here that I will use throughout the talk where we are looking at a railroad crossing. And this consists of components, first of all we have trains that try to cross the road here. We have cars that also try to cross the same critical section of the road, and we have gates that need to be opened and closed in order to basically implement some protocol that ultimately avoids the dangerous situation where there are both trains and cars inside this railroad crossing. And let us assume that the gate control is managed by some piece of software, and so this is in the end about the analysis of a complex embedded system that we are undertaking here. What we're going to use during the talk is a number of events that we shall consider, and let me just briefly introduce them to you. So we have, first of all, a train that is approaching an intersection. Those who have ever looked at signaling systems in the railroad area know that there are automatic detectors that detect when trains are arriving, when trains are leaving. An intersection, in fact, they're actually counting the number of wheels that pass 2 there in order to make sure that, for instance, a train is not accidentally leaving a car on the railroad track. So they are highly sophisticated. But we're looking at this at a much higher level of abstraction. So we have an event that says that a train is in the crossing. TL is the event that says the train is leaving the crossing or has left the crossing. The car can be approaching. The car can be in the crossing and the car can have left the crossing. And finally, the gate can be closing, it can be opening, and we also incorporate an identified failure state which says that the gate is actually failing. What we typically do is next we come up with some behavioral model of the system, and this is typically given in some form of communicating finite automaton, and I should point out this is a rather stylized idea of what the actual control is going to be. But what's important here is that the train as well as the car as well as the gate are represented by state machines and that these state machines synchronize via common events and, hence, what we consider to be the total model of the system is basically the concurrent execution of these three components. Now, what we also like to do in system analysis is apply a technique that's called model checking. Let me briefly recap what model checking is. We're starting out with a model off the system. This model is typically given, well, perhaps in UML or SML or some related notation that is, in the end, interpreted typically as a transition system which is sort of a variant of what logicians call a Kripke structure. On the other hand, we have a specification of the requirements on the system, and those are typically given using assertions, temporal logic or automata that will not be at the heart of this talk, but let us assume that we shall use temporal logic in order to specify the properties here. And finally, we have something that brings the model and the specification together in an algorithmic, systematic form, and that is some form of model checking algorithm or a model checking procedure that checks whether the model is actually satisfying the specification which is indicated by this satisfactory relation symbol in between the M and the S. 3 So mapping that to our example we have on the left-hand side as a model of the system, this collection of communicating finite state automata, we have on the right-hand side a requirement, in this case a safety requirement which specifies the absence of Hazard as the functional safety experts would say, and it reads that there's never a train in the crossing at the same time when there's a car in the crossing, and we can formalize that using a temporal logic formula that says here fee is identical to always not TC and CC. So the train and the car are never going to be in the crossing at the same time. That's basically the setting that we're using, and as I said, what we would like to do is we would like to check this relation between the model and the specification in a tool-supported, automatic, systematic way. And for that, we're using a model checker that is typically using either state space search, some form of symbolic fixed point computation or set solving or a number of different techniques. What we're actually going to focus on during this talk is that kind of model checking that focuses on state space search. What we are going to look at is explicit state model checking, which is either performing systematic depth-first or breadth-first search on the state space in order to locate states that potentially violate the property that we have previously specified and that those states be here denoted by these circled axis in the state space. What a state space search does is basically it tries to find a counter example or an error path to such a violating state. So what the model checker will return is something like a path that says, okay, when the car is arriving, the train is arriving, the cor is in the crossing, the gate is closing, and then the train is in the crossing, then there's a potential for an accident because the car hasn't left before the train is entering. And so we're in a potentially dangerous situation here. Now, that's pretty standard technology, and what we're getting as a result is we're getting error trails, and they can be presented in various formats. The interesting observation is that if one tells, for instance, a model checker like spin not to give you just one single counter example but actually to give you all upon counter examples, there will be a quite significant number of those coming up for this encoding of the railroad crossing example that we chose. It happens to be 47 traces, and they all have this shape here so there are some events in them that we have seen in origin, and they are numbers and they're representing 4 internal events of the model. What becomes clear is that such a sheer amount of data is very difficult to analyze in sort of a manual fashion. In particular, because what these traces have in common is that they lead to a property violating state, but they are just evidence for errors occurring. When we want to debug the model, then weeks of course, interested what is intrinsically the cause actually for this violation happening, right? Manual analysis, as I pointed out, is tedious error prone and is essentially impossible even for such a simple, not very complex model as this railroad crossing. So our goal is to actually come up with what we refer to as algorithmic causality computation, which means we want to compute causalities from the data that we can get out of the model in order to answer this question, what actually the cause is going to be like. That's basically what I would like to talk about today. The outline of my talk is going to be that we are going to first talk a little bit about models of causation that we can underlie this computation. I will then introduce an adopted structural equation model. I'll explain what that is in a minute. I'll introduce you to what we refer to as causality checking. I'll talk a little bit about an experimental evaluation that we did. And finally, conclude my talk. Please, if you have any questions, of course feel free to ask them. Okay. Models of causation. What is a cause? In my mind, there is no really apriori definition of what one should consider a cause. There's various approaches that you can find, both on the technical and the philosophical literature on the definition of causes. There's, for instance, statistics, mathematicians like to think that correlations say something about causes. There's various forms of causality reasoning, in particular in the philosophy of science. There is this idea of event structures and the analysis of concurrent systems. There's work by Lamport on his happen before relation and his analysis of distributed systems. And there are many more ideas that seem to allude in some way or another of causality. What I'm saying that I think there's no apriori, correct definition of causality, then what I think is much more important to worry about is the adequacy of a causality definition. So we want a causality definition that, in the end, gives us results that we can sort of make a lot of sense out. 5 Let us look at what we think is sort of an interesting thought that goes into this direction and just illustrate that these ideas are not really absolutely new. I would like to refer to David Hume back in the 18th century, who, in his inquiry concerning human understanding, is reflecting on what he believes causes are. And there are some quotations here, and maybe I would like to read out these few sentences here. So he writes, yet so imperfect are the ideas which we form concerning it that it's impossible to give any just definition of cause, except that it is drawn from something extraneous and foreign to it. So he points out cause is not something to do with external that he says, similar objects that are similar, he seems to something easy to define, and it always has influences. He furthermore makes an observation are always conjoined with similar. So things suggest, always lead to similar results. And then, and this comes closer to the is or his idea of causality is he says be an object followed by another. And first, are followed by objects similar heart of what his causality definition that therefore, we may define a cause to where all the objects, similar to the to the second. So when something happens, then it always or something very similar to it, then it always leads to something second happening in very similar fashion. And this is actually something that one could refer to as sort of the positive part of causal definition that says if something happened, it has a certain consequence. Then he continues to say, or, in other words, where if the first object had not been, the second never had existed. This is what later on we will refer to as sort of counterfactual reasoning, which says that, okay, if the first thing does not occur, then the second will not occur as well. Now, this idea has been picked up by Lewis in the '70s of the last century with his definition of causal dependence that is quite often referred to, also, by people worried about the philosophy of science and engineering. And he says that where C, which refers to a cause, and E, an effect, are two distinct possible events, E causally depends on C if and only if C were to occur, E would occur. And if C were not to occur, E would not occur. And 6 again, here we've got these two ideas. First of all, the positive side of the causal definition. And secondly, sort of the counterfactual argument. >>: [indiscernible] so you cannot distinguish which is the cause and which is the effect. >> Stefan Leue: Yes. In some sense, you could argue that there is sort of a reversal. Typically, this is answered by the fact that one says causality is coinciding with temporal evolution, right. So the earlier one would always be the cause. The later one would be the effect. Right. This is sort of the silent assumption when you make these arguments. One thing happens before the other, and that basically determines which is of the cause and which one's the effect. >>: Which doesn't account for common cause. >> Stefan Leue: >>: Which means? Suppose I had a B that causes both C and E. >> Stefan Leue: Right. >>: That's always the question in an experiment, you need to know that C is an independent variable. >> Stefan Leue: >>: Right, right. By making a randomized experiment. >> Stefan Leue: So it is clear about this argument by itself, and I shall have some illustrations for that, is not sufficient way to capture all of the cause/effect relationships and complex technical systems that we're faced. By the way, this is often simplified by just saying C is causal-free. If were C not to occur, then E would not occur either. So very often, in counterfactual reasoning, one just refers to sort of the second counterfactual argument type or part of the argument. Okay. So one sort of a consequence of this Lewis type of idea of counterfactual reasoning is that it leads to a what-if analysis where if we're 7 looking into, as it's always sometimes referred to, alternate worlds. So we kind of speculate about if the world had been different, if a different course of action had taken place, then something wouldn't have happened. Had there been another cause of action in which the gate had been closed before the car entered the crossing, there would not have been an accident. So this is kind of a good world or what I will refer to as the good world. And I'm a little bit arbitrarily defining good and bad worlds. Good worlds are the things where the effects do not occur. And the bad worlds are the ones where the effect occurs because, in the back of my mind, I'll always have the idea that a bad thing to happen is the accident, right. So naive counterfactual reasoning, as I will refer to it, performs a positive and counterfactual test for one event C and one event E, and I'll briefly critique what some limitations of this naive counterfactual reasoning are. Let us assume that we are introducing an event, LB, which says, okay, a stop light is broken. And that we want to say, okay, causal is the conjunction of causal events. So the gate failing and the light being broken is something that we want to consider conjointly to be causal. Then we have good executions where only one of these two events occur, and we have bad executions where both of these events are actually occurring. And if you apply this simple counterfactual test, then none of the causal events will be recognized since they occur both in good and in bad worlds, right. And this cannot be according to basically the naive counterfactual reasoning. Very similar kind of argument is applying to disjunction of causes. When you have another event that you're introducing, IL, assume that there's a stop light in addition to the gate, and the driver is ignoring the stop light. Then the disjunction of the causal events, the light broken or the driver ignoring the light, leading to an accident, actually this leads you to a good world where none of these events occurs, and a bad world, where either one of these occurs. And very much like in the previous case, none of these cases will be with a naive counterfactual reasoning, be recognized as actually a cause. There's some more limitations that naive counterfactual reasoning cannot 8 account for. It's the non-occurrence of events which says that the TC -- so the train in the crossing, the fact that the gate is not closed and the car is in the crossing is actually causal for an accident to occur. This is not something that you can really capture in the simple naive definition -application of the counterfactual argument. As well as that ordering of events can be distinguishing factors between bad and good worlds. So the train in the crossing, the car in the crossing, the gate closing, the gate opening, this is certainly a bad sequence, whereas the gate in the crossing, the train in the crossing -- the gate close, pardon me, the train in the crossing, the gate opening, the car crossing is a good sequence, consisting of the same set of events, but just in different orders, and the order, of course, matters whether it's a good or a bad one. And finally, something like relevance of events that, for instance, the train engineer union decided not to call for a strike on the day of the accident happening. You don't really want that to be a cause even though it would pass the counterfactual test because it's just simply something you do not want to be -- consider as being relevant in your reasoning. Okay. So as a consequence, this naive counterfactual reasoning is best suited to explain sort of simple, single cause/effect chains. Is it not suit the for effects that center logically complex structure. Just want to point out that there's large, large literature, particularly in the philosophy of science, that tries to critique and modify counterfactual reasoning in various ways. And they're more cases than the ones I've presented here that are not adequately addressed. Now, we are considering an approach by Joe Halpern and Didier Pearl that's called structural equation model, which tries to address some of these points. In particular, which is aiming at sort of more complex logical structure and is trying to combine this idea with Lewis, counterfactual reasoning. And time does not unfortunately suffice to go into every detail of the basic Halpern/Pearl model, but let me just point out some of the key ideas that they propose. They are considering events as being represented by Boolean variables. They, on the other hand, consider events sort of not only to be Boolean, but also to be possibly defined over arbitrary domains. So variable in the real word, attaining a certain value is then mapped on to Boolean variables. 9 They make a distinction between exogenous and endogenous variables, so they make it possible to combine which part of the real world are relevant for our consideration, our reasoning, and which ones are irrelevant. And they in general compute minimal Boolean disjunction and conjunction of causal events. And they define a number of causality conditions and, actually, they define the AC1 condition, which ensures that there exists a world where the Boolean combination of causal effects, C, and effect, E, actually occur. So that is sort of the positive part of the counterfactual test. And then they have two conditions, AC2. First one that says that if at least one of the causal events does not happen, the effect E does not happen. That's basically a counterfactual argument. And then they say if the causal events occur, the occurrence of other events cannot prevent the effect. So when you have a set of events that you consider to be causal, you cannot see another event occur. That means that the effect is actually going away. And finally, they are imposing a minimality constraint, that the causal events that you have identified are actually minimal in the sense that no subset of the causally identified events satisfies AC1 and AC2 at the same time. That's roughly the idea, and I would like to refer you to their paper for the complete formalization and motivation for their approach. We found that to be a very good sort of first match for an application to our -- sorry? >>: Can we go back a slide. Maybe there should be a condition that says that E should not be C, right, because again, to replace E there, E is the minimal event that's going to satisfy AC1, AC2 and AC3. The event is a cause of itself. >> Stefan Leue: >>: Yes, yes. So perhaps should be it should be other than E. >> Stefan Leue: They actually presume this is something that I'm not explaining here. What they actually do is they presume some causal structure between events. So they presume some previous analysis in which they identify different events, and where they identify which events can possibly influence 10 which other events. And it is clear that a cause needs to precede an effect in that -- in what they refer to as a causal network, right. So they do provide a little bit more structure than I'm explaining here. These are cycle-free. So things are not referring to themselves, okay? Okay? So there's a lot of benefits in this model, in that it considers Boolean combinations of events, the distinction between exogenous and endogenous variables. There are some shortcomings in the sense that there's no consideration of event orders as being causal factors, and we had pointed out that that's, in principle, important. And what's also a certain deficit, we like to think in terms of these transition systems, as I argued before, as being the type of Kripke structures that we like to reason about. And in particular, what we like to reason about are traces that describe sort of the temporal logical evolution of the system in terms of computation steps. And so we are basically adding these things to that model so I would like to introduce how our adopted structured equation model actually looks. Let me introduce, first of all, some auxiliary means that we defined. We first of all defined something that we refer to as an event order logic. It is, in essence, some form of a linear time temporal logic with a certain limited expressiveness. It first of all refers to events and it contains Boolean expressions that say that events are occurring at all in the obvious interpretation. In addition to that, it expresses event order conditions. So while above here A and B just simply says along some trace A and B do occur, A conjunctively before B actually means that A and B both occur and A happens before B. We then also define interval operators. The one here says that A occurs until eventually B will hold in every state. The next one says that A always holds until eventually B occurs. And the third one defines basically an interval. It's in the interval defined by A and C. B always holds we defined this logic because it gave us sort of a convenient notation in order to later on specify the logical constraints that we considered to be causal. We do have a model theoretic semantics for that. And, of course, I don't want to go over the details here. I just want to point out that we defined both what it means that a state defined -- satisfies actually a simple formula, 11 which is just given by an event symbol, and that is done by the fact that we're saying, okay, if the system transitioned into the state, then the state satisfies that formula. And finally, by virtue of applying this semantics, we can also define what it means that a transition system is actually satisfying some formula in this logic; namely, if there is a trace such that this trace satisfies the formula. As I said, the details of that semantics are pretty straightforward and they are in our paper. So this gives us an opportunity now to basically represent traces. For instance, such as sigma here, and to be categorized by an event or a logic formula, which means that basically, trace can be represented or can belong to an equivalence class represented by such an event or a logic formula. That is the kind of equivalence we are considering here. >>: What do you think about the difference between holds and occurs in these kind of order of conjunctions. Say, for instance, A conjunctive to B means that B always holds once A occurs. What's the difference between occurs and holds? Does holds means has occurred? >> Stefan Leue: Okay. Holds basically means that -- let me just go back. Okay. In every state, B will be satisfied here. Okay. And there is some state, until eventually B holds in every state, in which A actually holds. That's maybe a better way of putting it, right. Occurring means in a number of states up until the right-hand side is reached, eventually or at some point, A will actually hold. Yes? >>: [inaudible] logic some type of trigger for [indiscernible]. >> Stefan Leue: You can't probably represent it in LTL. >>: You have Boolean operators on top of the interval operators? Take the disjunction and the interval operators are conjunction? Or do you only have Boolean operators on the atomic proposition. >> Stefan Leue: Only on the atomic proposition. >>: So basically, you're going to have one top level interval expression. it's probably not first order expressive. So 12 >> Stefan Leue: >>: No, it should be LTL expressive. LTL is equivalent to the first order theory of -- >> Stefan Leue: Okay, yeah, yeah. >>: So I'm just wondering, can you express everything in LTL in your logic? I'm guessing not, because you don't have Boolean combinations over the operators. >> Stefan Leue: I think so too, yes. So just briefly go back. Yes. So how does your adoptive model now look like? Let us assume that we have a transition system given in the standard form so it consists of a set of states, a set of actions, a transition relation, a set of initial states, atomic propositions labeling the states and actually this labeling function. And let us assume that A is a set of event variables over the event types and ACT. I should point out that we're doing one little trick here. We're at the moment not able to sort of deal with a repeated occurrence of events so we solve this in a sort of syntactic manner by distinguishing the first instance of the occurrence of some event, A from the second from the cert. So every event basically, along a trace, occurs at most once. That's the limiting assumption that we're currently making. Let fee denote an LTL formula that's representing the non-reoccurrence of some property evaluations, so the not fee is the violation and it's often the fact that we're looking after. It's typically the accident, the bad thing happening. What we're currently doing is we are limiting our approach to reachability properties that we express by safety LTL formula of that form. And then let lip see and an event order logic formula that is consisting of event variables in some set, Z, or in other words, the set Z is the set of variables that actually occur in this formula, and we're saying that such a formula is actually considered a cause for the effect not C if the following conditions that we're finding actually hold. So there's the condition AC1 that says basically there exists a sigma so that both sigma satisfies the lip see and sigma satisfies the not fee. So it's 13 basically signifying the to be the cause and it's leading to the effect of where we see, again, the event order logic correctization of what we consider actually violating the cause and this is -- or it's this is actually what we refer to as the positive side positive side of this counterfactual test. So we can, for instance, pick a candidate cause see here and say here there's an example here this is given by this event order logic formula. It says that there's a train approaching before the car approaching before the gate failing, before the car and the crossing, and before the train is in the crossing and then we need to pick some sigma satisfying this fee, and basically we thereby, by the occurrence of the events, and here we define which events belong to our set Z, and Z is the set of candidates of events that we consider to be causal. And then we have set W, which is all of the events that are not occurring in the sigma, and so the example here can be trivially seen that we have for this set Z. These events occurring here, the set W, are all of the events in the model that do not occur in this set. So basically, this is what we're starting out with. what positively happens and what causes the effect. We're starting out with Then we're moving to the AC2 definition or condition. The first part of it, and let me maybe not go over in all detail the formalization here, but let me point out that what we're doing here is we're actually defining the counterfactual test with saying, okay, there exists a sequence sigma prime where the order and occurrence of events is different from the sigma that we have originally considered. That's the sigma that passed the AC1 test. And fee is not violated on this sigma prime. So when we're looking into the formal definition here, we're sigh saying basically there's the sigma prime, so that the sigma prime satisfies this causality constraint, and then we're basically looking in here and saying, okay, it is not leading to the error, and the valuation is different from what the valuation was in the original example. The AC2-1 in our example is fulfilled by using sigma, the TA, CA, GF, CC, TC tracer, which means the train is approaching, the car is approaching, the gate is failing, the car is in the crossing, and the train is in the crossing. Since there exist a sequence sigma prime, so this is the bad sequence, train approaching, car approaching, gate closing, train in the crossing, so that the 14 valuation basically of some of the variables in the set W has changed and actually the sigma prime is not leading to the effect that we had seen. So basically, that is the counterfactual test. Basically says there is an execution that does not lead to the negative effect. Okay. So what we're doing now is we're adding a further condition that expresses the following. It basically says for a sequence of events to be causal, it cannot be possible to add an event so that causality is violated. So it should not be possible, when you have a causal sequence, you want it to suffice by itself. You don't -- and you want to say, okay, if this causal sequence is added or if we add a further event to this causal sequence, we do not want the effect to go away. So what, in essence, what that means is this condition is there in order to reveal that potentially the non-occurrence of events can be causal as well. Because if we, in other words, add an event to a sequence, which makes the causality go away, then it means in our setting that in the previous version, when this event was not there, this absence of the event itself can be considered causal. I'll have an example for that in a minute. Ask that's actually encoded in the formal definition up here. So in that sequence sigma double prime that we're providing the valuation of the positive events, that is as in the originally considered sequence sigma and the valuation of the negative events is different from what it was in the originally considered sigma, and that basically means for all executions where the events in Z have the original value as defined by the valuation of the variables in sigma. The value of the arbitrary subset of events in W, the non-occurring events, have no effect on the violation of the effect. That's basically what we have to check. So if we consider sigma double prime to be the sequence where the train's arriving, the car's arriving, the gate is failing, the car crossing, the car leaving the intersection and the train is getting into the intersection, then suddenly, the bad effect would go away because the car has actually left the intersection, of course, before the train is actually entering. And so the formal conditions for AC2(2) are satisfied. However, for the sigma double prime, the property fee is not violated, as I explained, since the car had left. So the consequence is that this formula is not causal because the AC2(2) 15 conditions failed and the non-occurrence of this event is causal and that basically needs to be added to the causal formula. So what we need to do is we need to basically find minimal set of causal non-occurrence events and add that to the formula. And I'll talk a little bit about how we're going to go about that. There's a third condition that we add, which is that our formula that describes the cause is minimal in the event that no subset off this formula, no partial formula satisfies actually the conditions AC1 and AC2. So we can't take any causality argument away and still establish the causality. So how do we go about capturing this causality of nonoccurrence? What we do basically, we consider all of the event variables and some such sequence sigma double prime that contains a non-occurring event and we depend -- we add, depending on the position of the event variable in that formula an additional formula that expresses that the event is not occurring if it is at the beginning of that causality formula, C, or this formula, if it's at the end of C or if it's in the middle of C, we accordingly add such a term to the formula. And then we perform again the test for AC2. So in the example, we would get, for instance, this expression that we're saying, okay, train approaching, car approaching, gate failing, car closing, and then it's not true that the car is leaving the intersection between the car and the crossing or the train and the crossing and this now leads to such an undesired effect. Okay. How do we capture this notion which was not included in the original structured event model that causality of event order is actually relevant in so for some end order logic formula of C, what we're doing is we're placing the ordered operator conjunctively and ordered by the -- order the conjunction, which yields a formula C. And so what we're now saying is that that is considered some C and event order logic formula over a set, Y subset of the events that we're considering to candidate events for the causality. The order expressed by the lip see is not causal if the sigma satisfies C and there is a sequence sigma prime in the bad trace such that sigma prime does not satisfy the C and the sigma prime satisfies the basically the formula where we have given up the order constraints. So in the example, for instance, the order of the events, gate failed, car crossing, not in car leaving and train in the crossing is important for causing 16 the accident that we're considering. So the relative order of TA and CA is not important, however, but they need to precede the above events. And the resulting formula, then, is this one that expresses the causality that we're computing here. Okay. So that is the set of definitions that we have come up with in order to sort of adopt the Halpern Pearl model for causality to our checking. Now, our goal was also to come up with a sort of a mechanization of that idea, to cast that into an actual algorithm in order to do this analysis. So what we will exploit in our analysis is the fact that we need alternate worlds in order to reason about counterfactual arguments, and that are these alternate worlds are sort of given by the model theoretic semantics of the system models in the sense that we can define traces that lead to errors and traces that do not lead to errors. And we will do that basically by capabilities that are provided by the model checkers that we're using, the explicit state type model checkers that we're currently considering. How does that work? So traces do actually -- that the model checkers compute do actually define the alternate worlds. We can, first of all, compute with a model check of the set of the bad traces, which are all the counter examples that lead to a property violation. Let us assume that we can sort of modify the search algorithm in such a way that it will give us all of the traces that lead into bad states. On the other hand, to compute alternate worlds, we can compute the set of good traces by having the model checker search the states base and basically consider any trace that does not lead to an error state into the set of good traces. The search will terminate sort of when we either are trying to close a loop or when the search depth has actually been reached or when a final state has been reached. Please? >>: Do you have some ordering on the good traces, or I guess they're all independent? >> Stefan Leue: I mean, they have a prefix structure, right. some order, as I will show you that's -- We do build up 17 >>: The assertion of doomed to failure and I guess there's probably doomed to goodness at some point. No matter what choice you make, you're going to be good, so it's just a ->> Stefan Leue: Right, okay. So you can -- so that's probably, I'm not sure whether this graphic shows that, but let us assume that, for instance, this is a sub-tree that has only blue arcs, blue edges. Then that is something where you will never again run into a failure, right. >>: Are you going to enumerate all those? >> Stefan Leue: We are enumerating those. we algorithmically solve that, okay? I'll explain a little bit about how So key idea, explore the state space search using a depth-first, a breadth-first search. As I said, bad and good traces, the ones that lead to the property violating states, and the ones that do not lead into property violating states. And as I said, this is all only applicable currently to reachability properties. Also, what I should point out to those kind of properties where after property violation has been detected, no meaningful behavior ensues. If you look at some safety properties, where just a property violation is reached, the system may continue to work and there may be another, for instance, assertion violation reached with, again, some causal sequence of events. Those kinds of properties we do not consider. We consider those, they're sort of reachability properties and what happens afterwards is either nothing or it is not really meaningful for the analysis at hand. Okay. We do define, as sort of an auxiliary construction, notions of sub-executions and that is these operations that I'm illustrating here. They basically say whether traces are sub-traces of other traces and whenever you see a dot to go with such an operator, it means when they are in an ordered manner sub-traces of each other. And what we were able show is that we can reduce the checks for AC1 to AC3 and OC1 to sub-execution tests and the proofs for that are in the paper. I want to show you that. However, how we are actually going to do that in our construction. We have to implementation variants of that. The one is an offline enumeration, where what we're doing is we're actually enumerating all of the traces and 18 we're storing the sets of the good and the bad traces and then we perform these sub-trace computations in an offline step, which, of course, leads to considerable storage requirements, memory requirements. The alternative for some on-the-fly method where we're basically using some depth-first search, breadth-first search on the state space, and restoring paths in adequate data structure and we're storing them there as you obtain them. And this data structure that we're using is the subset graph, which looks as follows. And basically, the graph stores the traces and categorizes them and it stores them in an order where you see the shortest traces up here at the level one, and the longest traces, the further you go down here in this tree. Now, so the levels that correspond actually to the trace length that the traces have and connections here between levels mean that the lower level trace is a sub-trace of the higher level trace. So we connect them whenever one trace is a sub-trace of the other one. This is currently the unordered consideration, okay? Now, we have different types of nodes in this graph. We have, first of all, the green nodes. And the green nodes basically say that the trace is in the set of the good traces, so it cannot be a trace that by itself is causal. And what's also important for node to be green is that all nodes on the level below that are connected with it are also colored green and these traces are either prefixes of good or bad traces. By the way, there's something a little bit awkward. I'm saying on the level below. Of course, that means going upwards in the chart. A red node is now a trace that is in the set of bad traces and all nodes on the level below are connected to it, are actually green. So it's sort of the first trace in an evolution that is actually turning from a good trace into a bad trace. And they are -- the shortest bad trace is found so they satisfy minimality constraints and they are considered to be candidates for being causal traces. The black nodes are now good execution traces. But at least one node on the level below that it is connected with is colored red. So they are longer traces. They are good traces. But they have a sub-trace that is a bad trace. 19 And this basically means that they identify one event that turns bad into good. That hints at the non-occurrence of events check, the AC22 check that it can be carried out by comparing these and their predecessors. So, for instance, and I shall return to this example in a minute, the car leaving event here basically is what makes the difference between these two traces. I'll look at that in more detail in a minute. And then we have these orange nodes. They represent a bad execution trace and at least one node on the level below that is connected to this orange node is actually colored red. They are bad traces. They do not specify the minimality conditions so we, in essence, do not necessarily need to consider them except for some special purposes, perhaps. Okay. So we're able to prove some theorems for the adopted structured equation model of Halpern and Pearl. Sorry, for our adoption of the structured equation model. So we can say that an event or a logic formula, C sigma that's derived from a red node containing the trace sigma fulfills AC1, AC2(1), and AC3 and actually for breadth-first search, we can say that this is fulfilled immediately, because we know it's always the shortest trace that we're reaching there. For depth first search, we basically can show that when the search terminates. But in depth-first search, we basically add a trace to the graph. We don't know whether it's a minimal one yet, right. So we have to completely explore there. The construction of the subset graph can be done by basically state space search. And once the state space search is complete, what we're doing is with he have to perform the non-occurrence of events test, and the order condition tests on the obtained structure. So what are the inferences that we can draw from the graph of all, we need to consider all of the red traces as being causality and then we need to check whether this condition means we determine whether the non-occurrence of events is is basically, first candidates for AC2(2) holds, which causal. So we have a red execution. For instance, here, at level 5, the execution where we have the train approaching, the car approaching, the gate failing, the car in the crossing and the train in the crossing, and then we have at the level above, the black execution, which has an additional event. And it basically means that the negative effect goes away. And we can then identify 20 that this event, car leaving, is actually the one that's responsible for turning the bad trace into a good trace. And so we conclude that the causal execution here is that the not car leaving approach event is actually responsible for that. Right? We need to perform that not just for single events, but we need to basically check the situations and also identify which it's multiple events that are not occurring and that turn the one into the other. >>: So you could, in principle, you could have a destruction of negative occurrences placed at different points in the sequence, right? You could say well, some event between the TA and the CA would prevent the failure. Or in a different event, between GS and CC, which would prevent the failure. Is that expressible in your logic? I mean, can I actually -- if my level 5 event there had several level 6 events there, a level that had negative events in different places, is that expressible or am I mistaken there? >> Stefan Leue: I believe it should be expressible. but let me check offline. I believe it should be >>: Leaving aside my particular example, so it's possible that you could wind up with a situation where you just don't have a causal explanation that is expressible in the logic, right? >> Stefan Leue: When you don't have a -- >>: For a particular set of traces, it might be that I have nothing that satisfies all of those AC conditions. >> Stefan Leue: Um-hmm. >>: So, yeah, so you don't have -- there's not some kind of completeness result that says I always have an explanation. >> Stefan Leue: No. In particular, because I mean, you're only getting those traces that the model permits, either in the bad or in the good case, right. You're not getting all possible combinations and orders of events happening. But only the ones that the model permits, right. So I believe that that's an argument why you couldn't have such a completeness 21 result. >>: I mean, I don't think you would want to have a completeness result, because you want your logic in some way to express some domain knowledge, right. You want it to be incomplete and you want it to focus on certain kinds of explanations as being more likely than others. >> Stefan Leue: Right, um-hmm. Okay. >>: So if you have this gate broken event, you could also have sort of -- you could also have other dependent events or other events that might cause a [indiscernible] like, you know, if there's a -- if there's a human on site to fix the gate, they could also stop the car. >>: Or the car decides to back up, you know. >>: There are so many other things that aren't described, you put them in there and it could get much more complicated. >>: So the causal execution could get more complicated. As you say, you have many different events that disjunctively could prevent, you know, the bad thing from happening. >> Stefan Leue: Um-hmm. >>: An meteor arrives from outer space or something. There could be different things that could happen. So I was wondering sort of how expressive is my ability to, you know, describe these causal executions. >>: Did you want these things to be minimal in some sense, so we can't get overly complicated. >>: Presumably, you want the simplest explanation, right? >> Stefan Leue: Well, that's what the AC3 condition says. minimal causal events so -- You want the >>: So here you have a causal execution and you may have many of those in the graph. So I'm assuming that you want to in some way -- 22 >> Stefan Leue: That is disjunctive, yes. >>: Put it together and say okay, here I have this disjunction, but really I want to break it down to some simple formula. >> Stefan Leue: Okay. I'll have an example in a minute that shows a fault tree that is exactly this disjunction. You're right, it would be desirable to sort of come up with some sharing of common parts of the different disjunctions, right. But we're not currently doing that. We're basically producing these disjunctions. I hope that will become clear when I show an example in a minute, okay? One observation which is somewhat related to the question, how efficiently this works in terms of storage is that you need to store the black executions, which are these plus one events that turn the bad into good. They're only necessary if you want actually to compute the non-occurrence of events as being causal. So if you want to over approximate the problem by negating that, then you need to store a lot less. The event order test, that also needs to be carried out, basically looks at the traces, the red ones that are at one level, and it basically tries to determine whether order between them is relevant. So we're basically computing sort of the order of all of these traces and see whether there are the partial orders that they define and see whether there are any events that remain unordered. So, for instance, in this example, we compute the TA and CA. The order is irrelevant because they occur on the red traces at the same level, TA, CA, and this trace CA, TA, so their order cannot be relevant. And so what we compile then is a causal formula of this form where we're saying, okay, their order is relevant, but they need to be ordered with respect to these events later on in the trace. >>: Does that only correspond to the first and the third red box, or does that correspond to all three red boxes then? >> Stefan Leue: >>: That corresponds to all three red boxes. Because in the second one, the gate failing is in between the two 23 approaches. >> Stefan Leue: That's correct. Okay. That's correct. Thanks. That may be something wrong here, yep. >>: I'll have to check. [indiscernible] what gate failing means. >> Stefan Leue: Basically, it's one event. And actually, this formula says the TA and the CA do occur. Their relative order is not important, but they happen before this one fails. And so here, has to point out ->>: [indiscernible] first, then that TA CA does occur before GF. it starts -- it really -- It's just, >> Stefan Leue: Depends on the semantics, but I honestly, I have to check. I have to go back to the definition of the logic and make sure that there's no mistake in here. Thanks. So there's two ways of implementing this, with BFS or with DFS, pardon me. >>: Can we go back. So it might be an invariant of this system that you cannot have the car crossing unless it [indiscernible] first. It's a variant of the system that if CC happened, then TA happened. So it might be redundant to actually give the TA and the CA there. If I'm a user of the system, I might know about this invariant. So if you're asking why it happened [indiscernible] talk about TA and CA. Say the gate failed, the car came, the [indiscernible] so those four would explain. But those kind of reductions are not [indiscernible] check. >> Stefan Leue: No, no. It's not currently included to reason semantically about the model and to sort of infer, for instance, an invariant. It would be interesting ->>: Checking, right. So if you want the most [indiscernible] form of all the bad things, it's just the property you want to check. That's it. So that's what you're getting at. You have the model in a private formula, right, the property be checked. All the bad things are things that evaluates the property. That's the most minimal thing that you could hope for, right? >> Stefan Leue: That's the -- you're saying -- 24 >>: But now you have traces, these traces that go to the baaed state and somehow you want to summarize them as a concise explanation what they have all in common. Of course what they have in common is they are all bad traces. That's what you're playing at. So if you hide many of these things that go to the bad state, then you do not have an explanation of the traces. I don't know how you reconcile the tension, basically. >> Stefan Leue: How I reconcile? >>: The tension. I mean, if you want the most precise form of what -- of bad traces, they are just traces that go to a bad stays. And where are the bad state. >>: But TA and CA could be aberrant. They could be a common [indiscernible]. >>: Yes, but it sounds to me like [indiscernible] way of explaining bad states is basically the definition of a bad state. That is, you have a car crossing, a train crossing, that would be the minimum perhaps definition of a bad state. See what I mean? I'm sorry to [indiscernible]. >>: There's an aspect of linguistic simplicity here. >>: You're looking for a simple explanation of causality in some language, and the language itself is providing a certain deductive bias, saying some explanations are better than others. I mean, any event that you can describe about a system is synthetic, right. In the real world, it's a disjunction of an infinite set of events. So clearly, what you choose to describe as an event, all right, is going to determine your notion of causality. We're saying that's give. My set events that I've been working with is given. My set relationships that I can consider sort of macro events is given. And within that bias language, I want to pick the simplest explanation, the simplest explanation that fits this criteria. So I think that that makes sense. I don't quite understand the what partial orders between events can be expressed here. It seems like sort of series of parallel graphs or something. Because I can say, well, TA and CA are unordered, but they come before GF. 25 >> Stefan Leue: Right. >>: So can I express, is it only series parallel in relations that I can express that way or are there other partial orders that can be expressed. >> Stefan Leue: I believe that you could express all partial orders. But let me go back to the definition. Okay? There's some issue that we also discussed, which is that, for instance, if you have some initialization sequences, right, they occur again and again on the bad and the good traces, right. And we currently consider them to be part of what Halpern and Pearl would cause the causal process that sort of leads up to the effect, but we currently don't have any internal structure that we can identify some of the events as being sort of in a causal hierarchy higher up than other events, of being root causes to other events, right. This would be a refinement of this model, bits not actually currently in there. And the same thing about invariants. invariant orders. >>: We do not currently identify sort of I have a question [indiscernible] -- >>: You can obviously write conditions of bad states that are unsatisfiable. But you can write a condition that is not possible, in which case isn't an explanation of anything. I mean, it is [indiscernible]. But here the set of events is fixed. Given the fixed set of events, you want the most concise possible. If the set of events just what the property [indiscernible] that's it. But if you want to have more, bigger set of events that may go to from the initial state, then this is more general. Does that make sense? >> Stefan Leue: Okay. We're doing a breadth-first search basically whenever a good or bad execution is found. We're adding it to the subset graph and we're basically obtaining these traces by it rating backwards through the predecessor links. I think that's quite standard technique. We're reaching a duplicate when the new trace is the same length, it may have a different order, and the new trace when it's longer, we store it for later OC1 test and order constraint test, basically. 26 For causality checking with DFS, we basically add the trace to the subset graph whenever a good or a bad trace is found, even a prefix counts as a good trace. So they don't necessarily need to be -- the traces don't necessarily need to be maximized when we put them into the graph structure. And when we find a duplicate there, then again we generate all traces formed by the new prefix and all suffixes of the duplicate, and we add that to the subset graph. Okay. So complexity. We have not done a complete complexity analysis of this yet. There is a caveat that Eiter and Lukasziewicz proved that for the Halpern Pearl structured equation model, even for one with only binary variables, the computing the causal relationship between events is NP-complete. However, they also showed later on that for cycle-free causal dependencies, computing causal relationships can be done polynomially, and that's actually what we have. We don't have any sort of circular dependencies in our model. But, again, this is only part of the complexity analysis. Of course, we are paying some penalty compared to simple model checking for all of the storing of traces for doing the comparison operations determining the orders. But that remains to be analyzed in terms of complexity. In terms of experimental evaluation, we have a prototypical implementation called SpinCause off these algorithms, and it is part of the quantum tool architecture. Quantum is a tool that's designed to do functional safety analysis for basically UML, SML type of models and we go either via PRISM to do some probabilistic model checking or promela spin which is what I've illustrated here for simply functional properties and then here the causality checker which computes these causalities and which we would finally visualize them using fault trees or alternatively UML sequence diagrams. But in particular, fault trees are very suitable here and they are engineering standard practice notation. This is the fault tree that we're getting for the railroad crossing example, and so this actually sort of shows the disjunction of two of these formulae that we have computed and again it may be possible to do more sharing of events to do minimal cut set analysis and the like. This is not yet something that we have included in this. We're using these fault trees at the moment merely as a visual notation, as a visual representation for these orders. 27 By the way, the order constraints that we can express here in the event order logic are richer than the orders you can experience by the so-called dynamic fault trees, which simply say this event happens before that event and so on. So we're using a logic that is more expressive than what you can express in fault trees. But that's why we are providing this as a side annotation there. For the second case study that we've carried out, so this one certainly, in terms of its size, it's a toy example. The airbag control unit is actually an industrial strength model that models the architecture of an air bag control system and analyzes the hazard of the airbag deploying inadvertently without a crash actually occurring. And we've done this analysis and obtained this fault tree here. Fortunately, I can't go into the details of what this model actually says but just give you some statistics here. It's a decently-sized model with over 20,000 bad executions here. And we do compute order constraint. This is the order constraint that complies to this last disjunction of the explanation of the fault tree. >>: Is this the full fault tree for the -- >> Stefan Leue: >>: This is the full fault tree for that model. 20,000 executions, is there 20,000 ways in which they can -- >> Stefan Leue: Yes, can inadvertently deploy. Ones that are not the intended ones, um-hmm. Okay. We could probably do some more sort of sophisticated, domain specific causal analysis here, because when you get a property violation of the model this, this can be either because you have modeled in your model both the correct and the failure behavior. The failure behavior led to the property violation, or because your normal behavior's model was incorrect, right. And so what we did here was actually we included in the model also the failure behavior. So we -- the anticipated component failures and basically this fault tree represents all events that are sort of not bugs that the modeler, that the designer did, but they are anticipated faults of the actual system, right. In that sense, it's a little bit different from the previous example, where we did basically model debugging, if you like. But it's applicable to both 28 settings, right. And there's a strategy and sort of functional safety analysis to come up with models, for instance, in SML that contain both the normal operation behavior as well as the error behavior. Okay. >>: Please? [inaudible]. >> Stefan Leue: I have a slide on that, okay. So it's being carried out on a compute server that we have in our lab. So we have, as I mentioned to you, we have two approach, the on-the-fly approach that I just described and the offline approach. In the offline approach, we first precompute all of the traces and then do the analysis, whereas here we're doing it on the fly. If we look at the airbag case study, we can see that -- okay, sorry. The MC is the pure model checking. CC1 is the causality checking with this AC2(2) test, the test for nonoccurrence of events being causal, being disabled. And the CC2 test is where that is enabled. So we're seeing that first of all, if we compare depth-first search with breadth-first search, both in terms of run time and in terms of memory consumption, the breadth-first search has advantages. That is because we need to generate less traces. We can basically rely on the shortest trace assumption and throw away the irrelevant traces that we don't need to store more effectively when we're doing a breadth-first search. It also shows that there's some gain if we disable the AC2 test step. What we can also conclude is that basically the on-the-fly approach somewhat outperforms the offline approach both in terms of run time and -- sorry, here, this is what I should compare. Both in terms of run time and in terms of memory. That is clear because we don't need to store all of the traces. We can really avoid storing the longer than necessary traces and so the ideal combination seems to be to combine the online approach with the breadth-first search in the analysis. This is what I mentioned. There's something that I think was included in the abstract, but I did not really get around to addresses, which is doing this with probabilities, which engineers like to see in functional safety. We have the whole story also implemented for prism probabilistic model checker, where we can attach probabilities to the system going into an error state and we are there able to compute basically the fault tree also with probabilities where this is basically the top -- the probability of what's called the top level 29 event, which is this hazard that corresponds to the inadvertent deployment. I should say that at the moment, we only have the offline version off this probabilistic approach implemented but plan to do the online version soon as well. Okay. So what's the conclusion? Very briefly, I showed you sort of a technique that complements, actually, model checking. And the aim is algorithmic support for the debugging of models, and we defined and adopted a structured equation model, proposed an implementation and showed that it's basically applicable to non-trivial case studies, certainly at the case study level more needs to be done and we want to do more comprehensive tests. What's future work. Sort of in general, this question, causality checking at the limits of scaleability. We discussed this over lunch also, when you can't explore the whole state space. Can you -- what can you do. Can you define sort of equivalence classes, partial order type that give you all of the interleavings and you don't have to explore all of them and so on and so forth. How does it work together with abstractions. General causality checking in a symbolic environment would be interesting to look at. I two weeks ago talked to [indiscernible], and he and his system thinks that he can do most of the things we're doing in here, symbolic setting, except that he can't work with orders and with nonoccurrence of events. But still I think that's interesting to look into that. The online causality checking for probabilistic models and then specific adaptions to functional safety analysis. As I said, minimal cut sets, root causes, common and cascading causes. Very briefly, I'm listing some related work here. This is, of course, not all of the related work. There's large amount of literature on explaining counter examples. Tom, you did some work on that a number of years. Alex Gross did work on that and so on. But these are works where sort of Halpern and Pearl's model for causality was taken and embedded in some sort of computational environment. There's work basically done at IBM high [indiscernible] labs where they sort of compute sort ever along a counter example when a complex LTL formula can no longer be satisfied and they refer to basically that model of causality, that 30 work by [indiscernible], others on what causes a system to satisfy a specification, which basically relates the structured equation model by Halpern and Pearl to model coverage and defines a notion of sort of relevance to model components and work that was actually done relatively local here that we've found out about only about ten days ago, tracing data errors with new conditioned causality by authors from University of Washington as well as one co-author from Microsoft Research that basically tries to relate surprising query results to errors in query of data. Uses sort of the counterfactual reasoning and uses SAT solving. Okay. With that, I would like to thank you for your interest and patience and I'm glad we had already extensive discussions, but if there are more questions, I'm happy to take them. >> Tom Ball: Thanks very much. I think we'll end it there. If you have more questions, one on one, he's here for the whole week. So if you want to set up some more time, please let me know. >> Stefan Leue: Thank you.