1

advertisement
1
>> Tom Ball: I'm Tom Ball, and it's my pleasure to welcome Stefan Leue from
Germany. He's a professor of computer science in the Department of Computer
and Information Science ask the University of Konstanz, where he holds the
chair for software engineering. And he's interested in formal techniques for
the design and analysis of complex systems, looking at model checking, safety
analysis, system debugging and causality reasoning, which I think we're going
to hear about today.
So welcome.
>> Stefan Leue: Thank you very much, Tom, for the introduction and also thank
you very much for the kind invitation. Glad to be here.
And yes, we want to talk about causality checking today, and first want to say
this is joint work with my Ph.D. student, Florian Leitner-Fischer and also it
is ongoing work so work we are still shaping up.
What is the setting of our work? We are interested in, as Tom pointed out in
the introduction, in the analysis of complex systems and very typically, these
systems, they comprise -- they consist of hardware components as well as
software components. So very often referred to as embedded systems. And let
me just pick an example here that I will use throughout the talk where we are
looking at a railroad crossing. And this consists of components, first of all
we have trains that try to cross the road here. We have cars that also try to
cross the same critical section of the road, and we have gates that need to be
opened and closed in order to basically implement some protocol that ultimately
avoids the dangerous situation where there are both trains and cars inside this
railroad crossing.
And let us assume that the gate control is managed by some piece of software,
and so this is in the end about the analysis of a complex embedded system that
we are undertaking here.
What we're going to use during the talk is a number of events that we shall
consider, and let me just briefly introduce them to you. So we have, first of
all, a train that is approaching an intersection. Those who have ever looked
at signaling systems in the railroad area know that there are automatic
detectors that detect when trains are arriving, when trains are leaving. An
intersection, in fact, they're actually counting the number of wheels that pass
2
there in order to make sure that, for instance, a train is not accidentally
leaving a car on the railroad track. So they are highly sophisticated.
But we're looking at this at a much higher level of abstraction. So we have an
event that says that a train is in the crossing. TL is the event that says the
train is leaving the crossing or has left the crossing. The car can be
approaching. The car can be in the crossing and the car can have left the
crossing. And finally, the gate can be closing, it can be opening, and we also
incorporate an identified failure state which says that the gate is actually
failing.
What we typically do is next we come up with some behavioral model of the
system, and this is typically given in some form of communicating finite
automaton, and I should point out this is a rather stylized idea of what the
actual control is going to be.
But what's important here is that the train as well as the car as well as the
gate are represented by state machines and that these state machines
synchronize via common events and, hence, what we consider to be the total
model of the system is basically the concurrent execution of these three
components.
Now, what we also like to do in system analysis is apply a technique that's
called model checking. Let me briefly recap what model checking is. We're
starting out with a model off the system. This model is typically given, well,
perhaps in UML or SML or some related notation that is, in the end, interpreted
typically as a transition system which is sort of a variant of what logicians
call a Kripke structure.
On the other hand, we have a specification of the requirements on the system,
and those are typically given using assertions, temporal logic or automata that
will not be at the heart of this talk, but let us assume that we shall use
temporal logic in order to specify the properties here.
And finally, we have something that brings the model and the specification
together in an algorithmic, systematic form, and that is some form of model
checking algorithm or a model checking procedure that checks whether the model
is actually satisfying the specification which is indicated by this
satisfactory relation symbol in between the M and the S.
3
So mapping that to our example we have on the left-hand side as a model of the
system, this collection of communicating finite state automata, we have on the
right-hand side a requirement, in this case a safety requirement which
specifies the absence of Hazard as the functional safety experts would say, and
it reads that there's never a train in the crossing at the same time when
there's a car in the crossing, and we can formalize that using a temporal logic
formula that says here fee is identical to always not TC and CC. So the train
and the car are never going to be in the crossing at the same time.
That's basically the setting that we're using, and as I said, what we would
like to do is we would like to check this relation between the model and the
specification in a tool-supported, automatic, systematic way. And for that,
we're using a model checker that is typically using either state space search,
some form of symbolic fixed point computation or set solving or a number of
different techniques.
What we're actually going to focus on during this talk is that kind of model
checking that focuses on state space search.
What we are going to look at is explicit state model checking, which is either
performing systematic depth-first or breadth-first search on the state space in
order to locate states that potentially violate the property that we have
previously specified and that those states be here denoted by these circled
axis in the state space.
What a state space search does is basically it tries to find a counter example
or an error path to such a violating state. So what the model checker will
return is something like a path that says, okay, when the car is arriving, the
train is arriving, the cor is in the crossing, the gate is closing, and then
the train is in the crossing, then there's a potential for an accident because
the car hasn't left before the train is entering.
And so we're in a potentially dangerous situation here. Now, that's pretty
standard technology, and what we're getting as a result is we're getting error
trails, and they can be presented in various formats. The interesting
observation is that if one tells, for instance, a model checker like spin not
to give you just one single counter example but actually to give you all upon
counter examples, there will be a quite significant number of those coming up
for this encoding of the railroad crossing example that we chose. It happens
to be 47 traces, and they all have this shape here so there are some events in
them that we have seen in origin, and they are numbers and they're representing
4
internal events of the model.
What becomes clear is that such a sheer amount of data is very difficult to
analyze in sort of a manual fashion. In particular, because what these traces
have in common is that they lead to a property violating state, but they are
just evidence for errors occurring. When we want to debug the model, then
weeks of course, interested what is intrinsically the cause actually for this
violation happening, right?
Manual analysis, as I pointed out, is tedious error prone and is essentially
impossible even for such a simple, not very complex model as this railroad
crossing. So our goal is to actually come up with what we refer to as
algorithmic causality computation, which means we want to compute causalities
from the data that we can get out of the model in order to answer this
question, what actually the cause is going to be like.
That's basically what I would like to talk about today. The outline of my talk
is going to be that we are going to first talk a little bit about models of
causation that we can underlie this computation. I will then introduce an
adopted structural equation model. I'll explain what that is in a minute.
I'll introduce you to what we refer to as causality checking. I'll talk a
little bit about an experimental evaluation that we did. And finally, conclude
my talk. Please, if you have any questions, of course feel free to ask them.
Okay. Models of causation. What is a cause? In my mind, there is no really
apriori definition of what one should consider a cause. There's various
approaches that you can find, both on the technical and the philosophical
literature on the definition of causes. There's, for instance, statistics,
mathematicians like to think that correlations say something about causes.
There's various forms of causality reasoning, in particular in the philosophy
of science. There is this idea of event structures and the analysis of
concurrent systems. There's work by Lamport on his happen before relation and
his analysis of distributed systems. And there are many more ideas that seem
to allude in some way or another of causality.
What I'm saying that I think there's no apriori, correct definition of
causality, then what I think is much more important to worry about is the
adequacy of a causality definition. So we want a causality definition that, in
the end, gives us results that we can sort of make a lot of sense out.
5
Let us look at what we think is sort of an interesting thought that goes into
this direction and just illustrate that these ideas are not really absolutely
new. I would like to refer to David Hume back in the 18th century, who, in his
inquiry concerning human understanding, is reflecting on what he believes
causes are.
And there are some quotations here, and maybe I would like to read out these
few sentences here. So he writes, yet so imperfect are the ideas which we form
concerning it that it's impossible to give any just definition of cause, except
that it is drawn from something extraneous and foreign to it.
So he points out cause is not
something to do with external
that he says, similar objects
that are similar, he seems to
something easy to define, and it always has
influences. He furthermore makes an observation
are always conjoined with similar. So things
suggest, always lead to similar results.
And then, and this comes closer to the
is or his idea of causality is he says
be an object followed by another. And
first, are followed by objects similar
heart of what his causality definition
that therefore, we may define a cause to
where all the objects, similar to the
to the second.
So when something happens, then it always or something very similar to it, then
it always leads to something second happening in very similar fashion. And
this is actually something that one could refer to as sort of the positive part
of causal definition that says if something happened, it has a certain
consequence.
Then he continues to say, or, in other words, where if the first object had not
been, the second never had existed. This is what later on we will refer to as
sort of counterfactual reasoning, which says that, okay, if the first thing
does not occur, then the second will not occur as well.
Now, this idea has been picked up by Lewis in the '70s of the last century with
his definition of causal dependence that is quite often referred to, also, by
people worried about the philosophy of science and engineering.
And he says that where C, which refers to a cause, and E, an effect, are two
distinct possible events, E causally depends on C if and only if C were to
occur, E would occur. And if C were not to occur, E would not occur. And
6
again, here we've got these two ideas. First of all, the positive side of the
causal definition. And secondly, sort of the counterfactual argument.
>>: [indiscernible] so you cannot distinguish which is the cause and which is
the effect.
>> Stefan Leue: Yes. In some sense, you could argue that there is sort of a
reversal. Typically, this is answered by the fact that one says causality is
coinciding with temporal evolution, right. So the earlier one would always be
the cause. The later one would be the effect. Right. This is sort of the
silent assumption when you make these arguments. One thing happens before the
other, and that basically determines which is of the cause and which one's the
effect.
>>:
Which doesn't account for common cause.
>> Stefan Leue:
>>:
Which means?
Suppose I had a B that causes both C and E.
>> Stefan Leue:
Right.
>>: That's always the question in an experiment, you need to know that C is an
independent variable.
>> Stefan Leue:
>>:
Right, right.
By making a randomized experiment.
>> Stefan Leue: So it is clear about this argument by itself, and I shall have
some illustrations for that, is not sufficient way to capture all of the
cause/effect relationships and complex technical systems that we're faced.
By the way, this is often simplified by just saying C is causal-free. If were
C not to occur, then E would not occur either. So very often, in
counterfactual reasoning, one just refers to sort of the second counterfactual
argument type or part of the argument.
Okay. So one sort of a consequence of this Lewis type of idea of
counterfactual reasoning is that it leads to a what-if analysis where if we're
7
looking into, as it's always sometimes referred to, alternate worlds. So we
kind of speculate about if the world had been different, if a different course
of action had taken place, then something wouldn't have happened.
Had there been another cause of action in which the gate had been closed before
the car entered the crossing, there would not have been an accident. So this
is kind of a good world or what I will refer to as the good world. And I'm a
little bit arbitrarily defining good and bad worlds. Good worlds are the
things where the effects do not occur. And the bad worlds are the ones where
the effect occurs because, in the back of my mind, I'll always have the idea
that a bad thing to happen is the accident, right.
So naive counterfactual reasoning, as I will refer to it, performs a positive
and counterfactual test for one event C and one event E, and I'll briefly
critique what some limitations of this naive counterfactual reasoning are.
Let us assume that we are introducing an event, LB, which says, okay, a stop
light is broken. And that we want to say, okay, causal is the conjunction of
causal events. So the gate failing and the light being broken is something
that we want to consider conjointly to be causal. Then we have good executions
where only one of these two events occur, and we have bad executions where both
of these events are actually occurring.
And if you apply this simple counterfactual test, then none of the causal
events will be recognized since they occur both in good and in bad worlds,
right. And this cannot be according to basically the naive counterfactual
reasoning.
Very similar kind of argument is applying to disjunction of causes. When you
have another event that you're introducing, IL, assume that there's a stop
light in addition to the gate, and the driver is ignoring the stop light. Then
the disjunction of the causal events, the light broken or the driver ignoring
the light, leading to an accident, actually this leads you to a good world
where none of these events occurs, and a bad world, where either one of these
occurs.
And very much like in the previous case, none of these cases will be with a
naive counterfactual reasoning, be recognized as actually a cause.
There's some more limitations that naive counterfactual reasoning cannot
8
account for. It's the non-occurrence of events which says that the TC -- so
the train in the crossing, the fact that the gate is not closed and the car is
in the crossing is actually causal for an accident to occur. This is not
something that you can really capture in the simple naive definition -application of the counterfactual argument.
As well as that ordering of events can be distinguishing factors between bad
and good worlds. So the train in the crossing, the car in the crossing, the
gate closing, the gate opening, this is certainly a bad sequence, whereas the
gate in the crossing, the train in the crossing -- the gate close, pardon me,
the train in the crossing, the gate opening, the car crossing is a good
sequence, consisting of the same set of events, but just in different orders,
and the order, of course, matters whether it's a good or a bad one.
And finally, something like relevance of events that, for instance, the train
engineer union decided not to call for a strike on the day of the accident
happening. You don't really want that to be a cause even though it would pass
the counterfactual test because it's just simply something you do not want to
be -- consider as being relevant in your reasoning.
Okay. So as a consequence, this naive counterfactual reasoning is best suited
to explain sort of simple, single cause/effect chains. Is it not suit the for
effects that center logically complex structure. Just want to point out that
there's large, large literature, particularly in the philosophy of science,
that tries to critique and modify counterfactual reasoning in various ways.
And they're more cases than the ones I've presented here that are not
adequately addressed.
Now, we are considering an approach by Joe Halpern and Didier Pearl that's
called structural equation model, which tries to address some of these points.
In particular, which is aiming at sort of more complex logical structure and is
trying to combine this idea with Lewis, counterfactual reasoning.
And time does not unfortunately suffice to go into every detail of the basic
Halpern/Pearl model, but let me just point out some of the key ideas that they
propose. They are considering events as being represented by Boolean
variables. They, on the other hand, consider events sort of not only to be
Boolean, but also to be possibly defined over arbitrary domains. So variable
in the real word, attaining a certain value is then mapped on to Boolean
variables.
9
They make a distinction between exogenous and endogenous variables, so they
make it possible to combine which part of the real world are relevant for our
consideration, our reasoning, and which ones are irrelevant. And they in
general compute minimal Boolean disjunction and conjunction of causal events.
And they define a number of causality conditions and, actually, they define the
AC1 condition, which ensures that there exists a world where the Boolean
combination of causal effects, C, and effect, E, actually occur. So that is
sort of the positive part of the counterfactual test.
And then they have two conditions, AC2. First one that says that if at least
one of the causal events does not happen, the effect E does not happen. That's
basically a counterfactual argument. And then they say if the causal events
occur, the occurrence of other events cannot prevent the effect. So when you
have a set of events that you consider to be causal, you cannot see another
event occur. That means that the effect is actually going away.
And finally, they are imposing a minimality constraint, that the causal events
that you have identified are actually minimal in the sense that no subset of
the causally identified events satisfies AC1 and AC2 at the same time. That's
roughly the idea, and I would like to refer you to their paper for the complete
formalization and motivation for their approach.
We found that to be a very good sort of first match for an application to
our -- sorry?
>>: Can we go back a slide. Maybe there should be a condition that says that
E should not be C, right, because again, to replace E there, E is the minimal
event that's going to satisfy AC1, AC2 and AC3. The event is a cause of
itself.
>> Stefan Leue:
>>:
Yes, yes.
So perhaps should be it should be other than E.
>> Stefan Leue: They actually presume this is something that I'm not
explaining here. What they actually do is they presume some causal structure
between events. So they presume some previous analysis in which they identify
different events, and where they identify which events can possibly influence
10
which other events. And it is clear that a cause needs to precede an effect in
that -- in what they refer to as a causal network, right.
So they do provide a little bit more structure than I'm explaining here.
These are cycle-free. So things are not referring to themselves, okay?
Okay?
So there's a lot of benefits in this model, in that it considers Boolean
combinations of events, the distinction between exogenous and endogenous
variables. There are some shortcomings in the sense that there's no
consideration of event orders as being causal factors, and we had pointed out
that that's, in principle, important. And what's also a certain deficit, we
like to think in terms of these transition systems, as I argued before, as
being the type of Kripke structures that we like to reason about. And in
particular, what we like to reason about are traces that describe sort of the
temporal logical evolution of the system in terms of computation steps.
And so we are basically adding these things to that model so I would like to
introduce how our adopted structured equation model actually looks.
Let me introduce, first of all, some auxiliary means that we defined. We first
of all defined something that we refer to as an event order logic. It is, in
essence, some form of a linear time temporal logic with a certain limited
expressiveness. It first of all refers to events and it contains Boolean
expressions that say that events are occurring at all in the obvious
interpretation.
In addition to that, it expresses event order conditions. So while above here
A and B just simply says along some trace A and B do occur, A conjunctively
before B actually means that A and B both occur and A happens before B.
We then also define interval operators. The one here says that A occurs until
eventually B will hold in every state. The next one says that A always holds
until eventually B occurs. And the third one defines basically an interval.
It's in the interval defined by A and C. B always holds we defined this logic
because it gave us sort of a convenient notation in order to later on specify
the logical constraints that we considered to be causal.
We do have a model theoretic semantics for that. And, of course, I don't want
to go over the details here. I just want to point out that we defined both
what it means that a state defined -- satisfies actually a simple formula,
11
which is just given by an event symbol, and that is done by the fact that we're
saying, okay, if the system transitioned into the state, then the state
satisfies that formula. And finally, by virtue of applying this semantics, we
can also define what it means that a transition system is actually satisfying
some formula in this logic; namely, if there is a trace such that this trace
satisfies the formula.
As I said, the details of that semantics are pretty straightforward and they
are in our paper.
So this gives us an opportunity now to basically represent traces. For
instance, such as sigma here, and to be categorized by an event or a logic
formula, which means that basically, trace can be represented or can belong to
an equivalence class represented by such an event or a logic formula. That is
the kind of equivalence we are considering here.
>>: What do you think about the difference between holds and occurs in these
kind of order of conjunctions. Say, for instance, A conjunctive to B means
that B always holds once A occurs. What's the difference between occurs and
holds? Does holds means has occurred?
>> Stefan Leue: Okay. Holds basically means that -- let me just go back.
Okay. In every state, B will be satisfied here. Okay. And there is some
state, until eventually B holds in every state, in which A actually holds.
That's maybe a better way of putting it, right. Occurring means in a number of
states up until the right-hand side is reached, eventually or at some point, A
will actually hold. Yes?
>>:
[inaudible] logic some type of trigger for [indiscernible].
>> Stefan Leue:
You can't probably represent it in LTL.
>>: You have Boolean operators on top of the interval operators? Take the
disjunction and the interval operators are conjunction? Or do you only have
Boolean operators on the atomic proposition.
>> Stefan Leue:
Only on the atomic proposition.
>>: So basically, you're going to have one top level interval expression.
it's probably not first order expressive.
So
12
>> Stefan Leue:
>>:
No, it should be LTL expressive.
LTL is equivalent to the first order theory of --
>> Stefan Leue:
Okay, yeah, yeah.
>>: So I'm just wondering, can you express everything in LTL in your logic?
I'm guessing not, because you don't have Boolean combinations over the
operators.
>> Stefan Leue: I think so too, yes. So just briefly go back. Yes. So how
does your adoptive model now look like? Let us assume that we have a
transition system given in the standard form so it consists of a set of states,
a set of actions, a transition relation, a set of initial states, atomic
propositions labeling the states and actually this labeling function.
And let us assume that A is a set of event variables over the event types and
ACT. I should point out that we're doing one little trick here. We're at the
moment not able to sort of deal with a repeated occurrence of events so we
solve this in a sort of syntactic manner by distinguishing the first instance
of the occurrence of some event, A from the second from the cert. So every
event basically, along a trace, occurs at most once. That's the limiting
assumption that we're currently making.
Let fee denote an LTL formula that's representing the non-reoccurrence of some
property evaluations, so the not fee is the violation and it's often the fact
that we're looking after. It's typically the accident, the bad thing
happening.
What we're currently doing is we are limiting our approach to reachability
properties that we express by safety LTL formula of that form. And then let
lip see and an event order logic formula that is consisting of event variables
in some set, Z, or in other words, the set Z is the set of variables that
actually occur in this formula, and we're saying that such a formula is
actually considered a cause for the effect not C if the following conditions
that we're finding actually hold.
So there's the condition AC1 that says basically there exists a sigma so that
both sigma satisfies the lip see and sigma satisfies the not fee. So it's
13
basically signifying the
to be the cause and it's
leading to the effect of
where we see, again, the
event order logic correctization of what we consider
actually violating the cause and this is -- or it's
this is actually what we refer to as the positive side
positive side of this counterfactual test.
So we can, for instance, pick a candidate cause see here and say here there's
an example here this is given by this event order logic formula. It says that
there's a train approaching before the car approaching before the gate failing,
before the car and the crossing, and before the train is in the crossing and
then we need to pick some sigma satisfying this fee, and basically we thereby,
by the occurrence of the events, and here we define which events belong to our
set Z, and Z is the set of candidates of events that we consider to be causal.
And then we have set W, which is all of the events that are not occurring in
the sigma, and so the example here can be trivially seen that we have for this
set Z. These events occurring here, the set W, are all of the events in the
model that do not occur in this set.
So basically, this is what we're starting out with.
what positively happens and what causes the effect.
We're starting out with
Then we're moving to the AC2 definition or condition. The first part of it,
and let me maybe not go over in all detail the formalization here, but let me
point out that what we're doing here is we're actually defining the
counterfactual test with saying, okay, there exists a sequence sigma prime
where the order and occurrence of events is different from the sigma that we
have originally considered. That's the sigma that passed the AC1 test. And
fee is not violated on this sigma prime.
So when we're looking into the formal definition here, we're sigh saying
basically there's the sigma prime, so that the sigma prime satisfies this
causality constraint, and then we're basically looking in here and saying,
okay, it is not leading to the error, and the valuation is different from what
the valuation was in the original example.
The AC2-1 in our example is fulfilled by using sigma, the TA, CA, GF, CC, TC
tracer, which means the train is approaching, the car is approaching, the gate
is failing, the car is in the crossing, and the train is in the crossing.
Since there exist a sequence sigma prime, so this is the bad sequence, train
approaching, car approaching, gate closing, train in the crossing, so that the
14
valuation basically of some of the variables in the set W has changed and
actually the sigma prime is not leading to the effect that we had seen.
So basically, that is the counterfactual test. Basically says there is an
execution that does not lead to the negative effect. Okay. So what we're
doing now is we're adding a further condition that expresses the following. It
basically says for a sequence of events to be causal, it cannot be possible to
add an event so that causality is violated. So it should not be possible, when
you have a causal sequence, you want it to suffice by itself. You don't -- and
you want to say, okay, if this causal sequence is added or if we add a further
event to this causal sequence, we do not want the effect to go away.
So what, in essence, what that means is this condition is there in order to
reveal that potentially the non-occurrence of events can be causal as well.
Because if we, in other words, add an event to a sequence, which makes the
causality go away, then it means in our setting that in the previous version,
when this event was not there, this absence of the event itself can be
considered causal. I'll have an example for that in a minute. Ask that's
actually encoded in the formal definition up here.
So in that sequence sigma double prime that we're providing the valuation of
the positive events, that is as in the originally considered sequence sigma and
the valuation of the negative events is different from what it was in the
originally considered sigma, and that basically means for all executions where
the events in Z have the original value as defined by the valuation of the
variables in sigma. The value of the arbitrary subset of events in W, the
non-occurring events, have no effect on the violation of the effect. That's
basically what we have to check.
So if we consider sigma double prime to be the sequence where the train's
arriving, the car's arriving, the gate is failing, the car crossing, the car
leaving the intersection and the train is getting into the intersection, then
suddenly, the bad effect would go away because the car has actually left the
intersection, of course, before the train is actually entering.
And so the formal conditions for AC2(2) are satisfied. However, for the sigma
double prime, the property fee is not violated, as I explained, since the car
had left.
So the consequence is that this formula is not causal because the AC2(2)
15
conditions failed and the non-occurrence of this event is causal and that
basically needs to be added to the causal formula. So what we need to do is we
need to basically find minimal set of causal non-occurrence events and add that
to the formula. And I'll talk a little bit about how we're going to go about
that.
There's a third condition that we add, which is that our formula that describes
the cause is minimal in the event that no subset off this formula, no partial
formula satisfies actually the conditions AC1 and AC2. So we can't take any
causality argument away and still establish the causality.
So how do we go about capturing this causality of nonoccurrence? What we do
basically, we consider all of the event variables and some such sequence sigma
double prime that contains a non-occurring event and we depend -- we add,
depending on the position of the event variable in that formula an additional
formula that expresses that the event is not occurring if it is at the
beginning of that causality formula, C, or this formula, if it's at the end of
C or if it's in the middle of C, we accordingly add such a term to the formula.
And then we perform again the test for AC2.
So in the example, we would get, for instance, this expression that we're
saying, okay, train approaching, car approaching, gate failing, car closing,
and then it's not true that the car is leaving the intersection between the car
and the crossing or the train and the crossing and this now leads to such an
undesired effect.
Okay. How do we capture this notion which was not included in the original
structured event model that causality of event order is actually relevant in so
for some end order logic formula of C, what we're doing is we're placing the
ordered operator conjunctively and ordered by the -- order the conjunction,
which yields a formula C. And so what we're now saying is that that is
considered some C and event order logic formula over a set, Y subset of the
events that we're considering to candidate events for the causality. The order
expressed by the lip see is not causal if the sigma satisfies C and there is a
sequence sigma prime in the bad trace such that sigma prime does not satisfy
the C and the sigma prime satisfies the basically the formula where we have
given up the order constraints.
So in the example, for instance, the order of the events, gate failed, car
crossing, not in car leaving and train in the crossing is important for causing
16
the accident that we're considering. So the relative order of TA and CA is not
important, however, but they need to precede the above events. And the
resulting formula, then, is this one that expresses the causality that we're
computing here.
Okay. So that is the set of definitions that we have come up with in order to
sort of adopt the Halpern Pearl model for causality to our checking. Now, our
goal was also to come up with a sort of a mechanization of that idea, to cast
that into an actual algorithm in order to do this analysis.
So what we will exploit in our analysis is the fact that we need alternate
worlds in order to reason about counterfactual arguments, and that are these
alternate worlds are sort of given by the model theoretic semantics of the
system models in the sense that we can define traces that lead to errors and
traces that do not lead to errors.
And we will do that basically by capabilities that are provided by the model
checkers that we're using, the explicit state type model checkers that we're
currently considering.
How does that work? So traces do actually -- that the model checkers compute
do actually define the alternate worlds. We can, first of all, compute with a
model check of the set of the bad traces, which are all the counter examples
that lead to a property violation. Let us assume that we can sort of modify
the search algorithm in such a way that it will give us all of the traces that
lead into bad states.
On the other hand, to compute alternate worlds, we can compute the set of good
traces by having the model checker search the states base and basically
consider any trace that does not lead to an error state into the set of good
traces. The search will terminate sort of when we either are trying to close a
loop or when the search depth has actually been reached or when a final state
has been reached. Please?
>>: Do you have some ordering on the good traces, or I guess they're all
independent?
>> Stefan Leue: I mean, they have a prefix structure, right.
some order, as I will show you that's --
We do build up
17
>>: The assertion of doomed to failure and I guess there's probably doomed to
goodness at some point. No matter what choice you make, you're going to be
good, so it's just a ->> Stefan Leue: Right, okay. So you can -- so that's probably, I'm not sure
whether this graphic shows that, but let us assume that, for instance, this is
a sub-tree that has only blue arcs, blue edges. Then that is something where
you will never again run into a failure, right.
>>:
Are you going to enumerate all those?
>> Stefan Leue: We are enumerating those.
we algorithmically solve that, okay?
I'll explain a little bit about how
So key idea, explore the state space search using a depth-first, a
breadth-first search. As I said, bad and good traces, the ones that lead to
the property violating states, and the ones that do not lead into property
violating states. And as I said, this is all only applicable currently to
reachability properties. Also, what I should point out to those kind of
properties where after property violation has been detected, no meaningful
behavior ensues.
If you look at some safety properties, where just a property violation is
reached, the system may continue to work and there may be another, for
instance, assertion violation reached with, again, some causal sequence of
events. Those kinds of properties we do not consider. We consider those,
they're sort of reachability properties and what happens afterwards is either
nothing or it is not really meaningful for the analysis at hand.
Okay. We do define, as sort of an auxiliary construction, notions of
sub-executions and that is these operations that I'm illustrating here. They
basically say whether traces are sub-traces of other traces and whenever you
see a dot to go with such an operator, it means when they are in an ordered
manner sub-traces of each other. And what we were able show is that we can
reduce the checks for AC1 to AC3 and OC1 to sub-execution tests and the proofs
for that are in the paper. I want to show you that. However, how we are
actually going to do that in our construction.
We have to implementation variants of that. The one is an offline enumeration,
where what we're doing is we're actually enumerating all of the traces and
18
we're storing the sets of the good and the bad traces and then we perform these
sub-trace computations in an offline step, which, of course, leads to
considerable storage requirements, memory requirements.
The alternative for some on-the-fly method where we're basically using some
depth-first search, breadth-first search on the state space, and restoring
paths in adequate data structure and we're storing them there as you obtain
them. And this data structure that we're using is the subset graph, which
looks as follows.
And basically, the graph stores the traces and categorizes them and it stores
them in an order where you see the shortest traces up here at the level one,
and the longest traces, the further you go down here in this tree.
Now, so the levels that correspond actually to the trace length that the traces
have and connections here between levels mean that the lower level trace is a
sub-trace of the higher level trace. So we connect them whenever one trace is
a sub-trace of the other one. This is currently the unordered consideration,
okay?
Now, we have different types of nodes in this graph. We have, first of all,
the green nodes. And the green nodes basically say that the trace is in the
set of the good traces, so it cannot be a trace that by itself is causal. And
what's also important for node to be green is that all nodes on the level below
that are connected with it are also colored green and these traces are either
prefixes of good or bad traces. By the way, there's something a little bit
awkward. I'm saying on the level below. Of course, that means going upwards
in the chart.
A red node is now a trace that is in the set of bad traces and all nodes on the
level below are connected to it, are actually green. So it's sort of the first
trace in an evolution that is actually turning from a good trace into a bad
trace. And they are -- the shortest bad trace is found so they satisfy
minimality constraints and they are considered to be candidates for being
causal traces.
The black nodes are now good execution traces. But at least one node on the
level below that it is connected with is colored red. So they are longer
traces. They are good traces. But they have a sub-trace that is a bad trace.
19
And this basically means that they identify one event that turns bad into good.
That hints at the non-occurrence of events check, the AC22 check that it can be
carried out by comparing these and their predecessors.
So, for instance, and I shall return to this example in a minute, the car
leaving event here basically is what makes the difference between these two
traces. I'll look at that in more detail in a minute.
And then we have these orange nodes. They represent a bad execution trace and
at least one node on the level below that is connected to this orange node is
actually colored red. They are bad traces. They do not specify the minimality
conditions so we, in essence, do not necessarily need to consider them except
for some special purposes, perhaps.
Okay. So we're able to prove some theorems for the adopted structured equation
model of Halpern and Pearl. Sorry, for our adoption of the structured equation
model. So we can say that an event or a logic formula, C sigma that's derived
from a red node containing the trace sigma fulfills AC1, AC2(1), and AC3 and
actually for breadth-first search, we can say that this is fulfilled
immediately, because we know it's always the shortest trace that we're reaching
there. For depth first search, we basically can show that when the search
terminates. But in depth-first search, we basically add a trace to the graph.
We don't know whether it's a minimal one yet, right. So we have to completely
explore there.
The construction of the subset graph can be done by basically state space
search. And once the state space search is complete, what we're doing is with
he have to perform the non-occurrence of events test, and the order condition
tests on the obtained structure.
So what are the inferences that we can draw from the graph
of all, we need to consider all of the red traces as being
causality and then we need to check whether this condition
means we determine whether the non-occurrence of events is
is basically, first
candidates for
AC2(2) holds, which
causal.
So we have a red execution. For instance, here, at level 5, the execution
where we have the train approaching, the car approaching, the gate failing, the
car in the crossing and the train in the crossing, and then we have at the
level above, the black execution, which has an additional event. And it
basically means that the negative effect goes away. And we can then identify
20
that this event, car leaving, is actually the one that's responsible for
turning the bad trace into a good trace. And so we conclude that the causal
execution here is that the not car leaving approach event is actually
responsible for that. Right?
We need to perform that not just for single events, but we need to basically
check the situations and also identify which it's multiple events that are not
occurring and that turn the one into the other.
>>: So you could, in principle, you could have a destruction of negative
occurrences placed at different points in the sequence, right? You could say
well, some event between the TA and the CA would prevent the failure. Or in a
different event, between GS and CC, which would prevent the failure. Is that
expressible in your logic? I mean, can I actually -- if my level 5 event there
had several level 6 events there, a level that had negative events in different
places, is that expressible or am I mistaken there?
>> Stefan Leue: I believe it should be expressible.
but let me check offline.
I believe it should be
>>: Leaving aside my particular example, so it's possible that you could wind
up with a situation where you just don't have a causal explanation that is
expressible in the logic, right?
>> Stefan Leue:
When you don't have a --
>>: For a particular set of traces, it might be that I have nothing that
satisfies all of those AC conditions.
>> Stefan Leue:
Um-hmm.
>>: So, yeah, so you don't have -- there's not some kind of completeness
result that says I always have an explanation.
>> Stefan Leue: No. In particular, because I mean, you're only getting those
traces that the model permits, either in the bad or in the good case, right.
You're not getting all possible combinations and orders of events happening.
But only the ones that the model permits, right.
So I believe that that's an argument why you couldn't have such a completeness
21
result.
>>: I mean, I don't think you would want to have a completeness result,
because you want your logic in some way to express some domain knowledge,
right. You want it to be incomplete and you want it to focus on certain kinds
of explanations as being more likely than others.
>> Stefan Leue:
Right, um-hmm.
Okay.
>>: So if you have this gate broken event, you could also have sort of -- you
could also have other dependent events or other events that might cause a
[indiscernible] like, you know, if there's a -- if there's a human on site to
fix the gate, they could also stop the car.
>>:
Or the car decides to back up, you know.
>>: There are so many other things that aren't described, you put them in
there and it could get much more complicated.
>>: So the causal execution could get more complicated. As you say, you have
many different events that disjunctively could prevent, you know, the bad thing
from happening.
>> Stefan Leue:
Um-hmm.
>>: An meteor arrives from outer space or something. There could be different
things that could happen. So I was wondering sort of how expressive is my
ability to, you know, describe these causal executions.
>>: Did you want these things to be minimal in some sense, so we can't get
overly complicated.
>>:
Presumably, you want the simplest explanation, right?
>> Stefan Leue: Well, that's what the AC3 condition says.
minimal causal events so --
You want the
>>: So here you have a causal execution and you may have many of those in the
graph. So I'm assuming that you want to in some way --
22
>> Stefan Leue:
That is disjunctive, yes.
>>: Put it together and say okay, here I have this disjunction, but really I
want to break it down to some simple formula.
>> Stefan Leue: Okay. I'll have an example in a minute that shows a fault
tree that is exactly this disjunction. You're right, it would be desirable to
sort of come up with some sharing of common parts of the different
disjunctions, right. But we're not currently doing that. We're basically
producing these disjunctions. I hope that will become clear when I show an
example in a minute, okay?
One observation which is somewhat related to the question, how efficiently this
works in terms of storage is that you need to store the black executions, which
are these plus one events that turn the bad into good. They're only necessary
if you want actually to compute the non-occurrence of events as being causal.
So if you want to over approximate the problem by negating that, then you need
to store a lot less.
The event order test, that also needs to be carried out, basically looks at the
traces, the red ones that are at one level, and it basically tries to determine
whether order between them is relevant. So we're basically computing sort of
the order of all of these traces and see whether there are the partial orders
that they define and see whether there are any events that remain unordered.
So, for instance, in this example, we compute the TA and CA. The order is
irrelevant because they occur on the red traces at the same level, TA, CA, and
this trace CA, TA, so their order cannot be relevant.
And so what we compile then is a causal formula of this form where we're
saying, okay, their order is relevant, but they need to be ordered with respect
to these events later on in the trace.
>>: Does that only correspond to the first and the third red box, or does that
correspond to all three red boxes then?
>> Stefan Leue:
>>:
That corresponds to all three red boxes.
Because in the second one, the gate failing is in between the two
23
approaches.
>> Stefan Leue: That's correct. Okay. That's correct.
Thanks. That may be something wrong here, yep.
>>:
I'll have to check.
[indiscernible] what gate failing means.
>> Stefan Leue: Basically, it's one event. And actually, this formula says
the TA and the CA do occur. Their relative order is not important, but they
happen before this one fails. And so here, has to point out ->>: [indiscernible] first, then that TA CA does occur before GF.
it starts -- it really --
It's just,
>> Stefan Leue: Depends on the semantics, but I honestly, I have to check. I
have to go back to the definition of the logic and make sure that there's no
mistake in here. Thanks.
So there's two ways of implementing this, with BFS or with DFS, pardon me.
>>: Can we go back. So it might be an invariant of this system that you
cannot have the car crossing unless it [indiscernible] first. It's a variant
of the system that if CC happened, then TA happened. So it might be redundant
to actually give the TA and the CA there. If I'm a user of the system, I might
know about this invariant. So if you're asking why it happened [indiscernible]
talk about TA and CA. Say the gate failed, the car came, the [indiscernible]
so those four would explain. But those kind of reductions are not
[indiscernible] check.
>> Stefan Leue: No, no. It's not currently included to reason semantically
about the model and to sort of infer, for instance, an invariant. It would be
interesting ->>: Checking, right. So if you want the most [indiscernible] form of all the
bad things, it's just the property you want to check. That's it. So that's
what you're getting at. You have the model in a private formula, right, the
property be checked. All the bad things are things that evaluates the
property. That's the most minimal thing that you could hope for, right?
>> Stefan Leue:
That's the -- you're saying --
24
>>: But now you have traces, these traces that go to the baaed state and
somehow you want to summarize them as a concise explanation what they have all
in common. Of course what they have in common is they are all bad traces.
That's what you're playing at. So if you hide many of these things that go to
the bad state, then you do not have an explanation of the traces. I don't know
how you reconcile the tension, basically.
>> Stefan Leue:
How I reconcile?
>>: The tension. I mean, if you want the most precise form of what -- of bad
traces, they are just traces that go to a bad stays. And where are the bad
state.
>>:
But TA and CA could be aberrant.
They could be a common [indiscernible].
>>: Yes, but it sounds to me like [indiscernible] way of explaining bad states
is basically the definition of a bad state. That is, you have a car crossing,
a train crossing, that would be the minimum perhaps definition of a bad state.
See what I mean? I'm sorry to [indiscernible].
>>:
There's an aspect of linguistic simplicity here.
>>: You're looking for a simple explanation of causality in some language, and
the language itself is providing a certain deductive bias, saying some
explanations are better than others. I mean, any event that you can describe
about a system is synthetic, right. In the real world, it's a disjunction of
an infinite set of events.
So clearly, what you choose to describe as an event, all right, is going to
determine your notion of causality. We're saying that's give. My set events
that I've been working with is given. My set relationships that I can consider
sort of macro events is given. And within that bias language, I want to pick
the simplest explanation, the simplest explanation that fits this criteria.
So I think that that makes sense. I don't quite understand the what partial
orders between events can be expressed here. It seems like sort of series of
parallel graphs or something. Because I can say, well, TA and CA are
unordered, but they come before GF.
25
>> Stefan Leue:
Right.
>>: So can I express, is it only series parallel in relations that I can
express that way or are there other partial orders that can be expressed.
>> Stefan Leue: I believe that you could express all partial orders. But let
me go back to the definition. Okay? There's some issue that we also
discussed, which is that, for instance, if you have some initialization
sequences, right, they occur again and again on the bad and the good traces,
right.
And we currently consider them to be part of what Halpern and Pearl would cause
the causal process that sort of leads up to the effect, but we currently don't
have any internal structure that we can identify some of the events as being
sort of in a causal hierarchy higher up than other events, of being root causes
to other events, right. This would be a refinement of this model, bits not
actually currently in there.
And the same thing about invariants.
invariant orders.
>>:
We do not currently identify sort of
I have a question [indiscernible] --
>>: You can obviously write conditions of bad states that are unsatisfiable.
But you can write a condition that is not possible, in which case isn't an
explanation of anything. I mean, it is [indiscernible]. But here the set of
events is fixed. Given the fixed set of events, you want the most concise
possible. If the set of events just what the property [indiscernible] that's
it. But if you want to have more, bigger set of events that may go to from the
initial state, then this is more general. Does that make sense?
>> Stefan Leue: Okay. We're doing a breadth-first search basically whenever a
good or bad execution is found. We're adding it to the subset graph and we're
basically obtaining these traces by it rating backwards through the predecessor
links. I think that's quite standard technique.
We're reaching a duplicate when the new trace is the same length, it may have a
different order, and the new trace when it's longer, we store it for later OC1
test and order constraint test, basically.
26
For causality checking with DFS, we basically add the trace to the subset graph
whenever a good or a bad trace is found, even a prefix counts as a good trace.
So they don't necessarily need to be -- the traces don't necessarily need to be
maximized when we put them into the graph structure.
And when we find a duplicate there, then again we generate all traces formed by
the new prefix and all suffixes of the duplicate, and we add that to the subset
graph.
Okay. So complexity. We have not done a complete complexity analysis of this
yet. There is a caveat that Eiter and Lukasziewicz proved that for the Halpern
Pearl structured equation model, even for one with only binary variables, the
computing the causal relationship between events is NP-complete.
However, they also showed later on that for cycle-free causal dependencies,
computing causal relationships can be done polynomially, and that's actually
what we have. We don't have any sort of circular dependencies in our model.
But, again, this is only part of the complexity analysis. Of course, we are
paying some penalty compared to simple model checking for all of the storing of
traces for doing the comparison operations determining the orders. But that
remains to be analyzed in terms of complexity.
In terms of experimental evaluation, we have a prototypical implementation
called SpinCause off these algorithms, and it is part of the quantum tool
architecture. Quantum is a tool that's designed to do functional safety
analysis for basically UML, SML type of models and we go either via PRISM to do
some probabilistic model checking or promela spin which is what I've
illustrated here for simply functional properties and then here the causality
checker which computes these causalities and which we would finally visualize
them using fault trees or alternatively UML sequence diagrams. But in
particular, fault trees are very suitable here and they are engineering
standard practice notation.
This is the fault tree that we're getting for the railroad crossing example,
and so this actually sort of shows the disjunction of two of these formulae
that we have computed and again it may be possible to do more sharing of events
to do minimal cut set analysis and the like. This is not yet something that we
have included in this. We're using these fault trees at the moment merely as a
visual notation, as a visual representation for these orders.
27
By the way, the order constraints that we can express here in the event order
logic are richer than the orders you can experience by the so-called dynamic
fault trees, which simply say this event happens before that event and so on.
So we're using a logic that is more expressive than what you can express in
fault trees. But that's why we are providing this as a side annotation there.
For the second case study that we've carried out, so this one certainly, in
terms of its size, it's a toy example. The airbag control unit is actually an
industrial strength model that models the architecture of an air bag control
system and analyzes the hazard of the airbag deploying inadvertently without a
crash actually occurring. And we've done this analysis and obtained this fault
tree here.
Fortunately, I can't go into the details of what this model actually says but
just give you some statistics here. It's a decently-sized model with over
20,000 bad executions here. And we do compute order constraint. This is the
order constraint that complies to this last disjunction of the explanation of
the fault tree.
>>:
Is this the full fault tree for the --
>> Stefan Leue:
>>:
This is the full fault tree for that model.
20,000 executions, is there 20,000 ways in which they can --
>> Stefan Leue: Yes, can inadvertently deploy. Ones that are not the intended
ones, um-hmm. Okay. We could probably do some more sort of sophisticated,
domain specific causal analysis here, because when you get a property violation
of the model this, this can be either because you have modeled in your model
both the correct and the failure behavior. The failure behavior led to the
property violation, or because your normal behavior's model was incorrect,
right.
And so what we did here was actually we included in the model also the failure
behavior. So we -- the anticipated component failures and basically this fault
tree represents all events that are sort of not bugs that the modeler, that the
designer did, but they are anticipated faults of the actual system, right.
In that sense, it's a little bit different from the previous example, where we
did basically model debugging, if you like. But it's applicable to both
28
settings, right. And there's a strategy and sort of functional safety analysis
to come up with models, for instance, in SML that contain both the normal
operation behavior as well as the error behavior.
Okay.
>>:
Please?
[inaudible].
>> Stefan Leue: I have a slide on that, okay. So it's being carried out on a
compute server that we have in our lab. So we have, as I mentioned to you, we
have two approach, the on-the-fly approach that I just described and the
offline approach. In the offline approach, we first precompute all of the
traces and then do the analysis, whereas here we're doing it on the fly.
If we look at the airbag case study, we can see that -- okay, sorry. The MC is
the pure model checking. CC1 is the causality checking with this AC2(2) test,
the test for nonoccurrence of events being causal, being disabled. And the CC2
test is where that is enabled. So we're seeing that first of all, if we
compare depth-first search with breadth-first search, both in terms of run time
and in terms of memory consumption, the breadth-first search has advantages.
That is because we need to generate less traces. We can basically rely on the
shortest trace assumption and throw away the irrelevant traces that we don't
need to store more effectively when we're doing a breadth-first search.
It also shows that there's some gain if we disable the AC2 test step. What we
can also conclude is that basically the on-the-fly approach somewhat
outperforms the offline approach both in terms of run time and -- sorry, here,
this is what I should compare. Both in terms of run time and in terms of
memory. That is clear because we don't need to store all of the traces. We
can really avoid storing the longer than necessary traces and so the ideal
combination seems to be to combine the online approach with the breadth-first
search in the analysis.
This is what I mentioned. There's something that I think was included in the
abstract, but I did not really get around to addresses, which is doing this
with probabilities, which engineers like to see in functional safety. We have
the whole story also implemented for prism probabilistic model checker, where
we can attach probabilities to the system going into an error state and we are
there able to compute basically the fault tree also with probabilities where
this is basically the top -- the probability of what's called the top level
29
event, which is this hazard that corresponds to the inadvertent deployment.
I should say that at the moment, we only have the offline version off this
probabilistic approach implemented but plan to do the online version soon as
well.
Okay. So what's the conclusion? Very briefly, I showed you sort of a
technique that complements, actually, model checking. And the aim is
algorithmic support for the debugging of models, and we defined and adopted a
structured equation model, proposed an implementation and showed that it's
basically applicable to non-trivial case studies, certainly at the case study
level more needs to be done and we want to do more comprehensive tests.
What's future work. Sort of in general, this question, causality checking at
the limits of scaleability. We discussed this over lunch also, when you can't
explore the whole state space. Can you -- what can you do. Can you define
sort of equivalence classes, partial order type that give you all of the
interleavings and you don't have to explore all of them and so on and so forth.
How does it work together with abstractions.
General causality checking in a symbolic environment would be interesting to
look at. I two weeks ago talked to [indiscernible], and he and his system
thinks that he can do most of the things we're doing in here, symbolic setting,
except that he can't work with orders and with nonoccurrence of events. But
still I think that's interesting to look into that.
The online causality checking for probabilistic models and then specific
adaptions to functional safety analysis. As I said, minimal cut sets, root
causes, common and cascading causes.
Very briefly, I'm listing some related work here. This is, of course, not all
of the related work. There's large amount of literature on explaining counter
examples. Tom, you did some work on that a number of years. Alex Gross did
work on that and so on. But these are works where sort of Halpern and Pearl's
model for causality was taken and embedded in some sort of computational
environment.
There's work basically done at IBM high [indiscernible] labs where they sort of
compute sort ever along a counter example when a complex LTL formula can no
longer be satisfied and they refer to basically that model of causality, that
30
work by [indiscernible], others on what causes a system to satisfy a
specification, which basically relates the structured equation model by Halpern
and Pearl to model coverage and defines a notion of sort of relevance to model
components and work that was actually done relatively local here that we've
found out about only about ten days ago, tracing data errors with new
conditioned causality by authors from University of Washington as well as one
co-author from Microsoft Research that basically tries to relate surprising
query results to errors in query of data. Uses sort of the counterfactual
reasoning and uses SAT solving.
Okay. With that, I would like to thank you for your interest and patience and
I'm glad we had already extensive discussions, but if there are more questions,
I'm happy to take them.
>> Tom Ball: Thanks very much. I think we'll end it there. If you have more
questions, one on one, he's here for the whole week. So if you want to set up
some more time, please let me know.
>> Stefan Leue:
Thank you.
Download