Reasoning with Uncertain t

advertisement
Reasoning with Uncertainty 1999/2000
Frans Voorbraak
In these notes a number of formalisms for reasoning with (numeric) uncertainty in AI are treated. The emphasis is on a systematic presentation of
the relevant denitions and results. In this the notes supplement the book
Representing Uncertain Knowledge by Krause and Clark, which has an emphasis on informal discussions concerning the relative merits of the dierent
formalisms.
The considered formalisms are Probability Theory and some of its generalisations, Dempster-Shafer Theory, and the Certainty Factor Model. Fuzzy
Sets and Logic and the theory behind probabilistic networks will be treated
in accompanying notes. The notes are mostly self-contained, but it is often
convenient to read them along with the relevant material in
Uncertain Knowledge.
Representing
1 Probability Theory
1.1 Introduction
In this chapter we consider probability theory as a formalism for representing uncertainty in knowledge-based systems. Probability theory is by
far the best developed formalism for reasoning with uncertainty, and it has
many interpretations. The most relevant interpretation for our purpose is
probability as a degree of belief of an ideally rational agent.
We discuss the famous Dutch book argument, which is supposed to show
that degrees of belief of an ideally rational agent not only
can but should
be represented by means of probabilities. In spite of this argument, several
alternative approaches to reasoning with uncertainty in AI have been proposed, since a naive application of probability theory to uncertain reasoning
in knowledge-based systems leads to several problems. Some of these problems are discussed. But rst some basic material of axiomatic probability
theory is treated.
1
1.2
Probability Functions
events, which can be formalised as subsets
sample space (also called a frame). A sample space denotes the certain
Probabilities can be assigned to
of a
or sure event and its elements denote all (mutually exclusive) considered
possibilities, such as all possible outcomes of some experiment.
A denotes
a subset of a sample space , then
possibilities in
If
A
is
the event that one of the
A is the case. The empty set ; denotes the impossible event.
In order for probabilities to be dened, one has to impose some structure
on the set of considered events, and one usually assumes this set to be
an
algebra
(also called a
eld)
on the sample space.
Some authors have
challenged this assumption on the structure of events to which probabilities
are assigned, but the assumption is made in any standard treatment of
axiomatic probability theory.
Denition 1.2.1 Let be a set. An
algebra 6 on which satises the following three conditions.
1. (inclusion of sample set)
2 6:
is a set of subsets of
(1)
2. (closure under nite unions)
A; B 2 6 ) A [ B 2 6:
(2)
3. (closure under complementation)
(3)
A 2 6 ) A 2 6:
(Here A is the complement of A with respect to .)
A -algebra on is an algebra on which additionally satises the following
property of closure under countable unions.
[
(4)
For every countable I , fA : i 2 I g 6 ) A 2 6:
i
i
The notion of a
2I
i
-algebra does not play an important role in these notes,
nite sample spaces, where closure
since we will soon restrict ourselves to
under countable unions collapses to closure under nite unions.
The powerset 2
of is an algebra on and one often encounters prob-
abilities dened on this powerset algebra. This powerset algebra 2
is the
2
maximal algebra on .
f;; g.
The minimal algebra on is the trivial algebra
In general there are many algebras in between.
The following proposition lists some properties of algebras. The proofs
are omitted and left as exercises. (This will be standard practice.)
Proposition 1.2.1 Assume that is an algebra on . The following holds.
6
1. and f;; g are algebras on , and f;; g 6 2
.
2. For every nite I , fA : i 2 I g 6 ) S 2 A 2 6.
3. (closure under nite intersections) A; B 2 6 ) A \ B 2 6.
4. If 6 is a -algebra, then closure under countable intersections holds.
2
i
i I
i
If 1 is a subset of an algebra 6 such that the members of 1 are nonempty and pairwise disjoint and every element of 6 is a countable union of
members of 1, then 1 is called a
basis of 6.
Not every algebra has a basis,
but the existence of a basis is guaranteed if the algebra is
nite,
and, in
particular, if the algebra is dened on a nite sample space.
Any countable set 1 of subsets of can generate an algebra on by
taking the closure of 1[f
g under (countable) unions and complementation.
( -)algebra generated by 1.
Denition 1.2.2 Let 6 be an algebra on . P is called a probability function on 6 i P is a real-valued function on 6 satisfying the following three
conditions.
1. (non-negativity)
For all A 2 6; P (A) 0:
(5)
2. (unit normalisation)
P (
) = 1:
(6)
3. (nite additivity)
For all disjoint A; B 2 6; P (A [ B ) = P (A) + P (B ):
(7)
If 6 is a -algebra on , then P is called a probability measure on 6 i P is
probability function on 6 which additionally satises the following property
of countable additivity.
If fA : i[2 I g is aXcountable set of pairwise disjoint elements of 6;
then P ( A ) = P (A ):
(8)
The thus obtained ( -)algebra is called the
i
2I
i
i
i
2I
i
3
Denition 1.2.3 A
is a tuple h
; 6; P i, such that 6 is a
-algebra on and P is a probability measure on 6.
If h
; 6; P i is a probability space, then is called its sample space.
The elements of are called elementary events, and the elements of 6
are called measurable sets or measurable events. If 6 is a -algebra on
, then PROB (
; 6) denotes the class of all probability measures on 6.
Instead of PROB (
; 2
), we sometimes simply write PROB (
). Elements
of PROB (
) are called probability measures over .
probability space
From now on, we assume, unless explicitly stated otherwise, that the
sample space is
nite.
Some consequences of this assumption are that the
distinction between probability function and probability measure vanishes,
and that every algebra has a basis. Notice that if 1 is a basis of an algebra
6, then a probability function
P
on 6 is determined by its values on the
members of 1.
We list some useful properties of probability functions.
Proposition 1.2.2 Let h ; ; P i be a probability space and assume that the
6
sets A and B are elements of 6.
1. P (A) = P (A \ B ) + P (A \ B ):
2. P (A) = 1 0 P (A):
3. P (;) = 0: (zero normalisation)
4. P (A) 1:
5. B A ) P (A) 0 P (B ) = P (A \ B ):
6. B A ) P (B ) P (A): (monotonicity)
7. P (A [ B ) = P (A) + P (B ) 0 P (A \ B ):
The latter result can be generalised to a formula which is interesting in
the light of the future discussion of Dempster-Shafer theory.
Proposition 1.2.3 Let h
; 6; P i be a probability space and assume that the
sets A1; A2; : : :; A are elements of 6.
n
[
X
n
For all n 1; P ( A ) =
=1
i
i
jI j+1P (
(01)
;6=I f1;:::;ng
\A
2
i I
(Try to understand equation (9) by writing it out for small n.)
4
:
i)
(9)
The rst property of proposition 1.2.2 can also be generalised.
A set
fB1 ; B2; : : :; Bn g is called a partition of if its members are pairwise disjoint
B1 [ B2 [ : : : [ Bn = .
and
Proposition 1.2.4 Let h ; ; P i be a probability space and assume that
6
the sets A; B1; B2 ; : : :; B are elements of 6 such that fB1 ; B2; : : :; B g is a
partition of .
(10)
P (A) = P (A \ B1 ) + P (A \ B2) + 1 1 1 + P (A \ B ):
n
n
n
Some more properties of probability functions will be given further on.
We close this section with the observation that probability functions could
have been dened as functions on formulas of a formal language (instead of
functions on sets). In the following we make this more precise.
Assume
L
tional letters
to be a propositional language built up from a set of proposi-
PL with the usual propositional connectives :; ^; _; !; $. In
the line of our assumption that the sample space is nite, we assume that
PL is nite.
Formulas of
L
are denoted by
Denition 1.2.4 P is called a
'; ; : : :.
i P is a realvalued function on L satisfying the following three conditions.
1. (non-negativity)
For all ' 2 L; P (') 0:
(11)
2. (unit normalisation)
For all ' 2 L; j= ' ) P (') = 1:
(12)
probability function on
3. (nite additivity)
For all '; 2 L; j= :(' ^
)
) P (' _
)=
L
P (') + P ( ):
(13)
This denition is of course closely related to denition 1.2.2 of probability
functions on sets. To establish an exact connection, notice that the following
can be proved.
For all
';
2 L; j= ' $
) P (') = P (
)
:
(14)
Equation (14) tells us that the probability functions of denition 1.2.4 can
be viewed as (probability) functions on the Lindenbaum algebra of
is, as the name suggest, an algebra in the sense of denition 1.2.1.
5
L,
which
One encounters a small complication when using probability functions
dened on formulas. In general, a propositional letter does not correspond
to an elementary event. Let us dene an
elementary proposition to be a con-
junction of literals, such that every propositional letter appears exactly once
in the conjunction. An elementary proposition is a most specic proposition
and completely characterises a possible world or valuation.
The notion of an elementary proposition is the correct counterpart of
the notion of an elementary event of a sample space. However, if there are
n propositional letters, then there are 2
n
non-equivalent elementary propo-
sitions, whereas not every sample space has a cardinality which is a power
of 2.
(Consider, for example, the obvious sample space representing the
outcomes of a roll of a (six-sided) die.)
Therefore, denition 1.2.4 has to be augmented to allow the encoding
of some information about the particular situation that is being modelled.
This can be done, but we save us the trouble and use set functions in the
rest of these notes.
Exercise 1.2.1
Exercise 1.2.2
Prove proposition 1.2.1.
Give an example of an algebra 6 on some sample space such that 6 is neither 2
nor
Exercise 1.2.3
Exercise 1.2.4
Exercise 1.2.5
f;; g.
Show that every nite algebra has a basis.
Prove proposition 1.2.2.
Formulate the results of proposition 1.2.2 in terms of prob-
ability functions dened on formulas, and convince yourself that the proofs
you have found in the previous exercise carry over to the context of probabilities dened on formulas.
Exercise 1.2.6
Exercise 1.2.7
Prove equation (14).
Consider the experiment that consists of a single roll of
a fair (ordinary, six-sided) die. Give a probability space representing this
experiment. How might a representation using probability functions on formulas look like?
6
1.3 Conditional Probability and Bayes' Rule
Denition 1.3.1 Let P 2 PROB ; and A 2 such that P A > .
(
6)
6
(
)
0
1. For every B 2 6, we dene P (B jA), the (conditional) probability of
B given A, as follows
(15)
P (BjA) = P (PA(A\ )B) :
2. The function P (P conditioned by A) on 6 is given by the following
equation.
(16)
For every B 2 6; P (B ) = P (B jA):
A
A
It is easy to see that the function
function on 6, and that
P
=
P.
P
A
dened above is a probability
Hence conditioning can be seen as a
P is the result
A. This process
= P \ , whenever
process for revising or updating probability functions, where
of revising
P
A
after obtaining the evidence represented by
P
is order-independent in the sense that (
P (A \ B) > 0.
A )B
P
= (
B )A
A
B
Some other useful properties are given below.
Proposition 1.3.1 Let h
; 6; P i be a probability space and assume that
the sets A; B1; B2 ; : : :; B are elements of 6 such that fB1 ; B2; : : :; B g is a
partition of and for every i 2 f1; 2; : : :; ng, P (B ) > 0. Then
n
P (A) =
n
X P AjB
i
n
i
=1
(
i)
1 P (Bi ):
(17)
Proposition 1.3.2 Let h
; 6; P i be a probability space01and assume that the
sets A1; A2; : : :; A are elements of 6 such that P (T =1
A ) > 0. Then
\
n
i
n
i
n
P ( A ) = P (A1) 1 P (A2jA1) 1 P (A3jA1 \ A2) 1 : : : 1 P (A
=1
i
n
i
\01 A
n
j
i
=1
:
i)
(18)
chaining rule. An easy corollary of this
chaining rule is that for any P 2 PROB (
; 6), P (AjC ) = P (AjB ) 1 P (B jC ),
whenever A; B; C 2 6, A B C , and P (B ) > 0. This property essentially
The above result is known as the
characterises the conditioning process, as is shown below.
7
Proposition 1.3.3 Assume that P 2 PROB ; . Let C 2 such that
(
6)
6
P (C ) > 0 and let F denote some probability function on 6. Then the
following two statements are equivalent.
1. F = P .
2. F satises the three conditions below, where A; B range over 6.
(a) F = P .
(b) F (A) = F (A \ C ).
(c) If P (B ) > 0 and A B C , then F (A) = F (A) 1 F (B ).
P;C
P;C
C
P;C
P;
P;C
Proof.
P;C
P;C
P;B
P;C
That 1 implies 2 is immediate. To show that 2 implies 1, assume that
A be an arbitrary subset
A \ C C , we have F (A \ C ) = F (A \ C ) 1 F (C ).
Using F = P , we obtain P (A \ C ) = F
(A \ C ) 1 P (C ), and thus
F (A \ C ) = P (A \ C )=P (C ) = P (A). Since F (A) = F (A \ C ), it
follows that F
= P .
the conditions listed under 2 are satised and let
of . Since
P;
P;
P;C
P;
P;C
P;C
C
P;C
P;C
P;C
C
For the computation of (conditional) probabilities a simple result due to
Thomas Bayes has proved to be very useful. This result, known as
Bayes'
rule or Bayes' theorem, solves a much encountered problem when applying
probability theory to knowledge-based systems.
For example, in (medical) diagnosis, one is interested in the the condi-
H given some evidence (sympE , but it is usually much easier to obtain P (E jH ) (from experts or
statistical records) than to obtain P (H jE ). Bayes' rule allows one to compute P (H jE ) from P (E jH ) and the prior probabilities of H and E .
Proposition 1.3.4 (Bayes) Let h
; 6; P i be a probability space. Assume
that H; E 2 6 such that P (H ) > 0 and P (E ) > 0. Then
(19)
P (H jE ) = P (E jPH()E1 )P (H ) :
tional probability of a hypothesis (disease)
tom)
Proof.
P (H jE ) = P (PH(E\)E ) =
( \E ) 1 P (H )
( )
P H
P H
P (E )
=
P (E jH ) 1 P (H ) :
P (E )
Although the above proposition is an almost trivial theorem of axiomatic
probability theory, the results of its application are not so trivial.
8
Example 1.3.1
A tumour can be either benignant (
B) or malignant (M ).
To gain insight into the nature of a tumour, a radiological test can be performed. The result of this kind of test is classied as either positive (+) or
negative (0) with respect to cancer (a malignant tumour). The reliability of
such a test is given as follows. Its
false positive rate (= the probability that a
false negative
benignant tumour is classied as malignant ) is 0.096, and its
rate (= the probability that a malignant tumour is classied as benignant)
P (+jM ) that a malignant tumour produces
is 0.208. Thus, the probability
a positive test result is 0.792.
Now suppose that some particular tumour has a prior probability of 0.01
of being malignant and that it produces a positive test result. In these circumstances, among physicians, the probability that the tumour is malignant
is typically assessed to be about 0.75. However, using Bayes' rule one can
P (M j+) from the data as follows.
) 1 P (M )
P (+jM ) 1 P (M )
P (M j+) = P (+jM
=
P (+)
P (+jM ) 1 P (M ) + P (+jM ) 1 P (M )
0:792 1 0:01
=
0:077:
0:792 1 0:01 + 0:096 1 0:99
compute
Thus, the typical assessment does not even come close to the theoretically
correct probability. Possible explanations of this phenomenon are that peo-
base
rate fallacy), or that they simply fail to distinguish the probabilities P (AjB )
and P (B jA).
ple often forget to take into account the eect of the prior probability (
Although Bayes' rule as formulated in proposition 1.3.4 may give rise
to interesting results, one needs a more general formulation if one intends
to use the rule in a system for medical diagnosis.
In that case, it is not
sucient to calculate the probability of a particular disease given a single
symptom. Instead, one is interested in the probability of each considered
disease given any combination of symptoms.
To obtain a more general formulation of proposition 1.3.4, simply replace
the single symptom
E by an arbitrary combination of symptoms.
If the hypotheses form a partition of the sample space (as is the case
in the above example), then one can use proposition 1.3.1 to dispose of the
need to know the prior probabilities of the evidence. One then obtains the
following version of Bayes' rule.
9
Proposition 1.3.5 (Bayes) Let h ; ; P i be a probability space. Assume
6
that fH : i 2 I g 6 is a partition of such that for Tall i 2 I , P (H ) > 0.
Further assume that fE : k 2 K g 6 such that P ( 2 E ) > 0. Then
we have, for all i 2 I ,
T 2 E jH ) 1 P (H )
\
(
P
(20)
P (H j E ) = P (P (T E jH ) 1 P (H )) :
2
2
2
i
i
k
i
k
k
k
k
j
K
I
K
k
k
i
k
K
K
k
i
j
j
We will see later that there is still a long way to go from the above result
to a method for automatic (medical) diagnosis. We end this section with
an alternative formulation of Bayes' rule which makes use of the notions of
odds and likelihood.
Denition 1.3.2 Let h ; ; P i be a probability space, and let E , H 2 .
6
6
The (prior) odds O(H ), the posterior odds O(H jE ), and the likelihood ratio
(H jE ) of H given E are dened as follows.
1. Assume that P (H ) > 0.
)
P (H ) :
=
(21)
O(H ) = PP ((H
H ) 1 0 P (H )
2. Assume that P (H jE ) > 0.
P (H j E ) :
jE )
=
O(H jE ) = PP ((H
H j E ) 1 0 P (H j E )
(22)
3. Assume that P (E jH ) > 0.
jH )
(H jE) = PP ((E
E H) :
j
(23)
Probability theory can be given an equivalent formulation by using odds
instead of probability. Bayes' rule can be a seen as a rule for updating odds.
Proposition 1.3.6 (Bayes) Let h ; ; P i be a probability space. Assume
6
that E and H are elements of 6 such that P (H ); P (H jE ); P (E jH ) > 0.
Then
(24)
O(H jE ) = (H jE ) 1 O(H ):
10
H given E tells us how to update the odds on H
in the light of the evidence E . A high ( 1) value of (H jE ) corresponds
to the situation in which acquiring the evidence E lends much support to
the truth of H .
The likelihood ratio of
In these notes the odds-likelihood formulation of probability theory will
hardly be used. However, the notion of odds will make a second appearance
in section 1.5 on the Dutch book argument.
Exercise 1.3.1
Show that the function
P
A
dened in denition 1.3.1 is a
probability function on 6.
Exercise 1.3.2
Exercise 1.3.3
Exercise 1.3.4
and
Show that (
P
A )B
P
= (
B )A
=
P
\B ,
A
when
P (A \ B) > 0.
Prove proposition 1.3.1.
Assume that P 2 PROB (
; 6), A; B; C 2 6, A B C ,
P (B) > 0. Show that P (AjC ) = P (AjB) 1 P (BjC ).
Exercise 1.3.5
At one evening, a taxi causes a serious hit-and-run accident.
According to an eye-witness, the taxi involved was blue. The rm
Blue Star
owns 15 of the 100 taxis in town and those 15 taxis are blue, the other 85
belong to the rm
Green Cheese and
are green.
Tests (under conditions
similar to that evening) prove that in 80% of the cases the eye-witness is
able to correctly identify the colour of a blue taxi and the same percentage
is scored for green taxi's. What is the probability that the taxi involved was
blue? (First guess the probability, then compute it using Bayes' rule.)
Exercise 1.3.6
Exercise 1.3.7
Exercise 1.3.8
that 0
Show that
P (A) < 1 ) P (A) = O(A)=(1 + O(A)).
Prove proposition 1.3.6.
; ; P i be a probability space, and let E; H 2 6 such
Let h
6
< P (E jH) < 1. Show that
1 0 (H jE )
:
(H jE ) 0 (H jE)
Express P (E jH ) in terms of (H jE ) and (H jE).
P (E jH ) = (H jE) 1
11
1.4
Independence and Computational Issues
There is one basic notion of probability theory that has yet to be introduced,
independence.
Denition 1.4.1 Let h
; 6; P i be a probability space, and let A, B
The sets A and B are called independent i
P (A \ B) = P (A) 1 P (B):
namely the notion of
2
6
.
(25)
Alternatively, one can use the following denition.
Denition 1.4.2 Let h
; 6; P i be a probability space, and let A, B 2 6.
The set A is called independent from B i
(26)
P (B) > 0 ) P (AjB) = P (A):
It is easy to show that A is independent from B i A and B are independent. There are at least two reasonable ways to generalise the notion of
independence to collections of more than two events.
Denition 1.4.3 Let h
; 6; P i be a probability space. Assume that for some
n 2, the sets A1; A2; : : :; A are elements of 6.
1. A1 ; A2; : : :; A are called (completely) independent i
\ Y
(27)
for every I f1; 2; : : :; ng; P ( A ) = P (A ):
n
n
2I
i
i
i
2I
i
2. A1 ; A2; : : :; A are called pairwise independent i
for every i; j 2 I such that i =
6 j; P (A \ A ) = P (A ) 1 P (A ):
n
i
If
A1; A2; : : :; A
n
j
i
j
(28)
are (completely) independent, then also pairwise inde-
pendent, but in general the reverse implication is not valid. For a collection
of
two events, the above notions are of course equivalent.
One also needs a notion of conditional independence.
Below, we give
the conditional version of denition 1.4.1. The conditional versions of other
notions of independence can be obtained analogously.
Denition 1.4.4 Let h ; ; P i be a probability space, and let A, B, C 2
6
6
such that P (C ) > 0. A and B are called (conditionally) independent given
i
(29)
P (A \ BjC ) = P (AjC ) 1 P (BjC ):
C
12
We now return to the problem of designing a system for automatic (medical) diagnosis. A simple scheme would be to feed the system with the prior
P (TH ) of the hypotheses (diseases) and the conditional probP (H j 2 E ) of any hypothesis given any combination of the
probabilities
abilities
i
i
k
k
K
bodies of evidence (symptoms).
These probabilities are to be extracted
from experts or statistical records. The user of the system then only needs
to supply the information about bodies of evidence (symptoms) concerning
the particular case at hand.
The problem with this simple scheme is that in practice the number of
probabilities that have to be fed to the system is too large.
siders
n dierent symptoms, then there are 2
n
If one con-
combinations of symptoms.
Thus a very modest system which only considers 20 dierent symptoms already has to be fed more than a million probabilities. Bayes' rule does not
P (H j T 2 E ) to be computed if the probabilities P (T 2 E jH ) are
provide a solution to this problem, since the rule only allows the probabilities
i
k
k
K
k
K
k
i
known.
If one assumes that the symptoms are conditionally independent given
each disease, then the number of required probabilities is greatly reduced.
PT2
Q2
One then only needs the probabilities
(
k
K
E jH ) =
k
i
k
K
P (E jH ).
k
i
P (E jH ),
k
i
for each
i
and
k,
since
However, the mentioned assumption is
usually highly unrealistic.
P (H j T 2 E
The problem would be solved, if one could nd a way to compute the
probability
i
k
K
k)
from the values
P (H jE
i
k ),
that is, if one could
nd a kind of combination function which computes the combined eect of
dierent bodies of evidence. The following result shows that there is no such
combination function.
Proposition 1.4.1 (Neapolitan) Let 6 be the algebra generated by the
subsets A,B , and C of . There is in general no function F which computes
for any probability function P on 6 the value P (AjB \ C ) from the values
of P on the algebras generated by at most two elements from fA; B; C g.
Proof. Consider the probability functions P1 and P2 dened as follows.
P1(A \ B \ C ) = P1(A \ B \ C) = P1(A \ B \ C ) = P1(A \ B \ C ) = 0:25:
P2(A \ B \ C ) = P2(A \ B \ C ) = P2(A \ B \ C ) = P2(A \ B \ C ) = 0:25:
(See gure 1.) P1 and P2 are identical on algebras generated by at most two
elements of fA; B; C g. (This can be easily seen by removing one element of
fA; B; C g in gure 1.)
13
A
'
'
A
$
B
'
0.25
$
0.25
B
$
'
0.25
'
0.25
$
'
$
0.25
&
$
0.25
%
&
%
0.25
&
%
0.25
&
C
&
%
%
C
&
Figure 1: The probability functions
%
P1 (left) and P2 (right) mentioned in
the proof of Neapolitan's result.
Now suppose there is a function F as described in the proposition. Then
F applied to the values of P1 yields the same result as F applied to the values
of P2 . But P1 (AjB \ C ) = 1, whereas P2 (AjB \ C ) = 0. Contradiction.
(Notice that we have even proved that there exists no function F which
approximates P (AjB \ C ).)
In general, there is no simple (justied) way to avoid the combinatorial
explosion of the required probabilities, and this combinatorial explosion is
one of the main reasons why people started to look for alternative approaches
to reasoning with uncertainty in AI. We will discuss some of these alternatives (the Certainty Factor Model and Dempster-Shafer Theory) later.
However, the simple scheme sketched above can be much improved if one
takes into consideration the
Example 1.4.1
structure of the knowledge-domain.
Consider the problem of nding the most plausible diag-
fH1; H2; : : :; H10g based on the (presence or absence)
fE1; E2; : : :; E20g. The simple scheme would need 10 1 220
nosis among
of the
symptoms
(more
than ten million) conditional probabilities.
Assume the following additional information. The hypotheses can be divided into two classes
H
a
=
fH1; H2; : : :; H5g
and
H = fH6; H7; : : :; H10g,
E = fE1; E2; : : :; E7g,
and the symptoms can be divided into three classes
E
b
=
fE8; E9; : : :; E14g,
and
E
c
=
b
a
fE15; E16; : : :; E20g.
14
The symptoms from
E
a
are only relevant to nd the most plausible diagnosis among
H,
a
the
E are only relevant to nd the most plausible diagnosis
H , and the symptoms from E are only relevant to nd the most
plausible (partial) diagnosis from fH ; H g.
symptoms from
among
b
b
c
a
b
In that case, the number of required conditional probabilities is reduced
to 2
1 26 + 5 1 27 + 5 1 27
= 1408.
The above example illustrates the fact that the number of required probabilities can be greatly reduced if the problem can be divided up into smaller
independent subproblems. Essentially this technique is used in some sophisticated methods for probabilistic reasoning that have recently become available. These methods, mainly developed by Pearl, Lauritzen, and Spiegelhalter, have one feature in common: they all exploit conditional independencies
implied by a graphical representations of the problem domain. That is why
they are called
(probabilistic) network models.
In many|but not in all!|
situations the independency information allows the development of sound
and feasible algorithms for updating probabilities in knowledge-based systems.
The nodes of the networks consist of propositional variables, i.e., functions from the sample space to an exhaustive set of mutually exclusive events.
In this they dier from inference networks of rule-based systems, where the
nodes consist of propositions.
In Pearl's method the problem domain is represented by a causal network, i.e., a directed acyclic graph, where an arrow represents a causal relationship. (In the literature causal networks are also called Bayesian networks or belief networks.) For some classes of causal networks|the singly
connected ones|an ecient local probability propagation scheme can be
given which updates probabilities by instantiation of certain variables and
local communication between nodes.
Pearl's method derived from considerations about the way humans reason and is meant to be a model for human reasoning. Lauritzen and Spiegelhalter are only interested in the (mathematical) problem of propagating
probabilities in networks. Since the notion of (conditional) independence is
more fundamental to causal networks than the notion of causality, it is no
surprise that Lauritzen and Spiegelhalter use undirected graphs.
The propagation scheme of Lauritzen and Spiegelhalter is dened on trees
of cliques of a triangulated undirected graph (a clique is a maximal sets of
nodes such that every pair of distinct nodes is adjacent). Roughly speaking,
eciency is possible whenever the number of variables in the cliques is small,
15
which is not a too restrictive condition, since people tend to represent causal
relationships by means of hierarchies of relatively small clusters of variables.
The network models will be discussed in greater detail at the end of the
course. It should be stressed, however, that these models do not completely
solve the problem of reasoning with uncertainty in AI. In some situations, the
methods are not feasible. Moreover, the available probabilistic information
is often incomplete, badly calibrated, and incoherent.
Before we discuss some alternative formalisms which address (some of )
these problems, we turn to the justication of using probabilities as degrees
of belief of a rational agent.
Exercise 1.4.1
Show that
Exercise 1.4.2
Show that pairwise independent
Exercise 1.4.3
Show that
Exercise 1.4.4
Show that the independence of
Exercise 1.4.5
Assume that 0
A is independent from B i A and B are inde-
pendent.
A1; A2; : : :; A
n
are not nec-
essarily completely independent.
P (A) = 1.
of the pairs
fA; Bg, fA; B g,
independent given
C.
Exercise 1.4.6
from
P (E jH ).
k
A
is independent from itself i
and
fA; B g
P (A)
= 0 or
A and B implies that each
are independent.
< P (C ) < 1. Show that A and B are
C does not imply that A and B are independent given
Show that in general
P (T 2 E jH ) cannot be computed
k
K
k
i
i
Exercise 1.4.7
Consider the problem of nding the most plausible diag-
fH1; H2; : : :; H20g based on the (presence or absence) of the
fE1; E2; : : :; E30g. How many conditional probabilities would be
nosis among
symptoms
needed for the simple scheme?
Try to add some additional information analogous to example 1.4.1,
which reduces the number of the required probabilities to less than 1000.
16
1.5
Dutch Book Argument
This section is devoted to the most widely used argument (dating back to
Ramsey and De Finetti) in favour of the position that a probability measure
is the unique right representation of rational degrees of beliefs. The argument is known as the Dutch book argument and roughly runs as follows.
Suppose one takes the degrees of belief of an ideally rational individual
X in a proposition A to be the number q such that he is willing to bet on
A at odds q : 1 0 q and against A at odds 1 0 q : q. Then under some
reasonable assumptions it can be shown that X can avoid accepting a set of
bets which would result in a sure loss for X (a Dutch book for X ) i X 's
assignment of degrees of belief constitutes a probability measure.
Denition 1.5.1 Let 6 be an algebra on . A bet on A 2 6 is a tuple
b = hA; S; qi, where S 0 and 0 q 1. S is called the stake of b, q the
betting quotient of b and q : 1 0 q the odds of b. A bet against A is a bet
on A. BET (
; 6) denotes the class of all bets on elements of 6.
Example 1.5.1 Suppose you wager (D) 10 on the complete outsider Born
Loser which scores 19 : 1 at the bookmakers. Then the (total) stake of your
bet equals 1
2 10 + 19 2 10 =
200 and the odds of your bet on Born Loser
:
are 1 : 19. The betting quotient equals 0.05, since 0 05 : 1
0 0:05 = 1 : 19.
A can be determined as soon as it is
A or A holds.
Denition 1.5.2 Let b = hA; S; qi 2 BET (
; 6).
1. B is called b-specic i B 2 6nf;g such that B A or B A.
2. The value kbk of b at a b-specic B is dened as follows.
( (1 0 q)S if B A
kbk =
if B A
0qS
The value of a bet on some event
decided whether
B
B
Example 1.5.2
(Continuation of example 1.5.1.) Suppose you are lucky:
q S = (1 0 0:05) 2 200 =
Born Loser wins. Then your bet has the value (1 0 )
190.
Denition 1.5.3 A
with respect to an algebra 6 on is a nite
subset of BET (
; 6) such that hA; S; q i; hA; S; q 0i 2 ) q = q 0. The class
of all books with respect to 6 is denoted by BOOK (
; 6).
book
17
Denition 1.5.4 Let 2 BOOK ; .
(
6)
1. A is called -specic i for every b 2 ; A is b-specic.
2. The value k k of at a -specic A is dened as follows.
A
For any -specic A; k k
A
def
=
X kbk
2
A
:
(30)
b
Denition 1.5.5 A book with respect to is called a
6
Dutch book i for
every -specic A 2 6; k k < 0.
Denition 1.5.6 The acceptance set of an individual X with respect to an
algebra 6 is a set Acc (
; 6) BET (
; 6) such that
1. If hA; S; q i 2 Acc (
; 6) and > 0, then hA; S; qi 2 Acc (
; 6).
2. If hA; S; q i 2 Acc (
; 6) and 0 q 0 q , then hA; S; q 0i 2 Acc (
; 6).
3. 8A 2 6; 9!q such that hA; 1; q i and hA; 1; 1 0 q i 2 Acc (
; 6).
The unique q mentioned above is called X 's degree of belief in A. We dene
bel to be the function on 6 such that for every A 2 6; bel (A) = X 's
degree of belief in A. bel is called the belief function of Acc (
; 6).
A
X
X
X
X
X
X
X
X
X
X
The rst condition on acceptance sets implies that the acceptability of a
bet should not depend on the stake, but only on the event and the betting
quotient (or odds). If a bet is accepted, then the second condition requires
that bets on the same event and with the same stake, but with a more
favourable betting quotient should also be accepted.
The third condition
requires that for each event there is a unique breaking point, that is, a
betting quotient at which one is indierent as to which side of the bet one
takes (either betting on the event, or against the event at the reverse odds).
; 6) is completely determined by
(the degrees of belief given by) its belief function bel .
Denition 1.5.7 Let Acc (
; 6) be an acceptance set and let bel be its
belief function. The set Acc (
; 6) and the degrees of belief given by bel
are called coherent i Acc (
; 6) does not contain a Dutch book with respect
to 6.
Notice that an acceptance set
Acc
X (
X
X
X
X
X
X
The conditions on an acceptance set do not rule out the possibility that
its degrees of belief are incoherent.
18
Example 1.5.3
A and B are elements of an algebra 6 such
that A \ B = ;. Further assume that bel (A) = 0:3, bel (B ) = 0:2, and
bel (A [ B ) = 0:6. Then X is vulnerable to the following Dutch book with
respect to 6. = fhA; 1; 0:7i; hB; 1; 0:8i; hA [ B; 1; 0:6ig.
Obviously, is a subset of Acc (
; 6). For any -specic set C there are
three possibilities: (1) C A \ B , (2) C A \ B , and (3) C A \ B . In case
(1), k k = (1 0 0:7) 0 0:8+(1 0 0:6) = 00:1. In case (2), k k = 00:7+(1 0
0:8) + (1 0 0:6) = 00:1. In case (3), k k = (1 0 0:7) + (1 0 0:8) 0 0:6 = 00:1.
Hence is a Dutch book.
Assume that
X
X
X
X
C
C
C
The above example already illustrates an essential step in the proof of
the Dutch Book Theorem. This theorem states that degrees of belief given
by
bel
are coherent i
X
bel
is a probability function. Before we prove the
X
theorem, we rst mention a few lemmata.
Lemma 1.5.1 Let 2 BOOK ; and let be a basis of . The book
(
6)
1
6
is a Dutch book i for every D 2 1; kk < 0.
D
Lemma 1.5.2 Let Acc
fhA; S ; bel (A)i : A 2
tained in Acc (
; 6).
A
X
; 6)
. If is a Dutch book, then the book
P
6; S =
h
i2 S g is also a Dutch book conX (
A
A;S;q
X
Proposition 1.5.1 (Dutch Book Theorem) Let be an algebra on .
Acc
6
; 6) is coherent i bel is a probability function on 6.
X (
Proof.
X
Assume that
Acc
; 6) does not contain a Dutch book with re-
X (
spect to 6. We show that
bel
is a probability function on 6.
X
1. The condition of non-negativity is automatically satised.
2. Suppose
bel
bel
X (
)
6=
1.
Then 0
X (
)ig is a Dutch book
2
Acc
bel
;
X (
)
X (
6).
<
Thus
1, and
bel
X
fh
; 1; 1 0
satises unit
normalisation.
A; B 2 6 such that A \ B = ;.
(a) If bel (A [ B ) < bel (A) + bel (B ), then
fhA [ B; 1; 1 0 bel (A [ B )i; hA; 1; bel (A)i; hB; 1; bel (B )ig
3. Let
X
X
X
X
is a Dutch book
X
2 AccX (
; 6).
19
X
bel (A [ B ) > bel (A) + bel (B ), then
fhA [ B; 1; bel (A [ B )i; hA; 1; 1 0 bel (A)i; hB; 1; 1 0 bel (B )ig
is a Dutch book 2 Acc (
; 6).
Thus bel
satises nite additivity.
It remains to show that if bel
is a probability function on 6, then
Acc (
; 6) does not contain a Dutch book with respect to 6. We use a
(b) If
X
X
X
X
X
X
X
X
X
X
method due to John Kemeny.
Assume that
bel
is a probability function on 6, and let 1 be a basis
X
Acc (
; 6) is a Dutch book. By lemma 1.5.2, we
may assume that = fhA; S ; bel (A)i : A 2 6g.
Let S
= S 0 S . Then for any D 2 1,
X (1 0 bel (A)) 1 S + X 0bel (A) 1 S :
k k =
of 6. Suppose that
X
A
A
AA
X
A
D
D
X
A
AA
D
A
X
AA
P 21 bel (D) 1 kk . Since
, Prf () def
=
21 bel (D) = 1, and for any D 2 1; bel (D) 0, and k k < 0, it
follows that Prf ( ) < 0.
P 26 a 1 S , where
On the other hand, Prf ( ) =
X
X
a = (1 0 bel (A)) bel (D) 0 bel (A) bel (D):
P
Consider the estimated prot of
D
A
X
D
bel
X
D
X
A
Since
X
D
X
A
A
X
D
AA
X
D
A
X
is a probability function on 6, we have
a = (1 0 bel (A))bel (A) 0 bel (A)(1 0 bel (A)) = 0:
Hence Prf ( ) = 0.
We may conclude that Acc (
; 6) contains no Dutch book.
A
X
X
X
X
X
There is a large body of literature on the Dutch book argument. Although it has been criticised by many authors, it remains a strong, intuitively appealing, argument for using probabilities as degrees of belief of an
ideally rational agent. The argument is not airtight, but it is reasonable to
demand that any proposal for an alternative theory should be accompanied
by an explanation why the Dutch book argument does not disqualify the
proposed theory.
We end this section by briey discussing a few well-known objections
against the Dutch book argument.
20
Objection 1 I don't like gambling. Why should I bet?
Answer.
The argument is about the degrees of belief of
ideally rational
agents. One can argue that such agents dier from humans in that they
maximise utility without being bothered by taking some risk. Alternatively,
one can circumvent the eect of risk-taking by saying that a bet is put in
the acceptance set of an individual
X , not when X actually accepts the bet,
but when he would accept the bet in case risk avoidance would not be an
issue.
Objection 2 I like gambling. It's fun! I don't mind losing some money in
Answer.
the process.
Remarks similar to those above apply. In addition, it should be
stressed that a Dutch book results in a
sure
loss as soon as the relevant
propositions are decided. Most gamblers know that they are likely to lose in
the long run, but gambling is attractive because there is (or at least appears
to be) a chance of winning. There is not much fun in gambling without the
possibility to win.
Objection 3 The argument only applies to decidable propositions. To determine the value of a bet on an event A one should be able to decide whether
A or A holds.
Answer.
Many propositions are (in principle) decidable. So the result that
for these propositions degrees of belief should be probabilities is still a very
strong result. Moreover, even if the proposition
A is undecidable, it seems
irrational to accept a set of bets that would result in a loss if it would be
A holds and that would also result in a loss if it would be
determined that A holds.
Objection 4 It is unreasonable to require that the acceptance of a bet does
not depend on the stake involved.
Answer. The argument is about the degrees of belief of ideally rational
determined that
agents. They are not bothered by earthly concerns like having only a nite
amount of money. Moreover, the proof that coherent degrees of belief are
necessary probabilities only uses bets with stake 1.
Objection 5 In the presence of ignorance with respect to the exact uncertainties, it is unreasonable to require the existence of a (unique) breaking
point at which one is indierent to bet on or against an event.
Answer.
In the presence of ignorance with respect to the exact uncer-
tainties, it is still possible to determine certain
21
least committed choices
of
exact breaking points, by means of symmetry arguments, or principles as
the indierence principle or the maximum entropy principle.
The given answers to the objections are not meant to be the nal words
on the matter. In fact, in chapter 3, we argue (against the answer to the
last objection) that in the presence of ignorance it might be reasonable to
relax the requirement of the existence of a breaking point. This leads to an
argument for generalised probability theory.
Exercise 1.5.1
Let A be the following statement: On the rst of January
of the year 2000 there will exist a chess-program with an ELO-rating of 2800
points. Several years ago, professor van den Herik and chess-player Bohm
agreed to a bet with respect to A. Van den Herik put in (D) 500 on A
while B
ohm put in the same amount against A.
What are the stake and the odds of the bet van den Herik agreed to?
Can one conclude from this betting behaviour that professor van den Herik
rmly believes
A?
(Why not?)
Exercise 1.5.2 Assume that A \ B = ;; bel (A) = 0:4; bel (B) = 0:5, and
bel (A [ B ) = 0:7. Construct a Dutch book against X .
X
X
X
Exercise 1.5.3
bel (A \ B ) = 0:2; bel (A) = bel (B ) = 0:4,
and bel (A [ B ) = 0:7. Construct a Dutch book against X .
Assume that
X
X
X
X
Exercise 1.5.4
on . Show that
Exercise 1.5.5
Exercise 1.5.6
Exercise 1.5.7
= fhA; S; qig be a book with respect to an algebra 6
is a Dutch book i A = ;; S > 0, and q > 0.
Let
Prove lemma 1.5.1.
Prove lemma 1.5.2.
Let 6 be an algebra on . BET (
; 6) is called a
weak Dutch book i for every -specic A 2 6; k k 0 and there exists a
-specic A 2 6; kk < 0.
Show that if Acc (
; 6) does not contain a weak Dutch book with respect to 6, then bel
is a probability function on 6 such that bel (A) = 1
implies that A = .
A
A
X
X
X
22
Download