Reasoning with Uncertainty 1999/2000 Frans Voorbraak In these notes a number of formalisms for reasoning with (numeric) uncertainty in AI are treated. The emphasis is on a systematic presentation of the relevant denitions and results. In this the notes supplement the book Representing Uncertain Knowledge by Krause and Clark, which has an emphasis on informal discussions concerning the relative merits of the dierent formalisms. The considered formalisms are Probability Theory and some of its generalisations, Dempster-Shafer Theory, and the Certainty Factor Model. Fuzzy Sets and Logic and the theory behind probabilistic networks will be treated in accompanying notes. The notes are mostly self-contained, but it is often convenient to read them along with the relevant material in Uncertain Knowledge. Representing 1 Probability Theory 1.1 Introduction In this chapter we consider probability theory as a formalism for representing uncertainty in knowledge-based systems. Probability theory is by far the best developed formalism for reasoning with uncertainty, and it has many interpretations. The most relevant interpretation for our purpose is probability as a degree of belief of an ideally rational agent. We discuss the famous Dutch book argument, which is supposed to show that degrees of belief of an ideally rational agent not only can but should be represented by means of probabilities. In spite of this argument, several alternative approaches to reasoning with uncertainty in AI have been proposed, since a naive application of probability theory to uncertain reasoning in knowledge-based systems leads to several problems. Some of these problems are discussed. But rst some basic material of axiomatic probability theory is treated. 1 1.2 Probability Functions events, which can be formalised as subsets sample space (also called a frame). A sample space denotes the certain Probabilities can be assigned to of a or sure event and its elements denote all (mutually exclusive) considered possibilities, such as all possible outcomes of some experiment. A denotes a subset of a sample space , then possibilities in If A is the event that one of the A is the case. The empty set ; denotes the impossible event. In order for probabilities to be dened, one has to impose some structure on the set of considered events, and one usually assumes this set to be an algebra (also called a eld) on the sample space. Some authors have challenged this assumption on the structure of events to which probabilities are assigned, but the assumption is made in any standard treatment of axiomatic probability theory. Denition 1.2.1 Let be a set. An algebra 6 on which satises the following three conditions. 1. (inclusion of sample set) 2 6: is a set of subsets of (1) 2. (closure under nite unions) A; B 2 6 ) A [ B 2 6: (2) 3. (closure under complementation) (3) A 2 6 ) A 2 6: (Here A is the complement of A with respect to .) A -algebra on is an algebra on which additionally satises the following property of closure under countable unions. [ (4) For every countable I , fA : i 2 I g 6 ) A 2 6: i i The notion of a 2I i -algebra does not play an important role in these notes, nite sample spaces, where closure since we will soon restrict ourselves to under countable unions collapses to closure under nite unions. The powerset 2 of is an algebra on and one often encounters prob- abilities dened on this powerset algebra. This powerset algebra 2 is the 2 maximal algebra on . f;; g. The minimal algebra on is the trivial algebra In general there are many algebras in between. The following proposition lists some properties of algebras. The proofs are omitted and left as exercises. (This will be standard practice.) Proposition 1.2.1 Assume that is an algebra on . The following holds. 6 1. and f;; g are algebras on , and f;; g 6 2 . 2. For every nite I , fA : i 2 I g 6 ) S 2 A 2 6. 3. (closure under nite intersections) A; B 2 6 ) A \ B 2 6. 4. If 6 is a -algebra, then closure under countable intersections holds. 2 i i I i If 1 is a subset of an algebra 6 such that the members of 1 are nonempty and pairwise disjoint and every element of 6 is a countable union of members of 1, then 1 is called a basis of 6. Not every algebra has a basis, but the existence of a basis is guaranteed if the algebra is nite, and, in particular, if the algebra is dened on a nite sample space. Any countable set 1 of subsets of can generate an algebra on by taking the closure of 1[f g under (countable) unions and complementation. ( -)algebra generated by 1. Denition 1.2.2 Let 6 be an algebra on . P is called a probability function on 6 i P is a real-valued function on 6 satisfying the following three conditions. 1. (non-negativity) For all A 2 6; P (A) 0: (5) 2. (unit normalisation) P ( ) = 1: (6) 3. (nite additivity) For all disjoint A; B 2 6; P (A [ B ) = P (A) + P (B ): (7) If 6 is a -algebra on , then P is called a probability measure on 6 i P is probability function on 6 which additionally satises the following property of countable additivity. If fA : i[2 I g is aXcountable set of pairwise disjoint elements of 6; then P ( A ) = P (A ): (8) The thus obtained ( -)algebra is called the i 2I i i i 2I i 3 Denition 1.2.3 A is a tuple h ; 6; P i, such that 6 is a -algebra on and P is a probability measure on 6. If h ; 6; P i is a probability space, then is called its sample space. The elements of are called elementary events, and the elements of 6 are called measurable sets or measurable events. If 6 is a -algebra on , then PROB ( ; 6) denotes the class of all probability measures on 6. Instead of PROB ( ; 2 ), we sometimes simply write PROB ( ). Elements of PROB ( ) are called probability measures over . probability space From now on, we assume, unless explicitly stated otherwise, that the sample space is nite. Some consequences of this assumption are that the distinction between probability function and probability measure vanishes, and that every algebra has a basis. Notice that if 1 is a basis of an algebra 6, then a probability function P on 6 is determined by its values on the members of 1. We list some useful properties of probability functions. Proposition 1.2.2 Let h ; ; P i be a probability space and assume that the 6 sets A and B are elements of 6. 1. P (A) = P (A \ B ) + P (A \ B ): 2. P (A) = 1 0 P (A): 3. P (;) = 0: (zero normalisation) 4. P (A) 1: 5. B A ) P (A) 0 P (B ) = P (A \ B ): 6. B A ) P (B ) P (A): (monotonicity) 7. P (A [ B ) = P (A) + P (B ) 0 P (A \ B ): The latter result can be generalised to a formula which is interesting in the light of the future discussion of Dempster-Shafer theory. Proposition 1.2.3 Let h ; 6; P i be a probability space and assume that the sets A1; A2; : : :; A are elements of 6. n [ X n For all n 1; P ( A ) = =1 i i jI j+1P ( (01) ;6=I f1;:::;ng \A 2 i I (Try to understand equation (9) by writing it out for small n.) 4 : i) (9) The rst property of proposition 1.2.2 can also be generalised. A set fB1 ; B2; : : :; Bn g is called a partition of if its members are pairwise disjoint B1 [ B2 [ : : : [ Bn = . and Proposition 1.2.4 Let h ; ; P i be a probability space and assume that 6 the sets A; B1; B2 ; : : :; B are elements of 6 such that fB1 ; B2; : : :; B g is a partition of . (10) P (A) = P (A \ B1 ) + P (A \ B2) + 1 1 1 + P (A \ B ): n n n Some more properties of probability functions will be given further on. We close this section with the observation that probability functions could have been dened as functions on formulas of a formal language (instead of functions on sets). In the following we make this more precise. Assume L tional letters to be a propositional language built up from a set of proposi- PL with the usual propositional connectives :; ^; _; !; $. In the line of our assumption that the sample space is nite, we assume that PL is nite. Formulas of L are denoted by Denition 1.2.4 P is called a '; ; : : :. i P is a realvalued function on L satisfying the following three conditions. 1. (non-negativity) For all ' 2 L; P (') 0: (11) 2. (unit normalisation) For all ' 2 L; j= ' ) P (') = 1: (12) probability function on 3. (nite additivity) For all '; 2 L; j= :(' ^ ) ) P (' _ )= L P (') + P ( ): (13) This denition is of course closely related to denition 1.2.2 of probability functions on sets. To establish an exact connection, notice that the following can be proved. For all '; 2 L; j= ' $ ) P (') = P ( ) : (14) Equation (14) tells us that the probability functions of denition 1.2.4 can be viewed as (probability) functions on the Lindenbaum algebra of is, as the name suggest, an algebra in the sense of denition 1.2.1. 5 L, which One encounters a small complication when using probability functions dened on formulas. In general, a propositional letter does not correspond to an elementary event. Let us dene an elementary proposition to be a con- junction of literals, such that every propositional letter appears exactly once in the conjunction. An elementary proposition is a most specic proposition and completely characterises a possible world or valuation. The notion of an elementary proposition is the correct counterpart of the notion of an elementary event of a sample space. However, if there are n propositional letters, then there are 2 n non-equivalent elementary propo- sitions, whereas not every sample space has a cardinality which is a power of 2. (Consider, for example, the obvious sample space representing the outcomes of a roll of a (six-sided) die.) Therefore, denition 1.2.4 has to be augmented to allow the encoding of some information about the particular situation that is being modelled. This can be done, but we save us the trouble and use set functions in the rest of these notes. Exercise 1.2.1 Exercise 1.2.2 Prove proposition 1.2.1. Give an example of an algebra 6 on some sample space such that 6 is neither 2 nor Exercise 1.2.3 Exercise 1.2.4 Exercise 1.2.5 f;; g. Show that every nite algebra has a basis. Prove proposition 1.2.2. Formulate the results of proposition 1.2.2 in terms of prob- ability functions dened on formulas, and convince yourself that the proofs you have found in the previous exercise carry over to the context of probabilities dened on formulas. Exercise 1.2.6 Exercise 1.2.7 Prove equation (14). Consider the experiment that consists of a single roll of a fair (ordinary, six-sided) die. Give a probability space representing this experiment. How might a representation using probability functions on formulas look like? 6 1.3 Conditional Probability and Bayes' Rule Denition 1.3.1 Let P 2 PROB ; and A 2 such that P A > . ( 6) 6 ( ) 0 1. For every B 2 6, we dene P (B jA), the (conditional) probability of B given A, as follows (15) P (BjA) = P (PA(A\ )B) : 2. The function P (P conditioned by A) on 6 is given by the following equation. (16) For every B 2 6; P (B ) = P (B jA): A A It is easy to see that the function function on 6, and that P = P. P A dened above is a probability Hence conditioning can be seen as a P is the result A. This process = P \ , whenever process for revising or updating probability functions, where of revising P A after obtaining the evidence represented by P is order-independent in the sense that ( P (A \ B) > 0. A )B P = ( B )A A B Some other useful properties are given below. Proposition 1.3.1 Let h ; 6; P i be a probability space and assume that the sets A; B1; B2 ; : : :; B are elements of 6 such that fB1 ; B2; : : :; B g is a partition of and for every i 2 f1; 2; : : :; ng, P (B ) > 0. Then n P (A) = n X P AjB i n i =1 ( i) 1 P (Bi ): (17) Proposition 1.3.2 Let h ; 6; P i be a probability space01and assume that the sets A1; A2; : : :; A are elements of 6 such that P (T =1 A ) > 0. Then \ n i n i n P ( A ) = P (A1) 1 P (A2jA1) 1 P (A3jA1 \ A2) 1 : : : 1 P (A =1 i n i \01 A n j i =1 : i) (18) chaining rule. An easy corollary of this chaining rule is that for any P 2 PROB ( ; 6), P (AjC ) = P (AjB ) 1 P (B jC ), whenever A; B; C 2 6, A B C , and P (B ) > 0. This property essentially The above result is known as the characterises the conditioning process, as is shown below. 7 Proposition 1.3.3 Assume that P 2 PROB ; . Let C 2 such that ( 6) 6 P (C ) > 0 and let F denote some probability function on 6. Then the following two statements are equivalent. 1. F = P . 2. F satises the three conditions below, where A; B range over 6. (a) F = P . (b) F (A) = F (A \ C ). (c) If P (B ) > 0 and A B C , then F (A) = F (A) 1 F (B ). P;C P;C C P;C P; P;C Proof. P;C P;C P;B P;C That 1 implies 2 is immediate. To show that 2 implies 1, assume that A be an arbitrary subset A \ C C , we have F (A \ C ) = F (A \ C ) 1 F (C ). Using F = P , we obtain P (A \ C ) = F (A \ C ) 1 P (C ), and thus F (A \ C ) = P (A \ C )=P (C ) = P (A). Since F (A) = F (A \ C ), it follows that F = P . the conditions listed under 2 are satised and let of . Since P; P; P;C P; P;C P;C C P;C P;C P;C C For the computation of (conditional) probabilities a simple result due to Thomas Bayes has proved to be very useful. This result, known as Bayes' rule or Bayes' theorem, solves a much encountered problem when applying probability theory to knowledge-based systems. For example, in (medical) diagnosis, one is interested in the the condi- H given some evidence (sympE , but it is usually much easier to obtain P (E jH ) (from experts or statistical records) than to obtain P (H jE ). Bayes' rule allows one to compute P (H jE ) from P (E jH ) and the prior probabilities of H and E . Proposition 1.3.4 (Bayes) Let h ; 6; P i be a probability space. Assume that H; E 2 6 such that P (H ) > 0 and P (E ) > 0. Then (19) P (H jE ) = P (E jPH()E1 )P (H ) : tional probability of a hypothesis (disease) tom) Proof. P (H jE ) = P (PH(E\)E ) = ( \E ) 1 P (H ) ( ) P H P H P (E ) = P (E jH ) 1 P (H ) : P (E ) Although the above proposition is an almost trivial theorem of axiomatic probability theory, the results of its application are not so trivial. 8 Example 1.3.1 A tumour can be either benignant ( B) or malignant (M ). To gain insight into the nature of a tumour, a radiological test can be performed. The result of this kind of test is classied as either positive (+) or negative (0) with respect to cancer (a malignant tumour). The reliability of such a test is given as follows. Its false positive rate (= the probability that a false negative benignant tumour is classied as malignant ) is 0.096, and its rate (= the probability that a malignant tumour is classied as benignant) P (+jM ) that a malignant tumour produces is 0.208. Thus, the probability a positive test result is 0.792. Now suppose that some particular tumour has a prior probability of 0.01 of being malignant and that it produces a positive test result. In these circumstances, among physicians, the probability that the tumour is malignant is typically assessed to be about 0.75. However, using Bayes' rule one can P (M j+) from the data as follows. ) 1 P (M ) P (+jM ) 1 P (M ) P (M j+) = P (+jM = P (+) P (+jM ) 1 P (M ) + P (+jM ) 1 P (M ) 0:792 1 0:01 = 0:077: 0:792 1 0:01 + 0:096 1 0:99 compute Thus, the typical assessment does not even come close to the theoretically correct probability. Possible explanations of this phenomenon are that peo- base rate fallacy), or that they simply fail to distinguish the probabilities P (AjB ) and P (B jA). ple often forget to take into account the eect of the prior probability ( Although Bayes' rule as formulated in proposition 1.3.4 may give rise to interesting results, one needs a more general formulation if one intends to use the rule in a system for medical diagnosis. In that case, it is not sucient to calculate the probability of a particular disease given a single symptom. Instead, one is interested in the probability of each considered disease given any combination of symptoms. To obtain a more general formulation of proposition 1.3.4, simply replace the single symptom E by an arbitrary combination of symptoms. If the hypotheses form a partition of the sample space (as is the case in the above example), then one can use proposition 1.3.1 to dispose of the need to know the prior probabilities of the evidence. One then obtains the following version of Bayes' rule. 9 Proposition 1.3.5 (Bayes) Let h ; ; P i be a probability space. Assume 6 that fH : i 2 I g 6 is a partition of such that for Tall i 2 I , P (H ) > 0. Further assume that fE : k 2 K g 6 such that P ( 2 E ) > 0. Then we have, for all i 2 I , T 2 E jH ) 1 P (H ) \ ( P (20) P (H j E ) = P (P (T E jH ) 1 P (H )) : 2 2 2 i i k i k k k k j K I K k k i k K K k i j j We will see later that there is still a long way to go from the above result to a method for automatic (medical) diagnosis. We end this section with an alternative formulation of Bayes' rule which makes use of the notions of odds and likelihood. Denition 1.3.2 Let h ; ; P i be a probability space, and let E , H 2 . 6 6 The (prior) odds O(H ), the posterior odds O(H jE ), and the likelihood ratio (H jE ) of H given E are dened as follows. 1. Assume that P (H ) > 0. ) P (H ) : = (21) O(H ) = PP ((H H ) 1 0 P (H ) 2. Assume that P (H jE ) > 0. P (H j E ) : jE ) = O(H jE ) = PP ((H H j E ) 1 0 P (H j E ) (22) 3. Assume that P (E jH ) > 0. jH ) (H jE) = PP ((E E H) : j (23) Probability theory can be given an equivalent formulation by using odds instead of probability. Bayes' rule can be a seen as a rule for updating odds. Proposition 1.3.6 (Bayes) Let h ; ; P i be a probability space. Assume 6 that E and H are elements of 6 such that P (H ); P (H jE ); P (E jH ) > 0. Then (24) O(H jE ) = (H jE ) 1 O(H ): 10 H given E tells us how to update the odds on H in the light of the evidence E . A high ( 1) value of (H jE ) corresponds to the situation in which acquiring the evidence E lends much support to the truth of H . The likelihood ratio of In these notes the odds-likelihood formulation of probability theory will hardly be used. However, the notion of odds will make a second appearance in section 1.5 on the Dutch book argument. Exercise 1.3.1 Show that the function P A dened in denition 1.3.1 is a probability function on 6. Exercise 1.3.2 Exercise 1.3.3 Exercise 1.3.4 and Show that ( P A )B P = ( B )A = P \B , A when P (A \ B) > 0. Prove proposition 1.3.1. Assume that P 2 PROB ( ; 6), A; B; C 2 6, A B C , P (B) > 0. Show that P (AjC ) = P (AjB) 1 P (BjC ). Exercise 1.3.5 At one evening, a taxi causes a serious hit-and-run accident. According to an eye-witness, the taxi involved was blue. The rm Blue Star owns 15 of the 100 taxis in town and those 15 taxis are blue, the other 85 belong to the rm Green Cheese and are green. Tests (under conditions similar to that evening) prove that in 80% of the cases the eye-witness is able to correctly identify the colour of a blue taxi and the same percentage is scored for green taxi's. What is the probability that the taxi involved was blue? (First guess the probability, then compute it using Bayes' rule.) Exercise 1.3.6 Exercise 1.3.7 Exercise 1.3.8 that 0 Show that P (A) < 1 ) P (A) = O(A)=(1 + O(A)). Prove proposition 1.3.6. ; ; P i be a probability space, and let E; H 2 6 such Let h 6 < P (E jH) < 1. Show that 1 0 (H jE ) : (H jE ) 0 (H jE) Express P (E jH ) in terms of (H jE ) and (H jE). P (E jH ) = (H jE) 1 11 1.4 Independence and Computational Issues There is one basic notion of probability theory that has yet to be introduced, independence. Denition 1.4.1 Let h ; 6; P i be a probability space, and let A, B The sets A and B are called independent i P (A \ B) = P (A) 1 P (B): namely the notion of 2 6 . (25) Alternatively, one can use the following denition. Denition 1.4.2 Let h ; 6; P i be a probability space, and let A, B 2 6. The set A is called independent from B i (26) P (B) > 0 ) P (AjB) = P (A): It is easy to show that A is independent from B i A and B are independent. There are at least two reasonable ways to generalise the notion of independence to collections of more than two events. Denition 1.4.3 Let h ; 6; P i be a probability space. Assume that for some n 2, the sets A1; A2; : : :; A are elements of 6. 1. A1 ; A2; : : :; A are called (completely) independent i \ Y (27) for every I f1; 2; : : :; ng; P ( A ) = P (A ): n n 2I i i i 2I i 2. A1 ; A2; : : :; A are called pairwise independent i for every i; j 2 I such that i = 6 j; P (A \ A ) = P (A ) 1 P (A ): n i If A1; A2; : : :; A n j i j (28) are (completely) independent, then also pairwise inde- pendent, but in general the reverse implication is not valid. For a collection of two events, the above notions are of course equivalent. One also needs a notion of conditional independence. Below, we give the conditional version of denition 1.4.1. The conditional versions of other notions of independence can be obtained analogously. Denition 1.4.4 Let h ; ; P i be a probability space, and let A, B, C 2 6 6 such that P (C ) > 0. A and B are called (conditionally) independent given i (29) P (A \ BjC ) = P (AjC ) 1 P (BjC ): C 12 We now return to the problem of designing a system for automatic (medical) diagnosis. A simple scheme would be to feed the system with the prior P (TH ) of the hypotheses (diseases) and the conditional probP (H j 2 E ) of any hypothesis given any combination of the probabilities abilities i i k k K bodies of evidence (symptoms). These probabilities are to be extracted from experts or statistical records. The user of the system then only needs to supply the information about bodies of evidence (symptoms) concerning the particular case at hand. The problem with this simple scheme is that in practice the number of probabilities that have to be fed to the system is too large. siders n dierent symptoms, then there are 2 n If one con- combinations of symptoms. Thus a very modest system which only considers 20 dierent symptoms already has to be fed more than a million probabilities. Bayes' rule does not P (H j T 2 E ) to be computed if the probabilities P (T 2 E jH ) are provide a solution to this problem, since the rule only allows the probabilities i k k K k K k i known. If one assumes that the symptoms are conditionally independent given each disease, then the number of required probabilities is greatly reduced. PT2 Q2 One then only needs the probabilities ( k K E jH ) = k i k K P (E jH ). k i P (E jH ), k i for each i and k, since However, the mentioned assumption is usually highly unrealistic. P (H j T 2 E The problem would be solved, if one could nd a way to compute the probability i k K k) from the values P (H jE i k ), that is, if one could nd a kind of combination function which computes the combined eect of dierent bodies of evidence. The following result shows that there is no such combination function. Proposition 1.4.1 (Neapolitan) Let 6 be the algebra generated by the subsets A,B , and C of . There is in general no function F which computes for any probability function P on 6 the value P (AjB \ C ) from the values of P on the algebras generated by at most two elements from fA; B; C g. Proof. Consider the probability functions P1 and P2 dened as follows. P1(A \ B \ C ) = P1(A \ B \ C) = P1(A \ B \ C ) = P1(A \ B \ C ) = 0:25: P2(A \ B \ C ) = P2(A \ B \ C ) = P2(A \ B \ C ) = P2(A \ B \ C ) = 0:25: (See gure 1.) P1 and P2 are identical on algebras generated by at most two elements of fA; B; C g. (This can be easily seen by removing one element of fA; B; C g in gure 1.) 13 A ' ' A $ B ' 0.25 $ 0.25 B $ ' 0.25 ' 0.25 $ ' $ 0.25 & $ 0.25 % & % 0.25 & % 0.25 & C & % % C & Figure 1: The probability functions % P1 (left) and P2 (right) mentioned in the proof of Neapolitan's result. Now suppose there is a function F as described in the proposition. Then F applied to the values of P1 yields the same result as F applied to the values of P2 . But P1 (AjB \ C ) = 1, whereas P2 (AjB \ C ) = 0. Contradiction. (Notice that we have even proved that there exists no function F which approximates P (AjB \ C ).) In general, there is no simple (justied) way to avoid the combinatorial explosion of the required probabilities, and this combinatorial explosion is one of the main reasons why people started to look for alternative approaches to reasoning with uncertainty in AI. We will discuss some of these alternatives (the Certainty Factor Model and Dempster-Shafer Theory) later. However, the simple scheme sketched above can be much improved if one takes into consideration the Example 1.4.1 structure of the knowledge-domain. Consider the problem of nding the most plausible diag- fH1; H2; : : :; H10g based on the (presence or absence) fE1; E2; : : :; E20g. The simple scheme would need 10 1 220 nosis among of the symptoms (more than ten million) conditional probabilities. Assume the following additional information. The hypotheses can be divided into two classes H a = fH1; H2; : : :; H5g and H = fH6; H7; : : :; H10g, E = fE1; E2; : : :; E7g, and the symptoms can be divided into three classes E b = fE8; E9; : : :; E14g, and E c = b a fE15; E16; : : :; E20g. 14 The symptoms from E a are only relevant to nd the most plausible diagnosis among H, a the E are only relevant to nd the most plausible diagnosis H , and the symptoms from E are only relevant to nd the most plausible (partial) diagnosis from fH ; H g. symptoms from among b b c a b In that case, the number of required conditional probabilities is reduced to 2 1 26 + 5 1 27 + 5 1 27 = 1408. The above example illustrates the fact that the number of required probabilities can be greatly reduced if the problem can be divided up into smaller independent subproblems. Essentially this technique is used in some sophisticated methods for probabilistic reasoning that have recently become available. These methods, mainly developed by Pearl, Lauritzen, and Spiegelhalter, have one feature in common: they all exploit conditional independencies implied by a graphical representations of the problem domain. That is why they are called (probabilistic) network models. In many|but not in all!| situations the independency information allows the development of sound and feasible algorithms for updating probabilities in knowledge-based systems. The nodes of the networks consist of propositional variables, i.e., functions from the sample space to an exhaustive set of mutually exclusive events. In this they dier from inference networks of rule-based systems, where the nodes consist of propositions. In Pearl's method the problem domain is represented by a causal network, i.e., a directed acyclic graph, where an arrow represents a causal relationship. (In the literature causal networks are also called Bayesian networks or belief networks.) For some classes of causal networks|the singly connected ones|an ecient local probability propagation scheme can be given which updates probabilities by instantiation of certain variables and local communication between nodes. Pearl's method derived from considerations about the way humans reason and is meant to be a model for human reasoning. Lauritzen and Spiegelhalter are only interested in the (mathematical) problem of propagating probabilities in networks. Since the notion of (conditional) independence is more fundamental to causal networks than the notion of causality, it is no surprise that Lauritzen and Spiegelhalter use undirected graphs. The propagation scheme of Lauritzen and Spiegelhalter is dened on trees of cliques of a triangulated undirected graph (a clique is a maximal sets of nodes such that every pair of distinct nodes is adjacent). Roughly speaking, eciency is possible whenever the number of variables in the cliques is small, 15 which is not a too restrictive condition, since people tend to represent causal relationships by means of hierarchies of relatively small clusters of variables. The network models will be discussed in greater detail at the end of the course. It should be stressed, however, that these models do not completely solve the problem of reasoning with uncertainty in AI. In some situations, the methods are not feasible. Moreover, the available probabilistic information is often incomplete, badly calibrated, and incoherent. Before we discuss some alternative formalisms which address (some of ) these problems, we turn to the justication of using probabilities as degrees of belief of a rational agent. Exercise 1.4.1 Show that Exercise 1.4.2 Show that pairwise independent Exercise 1.4.3 Show that Exercise 1.4.4 Show that the independence of Exercise 1.4.5 Assume that 0 A is independent from B i A and B are inde- pendent. A1; A2; : : :; A n are not nec- essarily completely independent. P (A) = 1. of the pairs fA; Bg, fA; B g, independent given C. Exercise 1.4.6 from P (E jH ). k A is independent from itself i and fA; B g P (A) = 0 or A and B implies that each are independent. < P (C ) < 1. Show that A and B are C does not imply that A and B are independent given Show that in general P (T 2 E jH ) cannot be computed k K k i i Exercise 1.4.7 Consider the problem of nding the most plausible diag- fH1; H2; : : :; H20g based on the (presence or absence) of the fE1; E2; : : :; E30g. How many conditional probabilities would be nosis among symptoms needed for the simple scheme? Try to add some additional information analogous to example 1.4.1, which reduces the number of the required probabilities to less than 1000. 16 1.5 Dutch Book Argument This section is devoted to the most widely used argument (dating back to Ramsey and De Finetti) in favour of the position that a probability measure is the unique right representation of rational degrees of beliefs. The argument is known as the Dutch book argument and roughly runs as follows. Suppose one takes the degrees of belief of an ideally rational individual X in a proposition A to be the number q such that he is willing to bet on A at odds q : 1 0 q and against A at odds 1 0 q : q. Then under some reasonable assumptions it can be shown that X can avoid accepting a set of bets which would result in a sure loss for X (a Dutch book for X ) i X 's assignment of degrees of belief constitutes a probability measure. Denition 1.5.1 Let 6 be an algebra on . A bet on A 2 6 is a tuple b = hA; S; qi, where S 0 and 0 q 1. S is called the stake of b, q the betting quotient of b and q : 1 0 q the odds of b. A bet against A is a bet on A. BET ( ; 6) denotes the class of all bets on elements of 6. Example 1.5.1 Suppose you wager (D) 10 on the complete outsider Born Loser which scores 19 : 1 at the bookmakers. Then the (total) stake of your bet equals 1 2 10 + 19 2 10 = 200 and the odds of your bet on Born Loser : are 1 : 19. The betting quotient equals 0.05, since 0 05 : 1 0 0:05 = 1 : 19. A can be determined as soon as it is A or A holds. Denition 1.5.2 Let b = hA; S; qi 2 BET ( ; 6). 1. B is called b-specic i B 2 6nf;g such that B A or B A. 2. The value kbk of b at a b-specic B is dened as follows. ( (1 0 q)S if B A kbk = if B A 0qS The value of a bet on some event decided whether B B Example 1.5.2 (Continuation of example 1.5.1.) Suppose you are lucky: q S = (1 0 0:05) 2 200 = Born Loser wins. Then your bet has the value (1 0 ) 190. Denition 1.5.3 A with respect to an algebra 6 on is a nite subset of BET ( ; 6) such that hA; S; q i; hA; S; q 0i 2 ) q = q 0. The class of all books with respect to 6 is denoted by BOOK ( ; 6). book 17 Denition 1.5.4 Let 2 BOOK ; . ( 6) 1. A is called -specic i for every b 2 ; A is b-specic. 2. The value k k of at a -specic A is dened as follows. A For any -specic A; k k A def = X kbk 2 A : (30) b Denition 1.5.5 A book with respect to is called a 6 Dutch book i for every -specic A 2 6; k k < 0. Denition 1.5.6 The acceptance set of an individual X with respect to an algebra 6 is a set Acc ( ; 6) BET ( ; 6) such that 1. If hA; S; q i 2 Acc ( ; 6) and > 0, then hA; S; qi 2 Acc ( ; 6). 2. If hA; S; q i 2 Acc ( ; 6) and 0 q 0 q , then hA; S; q 0i 2 Acc ( ; 6). 3. 8A 2 6; 9!q such that hA; 1; q i and hA; 1; 1 0 q i 2 Acc ( ; 6). The unique q mentioned above is called X 's degree of belief in A. We dene bel to be the function on 6 such that for every A 2 6; bel (A) = X 's degree of belief in A. bel is called the belief function of Acc ( ; 6). A X X X X X X X X X X The rst condition on acceptance sets implies that the acceptability of a bet should not depend on the stake, but only on the event and the betting quotient (or odds). If a bet is accepted, then the second condition requires that bets on the same event and with the same stake, but with a more favourable betting quotient should also be accepted. The third condition requires that for each event there is a unique breaking point, that is, a betting quotient at which one is indierent as to which side of the bet one takes (either betting on the event, or against the event at the reverse odds). ; 6) is completely determined by (the degrees of belief given by) its belief function bel . Denition 1.5.7 Let Acc ( ; 6) be an acceptance set and let bel be its belief function. The set Acc ( ; 6) and the degrees of belief given by bel are called coherent i Acc ( ; 6) does not contain a Dutch book with respect to 6. Notice that an acceptance set Acc X ( X X X X X X The conditions on an acceptance set do not rule out the possibility that its degrees of belief are incoherent. 18 Example 1.5.3 A and B are elements of an algebra 6 such that A \ B = ;. Further assume that bel (A) = 0:3, bel (B ) = 0:2, and bel (A [ B ) = 0:6. Then X is vulnerable to the following Dutch book with respect to 6. = fhA; 1; 0:7i; hB; 1; 0:8i; hA [ B; 1; 0:6ig. Obviously, is a subset of Acc ( ; 6). For any -specic set C there are three possibilities: (1) C A \ B , (2) C A \ B , and (3) C A \ B . In case (1), k k = (1 0 0:7) 0 0:8+(1 0 0:6) = 00:1. In case (2), k k = 00:7+(1 0 0:8) + (1 0 0:6) = 00:1. In case (3), k k = (1 0 0:7) + (1 0 0:8) 0 0:6 = 00:1. Hence is a Dutch book. Assume that X X X X C C C The above example already illustrates an essential step in the proof of the Dutch Book Theorem. This theorem states that degrees of belief given by bel are coherent i X bel is a probability function. Before we prove the X theorem, we rst mention a few lemmata. Lemma 1.5.1 Let 2 BOOK ; and let be a basis of . The book ( 6) 1 6 is a Dutch book i for every D 2 1; kk < 0. D Lemma 1.5.2 Let Acc fhA; S ; bel (A)i : A 2 tained in Acc ( ; 6). A X ; 6) . If is a Dutch book, then the book P 6; S = h i2 S g is also a Dutch book conX ( A A;S;q X Proposition 1.5.1 (Dutch Book Theorem) Let be an algebra on . Acc 6 ; 6) is coherent i bel is a probability function on 6. X ( Proof. X Assume that Acc ; 6) does not contain a Dutch book with re- X ( spect to 6. We show that bel is a probability function on 6. X 1. The condition of non-negativity is automatically satised. 2. Suppose bel bel X ( ) 6= 1. Then 0 X ( )ig is a Dutch book 2 Acc bel ; X ( ) X ( 6). < Thus 1, and bel X fh ; 1; 1 0 satises unit normalisation. A; B 2 6 such that A \ B = ;. (a) If bel (A [ B ) < bel (A) + bel (B ), then fhA [ B; 1; 1 0 bel (A [ B )i; hA; 1; bel (A)i; hB; 1; bel (B )ig 3. Let X X X X is a Dutch book X 2 AccX ( ; 6). 19 X bel (A [ B ) > bel (A) + bel (B ), then fhA [ B; 1; bel (A [ B )i; hA; 1; 1 0 bel (A)i; hB; 1; 1 0 bel (B )ig is a Dutch book 2 Acc ( ; 6). Thus bel satises nite additivity. It remains to show that if bel is a probability function on 6, then Acc ( ; 6) does not contain a Dutch book with respect to 6. We use a (b) If X X X X X X X X X X method due to John Kemeny. Assume that bel is a probability function on 6, and let 1 be a basis X Acc ( ; 6) is a Dutch book. By lemma 1.5.2, we may assume that = fhA; S ; bel (A)i : A 2 6g. Let S = S 0 S . Then for any D 2 1, X (1 0 bel (A)) 1 S + X 0bel (A) 1 S : k k = of 6. Suppose that X A A AA X A D D X A AA D A X AA P 21 bel (D) 1 kk . Since , Prf () def = 21 bel (D) = 1, and for any D 2 1; bel (D) 0, and k k < 0, it follows that Prf ( ) < 0. P 26 a 1 S , where On the other hand, Prf ( ) = X X a = (1 0 bel (A)) bel (D) 0 bel (A) bel (D): P Consider the estimated prot of D A X D bel X D X A Since X D X A A X D AA X D A X is a probability function on 6, we have a = (1 0 bel (A))bel (A) 0 bel (A)(1 0 bel (A)) = 0: Hence Prf ( ) = 0. We may conclude that Acc ( ; 6) contains no Dutch book. A X X X X X There is a large body of literature on the Dutch book argument. Although it has been criticised by many authors, it remains a strong, intuitively appealing, argument for using probabilities as degrees of belief of an ideally rational agent. The argument is not airtight, but it is reasonable to demand that any proposal for an alternative theory should be accompanied by an explanation why the Dutch book argument does not disqualify the proposed theory. We end this section by briey discussing a few well-known objections against the Dutch book argument. 20 Objection 1 I don't like gambling. Why should I bet? Answer. The argument is about the degrees of belief of ideally rational agents. One can argue that such agents dier from humans in that they maximise utility without being bothered by taking some risk. Alternatively, one can circumvent the eect of risk-taking by saying that a bet is put in the acceptance set of an individual X , not when X actually accepts the bet, but when he would accept the bet in case risk avoidance would not be an issue. Objection 2 I like gambling. It's fun! I don't mind losing some money in Answer. the process. Remarks similar to those above apply. In addition, it should be stressed that a Dutch book results in a sure loss as soon as the relevant propositions are decided. Most gamblers know that they are likely to lose in the long run, but gambling is attractive because there is (or at least appears to be) a chance of winning. There is not much fun in gambling without the possibility to win. Objection 3 The argument only applies to decidable propositions. To determine the value of a bet on an event A one should be able to decide whether A or A holds. Answer. Many propositions are (in principle) decidable. So the result that for these propositions degrees of belief should be probabilities is still a very strong result. Moreover, even if the proposition A is undecidable, it seems irrational to accept a set of bets that would result in a loss if it would be A holds and that would also result in a loss if it would be determined that A holds. Objection 4 It is unreasonable to require that the acceptance of a bet does not depend on the stake involved. Answer. The argument is about the degrees of belief of ideally rational determined that agents. They are not bothered by earthly concerns like having only a nite amount of money. Moreover, the proof that coherent degrees of belief are necessary probabilities only uses bets with stake 1. Objection 5 In the presence of ignorance with respect to the exact uncertainties, it is unreasonable to require the existence of a (unique) breaking point at which one is indierent to bet on or against an event. Answer. In the presence of ignorance with respect to the exact uncer- tainties, it is still possible to determine certain 21 least committed choices of exact breaking points, by means of symmetry arguments, or principles as the indierence principle or the maximum entropy principle. The given answers to the objections are not meant to be the nal words on the matter. In fact, in chapter 3, we argue (against the answer to the last objection) that in the presence of ignorance it might be reasonable to relax the requirement of the existence of a breaking point. This leads to an argument for generalised probability theory. Exercise 1.5.1 Let A be the following statement: On the rst of January of the year 2000 there will exist a chess-program with an ELO-rating of 2800 points. Several years ago, professor van den Herik and chess-player Bohm agreed to a bet with respect to A. Van den Herik put in (D) 500 on A while B ohm put in the same amount against A. What are the stake and the odds of the bet van den Herik agreed to? Can one conclude from this betting behaviour that professor van den Herik rmly believes A? (Why not?) Exercise 1.5.2 Assume that A \ B = ;; bel (A) = 0:4; bel (B) = 0:5, and bel (A [ B ) = 0:7. Construct a Dutch book against X . X X X Exercise 1.5.3 bel (A \ B ) = 0:2; bel (A) = bel (B ) = 0:4, and bel (A [ B ) = 0:7. Construct a Dutch book against X . Assume that X X X X Exercise 1.5.4 on . Show that Exercise 1.5.5 Exercise 1.5.6 Exercise 1.5.7 = fhA; S; qig be a book with respect to an algebra 6 is a Dutch book i A = ;; S > 0, and q > 0. Let Prove lemma 1.5.1. Prove lemma 1.5.2. Let 6 be an algebra on . BET ( ; 6) is called a weak Dutch book i for every -specic A 2 6; k k 0 and there exists a -specic A 2 6; kk < 0. Show that if Acc ( ; 6) does not contain a weak Dutch book with respect to 6, then bel is a probability function on 6 such that bel (A) = 1 implies that A = . A A X X X 22