Aspets of Bayesian Networks Amber Tomas Otober 31, 2003 Supervisor: Dr Nigel Bean Thesis submitted as a requirement for the degree of Bahelor of Mathematial and Computer Sienes (Honours) in the Shool of Applied Mathematis Contents Signed Statement v Abstrat vi 1 . . . . . 1 1 3 3 4 6 . . . . 8 9 11 14 15 . . . . . . . 17 18 19 19 21 26 33 42 1.1 Introdution . . . . . . . . 1.2 Preliminary Information . 1.2.1 Graph Theory . . 1.2.2 Probability Theory 1.2.3 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . 2 Bayesian Networks 2.1 Independene Relations . . . . . 2.2 Probabilisti Bayesian Networks 2.3 Bayesian Belief Networks . . . . 2.4 Causal Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . 3 Information Propagation 3.1 Belief Distributions . . . . . . . . . 3.2 Conepts and Notation . . . . . . . 3.3 Belief Propagation . . . . . . . . . 3.3.1 Propagation in Trees . . . . 3.3.2 Propagation in Polytrees . . 3.3.3 Networks ontaining Loops 3.4 Answering Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Aspets of Struture 44 4.1 Conepts from Information Theory . . . . . . . . . . . . . . . . . . . . . . . 45 4.2 Changes to the Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2.1 Changes to the Parents of a Node . . . . . . . . . . . . . . . . . . . 49 i 4.2.2 Changes in Networks Containing Loops 4.2.3 Removal of Nodes . . . . . . . . . . . . 4.3 Simplifying the Struture . . . . . . . . . . . . 4.4 The Value of Evidene . . . . . . . . . . . . . . 4.4.1 Seleting a Set of Evidene Nodes . . . 4.4.2 Updating the Set of Evidene Nodes . . 5 Learning Bayesian Networks from Data 5.1 Introdution . . . . . . . . . . . . . . . . . . . . 5.2 Considerations of learning . . . . . . . . . . . . 5.3 Methods of Learning Bayesian Networks . . . . 5.3.1 Soring Funtions and Searh Methods . 5.3.2 Maximum Likelihood . . . . . . . . . . . 5.3.3 Hypothesis Testing . . . . . . . . . . . . 5.3.4 Resampling . . . . . . . . . . . . . . . . 5.3.5 Bayesian Methods . . . . . . . . . . . . 5.4 The Information Theoreti Approah . . . . . . 5.4.1 Chow and Liu Trees . . . . . . . . . . . 5.4.2 More general networks . . . . . . . . . . 5.5 The Bayesian Approah . . . . . . . . . . . . . 5.5.1 Notation . . . . . . . . . . . . . . . . . . 5.5.2 Known Struture . . . . . . . . . . . . . 5.5.3 Unknown Struture . . . . . . . . . . . 5.5.4 Prior Distributions . . . . . . . . . . . . 5.5.5 Inomplete Data . . . . . . . . . . . . . 5.6 Condene Measures on Strutural Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 54 56 62 63 64 . . . . . . . . . . . . . . . . . . 67 67 68 69 69 71 72 73 74 75 76 79 81 81 82 84 86 91 95 6 Inuene Diagrams for Deision Making 98 6.1 Deision Senarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.2 Inuene Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.3 Solution of Inuene Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . 103 7 Conlusions and Remarks 106 List of Figures 1.1 A Bayesian Network for the variables Party, Alohol Consumption, Level of Co-ordination and Clarity of Speeh . . . . . . . . . . . . . . . . . . . . . . 1.2 The network in a) has a tree struture, b) is a polytree and ) ontains a loop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 The four types of onnetions possible in a direted network; a) disonneted node, b) serial onnetion, ) diverging onnetion and d) onverging onnetion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 E is an auxillary variable used to represent the dependeny relations of a) in a direted graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 An alternate struture for the probability distribution over the variables given in Figure 1.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 The network in a) is a Probabilisti Bayesian Network of a domain and the network in b) is a Bayesian Belief Network of the same domain. . . . . . . . 3.1 A Bayesian Network over the two variables X and Y . . . . . . . . . . . . . 3.2 A setion of a Bayesian Network showing the messages that node X reeives from its neighbours and the messages that X sends out to its neighbours at eah iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 A Bayesian Network model to predit a itizen's vote in an eletion. . . . . 3.4 A graphial representation of the Noisy OR-Gate. Only if a ause is present and its inhibitor is not ating will the event X our. . . . . . . . . . . . . . 3.5 The Markov Network of the direted network in a) is formed by removing the arrows on the links and adding a link between nodes whih had a ommon hild. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 The formation of a join tree ) from the Bayesian Network in a). This allows the use of propagation methods for trees in what was originally a multiply onneted network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 A simple Bayesian Network ontaining a loop. . . . . . . . . . . . . . . . . . iii 1 4 10 11 13 14 17 21 24 31 35 36 36 3.8 Networks with the addition of query nodes. . . . . . . . . . . . . . . . . . . 42 4.1 A tree struture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 A Bayesian Network formed on the four variables Infetion (I), Fever (F), Tired (T) and Alertness (A). The parent of Alertness is yet to be determined. 4.3 Equivalent network strutures . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 The networks whih would be used for inferene if the state of X3 in the networks in Figure 4.3 were xed. . . . . . . . . . . . . . . . . . . . . . . . . 4.5 A Bayesian Network. If X4 is instantiated and X3 removed, then the path from X1 to X6 will be bloked. . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 The network formed based on the node ordering X1 ; : : : ; X5 , with boundary strata as given. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 A strutural heirahy. For example, C ould represent a luster of variables representing auses, D ontain disease nodes and S ontain nodes representing possible symptoms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8 A Bayesian Network with hypothesis variable Glandular Fever and information variables Test Result, Thermometer Reading and Tiredness. . . . . . 50 51 53 54 54 55 58 62 5.1 A Bayesian Network for whih we wish to learn the parameters by the method of maximum likelihood. . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2 Diagram illustrating the addition of dummy links to the network in a) to form the omplete network b). . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.3 All strutures on two variables. . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.1 A Bayesian Network with a deision node D and utility nodes U1 ; U2 and U3 . 99 6.2 A network representing the two deisions whih must be made in order to buy a movie tiket. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.3 Here A represents the variables Attration to boyfriend, M represents the variable Mood, and E represents the variable Enjoyment. . . . . . . . . . . . 101 6.4 A hain of variables representing the deision senario in Figure 6.2. . . . . 102 6.5 The deision senario of Figure 6.2 represented as a network with added information and preedene links. . . . . . . . . . . . . . . . . . . . . . . . . 102 6.6 Inuene Diagram representing the deision of whih tiket to buy on a sequene of dates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.7 Compat representation of the Inuene Diagram in Figure 6.6. . . . . . . . 105 Signed Statement This work ontains no material whih has been aepted for the award of any other degree or diploma in any university or other tertiary institution and, to the best of my knowledge and belief, due referene has been made in the text to material previously published or written by another person. SIGNED: ....................... DATE: ....................... v Abstrat Bayesian Networks are a representation of a probability distribution. They onsist of two omponents, a graphial omponent and a probabilisti omponent. The graphial omponent enondes the dependene struture whih exists between the variables in the domain, and the probabilisti omponent provides the remaining information about the joint probability distribution. Bayesian Networks are used primarily as a modelling tool to aid deision making, and in this thesis they are disussed in suh a ontext. In Chapter 2 we introdue Bayesian Networks in a more formal manner and look at three types in partiular - Probabilisti Bayesian Networks, Bayesian Belief Networks and Causal Bayesian Networks. Chapter 3 then examines how inferene in a Bayesian Network is performed and the proess by whih probabilities and information is propagated through the network. In Chapter 4 we take an information theoreti approah to examine the eet on a probability distribution of minor hanges to the network struture and the best ways to inorporate new information about the variables. In these setions we assume for the most part that the struture and probabilities of our network are known. In Chapter 5 we look at how to form a Bayesian Network for the domain of interest given that we have some data and/or prior knowledge. Both the Information Theory and Bayesian approahes are onsidered in some detail. Finally, in Chapter 6 we apply the theory of previous hapters to deision theory. In partiular we disuss the formation of Inuene Diagrams as an extension to Bayesian Networks and how these an be solved to give an optimal deision sequene. vi Chapter 1 1.1 Introdution Bayesian Networks provide a graphial representation and a probabilisti model of the relationships whih exist among a set of variables. For example, the Bayesian Network in Figure 1.1 displays the assoiation struture that exists between the four variables Party, Alohol Consumption, Level of Co-ordination and Clarity of Speeh. It shows the existene of an assoiation between Party and Alohol Consumption but also represents the lak of diret assoiation between attendene at a party and the level of one's o-ordination - this assoiation is mediated by the variable Alohol Consumption. The strength of assoiation along eah link is quantied by a set of onditional probabilities. Quantifying these relationships allows us to make inferene about the variables in the model. For example knowing the state of Level of Co-ordination and Clarity of Speeh allows us to reason about the most likely state of Alohol Consumption and Party. Party Alcohol Consumption Level of Co−ordination Clarity of Speech Figure 1.1: A Bayesian Network for the variables Party, Alohol Consumption, Level of Co-ordination and Clarity of Speeh . The use of Bayesian Networks has inreased signiantly over the last deade as their suitability for aiding deision problems and storing and updating information about the variables in some domain has been reognised and aepted in elds suh as Defene, Management, Artiial Intelligene and Teleommuniations. This boom in popularity 1 1.1. INTRODUCTION has been aided by the large amounts of researh done in the early nineties on improving the eÆieny of reasoning using Bayesian Networks, in onjuntion with the inrease in available omputing power. During this period a broad sope of appliations was realised whih ontributed to the development of easy to use graphial interfae software whih is now widely available from many dierent ompanies. In this thesis we onsider several theoretial aspets of Bayesian Networks. Although Bayesian Networks are a tool for modelling and deision making, by taking a theoretial approah we hope to gain a solid understanding of the proesses whih aet the suitability of a Bayesian Network to model spei senarios. For example, gaining an understanding of how information is used by a network, and the impliations of its struture, we beome more aware of the issues whih are important when formulating a model, and how the assumptions we make might aet the onlusions we draw. Addressing the theoretial aspets of how inferene proeeds when using a Bayesian Network also give us an understanding of the issues that are likely to arise when forming a model of a system, and how best to deal with them. It is hoped that the ideas in this thesis will provide the reader with a good understanding of Bayesian Networks and hene a solid basis from whih one an start to model real-world senarios. Although the onstrution and implementation of a Bayesian Network as a tool for inferene is unique, the theoretial basis for suh networks brings together areas of mathematis suh as graph theory, probability and statistis. In Setion 1.2 we present some relevant ideas from these elds, knowledge of whih will be assummed in later setions. In Chapter 2 we introdue Bayesian Networks in a more formal manner and look at three types in partiular - Probabilisti Bayesian Networks, Bayesian Belief Networks and Causal Bayesian Networks. Chapter 3 then examines how inferene in a Bayesian Network is performed and the proess by whih probabilities and information is propagated through the network. In Chapter 4 we take an information theoreti approah to examine the eet on a probability distribution of minor hanges to the network struture and the best ways to inorporate new information about the variables. In these hapters we assume, for the most part, that the struture and probabilities of our network are known. In Chapter 5 we look at how to form a Bayesian Network for the domain of interest given that we have some data and/or prior knowledge. Both the Information Theory and Bayesian approahes are onsidered in some detail. Finally, in Chapter 6 we apply the theory of previous hapters to deision theory. In partiular we disuss the formation of Inuene Diagrams as an extension to Bayesian Networks and how these an be solved to give an optimal deision sequene. 2 1.2. PRELIMINARY INFORMATION 1.2 Preliminary Information 1.2.1 Graph Theory 1.2.1.1 Undireted Graphs An undireted graph G onsists of a set of nodes X and a set of edges E , where eah element e of E is an unordered pair (Xi ; Xj ) and Xi and Xj are distint elements of X . We say a graph is omplete if there exists an edge between every distint pair of nodes. A graph on n nodes is omplete if and only if there are n(n 1)=2 distint edges. The order of a node is the number of links with one end-point at that node. There is said to be a path from X1 to Xk in G if there exist edges in E (X1 ; X2 ); (X2 ; X3 ); : : : ; (Xk 1 ; Xk ), where the Xi are all distint. An undireted graph is said to ontain a yle if the end-points X1 and Xk of a path oinide [11℄. 1.2.1.2 Direted Graphs If the set of edges, E , of a graph onsists of ordered pairs (Xi ; Xj ), then the graph D is alled a direted graph. The diretion of the edges is generally represented graphially by an arrow from node Xi to node Xj . A direted graph is termed ayli if there are no direted yles, that is there does not exist a sequene of edges (X1 ; X2 ); (X2 ; X3 ); : : : ; (Xk 1 ; Xk ) suh that X1 = Xk . Note that, although this denition appears idential to that for undireted graphs, here the diretion of the links is important. As for undireted graphs, a direted graph is omplete if there is an edge (in either diretion) between every distint pair of nodes. Although an undireted omplete graph is unique, there exist many omplete direted graphs over the set E , whih an be obtained by reversing the diretion of some or all of the links. The parents of a node Xj are dened to be those nodes Xi suh that (Xi ; Xj ) is in E . The hildren of a node Xj are dened to be those nodes Xk suh that (Xj ; Xk ) is in E . Additionally, the anestors of a node Xi in D are all nodes whih are predeessors of Xi , that is a node Xa is an anestor of Xi if there exists a direted path from Xa to Xi . A direted graph is a tree if every node has at most one parent. D is alled a polytree or is said to be singly onneted if there are no loops (undireted yles). For example, onsider the direted graphs in Figure 1.2. The graph in Figure 1.2 a) is a tree, as eah node has no more than one parent. The graph in Figure 1.2 b) is a polytree, and the graph in Figure 1.2 ) ontains a loop, though not a (direted) yle. In Figure 1.2 b), the node X5 has parents fX3 ; X4 g and anestors fX1 ; X2 ; X3 ; X4 g. 3 1.2. X1 PRELIMINARY INFORMATION X2 X1 X1 X3 X2 X2 X4 X3 X5 X4 a) X3 b) X4 c) X6 X7 X5 Figure 1.2: The network in a) has a tree struture, b) is a polytree and ) ontains a loop. 1.2.2 Probability Theory In this thesis we use upperase letters, Xi say, to represent random variables and the assoiated lower ase letter, xi say, to denote that Xi is in the partiular state xi . All variables are assummed to be disrete and have a nite number of mutually exlusive states. We assume that we are interested in the set of random variables U = fX1 ; X2 ; : : : ; Xn g, where U represents the set of all variables in the universe, or domain, and that the variables in U exist on the multi-dimensional state spae . The joint probability distribution of the set of variables U is denoted P (X1 ; X2 ; : : : ; Xn ) or P (U) and speies the probability that fX1 = x1 g \ fX2 = x2 g \ : : : \ fXn = xn g for all u = (x1 ; x2 ; : : : ; xn ) in . It will often be onvenient to deompose the set of variables U into disjoint subsets. In general, we denote a set by an upper-ase bold fae letter, the state spae of the variables in that set by the orresponding alligraphi letter, and a omponent of that state spae by the lower ase bold letter. For example the variables in the set X an take ongurations P x 2 X . When making a summation over all x 2 X , say, we will often write x to mean P x2X , where the restrition of x to X is taken to be impliit. For any onguration u of the variables in U and subset of U; X say, we let X[u℄ be the omponents of u that orrespond to the random variables in X. That is, X[u℄ is a vetor in X , whose entries orrespond to a partiular subset of the entries in u. Consider a subset of variables A U whih exists on the multi-dimentional state spae A. We denote the joint probability distribution of the variables in A by P (A). The event fA = ag ours when the variables in A are in onguration (state) a, and P (A = a) speies the probability of this event. Hene P (A = a) = P ( \ Xi 2A fXi = Xi[a℄g); and we will often abbreviate this to P (a). The joint probability distribution of the variables in a subset A of U an be obtained 4 1.2. PRELIMINARY INFORMATION from P (U) by the summation P (A = a) = X u2 P (u) I(A[u℄ = a); for all a 2 A, where I is the indiator funtion ( I= 1 if A[u℄ = a 0 otherwise. P (A) is alled the marginal probability distribution of A. The onditional probability of two sets of random variables A and B is given by P (A = ajB = b) = P (fA = ag \ fB = bg) ; P (B = b) (1.1) and the joint onditional distribution of A given fB = bg is denoted by P (AjB = b). This implies that P (fA = ag \ fB = bg) = P (A = ajB = b)P (B = b): (1.2) As well as onsidering the marginal distributions of subsets of variables, we will also want to speify the joint distribution of the variables whih belong to two dierent subsets, where variables in one subset may not be independent of the variables in the other. The joint probability distribution of all the variables whih are ontained in either A or in B, P (A [ B) speies the probabilities P (fA = ag \ fB = bg) for all a 2 A and b 2 B. Throughout this thesis the joint probability distribution of the random variables in the union of the sets A and B is denoted P (A; B). The variables in A are said to be independent of the variables in B if and only if P (fA = ag \ fB = bg) = P (A = a)P (B = b); for all vetors a 2 A and b 2 B. Equivalently, as an be seen from (1.1), A and B are independent if and only if P (A = ajB = b) = P (A = a) for all a and b. Several times in this thesis we shall make use of the following theorem. Theorem: The hain rule for probability states that P (X1 ; X2 ; : : : ; Xn ) = n Y i =1 5 P (Xi jX1 ; : : : ; Xi 1 ): 1.2. PRELIMINARY INFORMATION Proof. The result hold trivially for n = 1. Assume the result holds for n = k for n = k, 1. Then P (X1 ; X2 ; : : : ; Xk ) = P (Xk jX1 ; X2 ; : : : ; Xk 1 )P (X1 ; X2 ; : : : ; Xk 1 ); from (1.2) kY1 = P (Xk jX1 ; X2 ; : : : ; Xk 1 ) P (Xi jX1 ; : : : ; Xi 1 ) i=1 = k Y P (Xi jX1 ; X2 ; : : : ; Xi 1 ); =1 where the seond line follows from the assumption. i An important result that is needed in this thesis is Bayes' Rule. To derive this, we rst rewrite (1.1) as P (B = bjA = a) = whih gives P (fA = ag \ fB = bg) ; P (A = a) P (fA = ag \ fB = bg) = P (B = bjA = a)P (A = a): (1.3) Applying (1.3) in (1.2) gives P (B = bjA = a)P (A = a) = P (A = ajB = b)P (B = b); whih leads to Bayes' Rule, P (A = ajB = b) = P (B = bjA = a)P (a) ; P (B = b) for all a 2 A and b 2 B. Finally, the expeted value of a random varible X say, is dened as E [X ℄ = X 2X x P (X = x): x It is generally onsidered to be the `average' value of that random variable. 1.2.3 Hypothesis Testing `Statistial hypothesis testing is a formal means of distinguishing between probability distributions on the basis of random variables generated from one of the distributions' [33℄. In general we assume that the data is generated from the distribution orresponding to our null hypothesis, and we seek to determine if there is enough evidene to rejet the null hypothesis in favour of the alternative hypothesis. The hypotheses are generally denoted 6 1.2. PRELIMINARY INFORMATION H0 and Ha respetively. We refer to any formal method for determining whether to rejet or aept the null hypothesis as a hypothesis test. A hypothesis test is generally based on a test statisti, that is some value whih an be alulated from observed data. Assumming the null hypothesis to be true, it an be possible to alulate the sampling distribution of the test statisti, or, more likely, the asymptoti sampling distribution - the distribution of the test statisti as the size of the sample tends to innity. If we then draw a sample and alulate the value of the test statisti for that partiular sample, we an use the hypothesised distribution of the test statisti to alulate the probability of observing suh a value given that the null hypothesis is true. The p-value is the probability of observing a partiular value of the test statisti, assumming that the null hypothesis is true. If this is small then we have strong evidene against the null hypothesis. The level of signiane of a test is the value hosen prior to having observed a sample suh that the null hypothesis will be rejeted if the p-value is below this level. For example a p-value of 0.03 for a test onduted at the 5% level of signiane will result in the null hypothesis being rejeted, whereas a p-value of 0.06 will not. In the ase where a p-value is greater than the level of signiane, it is said that there is insuÆient evidene to rejet the null hypothesis. A Type One error is the probability that the test will rejet the null hypothesis when it is in fat true, and a Type Two error is the probability that the test will fail to rejet the null hypothesis when the alternative is true. The power of the test is the probability the test will rejet the null hypothesis when the alternative hypothesis is true. If we are performing a test for the values of some parameters of our model, say, then intuitively this should somehow be based on how likely it is that the data we have observed were generated from the hypothesised model. Many hypothesis tests are based on the likelihood. Given that we have sampled N observations x1 ; x2 ; : : : ; xN , if we assume the value of the parameters to be xed then the likelihood of the model is given by L(jx1 ; x2 ; : : : ; xN ) = P (x1 ; x2 ; : : : ; xN j): That is, the probability that the sample we observed was generated from the distribution dened by the parameters . Note that the likelihood is generally onsidered as a funtion of . 7 Chapter 2 Bayesian Networks A Bayesian Network is a direted ayli graph with the following properties: - Eah node represents a variable in the domain of interest. In this thesis we will assume all variables are disrete with a nite number of states1 . For example, in Figure 1.1 the variable Party would typially be an indiator variable with states yes and no. As the other three variables an be measured on a ontinuous sale, we typially break up this sale so that the variables are then ordinal. That is, they have a disrete number of states whih an be ordered, for example Level of Coordination and Alohol Consumption may have states low, moderate and high. - Eah node has assoiated with it a table of onditional probabilities. The table of onditional probabilities at a node X , say, speies the probability that X will be in state x given that its parents Pa(X ) are in some onguration pa(X ) for all states x of X and possible ongurations of the variables in Pa(X ). That is the strength of the relationship between a node and its parent set is quantied by the onditional probability distributions P (X jPa(X ) = pa(X )), for all ongurations pa(X ). In the network in Figure 1.1 for example, the node Level of Coordination would have the onditional probability table P (Level of Coordination jAlohol Consumption ). For eah of the states of Alohol Consumption, this table will give the probability distribution over the states of Level of Coordination. If eah variable has 3 states, as suggested above, there would be 9 entries in the table. Bayesian Networks have been used and interpreted in a variety of ontexts. They an be used as an eÆient means of storing a probability distribution, or the ars interpreted as having ausal impliation and the underlying probability distribution used simply as Although a Bayesian Network an be formed on ontinuous variables or a mixture of ontinuous and disrete variables, the theory is quite dierent to the disrete ase. 1 8 2.1. INDEPENDENCE RELATIONS a tool for inferene. In this hapter we rst look at the relationship between the struture of a network and the probability distribution whih it represents. We then disuss the formation and appliation of three types of Bayesian Network, namely Probabilisti Bayesian Networks, Bayesian Belief Networks and Causal Bayesian Networks. 2.1 Independene Relations The struture of a Bayesian Network represents the independene relations whih exist between the variables. By understanding the independene relations enoded by the struture we are able to exploit these to make Bayesian Networks an eÆient tool for the storage and retrieval of information. Here we look at the onnetion between struture and independene in more detail. Let G be a graph, X; Y and Z be three disjoint sets of variables from the set of all variables in the universe U, and let P represent a joint probability distribution over U. We use M to represent some dependeny model in U, where a dependeny model an be thought of as a rule whih determines all subsets of triplets (X; Y; Z) for whih the assertion X is independent of Y given Z is true. If so, we denote this assertion by I (X; Z; Y)M , where if the ontext is lear the subsript M may be omitted. The representation of independene relations in a direted graph is assoiated with the onept of d-separation. This onept is best illustrated by onsidering the types of onnetion possible at a node, and the assoiated impliations of independene. The four types of onnetion possible at a node, B say, are shown in Figure 2.1. Figure 2.1 a) shows the trivial ase; when B is not onneted to any other node, B is onsidered to be independent of the remaining variables in the network. For the other ases, suppose we wish to know the onditions neessary to blok the ow of information from node A to node C , that is to render A and C independent. From Figure 2.1 b) it an be seen that if we know the state of B then no further information an be obtained at C by knowing the state of A, that is, knowing the state of B bloks the ow of information from A to C . A serial onnetion hene represents the notion that A is onditionally independent of C given B , and we say B d-separates A from C . Likewise in Figure 2.1 ), if we know the state of B then no information an be passed between A and C and so B d-separates A from C at a diverging onnetion also. In the ase of a onverging onnetion, shown in Figure 2.1 d), if we have no information about B then A annot reeive information about C and so A and C are independent. However, if the state of B is known and we have knowledge about the state of A say, then this will have an eet on our belief in the state of C . For example, if A represents the variable Inome, C the variable Number of Dependents and B the variable Spending on 9 2.1. INDEPENDENCE RELATIONS B B a) A B A C C b) c) A C d) B Figure 2.1: The four types of onnetions possible in a direted network; a) disonneted node, b) serial onnetion, ) diverging onnetion and d) onverging onnetion. Leisure, knowing the state of Inome does not give us any information about the possible number of dependents. However, if we know that a household has a moderately large expenditure on leisure, knowing there are many dependents would imply that it is likely Inome is quite large. In the ase of a onverging onnetion then, B does not d-separate A from C . More formally, if X; Y and Z are three disjoint subsets of nodes in a direted ayli graph D, then Z is said to d-separate X from Y, denoted hXjZjYiD if there is no path between a node in X and a node in Y for whih the following onditions both hold [30℄, 1. every node with a onverging onnetion is in Z or has a desendent in Z, and 2. every other node is outside Z. Further, we say D is a D-map of M if I (X; Z; Y)M ) hXjZjYiD ; that is if every relation of onditional independene in the model is represented in the graph by an instane of d-separation. D is alled an I-map of M if hXjZjYiD ) I (X; Z; Y)M ; that is the d-separations of the graph orrespond to onditional independenies in the model. D is termed a minimal I-map of M if it is an I-map of M and would no longer be an I-map of M if any link were removed. Note that a graph with fewer links will imply more d-separations and so more assertions of independene. If a link were removed from a minimal I-map then this would reate a d-separation whih does not orrespond to an assertion of onditional independene in the model. If D is both a D-map and an I-map, we say D is a perfet map. There does not neessarily exist a direted ayli graph that is a perfet map of some distribution P . If a network is an I-map, then the d-separation properties of the network orrespond to onditional independenies of the domain. Although not every onditional independeny will neessarily be represented in the network, a network whih is a minimal 10 2.2. PROBABILISTIC BAYESIAN NETWORKS I-map will be as `lose' to a D-map as possible, that is the number of independenies not represented by the network will be minimised. Given that we have a network known to be an I-map of a model, there exist algorithms for deduing the independene relations enoded in the struture. For example, Geiger, Verma and Pearl [15℄ present, among others, an algorithm whih takes as input two sets of nodes X and Z and returns a set of nodes whih ontains suÆient information to ompute P (XjZ). This is equivalent to identifying those nodes that are not d-separated from X, given that we know the states of the variables in Z. 2.2 Probabilisti Bayesian Networks Given a probability distribution P on U, a direted ayli graph D is a Probabilisti Bayesian Network of P if and only if D is a minimal I-map of P [30, pp 119℄. Given any probability distribution P it is possible to onstrut an undireted graph G whih is an I-map of P [30℄. Although the equivalent statement does not hold for direted graphs, with the use of auxillary variables it is possible to represent any dependeny model expressed by an undireted graph G in a direted ayli graph [30, pp 130℄. For example, the graph in Figure 2.2 a) asserts the two independene relationships I (A; fB; Dg; C ) and I (B; fA; C g; D). In Figure 2.2 b), D and B are both serial onnetions and so I (A; fB; Dg; C ) is true also. However, as C has a onverging onnetion, it does not d-separate B and D, hene the relation I (B; fA; C g; D) does not hold in this direted graph. However, if we introdue a fth (auxillary) variable E , as in Figure 2.2 ), then the onnetion at C is serial and so now I (B; fA; C g; D) and the two independene relations of the original undireted network in Figure 2.2 a) hold. A D a) A B C D A B b) C D B c) C E Figure 2.2: E is an auxillary variable used to represent the dependeny relations of a) in a direted graph. Given that we have some distribution P over a set of variables, we require a method of assigning the appropriate links between the nodes in the orresponding network. The following theorem allows us to develop suh a method. 11 2.2. PROBABILISTIC BAYESIAN NETWORKS Theorem 1: A neessary and suÆient ondition for D to be a Probabilisti Bayesian Network of P is that eah variable X be onditionally independent of all its anestors given its parents Pa(X ), and that no proper subset of Pa(X ) satises this ondition2 [30, pp 120℄. Consider some ordering (X1 ; X2 ; : : : ; Xn ) for the variables in U. To form a Bayesian Network with respet to this ordering, we speify that the anestors of a node Xi must be ontained in the set U(i) = fX1 ; X2 ; : : : ; Xi 1 g. If we let Bi U(i) be a minimal set satisfying I (Xi ; Bi ; U(i) nBi ), then by Theorem 1 any direted ayli graph formed by designating Bi as parents of Xi , that is setting Pa(Xi ) = Bi , is a Probabilisti Bayesian Network of P . We refer to the ordered set of subsets fB1 ; B2 ; : : : ; Bn g as the boundary strata of M relative to the given ordering. Note that if a Bayesian Network is formed by the above method, then P (Xi jX1 ; X2 ; : : : ; Xi 1 ) = P (Xi jPa(Xi )): (2.1) If we hoose an alternate ordering for the variables, the boundary strata and hene the resulting network will dier. Hene there are many possible strutures whih an represent the same probability distribution one the onditional distributions have been speied. However, although every network is a minimal I-map of P , some orderings may result in a struture whih represents more of the onditional independenies than others. The ordering hosen an be the dierene between a omplete graph, whih requires the largest number of entries in the onditional probaility tables, or a more eÆient representation. Ideally when forming a Bayesian Network we would like to enode as many independenies as possible. For example, onsider the variables represented in Figure 1.1 and suppose the independene relations I (LC; AC; P ) I (CS; AC; P ) I (LC; AC; CS ) are judged to hold, where we have abreviated the varible names to the rst letter of eah word. Under the ordering (P; AC; LC; CS ) the network in Figure 1.1 would result. If we had instead used the ordering (LC; CS; AC; P ), we would obtain the network in Figure 2.3. The joint probability distribution over the domain an be omputed most eÆiently by making use of the onditional independenies enoded in the network. Beginning with 2 A proof of this theorem an be found in [30℄. 12 2.2. PROBABILISTIC BAYESIAN NETWORKS Level of Coordination Clarity of Speech Alcohol Consumption Party Figure 2.3: An alternate struture for the probability distribution over the variables given in Figure 1.1. the hain rule we have P (X1 ; X2 : : : ; Xn ) = P (X1 )P (X2 jX1 )P (X3 jX2 ; X1 ) : : : P (Xn jXn 1 ; : : : ; X1 ) Y = P (Xi jPa(Xi )); (2.2) i where the seond line follows from the relation (2.1). Hene the joint distribution an be obtained from the distributions of eah node onditioned on its parents. Ordinarily, to store the joint distribution over the variables X1 ; X2 ; : : : ; Xn we would need to store the Q probability P (x1 ; x2 ; : : : ; xn ) for one less than the ni=1 jXi j possible ongurations of the variables, where we have used the fat that the probabilities must sum to 1, and used jXi j to denote the number of states of Xi . In a Bayesian Network the size of the onditional probability table required at a node Xi will depend on the number of parents of Xi and the number of states of Xi and its parents. For some xed onguration of the parents of Xi , we know that the sum over the probabilities that Xi is in some state k is equal to 1. Therefore the number of entries required to dene the onditional probabiltiy tables and hene, from (2.2), the joint probability distribution, is n X i =1 Y (jXi j 1) Xk 2Pa(Xi ) jXk j: This an be onsiderably less than the previous expression. For example, onsider again the network in Figures 1.1 and 2.3. In Figure 1.1, assuming Party has 2 states and the other variables 3 states, the onditional probability table at Alohol Consumption has 3 2 entries and those at Level of Coordination and Clarity of Speeh have 3 3 entries. Hene this network requires the speiation of 2 + 6 + 9 + 9 = 26 probabilities. In the network of Figure 2.3, as there are 3 3 possible ongurations for the parents of Alohol 13 2.3. BAYESIAN BELIEF NETWORKS Consumption, the onditional probability table at this node has 3 9 entries. In total there are (3 1) + (3 3) + (3 9) + (2 3) = 45 probabilities to be speied, and so this network, formed on a dierent ordering, is learly less eÆient than the original network. However, as neither are omplete graphs, they are both more eÆeient than storing the joint probaility distribution. This would require storage of 2 33 1 = 53 values. A omplete Bayesian Network will have no storage advantages. The more independene relations that are enoded in the struture of the network, the more eÆient it beomes. Often relations of onditional independene an be fored to hold so that an eÆient approximation to the underlying probability distribution is obtained. Suh methods are presented in Setion 4.3 and Chapter 5. As well as failitating eÆient storage Bayesian Networks also have omputational advantages. For example if a marginal distribution is desired, depending on the ordering of the variables its retrieval may require far fewer operations than summing over all of the variables neessary to obtain the marginal distribution from the joint distribution. In Setion 3 we see how this allows for eÆient updating of the probability distribution when information about the state of a variable has been reeived. 2.3 Bayesian Belief Networks X1 X2 Xn−2 X1 X2 Xn−2 Xn−1 a) Xn−1 Xn b) Xn Figure 2.4: The network in a) is a Probabilisti Bayesian Network of a domain and the network in b) is a Bayesian Belief Network of the same domain. Consider the Probabilisti Bayesian Network in Figure 2.4 a) for the probability distribution P a (X1 ; X2 ; : : : ; Xn ), whih indiates that Xn 1 is independent of the remaining variables. Suppose in forming a Bayesian Network for P a we make Xn the hild of Xn 1 , as depited in Figure 2.4 b), to represent our (inorret) belief in a diret relationship between these two variables. Let Paa (Xn ) be the set fX1 ; : : : ; Xn 2 g and Pab (Xn ) the set Paa (Xn ) [ fXn 1 g. From Figure 2.4 a) we know I (Xn; fX1 ; : : : ; Xn 2 g; fX1 ; : : : ; Xn 1 gnfX1 ; : : : ; Xn 2 g); and so the boundary stratum for node Xn is Bn = fX1 ; : : : ; Xn 2 g. In Figure 2.4 b), sine 14 2.4. CAUSAL BAYESIAN NETWORKS Pa(Xn ) 6= Bn then this is not a minimal I-map and so by denition is not a Probabilisti Bayesian Network. However this network is still a Bayesian Network. If we know that Xn 1 is independent of the remaining variables, and dene the probabilities P (X1 ); : : : ; P (Xn 1 ) identially in a) and b), then n Y1 P b (X1 ; : : : ; Xn ) = P b (Xn jPab (Xn )) P (Xi jPab (Xi )) i=1 n Y1 = P b (Xn jPaa (Xn ); Xn 1 ) P (Xi ); i=1 as the nodes X1 ; : : : ; Xn 1 have no parents. Hene, as Xn is independent of Xn 1 in the true model, n Y1 P b (X1 : : : ; Xn ) = P b (Xn jPaa (Xn )) P (Xi ) i=1 = P a (X1 ; : : : ; Xn ): However, in b) the link Xn 1 ! Xn was reated beause of our belief in a diret relationship between the variables, hene the onditional probabilities we assign to Xn , P b (Xn jPab (Xn )), will not be equal to P a (Xn jPaa (Xn )) as assumed above. That is, the joint probability distributions will not be equal. The network formed by the introdution of Xn 1 ! Xn is an example of a Bayesian Belief Network for P . Bayesian Belief Networks are Bayesian Networks formed on a persons beliefs of ausal relationships and onditional independenies. When we form a Bayesian Belief Network we are trying to use our knowledge to rereate the underlying but unknown probability distribution as aurately as possible. The information we have about P is often organised to give an intuitive understanding of the major relationships between variables and the onstraints in the domain. The parents of a node Xi are hene those variables we believe have a diret inuene on Xi . Often the variables alloated as parents of Xi ould be onsidered to be auses of Xi , though in a Bayesian Belief Network, unlike a Causal Bayesian Network, a ausal interpretation is not neessary to alloate a variable as a parent of Xi . 2.4 Causal Bayesian Networks Causal Bayesian Networks are similar to Bayesian Belief Networks exept for the philosophy behind their onstrution. We form a Bayesian Belief Network based on our beliefs in an attempt to model the underlying probability distribution. When we form a Causal Bayesian Network, based on our notions of ausation, we are trying to repliate the human 15 2.4. CAUSAL BAYESIAN NETWORKS system of reasoning. This is ommonly believed to be based on a ausal struture [31℄. In a Causal Bayesian Network it is not important if the network represents the underlying distribution, so long as it is an aurate model of how we reason about the system. This means we an use the network as a model and tool for deision making and inferene, based on the reasoning of the expert who ompiled it. Networks formed in this manner have several useful features. Firstly, if a ausal relation is no longer believed to hold, we an represent this by simply removing the relevant link. Additionally, Causal Bayesian Networks allow one to model the eet of interventions. An intervention ours when a variable is set to a partiular state, for example a swith may be set to ON. This is fundamentally dierent to what we will refer to as an instantiated variable, whih is a variable whose state is known and is in that partiular state beuase of the inuene of other variables in the network. To model an intervention at a node we simply remove the links to that node from its parents and then treat the node as instantiated. Note that it is the partiular topology and ausal assumptions that result in these features. The ausal semantis of Bayesian Networks and interventions are overed extensively in [31℄. Summary In Setion 2.1 we disussed the representation of independene relations in a direted graph. We then dened a Bayesian Network as a model with two omponents - the graphial omponent (a direted graph) speies the independene relations whih hold between the variables in the domain, and the onditional probabilites quantify the relationships whih are present between the variables. Any model whih satises these two omponents is a Bayesian Network. In Setions 2.2, 2.3 and 2.4 we introdued three spei types of Bayesian Networks. These are eah Bayesian Networks whih have additional onstraints on either their onstrution or interpretation. A Probabilisti Bayesian Network has the added onstraint that the network must be a minimal I-map of the domain to be represented, a Bayesian Belief Network has only the onstraint that we belief the speied struture (and assoiated independene impliations) and onditional probabilities to be orret, and a Causal Bayesian Network has the onstraint that the parents of a node are its diret auses. Whih type of Bayesian Network is used depends on the spei modelling task at hand. 16 Chapter 3 Information Propagation Consider a Bayesian Network whih represents a probability distribution over the variables in the domain U. If we observe the present state of a variable this may give us information about the likely state of the other variables in the network. For example, onsider the network on two binary variables X and Y , as shown in Figure 3.1. If we observe that P(X) 0 1 0.5 0.5 X Y P(Y|X) 0 1 X=0 0.1 0.9 X=1 0.95 0.05 Figure 3.1: A Bayesian Network over the two variables X and Y . Y is in state 1, we an use this information to update our beliefs about X . In this ase, intuitively, we would expet it to be more likely that X is in state 0 and we would update our beliefs aordingly. In this hapter we look at how to systematially update our beliefs, or the probability distributions at eah node, given that we have reeived some information about the state of a variable or variables. In Setion 3.1 we intodue and justify our probabilisti approah to updating beliefs. In Setion 3.3 we introdue a belief updating mehanism for the partiular ase where the network is a tree, and then extend this to singly onneted networks and networks ontaining loops respetively. Finally, in Setion 3.4, we look at how the propagation sheme an be used to obtain answers to queries about the states of the variables in the network. 17 3.1. BELIEF DISTRIBUTIONS 3.1 Belief Distributions Given that we have a Bayesian Network and have observed the state of some variable, it would be possible to update our beliefs about the remaining variables based on our intuition. In this hapter we propose as an alternative a systemati method based on the rules of probability. Here we address why we should use a propagation mehanism to update our beliefs, instead of relying on the intuition whih allowed us to initialise the network, and justify our assumption that beliefs should obey the rules of probability. In the ontext of belief networks, after having initialised the network, that is having speied the links and the neessary onditional probabilities, we have a belief distribution at eah node. Hene our belief in X , BEL(X ) is speied by the values BEL(X = x) for P all states x of X dened suh that BEL(X = x) 2 [0; 1℄ for all x and x BEL(x) = 1. The Probabilisti Bayesian Network analogue is the marginal probability distribution of X . In the example in Figure 3.1, BEL(X ) = (0:5; 0:5) and BEL(Y ) = (0:525; 0:475). Note that to initialise these beliefs we are required only to give the onditional probability table at eah node and so make only loal assessments of the strength of relation between variables. In a large network we may reeive some information about the state of a variable, and are then required to update the beliefs of nodes whih may be far from where the infomation was observed. To do this subjetively requires us to onsider the relationships between variables over the entire network. The omplexity of this task for a reasonable sized network is well beyond what any mind is apable of proessing without making simpliations or resorting to stored generalisations. By using a systemati proedure to update our beliefs by using the information in the network, we are able to remove this highly subjetive task. Another advantage of propagating information in this manner is that a network an be ompiled with expert knowledge and a non-expert an then enter information and obtain a onlusion based on the expert knowledge embedded in the system. The question as to why we an treat beliefs in the same way as probabilities has been addressed extensively in many elds suh as Management, Psyhology and Deision Theory. Given that we assign numerial values to beliefs, on hoosing a sale from 0 to 1 it seems reasonable to assume BEL(X = x) = 0 if we are ertain X 6= x 1 if we are ertain X = x; though assigning values to intermediate degrees of belief is not straightfoward. In general, suggested methods usually involve some form of omparison between the unknown belief and one's belief in an event whih is known to our with a ertain probability. Several researhers have developed rules or axioms that they feel beliefs should follow and in eah 18 3.2. CONCEPTS AND NOTATION ase these rules imply that beliefs should follow the rules of probability [18℄. 3.2 Conepts and Notation When using a Belief Network in a deision senario we are generally interested in only a few variables. The variables of interest are referred to as hypothesis variables and their nodes as hypothesis nodes. We base our deisions on the belief distribution at these nodes. Our goal is to determine the belief distribution at the hypothesis nodes after having reieved some information or evidene e. Nodes at whih we reieve evidene are alled information or evidene nodes. Evidene will typially be in the form of an observation whih onrms the state of the variable. One evidene has been reeived the distribution of that node is xed and we say that the variable is instantiated. Evidene nodes are typially leaf nodes and hypothesis nodes are often root nodes. For example, onsider the network in Figure 1.1. If a parent observed their hild's low level of oordination on returning home late one night, then this observation an be entered as evidene. This an be done by instantiating the variable Level of Coordination suh that P (Level of Coordination je) = (1; 0; 0)T : In this hapter we fous on updating the belief distribution at some node X whih has r states. X has k hildren Y1 ; Y2 ; : : : ; Yk and m parents V1 ; V2 ; : : : ; Vm , so that Pa(X ) = fV1 ; V2; : : : ; Vm g. The universe U onsists of all variables in the network. The belief distribution at X is represented as an r vetor BEL(X ). The omputations will often be in the ontext of updating a single entry of this vetor, BEL(X = x), whih we will often abreviate to BEL(x). Similarly (x) and (x) are salars whilst (X ) and (X ) are vetors onsisting of the r values for (x) or (x) respetively. We now look at how we an update our belief distributions at the hypothesis nodes given the information we have reeived. 3.3 Belief Propagation The objet of belief propagation is to alulate for eah variable X , BEL(X ) P (X je); where e is the evidene available on whih we an update our beliefs. This is sometimes alled the posterior distribution of X , as it is the distribution at X after having taken the evidene into aount. To failitate information propagation throughout the network we break down the task into a series of loal propagations. That is, we use an iterative proedure where at eah 19 3.3. BELIEF PROPAGATION stage a node ombines the information available from its neighbours with information from the previous step and then sends this updated information bak to its neighbours. Under ertain onditions, whih are disussed later, this proedure will onverge and the distribution at eah node will be equal to the posterior distribution P (X je). The evidene X reeives is split into two disjoint subsets, e+ - the evidene X reeives from its parents, and X eX - the evidene X reeives from its hildren. In networks whih don't ontain loops we an take eX and e+X to be independent. These an be split further into eX (Yj ) and e+X (Vl ) to represent evidene reeived at X from hild Yj and parent Vl respetively. Note that we are never required to quantify these terms expliitly, but they are used notationally to indiate the soure of the information being used in a omputation. This allows us to keep trak of what information has already been taken aount of in updating a belief, so that no information is ounted more than one. If X is a root node then, as X has no parents, we initialise P (X je+X ) as P (X ). If X is a leaf node and an evidene node we initialise P (X jeX ) to reet the evidene, for example if X has the states x1 and x2 and we know that X = x2 , then P (X jeX ) = (0; 1)T . If X is a leaf node and is not instantiated then we let P (X jeX ) assign equal probability to eah state of X . This reets the fat that we have no evidene to suggest that X is more likely to be in one state than any other. Reall that a node X is onsidered a ause or inuene of its hildren and that this inuene in general tends to follow the diretion of the arrows. Thus P (X = xje+ ) (x) = X is sometimes alled the preditive support for X , as the information obtained from the parents of X will have a large inuene on our belief in the state of X . P (e jX = x) (x) = X is a measure of the retrospetive support for X = x, that is the probability that X would be reeiving suh information from its hildren given that X is in state x. If this is very large then intuitively we would inrease our belief in X = x. The loal messages that are sent between neighbouring variables are denoted X (vl ) and Yj (x). X (vl ) = P (eVl (X )jVl = vl ) is the information sent from X to a parent Vl , and Yj (x) = P (X = xje+Yj ) 20 3.3. BELIEF PROPAGATION is the message sent from node X to hild Yj . The messages that are sent between neighbours at eah iteration are shown in Figure 3.2. π X(V 1) V1 π X(V 2) λ X(v 1) V2 λ X(v 2) X π Y (X) π Y (X) 1 Y1 2 λ Y (x) λ Y (x) 1 2 Y2 Figure 3.2: A setion of a Bayesian Network showing the messages that node X reeives from its neighbours and the messages that X sends out to its neighbours at eah iteration. In general, we have that BEL(x) = P (xje+X ; eX ) P (xje+X )P (eX je+X ; x) = ; by Bayes' rule P (eX je+X ) P (xje+X )P (eX jx) = ; as eX and e+X are independent given x P (eX ) = :(x)(x); (3.1) where is a normalisation onstant. This gives us an expression for updating our belief given that we know (x) and (x). 3.3.1 Propagation in Trees In a tree, a node X an have at most one parent, V . To see how (x) an be obtained, rst suppose X has hildren Y1 and Y2 . Then the support for X = x given the information available from its hildren, (x) = P (eX jX = x) = P (eX (Y1 ); eX (Y2 )jx) = P (eX (Y1 )jx)P (eX (Y2 )jx); as, in a tree, the evidene reeived from the subtree rooted at Y1 is independent of the evidene reeived from the subtree rooted at Y2 . In general, given that Yj (x) = P (eX (Yj )jX = x); 21 3.3. BELIEF PROPAGATION independene of the subtrees rooted at the hildren of X implies that (x) = k Y j =1 Yj (x); (3.2) where X has hildren Y1 ; Y2 ; : : : ; Yk . Hene we an alulate (x) at X given that we have reeived the messages Yj (x) from eah hild Yj of X . To alulate BEL(x) we also need (x). Consider the following expansion: (x) = P (xje+X ) X = P (xje+X ; v)P (vje+X ) (onditioning on V ) v = X v = X v P (xjv)p(vje+X ) (3.3) P (xjv)X (v); (3.4) where (3.3) follows from the fat that, in a tree, all the information X reeives from its anestors is ontained in V , as V d-separates X from all other anestors of X . The probabilities P (xjv) an be obtained from the onditional probability table for node X . If we let the entries of this table orrespond to the entries in the matrix MX jV say, so that [MX jV ℄ij = P (X = xj jV = vi ), then we an write (X ) = MXT jV X (V ): (3.5) We are now able to alulate BEL(X ) by (3.1), given that we have reeived the relevant messages from the neighbours of X . In order to failitate information propagation, X must ombine these messages with its urrent information to send out updated messages to its neighbours. X will send the messages V (X ) to its parent V and Yj (X ) to eah hild Yj . The information X sends to its parent, X (v) = P (eV (X )jV = v) X = P (x; eV (X )jv) x = X x = X x P (eV (X )jv; x)P (xjv) P (eX jv; x)P (xjv); as, beause we are onsidering propagation in a tree struture, the evidene that V reeives from X must have ome from the hildren of X , assumming that as evidene nodes an 22 3.3. BELIEF PROPAGATION only be leaf nodes, no evidene is observed at X . Then, as X d-separates its desendents from its parent V , X (v) = X x = X x whih we an write as P (eX jx)P (xjv) (x)P (xjv); (3.6) X (V ) = MX jV (X ): The message X must send to hild Yj should inlude the evidene X has obtained from its parent as well as from its other hildren. If we let eX (Y (j ) ) denote the ombined evidene reeived at X from all hildren other than Yj , we an then write Yj (x) = = = = = P (xje+Yj ) P (xje+X ; eX (Y (j ) )) :P (eX (Y (j ) )jx; e+X )P (xje+X ); by Bayes' Rule :P (eX (Y (j ) )jx)P (xje+X ); as X d-separates Y (j ) from the anestors of X :Y (j) (x)(x); where is a normalising onstant. As the evidene X reeives from eah hild is independent, Y (j) (x) = P (eX (Y (j ) )jx) k Y P (eX (Yl )jx) =1 =6 (x) = ; from (3.2). Yj (x) = l l j Hene (x)(x) Yj (x) BEL(x) : = 0 : Yj (x) Yj (x) = : (3.7) To summarise, for every iteration of the belief propagation mehanism, eah node performs the following three tasks: 1. Node X uses the messages reeived from its neighbours to alulate (X ) and (X ), and then updates its belief distribution, BEL(X ). 23 3.3. BELIEF PROPAGATION 2. X uses the information obtained from its hildren to send an updated message to its parent. 3. X uses the information obtained from its parent and hildren other than Yj to send a message to hild Yj , for j = 1; : : : ; k. To arry out this proedure at node X , the only information required is X (V ); Yj (x) (for all j ) and the xed matrix of onditional probabilities, MX jV . For trees, these rules guarantee that equilibrium will be reahed in a time proportional to the longest path in the network and that, at equilibrium, eah node will have a belief distribution equal to its posterior distribution given all the available evidene [30℄. We will now demonstrate how this is arried out, through a simple example. Suppose there are two andidates standing for eletion, one from eah of the two major parties. They eah have very dierent defene poliies, and whether a itizen is for or against the defene poliy of a andidate is believed to be an indiator of for whom they will vote. This an be modelled by the Bayesian Network shown in Figure 3.3 a). Here the variable Defene Poliy has states 1 and 2 indiating preferene for andidate 1 or andidate 2's defene poliy respetively. Vote also has states 1 and 2 indiating for whom a itizen will vote. Based on knowledge obtained from pre-eletion polls, we have P (V jD) 1 2 0 : 95 0 : 05 T P (D) = (0:3; 0:7) and 1 0:95 0:05 and so MV jD = 0:15 0:85 : 2 0:15 0:85 As D is a root node and V an uninstantiated leaf node we initialise (D) = (0:3; 0:7)T and (V ) = (1; 1)T = V (D). It is not important that the entries in () and () sum to 1 as the appropriate saling is done when normalising BEL(). D a) V Defence Policy Vote D Defence Policy D R R Rally Rally b) V Vote c) Defence Policy M V Party Member Vote Figure 3.3: A Bayesian Network model to predit a itizen's vote in an eletion. 24 3.3. BELIEF PROPAGATION We rst update the belief distributions at the nodes. =) BEL(d) = (d)(d) = (d); d = 1; 2 BEL(D) = (0:3; 0:7)T : To alulate BEL(V) we need (v) = X d = X P (vjd)v (d); from (3.4) P (vjd)(d): d That is, (V ) = MVT jD (D) = (0:39; 0:61)T : Hene =) BEL(v) = (v)(v); v = 1; 2 BEL(V ) = (V ) = (0:39; 0:61)T : This gives the belief distributions based on the available information. Suppose now that a rally was held to allow people to demonstrate their opposition to the defene poliy of andidate 1. The variable Rally an be added to our model as in Figure 3.3 b), with states 0 and 1 to indiate absene or presene at the rally respetively. Beause the state of D is not known, knowing whether a itizen attended the rally gives us some information about for whom they may vote. Consider the ase of itizen 1 who was suspeted to have attended the rally. We hene instantiate Rally to (0:2; 0:8)T , so that (R) = (0:2; 0:8)T , as we are 80% ertain that the P (RjD) 0 1 itizen attended the rally. We have the onditional probability table 1 1 0 ; 2 0:8 0:2 whih represents the belief that only 20% of the people supporting andidate 2's defene poliy will rally, and that no-one who supports andidate 1's poliy will rally against it. The message R sends to D is d (R) = X r (r)P (rjd); d = 1; 2 =) D (R) = MRjD (R) = (0:2; 0:32)T : 25 3.3. BELIEF PROPAGATION The evidene D reeives from its two hildren is then ombined to give =) (d) = R (d)V (d); d = 1; 2 (D) = (0:2; 0:32)T and hene =) BEL(d) = :(d)(d); d = 1; 2 BEL(D) = :(0:06; 0:224)T = (0:211; 0:789)T : D must now send out a message to V , whih is, from (3.7) :BEL(d) V (d) = :(d)R (d) V (D) = :(0:211; 0:789)T : V (d) = =) (3.8) (3.9) Notie that the form of (3.8) ensures that the information D reeived from V is not inluded in the message sent bak from D to V . Now (V ) = MVT jD V (D); from (3.5) = (0:319; 0:681)T : Beause V is a leaf node, (V ) = P (eV jV ) does not need to be updated and remains as initialised. Hene, BEL(V ) = :(V )(V ) = :(V ) = (0:319; 0:681): After having taken into aount the information about the likelihood of itizen 1's presene at the rally, we have inreased our belief that they will vote for andidate 2. We now believe it is more than twie as likely they will vote for andidate 2 than for andidate 1. 3.3.2 Propagation in Polytrees Reall that a polytree is a singly onneted network. That is, a node may have several parents and hildren, but the network may not ontain loops. Information propagation in polytrees is similar to that in trees exept that, as a node X may have more than one parent, the information obtained from the parents of X must be ombined in order to 26 3.3. BELIEF PROPAGATION update the belief distribution. To send out a message to eah parent, X must ombine the information introdued from its hildren and its other parents. In a polytree, as no parents of any given node have a ommon anestor, this implies that the information introdued from one parent is independent of the information obtained from any other parent. To update BEL(X ) we need (X ) and (X ). We an write (x) = P (xje+X ) = P (xje+X (V1 ); : : : ; e+X (Vm )); where e+X (Vl ) is the evidene X obtains from the link with parent Vl . Conditioning on V = fV1 ; : : : ; Vm g and using the fat that P (xjv1 ; : : : ; vm ; e+X ) = P (xjv1 ; : : : ; vm ) then gives (x) = = X v1 ;:::;vm X v1 ;:::;vm X P (xjv1 ; : : : ; vm )P (v1 ; : : : ; vm je+X (V1 ); : : : ; e+X (Vm )) P (xjv1 ; : : : ; vm ) P (xjv1 ; : : : ; vm ) m Y l =1 m Y P (vl je+X (V1 ); : : : ; e+X (Vm )) P (vl je+X (Vl )): (3.10) =1 The nal line follows beause we don't know the state of X and, as the onnetion at X from its parents is onverging, the state of a parent is independent of the other parents. If we let X (vl ) = P (vl je+X (Vl )) = v1 ;:::;vm l and v = (v1 ; v2 ; : : : ; vm ), then we an write (3.10) as X P (xjv) m Y X (vl ): (3.11) =1 Q Note that the weight given to the evidene m l=1 X (vl ) for some v is proportional to the probability that X = x, given that onguration of its parents. As (x) is independent of information from the parents of X , this remains as for trees, that is k Y (x) = Yj (x): j =1 Hene (x) = v l BEL(x) = :(x)(x) = : k Y j =1 Yj (x) 27 X v P (xjv) m Y l =1 X (vl ): (3.12) 3.3. BELIEF PROPAGATION To alulate the messages X sends to its neighbours, the evidene reeived from all neighbours is rst ombined and then redistributed in proportion to the evidential weight on eah link. To alulate X (vl ), the message X sends to its lth parent, onsider all other parents as a single information soure V(l) = VnfVl g. Then the evidene eVl (X ) that Vl reeives from X is based on the evidene reeived at X from V(l) , [ e+X (V(l) ) = e+X (Vk ); k 6=l and the evidene X has reeived from its hildren, eX . Note that e+X (V(l) ) is independent of eX . Hene the support for Vl , given the evidene at X , is given by vl (X ) = P (e+X (V(l) ); eX jvl ) XX P (e+X (V(l) ); eX jvl ; v(l) ; x)P (v(l) ; xjvl ); = x vl ( ) where the double summation is over all states x of X and ongurations v(l) of the variables in V(l) . As e+X (V(l) ) and eX are independent, and X d-separates V(l) and Vl from its hildren, X (vl ) = = XX x v(l) XX x vl P (e+X (V(l) )jvl ; v(l) ; x)P (eX jvl ; v(l) ; x)P (v(l) ; xjvl ) P (e+X (V(l) )jv(l) )P (eX jx)P (v(l) ; xjvl ) ( ) P (v(l) je+X (V(l) )) + (l) P (eX (V ))P (eX jx)P (xjv(l) ; vl )P (v(l) jvl ); (l) ) P ( v x v(l) by Bayes' Rule. Given that V(l) is independent of Vl , we have that P (v(l) jvl ) = P (v(l) ). As v = v(l) [ vl , = XX X (v l ) = XX x v P (v(l) je+X (V(l) ))P (eX jx)P (xjv)I(Vl [v℄ = vl ); where = P (e+X (V(l) )) is a normalisation onstant. Substituting (x) for P (eX jx) and X (vk ) for P (vk je+X (Vk )) we obtain X (v l ) = 8 X X <Y x v : 6= 9 = X (vk ) (x)P (xjv)I(Vl [v℄ = vl ): ; k l When X sends a message to hild Yj this should be based on all information obtained at the previous iteration besides that reeived from Yj . Thus the preditive support hild Yj reeives from its parent, Yj (x), is equal to P (xjeX neX (Yj )); where eX neX (Yj ) is the 28 3.3. BELIEF PROPAGATION evidene X has reeived from eah of its parents and hildren exluding Yj . As BEL(X ) is the belief at X after evidene from all neighbours has been onsidered, and the evidene reeived from eah of the hildren of X is independent, we an write BEL(x) Yj (x) = :BEL(x)jYj (x)=1 : Yj (x) = : To demonstrate this proedure, onsider again the example from Setion 3.3.1. Another fator inuening for whom one will vote is a itizen's ommitment to a partiular party. One way to model this fator is to add the variable Party Member as a parent of Vote, as in Figure 3.3 ). This variable has states 0, 1 and 2 indiating no party membership and membership to the party of andidate 1 or 2 respetively. As V now has two parents this model is no longer a tree, though it is a polytree. P (V jD; M ) 1 2 1; 0 0:8 0:2 1; 1 1 0 We have the onditional probability table 1; 2 0:3 0:7 . 2; 0 0:1 0:9 2; 1 0:6 0:4 2; 2 0 1 Suppose now that itizen 1 is a member of party 2. Hene we instantiate M to 2 and wish to update our beliefs for whom they will vote. We have, V (M ) = (M ) = (0; 0; 1)T and V (D) = (0:211; 0:789), from (3.9). Hene (V ) = X p(vjd; m)V (m)V (d); from (3.11) d;m =) (V ) = MVT jD;M :VM;D ; where VM;D is the 6 That is, 1 vetor with entries V (m)V (d) for m = 1; 2; 3 and d = 1; 2. (V ) = (0:063; 0:937)T : Hene, BEL(V ) = :(V )(V ) = (0:063; 0:937)T : 29 3.3. BELIEF PROPAGATION We are now very ertain that itizen 1 will vote for andidate 2. Note that, beause of the onverging onnetion at Vote, the information introdued at Party Member will have no eet on the distribution of Defene Poliy or Rally while the state of Vote is unknown. In summary, to update BEL(X ) we require X (Vl ); l = 1; : : : ; m; Yj (x); j = 1; : : : ; k and the onditional probabilities P (X jV1 ; : : : ; Vm ). X then alulates and sends a message to eah hild and parent. Again use of this proedure guarantees onvergene in a time proportional to the longest path in the network. However, in polytrees the presene of multiple parents adds a further degree of omplexity to the neessary alulations. The summation required in (3.12) is over all ombinations of states of the parent variables. If the number of parents is large or they have many states, this summation an beome intratable. In the next setion we present a model whih, under several simplifying assumptions results in a losed form expression for BEL(X ), whih we derive. 3.3.2.1 The noisy OR-gate model for polytrees In order to simplify the exposition of the noisy OR-gate model, assume that the parents of a variable are its diret auses. We will also assume that eah variable or ause has two states - either the ause is present or the ause is absent. This model is based on two assumptions. The rst assumption of the model speies that an event will not our if none of its auses are present. If a ause is present then the event may or may not our. For example a ause of Broken Window may be Hit by Ball. If this were the only parent of Broken Window we would be assuming that, were the window to break it would be beause it were hit by a ball. If the window was hit by a ball it may or may not break. In general we refer to the fators that may prevent an event suh as breakage ourring, given that a ause is present, as inhibiting fators. The seond assumption of the model is that the inhibiting fators of eah variable are independent of the inhibiting fators of other variables. We an thus think of a variable X as having several parents or auses. Only if there is at least one ause present whose inhibitor is not ating will event X our. This is shown shematially in Figure 3.4. If we let ql be the probability that the lth inhibitor is ative and denote by 1 and 0 the states present and absent respetively, then P (X = 1jVl = 1; Vk = 0 8 k 6= l) = 1 ql: Note that if ql = 0 then the noise omponent is removed and an event will always our if a ause is ating. Let (pa(X)) index the set of all auses of X whih are present, when the parents of X are in onguration pa(X ). That is (pa(X)) = fk : vk = 1; vk 2 pa(X )g. Then X = 0 30 3.3. Inhibitor Cause Ii I1 V1 BELIEF PROPAGATION Vl AND AND In Vn AND OR X Figure 3.4: A graphial representation of the Noisy OR-Gate. Only if a ause is present and its inhibitor is not ating will the event X our. only if the inhibitors of Vk for all k 2 (pa(X)) are ating. As, by the seond assumption of the model, the inhibitors are independent, Y P (X = 0jpa(X )) = ql 2 (pa(X)) Y and P (X = 1jpa(X )) = 1 ql : l2(pa(X)) l Note that these expressions dene the onditional probability table for node X , whih is subjet to the onstraints of the model. We now use these assumptions to derive an expression for BEL(X ). Reall, from (3.1) and (3.11) that BEL(X = x) = :(x)(x) X Y = :(x) P (X = xjpa(X )) X (vl ); l pa(X ) P (3.13) where pa(X ) represents summation over all possible ongurations of the parents of X , and X (vl ) = P (vl je+X ). If we denote the nal part of this expression by KPa(X ) = x X pa(X ) P (X = xjpa(X )) m Y l 31 =1 X (vl ); 3.3. BELIEF PROPAGATION then X 0 KPa (X ) = f Y f Y pa(X ) l2(pa(X)) X = pa(X ) l2(pa(X)) ql g m Y l =1 ql g X (vl ) Y X (vl ) Y 2 (pa(X)) l62(pa(X)) Y = f ql X (vl )g X (vl ): pa(X ) l2(pa(X)) l62(pa(X)) X X (vl ) l Y Summing over the states Vi = 0 and Vi = 1 of an element of Pa(X ), Vi , gives 0 KPa (X ) = X X Y f ql X (vl )g Y X (vl ) pa(X ): l2(pa(X)) l62(pa(X)) Vi =vi X Y Y = f ql X (vl )gX (Vi = 0) X (v l ) pa(X ): l2(pa(X)) l62(pa(X)) ; Vi =0 l6=i vi + X pa(X ): Vi =1 qi X (Vi = 1)( Y 2 (pa(X)); l= 6 i ql X (vl )) l Y 62 (pa(X)) X (v l ): l Note that if Vi = 0, then i 62 (pa(X)), and if Vi = 1, then i 2 (pa(X)). Denoting the set Pa(X )nfVi g by Pai (X ), we an hene write 0 KPa (X ) = (X (Vi = 0) + qi X (Vi = 1)) X Y ( ql X (vl )) Y pai (X ) l2(pai (X )) l62(pai (X )) 0 = (X (Vi = 0) + qi X (Vi = 1))Kpa (X ) : X (vl ) i Applying this reursively, and substituting 1 X (Vl = 1) for X (Vl = 0) = P (Vl = 0je+X ), leads to 0 KPa (X ) = = = m Y l =1 l =1 l =1 m Y m Y [X (Vl = 0) + ql X (Vl = 1)℄ [1 + (ql 1)X (Vl = 1)℄ (3.14) [1 X (Vl = 1)(1 ql )℄: (3.15) 32 3.3. BELIEF PROPAGATION 1 To nd an expression for KPa (X ) , note that 1 KPa (X ) = = X pa(X ) Y (1 2 (pa(X)) l m Y X ql ) m Y l =1 X (vl ) 0 X (vl ) KPa (X ) pa(X ) l=1 X 0 = P (pa(X )je+ ) KPa (X ) pa(X ) 0 = 1 KPa (X ) : (3.16) Hene on substitution of (3.15) and (3.16) into (3.13), with (X ) = ((0); (1))T we obtain BEL(X = 0) = :(0) m Y =1 ( [1 X (Vl = 1)(1 ql )℄ l BEL(X = 1) = :(1) 1 m Y ) [1 X (Vl = 1)(1 ql )℄ : =1 The evaluation of this expression involves the alulation of a produt whih inreases only linearly with the number of parents of a node, and hene avoids the problem of having to sum over all ongurations of the parents as in (3.12). However the assumptions of the model, most signiantly that the inhibitors of the auses of an event are independent, are quite restritive. The eet this may have on the belief distributions should be weighed up against the omputational advantages. l 3.3.3 Networks ontaining Loops In a tree, the subtrees rooted at a node are independent. Additionally, in a polytree the information reeived from a parent is independent of the information reeived from any other parent. These independene relations allow messages to be sent out and reeived messages ombined in suh a way that no evidene is ounted twie. If the message propagation sheme used for trees and polytrees were used in a network ontaining loops, messages may irulate indenitely around the loops and never onverge to a stable equilibrium. The independene assumptions are no longer valid. There are two methods ommonly used for message propagation in multiply onneted networks - lustering and onditioning. Eah of these methods, disussed in Setions 3.3.3.1 and 3.3.3.2 respetively, makes use of the independene properties of the network to enable the use of loal omputations. However, in highly onneted networks these methods an beome intratable and so approximate methods are required. In this ase simulation is often used, and in Setion 3.3.3.3 we disuss how to apply simulation in the ontext of Bayesian Networks. 33 3.3. BELIEF PROPAGATION 3.3.3.1 Clustering Clustering, as its name suggests, involves grouping nodes into lusters so as to form a tree. The propagation method introdued for trees an be used on the lustered nodes, and the updated beliefs then distributed bak to the original variables. The tree formed as a result of lustering is termed a join tree. In order to desribe the formation of a join tree we rst give some denitions. A set of nodes is omplete if all nodes are pairwise linked, that is eah node is linked to every other node. A omplete set is alled a lique if it is not a subset of another omplete set. Note that a graph may have many liques whih do not have to be disjoint. Additionally we say an undireted graph is hordal if every yle of length four or more has at least one hord, that is an edge joining two nononseutive nodes. We an form a join tree by the following method: 1. Connet all parents that share a ommon hild and remove the arrows from the links. The resulting graph G is alled the Markov Network of the original Bayesian Network. 2. Form a hordal graph G0 from G. This an be done by the graph triangulation algorithm, as given in [30℄. 3. Identify all liques in G0 . 4. Order the liques C1 ; : : : ; Ct of G0 by rank of the highest order node in eah lique, then onnet eah Ci to a predeessor Cj (j < i) sharing the highest number of nodes with Ci . In order to understand why this algorithm works, onsider rst the formation of the Markov Network in Step 1. At a onverging onnetion, as in Figure 3.5 a), if we know the state of X2 then X1 and X3 are not independent. If the arrows are removed from the links to form the undireted graph given in Figure 3.5 b), then knowing the state of X2 renders X1 and X3 independent. When forming the Markov Network of a direted graph D we would add a link between the parents X1 and X3 , as in Figure 3.5 ), whih ensures there are no additional independene relations in the undireted graph. Hene it an be seen that a Markov Network G is an I-map of the orresponding direted graph D. Sine an I-map implies that any independenies represented in the graph are present in P , by adding extra links whih may lessen the number of independenies represented by G, G remains an I-map (though not a minimal I-map). Hene the hordal graph formed in step 2 is still an I-map of P . If G is an I-map of P and is hordal, then P is said to be deomposable relative to G [30℄. In addition, if P is deomposable relative to G, then any 34 3.3. X1 a) X3 X2 X1 b) X3 BELIEF PROPAGATION X1 X2 c) X3 X2 Figure 3.5: The Markov Network of the direted network in a) is formed by removing the arrows on the links and adding a link between nodes whih had a ommon hild. join tree T of the liques of G is an I-map relative to P . That is, if hC1 jC2 jC3 iT , then the variables in C1 are independent of the variables in C3 given the variables in C2 . A proof of this statement is given in [30℄. Hene the join tree is an I-map of the original distribution and we are justied in using the propagation algorithm introdued for trees. As an example, onsider the network in Figure 3.6 a). The orresponding hordal graph is shown in Figure 3.6 b) where the links between X2 and X3 and X5 and X4 were added in step 1 of the algorithm and the link between X1 and X5 in step 2. There are 3 liques in this graph, namely C1 = fX1 ; X2 ; X3 ; X5 g; C2 = fX1 ; X4 ; X5 g and C3 = fX4 ; X5 ; X6 g. The highest order node X5 is in eah lique. Hene we an join C1 to C2 (as C1 and C2 share 2 nodes), and C2 to C3 to obtain the join tree shown in Figure 3.6 ). In general, the join tree is not unique. Additionally, if we wished to introdue evidene at a node, we would not instantiate a node of the join tree. Instead a dummy node is added as a parent to the lique whih ontains the evidene node, and evidene is entered at this node. In the above example, P (C1 jC2 ) = P (X1 ; X2 ; X3 ; X5 jX1 ; X4 ; X5 ) = P (X2 ; X3 jX1 ; X4 ; X5 ) = P (X2 ; X3 jX1 ; X5 ); as X2 and X3 are d-separated from X4 by X1 and X5 . In general it an be seen that the dependene relationships between two liques Ci and Cj an be omputed from the onditional probability of the variables unique to Ci onditioned on the variables Ci shares with Cj . The number of states for the lustered variables inreases exponentially with the number of nodes in a luster, and the omputational omplexity inreases aordingly. In addition, the more highly onneted the network the more nodes eah lique will have in ommon with others and so the average number of nodes per lique inreases. 35 3.3. BELIEF PROPAGATION X1 X1 X2 X3 X4 X2 X3 X4 X5 X5 a) b) X6 c) X6 C3 C2 C1 Figure 3.6: The formation of a join tree ) from the Bayesian Network in a). This allows the use of propagation methods for trees in what was originally a multiply onneted network. 3.3.3.2 Conditioning This method makes use of the d-separation riteria of a network whih an be used to blok the ow of evidene along paths in suh a way as to render the network singly onneted. The propagation shemes for trees or polytrees an then be used. Consider the most simple loop as shown in Figure 3.7. If we instantiate X2 , that X2 X3 X1 X4 Figure 3.7: A simple Bayesian Network ontaining a loop. is x X2 to some state x2 , then X1 and X3 are d-separated and so information annot `ow' around the loop. Hene at X4 we an assume the information reeived from X1 is independent of that reeived from X3 and so an use the polytree algorithm. 36 3.3. BELIEF PROPAGATION In general we require a set of variables WI to be instantiated in order to enable this algorithm to be used. We then update our beliefs at node X , BEL(X = x) = P (xje) X = P (xjwI ; e)P (wI je); wI where the summation is over all possible ongurations of the variables in WI . This an be seen to be a weighted sum of the probability that X = x when the variables in WI are instantiated to eah possible onguration, the weights being given by P (wI je). As instantiating the variables in WI renders the network singly onneted, the term P (xjwI ; e) for eah wI an be omputed by the algorithm for polytrees. An appliation of Bayes' Rule allows us to alulate the weights P (wI je) = :P (ejwI )P (wI ); where P (ejwI ) an also be omputed using the polytree algorithm, by instantiating the variables in WI to wI and updating the distribution at the evidene nodes. We hene require 2 appliations of the polytree algorithm for eah onguration of the variables in WI . The required storage and omputation time for this method grows exponentially with the number of nodes required to be instantiated, as the propagation algorithm must be repeated for every ombination of the states of these variables. Hene this method also is not tratable in large or highly onneted networks. 3.3.3.3 Stohasti Simulation Simulation is typially used in networks too large for the eÆient use of the above methods. Exat methods applied to large or densely onneted networks require a prohibitive amount of either memory or omputation, and are not feasible. As mentioned previously, exat inferene in Bayesian Networks an beome intratable, and is in fat NP-hard [6℄. With the use of simulation, estimates of belief are obtained from observing the frequeny with whih a onguration of the variables ours in a series of runs. Although approximate inferene in Bayesian Networks has also been shown to be NP-hard [9℄, in some networks simulation is the only method that an be used to get a result at all. As the predened onditional probabilities give the probability that a variable will be in a partiular state given the states of its parents, a simulation an proeed as follows: 1. Set m to the desired number of iterations. 37 3.3. BELIEF PROPAGATION 2. Selet a state for eah root node by sampling from the probability distribution P (Xi ). 3. For eah node Xi in the network for whih the onguration of Pa(Xi ) is known, selet a state for Xi by sampling from the probability distribution P (Xi jpa(Xi )). Continue until all variables are instantiated, and reord the nal states of the variables. 4. Repeat steps 2 and 3 m times. ^ (Xi = k) = If Nik denotes the number of runs for whih Xi was in state k, let BEL for all variables Xi and states k. Nik m , As the state of a variable on any run is determined only by the onguration of its parents, this method avoids the problems assoiated with loops enountered in the previous methods. That is, indiret dependenies between variables whih render the network multiply onneted have no eet on the alulation sheme. The beliefs obtained by use of this method are only approximations. However, as the ^ (X ) onverges in probability number of runs m tends to innity, the distribution BEL to BEL(X ) [30℄. Although the number of omputations per run is modest, in pratie the number of runs m may have to be quite large for the approximation to be adequate. In a large network, the number of ongurations of the variables may be very large, for example a network used by Cheng et al. in [2℄ has 179 nodes and 1061 ongurations. Then if we take 106 samples for example, we an sample only 10 55 of the total sample spae. If a root node had some state that oured with a low probability, then many runs would be required to obtain suÆient realisations of that state to obtain aurate estimates for the distribution of its desendents. Consider the ase where a variable Y , say, is instantiated to Y = y . Given this information we ould determine the distribution of an intermediate variable X , P (X je), by onsidering the proportion of runs for whih X was in state x given that Y was in state y , for all states x of X . However, the proportion of runs for whih Y = y ould be small. As these are the only runs whih an be used to ompute the posterior distribution at the other nodes, this proess is very ineÆient and a large number of runs will be required to obtain suÆient auray. For example, if P (Y = y ) = 0:05 we would expet to be able to ount only 5 perent of the runs performed. We now present an alternate method based on Gibbs sampling whih handles instantiated variables more eÆiently. 38 3.3. BELIEF PROPAGATION 3.3.3.4 Gibbs Sampling To begin the simulation all evidene nodes are set to their observed state, and all other variables to an arbitrary initial state. We then assign some ordering to those nodes whih are not evidene nodes. At eah stage we update the state of a variable by sampling from its probability distribution onditional on the remaining variables being in their urrent onguration. That is we irulate through the variables indenitely, and at eah node X we alulate P (X jU(X ) = u(X ) ), where U(X ) is the set of all variables exluding X . The most omputationally demanding step in this proess is the evaluation of the distribution P (X ju(X ) ). However this an be simplied by onsidering the d-separation properties of the network. We know that the parents of X d-separate X from all anestors for whih the only path between these anestors and X is via Pa(X ). Hene if we know the states of the variables in Pa(X ), knowing the states of the anestors d-separated from X an give us no further information about the state of X . We an also obtain information about X from its hildren. These d-separate X from any desendents for whih the only path from X to the desendent is through the hildren of X . However, if a hild of X has other parents, there is a onverging onnetion at this hild. Hene X is not d-separated from the parents of this hild, given that the state of the hild is xed. The set onsisting of those variables whih are either a parent of X , a hild of X or a parent of a hild of X is known as the Markov Blanket of X , Ma(X ). If we know the onguration of the variables in the Markov Blanket of X , knowing the state of any other variable will give us no further information about X , that is Ma(X ) d-separates X from the rest of the network. In order to quantify the relationship P (X ju(X ) ) we begin with the hain rule. If we let L be the set indexing all variables in U(X ) besides the hildren of X , Y1 ; Y2 ; : : : ; Yk , then, by the hain rule for Bayesian Networks, P (u) = P (x; u(X ) ) = P (xjpa(X )) k Y j =1 P (yj jpa(Yj )) Y l 2L P (xl jpa(Xl )): As the nal produt is independent of X we an write (3.17) as P (x; u(X ) ) = :P (xjpa(X )) k Y j 39 =1 P (yj jpa(Yj )) (3.17) 3.3. BELIEF PROPAGATION and hene, as the marginal distribution P (u(X ) ) is also independent of X , P (x; u(X ) ) P (xju(X ) ) = P (u(X ) ) k Y = 0 :P (xjpa(X )) P (yj jpa(Yj )); j =1 where 0 is a normalising onstant. Hene we an evaluate the probability distribution of X onditioned on all the variables in the network given that we know only the onguration of those variables in Ma(X ), that is pa(X ); yj and pa(Yj ); j = 1; : : : ; k. As time progresses the system generated by this method is guaranteed to reah a steady state [30℄. That is, at any stage the probability that the variables are in some onguration u is given by P (u), as dened by the onditional probability tables. This will our whatever the initial states of the variables, though a period of initialisation is required. That is, the proess should be run for a time before reording the ongurations of the variables in order to allow the proess to stabilise. However, there is no way of knowing exatly how long this will be and it is dependent on the starting states of the variables. This method requires that an additional k onditional probabilities be retrieved to update the distribution at eah node ompared to the method presented in Setion 3.3.3.3. However, as in this ase instantiated variables make no dierene to the preision of the estimate for a given number of runs, Gibbs Sampling is far more eÆient when there are many instantiated variables. 3.3.3.5 Importane Sampling An alternative solution to the problem of estimates having large variane when there are extremely unlikely instantiations of evidene is importane sampling. Suppose we want to estimate the probability of some evidene, P (E = e), where E U is a set of evidene nodes. For onveniene we denote the set of all other variables in the network, that is UnE, by B. Then we an alulate the required probability by summing over the states of the variables in B, while the evidene nodes remain in onguration e. That is P (E = e ) = = X b2B P (fB = bg \ fE = eg) P (fB = bg \ fE = eg) f (b); f (b) b2B X (3.18) where f (B) is a probability distribution funtion over B and is referred to as the Importane funtion. 40 3.3. BELIEF PROPAGATION Cheng and Druzdel in [2℄ propose an adaptive importane sampling algorithm for large Bayesian Networks (AIS-BN). The importane sampling algorithm is similar to the sampling algorithm given in Setion 3.3.3.3 exept that eah root node is randomly instantiated to one of its possible states aording to the importane prior distribution for this node, and the states of the remaining nodes are sampled from the importane onditional probability distribution of this node onditioned on the states of its parents. To estimate the sum (3.18) we an independently generate m samples s1 ; s2 ; : : : ; sm from f (B) and let the value of our estimate be given by m 1X P (fB = si g \ fE = eg) ^ Pm (E = e) = : m i=1 f (si) The distribution f (B) an be hosen to emphasise instantiations of the variables in B whih are regarded as being important, by inreasing the probability of those instantiations over that in the original distribution P (B). This means that under the sampling distribution f (B), these instantiations will be observed relatively more often. However, simply taking the mean of these observations as the value of our estimate would result in a biased estimate for P (E = e). Therefore eah observation is rst multiplied by the weight 1=f (si ), whih, from (3.18), an be seen to produe an unbiased estimator. Although sampling from the density P (B) would also result in an unbiased estimate, a good hoie for the importane funtion an redue the variane of this estimate substantially. It an be shown that the variane of P^m is 2 X P (fB = bg \ fE = eg) 2 2 (P^m ) = f ; where f2 = P (E = e) f (b): m f (b) b2B Hene, the hoie of importane funtion aets the variane of the estimator. It an be shown that the optimal value for f , for all b 2 B, is suh that P (fB = bg \ fE = eg) P (E = e) = P (B = bjE = e); f (b) = (3.19) whih results in a value of f2 equal to 0. In pratie it is neessary to estimate (3.19), but it is assummed that funtions whih are lose to optimal will redue the variane eetively. The AIS-BN algorithm uses heuristis to ontinually update the importane funtion (hene the word `adaptive') as more samples are obtained, and has shown to ahieve over two orders of magnitude improvement in preision over both likelihood weighting and self-importane sampling [2℄. 41 3.4. ANSWERING QUERIES 3.4 Answering Queries After we have formed a Bayesian Network to represent the probability distribution we may then want to use the information stored on the links to answer queries about that distribution, for example the probability that node X is in state x say. This an be done through simple modiations to the network, namely by adding what we will refer to as query nodes. Given a query, we introdue a query node with states orresponding to the hypotheses `X is in state x' and the alternative `X is not in state x'. We then make the query node Q a hild of X , as in Figure 3.8 a), with the link probabilities speied by P (Q = 1jX ) = 1 if X = x 0 if X 6= x: We then propagate the probabilities through the network and the answer to our query is given by the belief that Q is in state 1. X X Y Z Q1 a) b) Q OR Q2 AND Figure 3.8: Networks with the addition of query nodes. This method an be extended to answer ompound queries, for example to determine the probability that node X is in state x or Y is in state y and that Z is in state z . This is ahieved by adding query nodes Q1 and Q2 as in Figure 3.8 b), where the probability table for the OR onnetive represented by Q1 has the form P (Q1 = 1jX; Y ) = 1 if X = x or Y = y 0 otherwise. The required probability is then given at node Q2 whih represents the AND onnetive between nodes Q1 and Z . That is, P (Q2 = 1jQ1 ; Z ) = 1 if Z = z and Q1 = 1 0 otherwise. This approah an be extended in similar ways to answer more omplex queries. Note that by adding the query nodes we are not altering the propagation of information through the 42 3.4. ANSWERING QUERIES remainder of the network. This is beause the onnetions at query nodes are always onverging and, as no evidene is entered at these nodes, they d-separate their parents. Hene this does not introdue an alternate path for the ow of information between the variables in the network. 43 Chapter 4 Aspets of Struture The previous hapter desribed how information travels through a Bayesian Network to update the belief distributions of the variables. In this hapter we aim to determine the eet on beliefs of hanges made to the network struture, and the value of the information observed at the evidene nodes. If we are using our network to make deisions and are unsure of a partiular aspet of the struture, for example whether a partiular link should be inluded, it would be unreasonable to then use this network if the presene or absene of the link would hange the optimal deision. Often it happens that the formation of our network is highly subjetive, and estimates of the true parent sets for eah node are made. In Chapter 5 we disuss learning the struture of the network when we have aess to data generated from the underlying probability distribution. In this hapter we assume that, given a set of variables to inlude, alloation of parents is performed subjetively, that is without using information from data. Although a Bayesian Network an be used simply as an eÆient way of storing information about a joint probability distribution, it is often used as a tool to aid deision making. In this ontext we lass variables as being one of three types: 1. Hypothesis variables - these are variables on whose belief distribution the deision will be based. 2. InformationnEvidene variables - these are leaf nodes1 whose state we may be able to observe. We an then enter the observed state of the variable as evidene to be propagated through the network. In pratie, the olletion of this information may have some ost attahed to it, for example the ost of a test. If the observable node is not a leaf node, the evidene an be represented as an adjoining leaf node whih is instantiated 1 44 4.1. CONCEPTS FROM INFORMATION THEORY 3. Intermediate variables - these are the remaining variables whih determine the relationship between the evidene variables and hypothesis variables and omplete the struture of the network. Thus when we refer to the eet of a strutural hange we are onerned with the beliefs of a set of hypothesis nodes. If a hange has little eet on the distribution of these beliefs then it will not signiantly aet our inferene and so is onsidered to be of little importane. In order to quantify hanges made to the struture and hene the probability distribution whih is represented by the network, we use measures developed in Information Theory, namely entropy and mutual information. These onepts are introdued in Setion 4.1 and applied in Setion 4.2, where we look at the eet of hanging the links and nodes of the network. For reasons of omputation and ease of understanding, it is desirable to keep the network as simple as possible whilst ensuring it is still an adequate representation of the probability distribution at the hypothesis nodes. Hene we need to know whih hanges will simplify the network whilst having as little impat on the belief distributions as possible. In general this is a diÆult task. In Setion 4.3 we dene the problem and then formulate an approximate method of solution. Given that the aquisition of evidene has some ost assoiated with it, we may not be able to aord to aquire all available evidene. Setion 4.4 disusses how to hoose the optimal set of information variables, given a xed budget. 4.1 Conepts from Information Theory In order to assess the eet of hanges to struture, some measure of the onsequent hange in probability distribution is required. A strutural hange may aet the information reahing the hypothesis variables and hene their distribution. Here we introdue several onepts from Information Theory whih enable us to quantify these eets. When making a deision we wish to know the state of the variables on whih we are to base the deision. Given a probability distribution over the states of a hypothesis variable, the more the probability mass is distributed over only one or two states, the more ertain we will be about the state of this variable. Entropy is a measure of the amount of unertainty assoiated with a random variable - the greater the entropy the more information is required, on average, to ompletely determine the state of that variable. Consider a variable Xi whih has ri states. Denote by P (xki ) the probability that Xi 45 4.1. CONCEPTS FROM INFORMATION THEORY is in state k. Then the entropy of a variable Xi , H (Xi ), is dened as ri X H (Xi ) = P (xki ) log2 P (xki ) =1 E [log2 P (Xi )℄: k = When the probability mass is all assigned to a single state, then H (Xi ) = 0, whih indiates omplete ertainty in the state of this variable. H (Xi ) is maximised when the probability is uniformly distributed over the states of Xi . Then ri X 1 1 log2 r r i i k =1 1 = log2 ri = log2 ri : H (Xi ) = The joint entropy, H (X1 ; X2 ) of a pair of disrete random variables with joint distribution P (X1 ; X2 ) is dened analogously as r1 X r2 X H (X1 ; X2 ) = P (xk1 ; xl2 ) log2 P (xk1 ; xl2 ) k =1 l=1 E [log2 P (X1 ; X2 )℄: = It an be shown, for example in [8℄, that H (X1 ; X2 ) = H (X2 ) + H (X1 jX2 ); (4.1) where the onditional entropy H (X1 jX2 ) = EX2 [H (X1 jX2 )℄ = r2 X l = =1 r2 X l = P (xl2 )H (X1 jxl2 ) =1 P (x2 ) r2 r1 X X l r1 X k =1 P (xk1 jxl2 ) log2 P (xk1 jxl2 ) P (xk1 ; xl2 ) log2 P (xk1 jxl2 ): =1 l=1 H (X1 jX2 ) represents the average unertainty whih remains about the state of X1 if we an observe the state of X2 . Applying (4.1) reursively leads to what is known as the hain rule for onditional entropy, k H (X1 ; : : : ; Xn ) = H (X1 ) + H (X2 jX1 ) + : : : + H (Xn jXn 1 ; : : : ; X1 ) = n X i =1 H (Xi jX1 ; : : : ; Xi 1 ): 46 4.1. CONCEPTS FROM INFORMATION THEORY Relative entropy is a measure of the dierene between two distributions P1 (Xi ) and P2 (Xi ) and is equivalent to the Kullbak-Leibler divergene between the distributions2. Relative entropy is dened as ri X P (xk ) P1 (xki ) log2 1 ki P2 (xi ) k =1 P (X ) = EP1 [log 1 i ℄: P2 (Xi ) If P1 (Xi ) is the `true' distribution and we base our deision on P2 (Xi ), the unertainty in P k k this deision is k P1 (xi ) log 2 P2 (xi ). Hene the ineÆieny in assuming the distribution P2 when P1 is in fat the true distribution is given by distK (P1 ; P2 ) = ri X k =1 P1 (xki ) log2 P2 (xki ) [ = ri X k =1 ri X k =1 P1 (xki ) log2 P1 (xki )℄ P1 (xki )(log2 P1 (xki ) log2 P2 (xki )) ri X P (xk ) P1 (xki ) log2 1 ik P2 (xi ) k =1 = distK (P1 ; P2 ): = (4.2) To show that the Kullbak-Leibler distane is non-negative, we make use of Jensen's inequality3. This says that for a onvex funtion f (X ), E [f (X )℄ f (E [X ℄): Hene, (4.3) P (X ) distK (P1 ; P2 ) = EP1 [log2 1 i ℄ P2 (Xi ) P (X ) = EP1 [ log2 2 i ℄ P1 (Xi ) Xi ) log2 EP1 PP2 ((X ; from (4.3) 1 i) ri X P (xk ) = log2 P1 (xki ) 2 ik P1 (xi ) k =1 = log2 ri X k =1 = log2 (1) = 0: P2 (xki ) (4.4) Although we denote the Kullbak-Leibler divergene by distK (), it is not stritly a distane (metri) as it does not satisfy the triangle inequality, nor is it symmetri, that is distK (P1 ; P2 ) 6= distK (P2 ; P1 ). 3 A proof of Jensen's inequality an be found in [8℄. 2 47 4.1. CONCEPTS FROM INFORMATION THEORY This result, together with (4.2), implies that it is always ineÆient to use less aurate information than that whih is available. That is, if evidene has been olleted we should always use this extra information. We an use relative entropy to quantify the eet of making a hange to the network whih may result in the distribution of belief at a hypothesis node hanging from P1 to P2 say. If there exists more than one hypothesis variable in the network, the eet of a strutural hange will be quantied by X Xh 2U distK (P1 (Xh ); P2 (Xh )); where Xh is the set of hypothesis variables of interest, and again the set U ontains all variables in the network. This sum an be weighted to reet the inuene individual hypothesis variables will have on the deision. We now introdue several other measures from Information Theory used in subsequent analyses. Mutual information, I (X1 ; X2 ), is a measure of the redution in unertainty of one random variable due to knowledge of the other. It is the relative entropy between the atual joint distribution and the joint distribution assuming the variables to be independent. That is, I (X1 ; X2 ) = = r1 X r2 X k k = =1 l=1 r1 X r2 X =1 l=1 P (xk1 ; xl2 ) log2 P (xk1 ; xl2 ) log2 r1 X r2 X P (xk1 ; xl2 ) P (xk1 )P (xl2 ) P (xk1 jxl2 ) P (xk1 ) P (xk1 ; xl2 ) log2 P (xk1 ) + k =1 l=1 = H (X1 ) H (X1 jX2 ): r1 X r2 X k =1 l=1 P (xk1 ; xl2 ) log2 P (xk1 jxl2 ) (4.5) Notie that I (X1 ; X2 ) = distK (P (X1 ; X2 ); P (X1 )P (X2 )) and so from (4.4) and (4.5) we nd that H (X1 ) H (X1 jX2 ) 0, whih implies that H (X1 ) H (X1 jX2 ): This shows that, on average, obtaining information about some variable X2 distint from X1 will redue our unertainty about the state of X1 . Only if X1 and X2 are independent will knowledge of the state of X2 not redue the entropy of X1 . Finally, the onditional mutual information of random variables X1 and X2 given X3 , I (X1 ; X2 jX3 ) = H (X1 jX3 ) H (X1 jX2 ; X3 ) P (X1 ; X2 jX3 ) : = E log2 P (X1 jX3 )P (X2 jX3 ) 48 (4.6) 4.2. CHANGES TO THE NETWORK This is a measure of the redution in unertainty at X1 by obtaining knowledge of X2 , when the state of X3 is already known. 4.2 Changes to the Network As the struture of a Bayesian Network is ompletely determined by the set of parents for eah node, modiations to struture are equivalent to a hange in the set of parents of at least one node. In this setion we onsider the eet on the marginal distribution of the nodes, most signiantly the hypothesis nodes, given some hange to the struture. 4.2.1 Changes to the Parents of a Node Often the hypothesis nodes are root nodes. The network shown in Figure 4.1 is a simple model whih will allow us to examine the eet of hanging the parents of a node in a network whih is a tree. Here X1 is the hypothesis node, and eah node Xi has hildren Xi+1 and Ci , where Ci is onsidered to be the root node of a subtree. That is, if any evidene has been entered in the subtree rooted at Ci , then as the network is a tree this information must pass through Ci before it an be reeived at Xi or X1 . Suppose now we hange the parent set of C4 to Pa(C4 ) = fX2 g. We may make suh a hange if we believe that X2 is a diret ause of C4 . Given that the subtree rooted at C4 ontains at least one evidene node, instead of this information rst being reeived at X4 and then propagated through X3 and X2 to X1 it is reeived diretly at X2 . Hene, as X2 and X3 must proess this information before it an reah X4 , we would expet H (X4 ) to inrease. Similarly, and perhaps more signiantly, as X2 is loser to X1 than X4 , we would expet X1 to reeive more diret information and so expet H (X1 ) to derease. However, not only struture aets the propagation of evidene - the onditional probabilities also have a large eet. In the above disussion we assummed that the amount of information reahing X4 from C4 was similar to the amount reahing X2 from C4 after having made the hange Pa(C4 ) = fX2 g. That is, we assumed H (X4 jC4 ; Pa(C4 ) = X4 ) ' H (X2 jC4 ; Pa(C4 ) = X2 ): From (4.5) we know that I (X4 ; C4 jPa(C4 ) = X4 ) = H (X4 ) H (X4 jC4 ; Pa(C4 ) = X4 ) and I (X2 ; C4 jPa(C4 ) = X2 ) = H (X2 ) H (X2 jC4 ; Pa(C4 ) = X2 ): Hene if H (X4 jC4 ; Pa(C4 ) = X4 ) << H (X2 jC4 ; Pa(C4 ) = X2 ); 49 4.2. CHANGES TO THE NETWORK then I (X4 ; C4 jPa(C4 ) = X4 ) >> I (X2 ; C4 jPa(C4 ) = X2 ), and so it is possible that by making this hange of parents, the information reahing X1 is redued. That is, the information (entropy reduing apaity) reahing X1 is determined not only by the struture but also our beliefs in the strength of ausal relations between a node and its hild. X1 C1 X2 X3 C2 C3 X4 C4 Figure 4.1: A tree struture. In general, the entropy reduing apaity of evidene is dereased the further from the hypothesis node it is reeived. This an be shown by the data proessing inequality [8℄ whih demonstrates that any manipulation at a node X2 of the information reeived from a node X3 an not inrease the amount of information reahing X1 . That is, if X1 ! X2 ! X3 , then I (X1 ; X2 ) I (X1 ; X3 ) with equality if and only if I (X1 ; X2 jX3 ) = 0. If we put the variable X3 equal to some funtion of X2 , g(X2 ), then it follows that I (X1 ; X2 ) I (X1 ; g(X2 )): At eah intermediate node between variables X1 and Xk say, a manipulation of the information ours when being propagated to its neighbours. Hene at eah intermediate node the amount of information reahing X1 from Xk will not be inreased, and will probably be redued. That is, the value (entropy reduing apaity) of evidene to the hypothesis variable is lessened the more times it is manipulated as it is propagated to the hypothesis node. In the example in Figure 4.1, if we had assigned Pa(C4 ) = fX2 g when X4 is the `true' parent of C4 , we would be basing our deision at X1 on stronger information than is justied, and ould obtain more ertain results than we should. If we assign Pa(C4 ) = fXk g, where k is greater than four, we would expet greater unertainty at X1 . This suggests that if we are unsure of the orret parent for a variable Xi then, taking a onservative approah, we should assign as the parent of Xi that node at whih the information would have the least entropy reduing eet on the hypothesis node. In general, for some hypothesis node X h , we say a node Xi is further from X h than 50 4.2. Xj if CHANGES TO THE NETWORK I (X h ; Xj ) > I (X h ; Xi ); or equivalently, H (X h jXj ) < H (X h jXi ). We hoose as the parent of our node, or subtree, that andidate parent whih is furthest from the hypothesis node. However, note that this analysis is based around the mutual information, P (X h ; Xi ) h I (X ; Xi ) = E : P (X h )P (Xi ) As this is an expeted value, we would expet that, in the long run, the proedure suggested for alloating parents to be onservative. However, this proedure may not be optimal in every situation. Consider the example illustrated in Figure 4.2. In this model we assume that an infetion auses fever whih in turn auses tiredness. We wish to add the variable Alertness to the model, but are unsure as to whether it should be a hild of Fever or Tired 4 . Infection Fever Tired ? Alertness Figure 4.2: A Bayesian Network formed on the four variables Infetion (I), Fever (F), Tired (T) and Alertness (A). The parent of Alertness is yet to be determined. The onditional probability tables at Infetion, Fever and Tired are P (T jF ) yes no v. high 0.97 0.03 . P (I ) = (0:05; 0:95); high 0.65 0.35 normal 0.15 0.85 If we assume that Fever should be the parent of Alertness, then we add Alertness to the model as a hild of Fever. Given that we dene the orresponding onditional probabilities as P (F jI ) yes no v. high high normal 0.31 0.65 0.04 , 0 0.1 0.9 P (AjF ) very high high normal high average low 0.01 0.2 0.79 , 0.05 0.5 0.45 0.3 0.6 0.1 This is a greatly simplied senario. It may be more realisti to have both Fever and Tired as parents of Alertness, and to model some other auses of tiredness, for example. 4 51 4.2. CHANGES TO THE NETWORK then I (I ; A) = 0:03495, whih orresponds to a 12.2% redution in entropy at Infetion. If instead Tired is alloated as the parent of Alertness, with the onditional probabilities P (AjT ) yes no high average low 0.02 0.4 0.58 , 0.3 0.6 0.1 this results in a value for I (I ; A) of 0.01288, whih orresponds to only a 4.5% redution in entropy. Hene Tired is the furthest andidate parent from Infetion, as we would expet from the data proessing inequality. Given the above disussion, these results imply that we should add Alertness as a hild of Tired to avoid over ertainty in the state of Infetion. However, onsider the results displayed in the table below. The entries in the rst olumn give the redution in entropy at Infetion given the three possible ndings at Alertness, that is H (I ) H (I jA), when the parent of Alertness is Fever. The entries in the seond olumn are the orresponding values for the ase that Tired is alloated as the parent of Alertness. I(I;A = a) Pa(A) = F Pa(A) = T high 0.2123 0.1433 0.0656 0.04 average low -0.3804 -0.2075 If we reeive evidene that the state of Alertness is low, then the entropy of Infetion is lower if the parent of Alertness is Tired than if it is Fever. If Alertness is observed to be average or high, then the entropy at Infetion is lower if Fever is the parent. Hene, although I (I ; A) is greater when Fever is the parent of Alertness, we see that this is not the ase for all possible instantiations of Alertness. This highlights the fat that, beause mutual information is an expeted value, alloating the parent of a node using the above method may not neessarily result in less information reahing the hypothesis node for all instantiations of that node. However, in some ases we may have extra information about the type of deision we are making. For example, in this senario it may be reasonable to assume that in most ases when we are required to make a deision about the hypothesis node Infetion, the alertness of the patient will be average or low. Then if we wish to alloate Alertness so as to have the least entropy reduing eet, the above results suggest it would be reasonable to alloate Fever as the parent of Alertness, and not Tired as suggested by the previous analysis. 52 4.2. 4.2.2 CHANGES TO THE NETWORK Changes in Networks Containing Loops The eet of modifying the set of parents in a network ontaining loops, or of speifying dierent auses of a variable, is hard to determine in general. Even if a link is added between nodes whih share a ommon hild so that the networks redue to the same join tree, propagations through the join trees will dier. We may also wish to examine hanges to the parent sets of a probabilisti network, for example if we try to simplify a network over the given set of nodes by removing ertain links. A ompliation here is that the existene of equivalene strutures is harder to determine in networks with loops. Two networks are said to be equivalent if they represent the same onditional independene assumptions, that is every joint distribution enoded by one struture an be enoded by the other and vie versa [20℄. For example, the two networks shown in Figure 4.3 are equivalent as they both enode the same independene relation, namely I (X1 ; X2 ; X3 ). Reall from Setion 2.2 that, given any ordering of the variables, we an X3 X1 X3 X1 X2 X2 Figure 4.3: Equivalent network strutures form a Probailisti Bayesian Network of the underlying probability distribution by assigning the variables in the boundary strata of a node Xi to be the parents of Xi . For eah ordering it is likely that a dierent struture will result. Hene there are at least n! dierent strutures for a Bayesian Network on n variables whih an represent the same probability distribution. Additionally, how the onditional probabilities are adjusted to ompensate for the hange is also an important onsideration. However, we do know that if we remove a link from a Probabilisti Bayesian Network then the resulting struture will not be equivalent to the original. This follows from the fat that, by denition, a Probabilisti Bayesian Network is a minimal I-map and so by deleting a link we are foring an additional onstraint of onditional independene to hold. Note that the two networks in Figure 4.3 an not be onsidered equivalent from a ausal point of view. If we were to make an intervention and x the state of X3 say, then the state of this variable is independent of its parents. Hene we an delete the links from X3 to its parents, and our inferene would proeed on the networks given in Figure 4.4. These networks are learly not equivalent. 53 4.2. X3 X1 CHANGES TO THE NETWORK X3 X1 X2 X2 Figure 4.4: The networks whih would be used for inferene if the state of X3 in the networks in Figure 4.3 were xed. 4.2.3 Removal of Nodes We now examine the issues relating to the removal of a node. We may remove a node from the network in order to simplify the network, or wish to know the eet of failing to inlude a variable in the model. Consider the network in Figure 4.5. If we delete node X3 say, we ould simply delete X1 X2 e X3 X4 X5 X6 Figure 4.5: A Bayesian Network. If X4 is instantiated and X3 removed, then the path from X1 to X6 will be bloked. all links onneted to this node. Then Pa(X6 ) = Pa(X6 )nfX3 g = fX4 ; X5 g: However, if we do this then X2 will no longer be an anestor of X6 and so we may hoose to add the link X2 ! X6 . X1 will still be an anestor of X6 through X4 . However, if X4 is instantiated then this will d-separate X1 from X6 and so we may need to link X1 to X6 also. That is, let Pa(X6 ) = Pa(X6 )nfX3 g [ Pa(X3 ) = fX1 ; X2 ; X4 ; X5 g: If we fail to inlude a node that may be observable, then we sarie information. Consider now the general ase of forming a Probabilisti Bayesian Network on the variables X1 ; : : : Xn , under that ordering, aording to the method given in Setion 2.2. Suppose we leave out the variable Xk . Reall that the boundary strata for a node Xi 54 4.2. CHANGES TO THE NETWORK is dened as the set of nodes Bi U(i) suh that I (Xi ; Bi ; U(i) nBi ), where U(i) = fX1 ; : : : ; Xi 1 g. On forming the new network whih does not inlude Xk we then need to reassess these boundary strata. If we follow the same variable ordering as previously, then for Xi suh that i < k, the boundary strata remain unhanged. Also, for Xi suh that i > k and Xk is not in Bi , these boundary strata remain unhanged also. This is beause, as U(i) nBi = fU(i) nfXk [ Bi gg [ Xk , I (Xi; Bi ; U(i) nBi ) I (Xi ; Bi; fU(i) nfXk [ Bigg [ Xk ); whih implies that I (Xi ; Bi ; U(i) nfXk [ Bi g). This an be written as I (Xi ; Bi ; U0(i) nBi ), where U0(i) = fU(i) nXk g and so the boundary strata Bi is unhanged. For those Xi suh that Xk is in Bi , the boundary strata must be reassessed. It is not possible to simply substitute in Bk for Xk in Bi , as Bi may have d-separated Xi from Xl say, where k < l < i. Hene this partiular substitution will only hold if Xk+1 ; : : : ; Xi 1 are all in Bi . For example, onsider the network on variables X1 ; : : : ; X5 , formed in that variable ordering, with B1 = ; B3 = fX2 g B5 = fX2 ; X3 g B2 = fX1 g B4 = fX3 g: The resulting network is shown in Figure 4.6 a). If we delete X3 say, the assertion that X1 X1 X2 X3 X2 X4 X4 a) X5 b) X5 Figure 4.6: The network formed based on the node ordering X1 ; : : : ; X5 , with boundary strata as given. we replae X3 in B5 by B3 , and X3 in B4 by B3 to give B05 = fX2 g and B04 = fX2 g is not valid. This would be equivalent to saying I (X5 ; fX2 ; X3 g; fX1 ; X4 g) & I (X3 ; X2 ; X1 ) =) I (X5; X2 ; fX1 ; X4 g); whih is not true. In Figure 4.6 b) we an see the additional independene relation this reates, I (X5 ; X2 ; X4 ), whih does not hold in Figure 4.6 a). As there are no nodes between X3 and X4 in the ordering, then we an dedue that B04 = B3 = fX2 g. However, as X5 is after X3 in the node ordering and X3 is in B5 , we need to separately assess B05 so 55 4.3. SIMPLIFYING THE STRUCTURE that B05 is the minimal set satisfying I (X5 ; B05 ; fX1 ; X2 ; X4 gnB05 ). In this ase we should dene B05 = fX2 ; X4 g. This analysis shows that if we delete `sub-networks' or sub-trees, the struture of the higher parts of the network will remain unhanged. Hene for a network initially formed on some ordering of variables it would be desirable to begin by eliminating nodes towards the end of this ordering to minimise the number of boundary strata needing to be reassessed. 4.3 Simplifying the Struture In this setion we assume that we are given a network whih we then want to simplify in order to minimise storage spae and inrease eÆieny. Our goal is to form a network whih is as simple as possible, but whih still provides an adequate approximation at the hypothesis nodes to the distribution of the original network. To do this, eah time a strutural simpliation is made we must then searh through all possible values for the onditional probabilities in order to nd that whih results in the best approximation. Given that for even a network of moderate size there are many interations to onsider, for example the onsiderations of deleting a node in Setion 4.2.3, we onlude that this is too omputationally intensive and instead take a dierent approah. In essene, we aim to nd the most simple network possible whilst keeping the distribution over the hypothesis nodes within a set distane from the original distribution. That is, we wish to nd a network struture S with minimal omplexity that, given appropriate onditional probabilities are dened, results in a distribution that satises some onstraint on distane from the original distribution. Given a node Xi , we have seen that the size of the onditional probability table required will depend on the number of parents of Xi and the number of states of Xi and its parents. Here we denote the number of entries in the onditional probability table at Xi by Y Sp(Xi ) = (jXi j 1) Xk 2Pa(Xi ) jXk j; where again jXi j denotes the number of states of Xi . Note that the omplexity of the network depends on the number of nodes, the number of links and the number of states of eah variable. We use as a measure of the omplexity of a network S [22℄, Size(S ) = X Xi 2U Sp(Xi ): This is linear in the number of nodes, and Sp(Xi ) is exponential in the number of parents of Xi . Suppose we denote the distribution resulting from a simpliation as P 0 . Then 56 4.3. SIMPLIFYING THE STRUCTURE if we let Xh be the set of hypothesis nodes and E the set of evidene nodes, we ould onsider the problem min Size(S ) s.t. dist(P (Xh je); P 0 (Xh je)) t; 8 e 2 E; (4.1) As (4.1) is quite strit, given a prior distribution for E we may instead wish to minimise Size(S ) subjet to a onstraint on X e e dist(P (Xh je); P 0 (Xh je)); (4.2) where e is the prior probability (or an estimate of the probability) that E is in onguration e. This may allow more sope for simpliation, though it also omes with added risk, as it is possible that we may observe an instantiation of E that ours with low probability but for whih the distane between P and P 0 is large. An additional onstraint on simpliation is that we inlude the hypothesis nodes and (in most ases) the information nodes. Our problem an then be posed as a minimum ost network design problem subjet to these onstraints, where the ost is assoiated with the omplexity of the network. This is similar to the Steiner Tree problem [21℄ whih has been studied extensively. However, the diserning feature of our problem is that xed weights or osts annot be assigned to the edges or to the nodes. For example a suitable value for the node weight would be Sp(Xi ), but this is a funtion of the struture of the network through the parents of Xi . The presene of the onstraint adds another layer of diÆulty to nding an eÆient method of solution. To avoid a onstrained problem, it is suggested, for example in [22℄, that the term desribing adequay of t be built into the objetive funtion. This yields an aeptane measure, A(P; P 0 ) = Size(S 0 ) + dist(P; P 0 ); (4.3) where S 0 is the simplied struture and some onstant. This an then be minimised using standard methods. Here the tuning parameter is hosen by the user - large values of orrespond to more heavily penalising models with a large distane. However, a areless hoie for this parameter ould result in models with large size or an unaeptable disrepany in distane being `aepted.' Notie that if we base our model seletion on the aeptane measure, then we need not onsider as an improvement any network S 0 57 4.3. SIMPLIFYING THE STRUCTURE suh that Size(S 0 ) Size(S ). This follows as A(S 0 ) = Size(S 0 ) + dist(P 0 ; P ) Size(S 0 ) + dist(P; P ) Size(S ) + dist(P; P ) = A(S ): It was shown in the example onerning Figure 4.5 above that if we want to simplify a given network, as well as onsidering deletions of links in the network, we must also onsider the insertion of links not originally inluded. An exhaustive searh ould be done by onsidering Size(S 0 ) and dist(P; P 0 ) for eah possible struture on the n nodes and hoosing that struture for whih this is minimised. However, the number of possible direted ayli graphs on n nodes is given in [7℄ by the reursive funtion f (n) = n X n ( 1)i+1 i 2i(n i) f (n i): =1 Thus there are 25 possible strutures on three nodes, 29 000 on ve nodes and approximately 4:2 1018 strutures on just ten nodes. Hene exhaustive enumeration of all network strutures is learly not feasible, and other methods have to be onsidered. One way of onstraining the searh spae, as disussed in [22℄ for example, is to onsider only those links within some strutural hierahy. For example if some variables an be onsidered auses of disease and others symptoms, we an impose the onstraints shown graphially in Figure 4.7, where links are allowed only within a luster or downwards in the hierahy. Another way of onstraining the searh is to give the nodes a numeri labelling and speify that a link from Xi to Xj may exist only if i < j: However, even with suh i C D S Figure 4.7: A strutural heirahy. For example, C ould represent a luster of variables representing auses, D ontain disease nodes and S ontain nodes representing possible symptoms. onstraints imposed, for networks of even moderate size, heuristis are needed. For any of the model seletion riteria disussed earlier, after eah simpliation we must be able to alulate the network size, update the onditional probabilities and then evaluate the hosen distane measure. As we may make many hanges in searhing for the optimal simpliation, it is important to be able to perform these alulations as eÆiently as possible. 58 4.3. SIMPLIFYING THE STRUCTURE Consider rst the task of evaluating network size. At eah stage assign to every link l the weight wl , where wl represents the amount by whih the size of the network will be redued by the deletion of this link. Let S 0 be the network obtained from S by the deletion of a link, and onsider the weight on the link from Xi to Xj say, X wl = Sp(Xk ) Xk 2S 0 0 Sp (Xj ): 2 = Sp(Xj ) Xk S X Sp0 (Xk ) Here Sp0 (Xj ) represents the value Sp(Xj ) in the network S 0 . In the ase where Xj is a leaf node with sole parent Xi in S , on simplifying the network by removal of the link from Xi to Xj we an also remove the node Xj as it is disonneted from the network. If Xj has sole parent Xi but is a parent of some other variable, we retain Xj . If Xj has parents other than Xi , Sp0 (Xj ) = SpjX(Xi jj ) . Hene 8 > <Sp(Xj ) if Pa(Xj ) = fXi g and Xj has no hildren wl = Sp(Xj ) jXj j if Pa(Xj ) = fXi g and Xj has at least one hild > : 1 Sp(Xj )(1 jXi j ) otherwise: After having removed the link from Xi to Xj we must then update the weights. The only links on whih the weights must be altered are those from the remaining variables in Pa(Xj ) to Xj . In this ase the weights will hange to jXwli j . To arry out model seletion we must hoose a measure of distane. A reasonable distane measure, additive over the hypothesis nodes, is X dist(P (Xh ); P 0 (Xh )) = Xi 2Xh distK (P (Xi ); P 0 (Xi )); where we have used the Kullbak-Liebler divergene. This allows for an eÆient method of alulating the distane at eah simpliation. Based on (4.2), dist(P; P 0 ) = X e e X Xi 2Xh distK (P (Xi je); P 0 (Xi je)): (4.4) Assumming we have the probabilities P (Xi je) from the original network, we an alulate (4.4) eÆiently by the use of a soring funtion. Dene the sore of a Bayesian Network S 0 to be (S 0 je) = and X X Xi 2Xh xi ~ 0 = P (Xi = xi je) log P 0 (Xi = xi je); X e2E e (S 0 je): 59 4.3. Then we have that ~0 ~ = = X e2E X e e X X Xi 2Xh X e2E Xi 2Xh = dist(P; P 0 ): xi SIMPLIFYING THE STRUCTURE P (Xi = xi je)(log P (Xi = xi je) log P 0 (Xi = xi je)) distK (P (Xi je); P 0 (Xi je)) To obtain the required probabilities P 0 (Xi je), we must rst update the onditional probabilities of our new network, as desribed below. For eah instantiation of the evidene variables we an then propagate this evidene through the network to obtain the required probabilties at the hypothesis nodes. If in simplifying the network we remove the link from Xi to Xj say, then we require the onditional probabilities P (Xj jPa0 (Xj )) = P (Xj jPa(Xj )nfXi g): Consider the probability distribution for Xj given a partiular onguration of Pa0 (Xj ), that is one row of the onditional probability table. Then the required probabilities are, for all xj 2 Xj , P (xj ; pa0 (Xj )) P (Xj = xj jpa0 (Xj )) = P (pa0 (Xj )) X P (xj ; pa0 (Xj ); Xi = xi ) = P (pa0 (Xj )) xi X P (xj jpa0 (Xj ); Xi = xi )P (pa0 (Xj ); Xi = xi ) = P (pa0 (Xj )) xi X = : P (xj jpa0 (Xj ); Xi = xi )P (pa0 (Xj ); Xi = xi ); xi where is the normalisation onstant for that partiular onguration of Pa0 (Xj ). As Pa(Xj ) = Pa0 (Xj ) [fXi g, the probabilities an easily be obtained from the network prior to the removal of the link by xing the states of the variables in Pa0 (Xj ) and summing over the probabilities for eah state of Xi . Given that we now have eÆient methods for alulating the size and distane required to evaluate the eet of a simpliation, we an ombine these ideas into a proess for network simpliation. If we were to use the aeptane measure, we ould searh for the network whih minimises this, by using any standard searh method. We now present a heuristi for the alternative formulation, minimise subjet to Size(S ) dist(P; P 0 ) < t: 60 (4.5) 4.3. SIMPLIFYING THE STRUCTURE whih is based on a greedy method. The required input is a Bayesian Network, N , with hypothesis nodes Xh , evidene nodes E and also a threshold t on the distane. We denote the set of edges in the network N i at step i by Li . The heuristi returns a simplied network Nbest. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Put L1 equal to the set of edges in the initial network N 1 . Calulate weights wl for all edges l in L1 . Put k = 1 While Lk 6= ;, do sort wl for all edges l 2 Lk lmax edge of maximum weight Lk+1 Lk nflmax g N 0 N k nflmax g Size(N 0 ) Size(N k ) wlmax update weights and probabilities on remaining links If Size(N 0 ) < Size(N ); alulate dist(P ; P jN 0 ) If dist(P ; P jN 0 ) < t N k+1 N 0 Nbest N k+1 else N k+1 N k else N k+1 N 0 k k + 1. Note that the initial network N 1 does not have to be the same as the network we wish to simplify, N , and in general will be a omplete network. This allows links not in the original network to be inluded in the simplied network, whih may result in a better approximation than by simply deleting links from N . If time permits the heuristi should be run for as many omplete networks as possible, and some thought should be given to whih strutures might be likely to result in a good approximation. The heuristi begins with the maximal number of allowable links and at eah stage removes the link whose deletion results in the largest derease in the value of the objetive funtion (4.5), that is the Size. If the Kullbak-Liebler divergene is used, the distane on line 12, required to assess the feasibility of deleting lmax , may be alulated eÆiently by the method presented earlier. If a deletion is not feasible the orresponding link is removed from the pool of remaining links, and this is repeated until there are no more links whih may be feasibly deleted. At eah stage the number of elements in Lk is redued by one, 61 4.4. THE VALUE OF EVIDENCE hene assuming the network has n nodes, the algorithm will loop at most n(n 1)=2 times. Note that there is no onstraint that speies the evidene nodes and hypothesis nodes should be inluded in the nal network. However, the onstraint on distane should be tight enough so that any relevant evidene nodes are inluded, and that the hypothesis nodes remain onneted to the network. Many of the omputational diÆulties enountered above are due to the fat that the number of possible strutures on even a reasonable number of nodes is huge. In Chapter 5 we disuss methods that have been developed for learning the struture of a Bayesian Network from data, and the tehniques, approximations and assumptions whih are introdued to overome suh diÆulties. 4.4 The Value of Evidene In this setion we assume that the set of evidene nodes E in our Bayesian Network is non-empty. Ideally when making inferene at the hypothesis nodes we would base this inferene on as muh information as possible, that is we would observe the state of eah information variable. However often this observation may be assoiated with a ost. For example, onsider the information variable Test Result shown as a hild of Glandular Fever in Figure 4.8. Although Test Result is an evidene node, it may ost $40 to have Glandular Fever Test Result Fever Tiredness Thermometer Reading Figure 4.8: A Bayesian Network with hypothesis variable Glandular Fever and information variables Test Result, Thermometer Reading and Tiredness. the test performed. Evidene may be observed at Thermometer Reading for a very small ost. Note that we make Thermometer Reading a hild of fever so that any disrepanies between the thermometer reading and atual fever an be aounted for, and that similarly Test Result is a hild of Glandular Fever in order to allow possible false positives or false negatives to be modelled. In Setion 4.4.1 we present a method to nd the optimal subset of evidene nodes at whih to observe evidene, given a xed budget. In Setion 4.4.2 we then show that the optimal set of evidene hanges with struture, or more preisely the distribution over the 62 4.4. THE VALUE OF EVIDENCE hypothesis nodes, and hene should be ontinually revised. 4.4.1 Seleting a Set of Evidene Nodes Our objetive is to maximise the amount of evidene reahing the hypothesis nodes, subjet to a budget onstraint. That is, max ~ I (Xh ; E ~) E X s.t. (Xi ) C; ~ Xi 2E ~ E is a subset of the possible evidene nodes, (Xi ) is the ost assoiated with where E observing evidene at node Xi and C is some predetermined upper bound on the total ost (the budget ). As the instantiation of some variables may aet the entropy reduing apaity of ~ and evidene at other nodes, we must onsider separately all feasible sets of evidene E ~ ). determine whih set maximises I (Xh ; E ~ = fXn m+1 ; : : : ; Xn g say, and we then an Suppose we have a set of evidene nodes E observe evidene at an additional node Xn m . Denote the set fXn m ; Xn m+1 ; : : : ; Xn g ~ + . Then we have that by E ~ + ) = H (Xh ) H (Xh jE ~ + ); from (4.5) I (X h ; E ~ ) + H (Xh jE ~ ) H (Xh jE ~ +) = H (Xh ) H (Xh jE ~ ) + I (Xh ; Xn m jE ~ ); from (4.6): = I (Xh ; E Thus in order to alulate the hange in information Xh reeives when a node Xn m is ~ ). However, this will added to the set of evidene, we need only alulate I (Xh ; Xn m jE still involve a summation over all states of the variables Xn m ; : : : ; Xn . In order to nd the optimal solution to the problem above, one would have to perform an exhaustive searh. That is, alulate the value of the information for eah of the possible sets of evidene nodes with total ost C . The optimal set of evidene would then be ~ ). that whih maximised I (Xh ; E We now propose a greedy searh tehnique for the problem, where we begin with no nodes from whih to ollet evidene, and at eah stage add the node whih is aordable and has the greatest entropy reduing apaity at the hypothesis nodes. Although the solution will not always be optimal, in the ase of there being many evidene nodes it is omputationally eÆient. Let the set of nodes hosen prior to step k at whih to observe evidene be denoted k 1 ~ E , let Wk E be the set of evidene nodes whih are andidates for seletion at step 63 4.4. THE VALUE OF EVIDENCE k, and Fk Wk be the set of all feasible nodes at step k, that is nodes whih an be ~ k without going over budget. C k = PX 2E~ k (Xi ) is the ost of olleting inluded in E i ~ k at the end of step k. the evidene E 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ~0 ; E W0 E, the set of all evidene nodes C0 0 k 1 for eah Xi 2 Wk 1 if (Xi ) C C k 1 Fk Fk [ fXi g if F k = ; ~ E ~k 1 E ~ g and stop. return fE ~ k 1) I (Xh ; Xi jE else X k argmax Xi 2Fk ~k E ~ k 1 [ fX k g E C k C k 1 + (X k ) Wk Fk nfX k g k k+1 go to line 5. If C is large ompared to the ost for eah test, it may require fewer steps to begin a similar searh starting with E and suessively removing nodes with the lowest sore. If there exist evidene nodes whih when instantiated would d-separate other nodes in Wk from Xh , then the mutual information of these d-separated nodes to Xh would be zero and so they ould be removed from Wk . In many ases this ould substantially inrease the speed of the searh as it would avoid many unneessary alulations of the mutual information. 4.4.2 Updating the Set of Evidene Nodes Suppose now the funtion we are trying to minimise is I (X1 jXn ; : : : ; Xn m ); where X1 is a root node and a hypothesis node and Xn m ; : : : ; Xn are information nodes. We know I (X1 ; Xn ; : : : ; Xn m ) = I (Xn ; : : : ; Xn m ; X1 ) = H (Xn ; : : : ; Xn m ) H (Xn ; : : : ; Xn 64 m jX1 ); (4.1) 4.4. THE VALUE OF EVIDENCE where, reall from Setion 4.1 H (Xn ; : : : ; Xn m ) = X xn ;:::;xn m P (xn ; : : : ; xn m ) log2 P (xn ; : : : ; xn m ) and also P (x n ; : : : ; x n m ) = = X P (x1 ; : : : ; xn ) x1 ;:::;xn m 1 x1 ;:::;xn m 1 X P (xn jpa(Xn )) : : : P (x2 jpa(X2 ))P (x1 ): As the onditional probabilities are xed and hene independent of the observed evidene, this last expression is a funtion of P (x1 ). Hene H (Xn ; : : : ; Xn m ) is a funtion of P (X1 ). Similarly, H (Xn ; : : : ; Xn m jX1 ) is a funtion of P (x n ; : : : ; x n m jx1 ) = P (xn; : P: :(;xxn) m ; x1 ) ; for every (x1 ; : : : xn) = X x2 ;:::;xn P = = m x2 ;:::;xn X x2 ;:::;xn m 1 P (x1 ; : : : ; xn ) P (x1 ) 1 m 1 P (xn jpa(Xn )) : : : P (x2 jpa(X2 ))P (x1 ) P (x1 ) P (xn jpa(Xn )) : : : P (x2 jpa(X2 )): 1 Hene, given the onditional probability table for eah node, these probabilities are xed. Therefore H (Xn ; : : : ; Xn m jX1 ) = X x1 P (x1 ) X xn ;:::;xn m P (xn ; : : : ; xn m jx1) log2 P (xn; : : : ; xn mjx1 ) is a funtion of P (X1 ) and so, from (4.1), I (X1 ; Xn ; : : : ; Xn m ) is a funtion of P (X1 ). This implies that the optimal set of evidene may hange as the distribution of X1 is updated. For example, if at time t = 0 we have a prior distribution for X1 and observe some evidene, at t = 1 we update the belief distribution of X1 to P (X1 je); and this beomes the prior belief of X1 at the next time step. Hene we should look for an optimal set of evidene nodes eah time the distribution of X1 hanges `substantially.' Note that I (X1 ; Xn ; : : : ; Xn m ) is a onave funtion of P (X1 ) and there exists a value for P (X1 ) whih results in a global maximum. Das, in [10℄, notes that as P (X1 ) moves away from this optimal value, one must hange the speiations of the network in order to maintain optimality. He goes on to say that this an be done in one of two ways, either 1. hange the set of evidene nodes so that the information obtained has a greater degree of relevane to the present situation, or 65 4.4. THE VALUE OF EVIDENCE 2. hange the intermediate nodes and therefore the links in the network so that \the hannel of evidene propagation is more relevant to the prevailing situation." By making the hanges suggested in point two, it seems that it is possible to redue the unertainty in X1 by more than is justied by the situation. This may result in removing genuine soures of unertainty in order that the information obtained has a greater eet on reduing the unertainty of X1 . Hene to ensure a deision is not biased, but that the available evidene is maximised, it seems best to ontinually monitor the optimal set of evidene and update this when neessary. If possible we should observe all evidene nodes at eah stage, though in pratie, for reasons of ost or pratiality, we are restrited to hoosing a subset of nodes. For example, if the deision to be made is some kind of diagnosis at the hypothesis node, then we may be able to redue the unertainty in this deision by performing a series of tests, that is observing evidene. After the information from a partiular test has been inorporated and the distribution at the hypothesis node updated, there may be little value in performing the same test again. Hene in this ase the evidene obtained by performing other tests may beome more valuable/relevant, and we update the set of evidene nodes at the next stage aordingly. This shows that point one above is intuitively reasonable, and agrees with general pratie. 66 Chapter 5 Learning Bayesian Networks from Data 5.1 Introdution In previous setions we have seen how to form a Bayesian Network from ausal or subjetive knowledge, how information is propagated through the network and why Bayesian Networks are an eÆient means of storing a probability distribution. Often it happens that the domain one wishes to make inferene about is not well understood, or that one may be ondent in identifying the independenies of the domain (and hene network struture) but be less sure about the numerial speiation of the onditional probabilities. It has also been shown that the opinions of experts, whose knowledge may be used to form a network, are rarely very aurate [18℄. In this hapter we will assume we have aess to data believed to have been generated from the underlying probability distribution. We an use this data alone to form a Bayesian Network, or we may ombine the data with expert or prior knowledge. After having formed a Bayesian Network in this way, we have a model of the domain whih an be used to assist deision proesses or for other inferential proesses. The struture of the network gives a graphial representation of the relationships between the variables in the domain. In this hapter we disuss the use of data to learn a Bayesian Network. In Setion 5.2 we introdue general onsiderations and the objetives of learning, then in Setion 5.3 give an overview of the multitude of approahes to this problem that have evolved from dierent shools of researh. In Setions 5.4 and 5.5 we look at two of the more ommon and widely aepted methods - namely the Information-Theoreti approah, whih was one of the pioneering methods, and then Bayesian methods whih have been the fous of muh reent researh in this area. Finally, in Setion 5.6 we look at an alternate approah based on the bootstrap. 67 5.2. CONSIDERATIONS OF LEARNING 5.2 Considerations of learning Our goal when learning Bayesian Networks from data may be to learn the struture of the network, the parameters of the network (onditional probabilities) or both the struture and parameters. The struture depends on the independene relations between the variables of the domain, and the parameter speiation is dependent on the struture. Generally there is no one model that stands out as the `orret' network based on the data. Given that we have a random sample of ases there will ultimately be variation between samples, referred to as sample variation. A model is onsidered to be an aeptable t to the data if it diers from the observed data by an amount whih an be explained by sample variation [30℄. There should be enough nodes and links in the model to represent the true underlying distribution, but not so many that the noise (whih arises from the sample variation) is modelled. A omplete graph would provide the best t to any data set, though a more simple model with fewer links ould provide a better representation of the true underlying dependene struture and would generalise better. The idea of generalisation relates to the ability of a model to predit an independent test observation, that is an observation not used to t the model. Although a omplete graph will provide the best t to the data, a model with good generalisation ability will more aurately model future ases. This is important if we form a Bayesian Network based on past ases and wish to enter information and make deisions using this network in the future. One of the reasons many methodologies for this problem exist is beause of the neessity to treat learning problems dierently for what an be broadly lassed as small, medium and large sample problems. If a reasonable approximation is to be found from a small sample, the method must rely more heavily on prior or expert knowledge and the data used to tune this or test for inonsistenies. When we have large amounts of data prior knowledge is less signiant. Most instanes fall into the medium sample ategory whih lies between these two extremes. The number of possible strutures and the number of parameters to be estimated inreases exponentially with the number of nodes in the network. Hene, omputationally, model seletion may require large amounts of time and storage spae. The larger the domain we model the more possible strutures and the more parameters are to be learnt so that we require more data for similar levels of ondene. It would seem desirable that as the sample size grows large, learning methods should identify a model that is loser to the true distribution, where loseness may be measured by some dissimilarity measure. However, given large amounts of data and possibly many variables, the omputational omplexity inreases. In general, learning Bayesian Networks is an NP-omplete 68 5.3. METHODS OF LEARNING BAYESIAN NETWORKS problem with respet to the number of variables [18℄ and exhaustive enumeration of all possible models is not possible, hene many approximate methods and heuristis have been developed. A ommon measure of omplexity was introdued in Setion 4.3 and was given to be the Size of the network. In that setion our primary aim was to redue the omplexity of an existing model, whereas here we aim to learn good models and then simplify them if needed. Over-tting is not often a problem in learning Bayesian Networks as a result of the onstraints imposed on omplexity a priori in order to onstrain the searh spae. It is more often the ase that we risk oversimpliation. In what follows we assume that we have a database D ontaining N observations x1 ; : : : ; xN . In any observation xl = (xl1 ; : : : ; xln ); if there is no information reorded on a variable this is referred to as a missing value and the data base is termed inomplete. As missing variables ompliate the analysis they are treated in Setion 5.5.5 separately to the ase of omplete data (no missing values) and we will otherwise assume our data to be omplete. We also ontinue to assume the variables are disrete. This assumption obviously restrits the models we are able to learn but as the required approah an dier substantially for ontinuous data it will not be disussed. 5.3 Methods of Learning Bayesian Networks 5.3.1 Soring Funtions and Searh Methods Soring funtions are used extensively for learning and model seletion. If the model spae is small, the pratitioner an look at a number of the `best' models and hoose that whih seems most appropriate, generally favouring the more simple models. However, when the searh spae is large, as is the ase for most network problems, this proedure needs to be automated. This an be done by dening a funtion whih assigns eah model some sore. We an then employ a searh proedure to nd a model with a high sore. Soring funtions generally onsist of two parts - a measure of t and a omplexity measure, and often take the form (measure of t) (measure of omplexity): (5.1) is a user dened parameter whih determines the relative inuene of the two terms in the above expression. If we hoose to be very large this will result in a simple model whih is probably too simple to adequately model the proess. If is very small this will result in hoosing the model whih best ts the data and is likely to result in overtting. For example the Maximum Likelihood approah introdued in Setion 5.3.2 tends to result in over-tting and so often a penalised likelihood sore is used whih penalises 69 5.3. METHODS OF LEARNING BAYESIAN NETWORKS overly omplex models. The aeptane measure (4.3) is another example of suh a soring funtion. Often a model seletion proedure an be quite sensitive to the hoie of tuning parameter , and are must be taken in determining its value. As mentioned earlier, when learning Bayesian Networks the omplexity is often onstrained a priori and so in this ase the soring funtion will onsist solely of a measure of t. Almost all soring funtions used in learning Bayesian Networks are deomposable, that is, they an be written as the produt of n fators, eah of whih is a funtion of only a node and its parents. A soring funtion Æ() is deomposable if it an be written in the form n Y Æ(D; S ) = s(xi jpa(Xi )); i=1 where Æ is a funtion of the observed data D and the struture S , and s() some speied funtion. Often the log sore is used so that a deomposable funtion an then be written as a summation. 5.3.1.1 Searh Methods One a soring funtion has been dened the model seletion proedure is to searh over the model spae for that with the best sore. Any network struture an be modied by either adding or deleting a link, subjet to the onstraint that there be no direted yles. If the soring funtion is deomposable, to obtain the sore of the new network from the old we need only evaluate s(xi jpa(Xi )), where Xi is the node to whih the ammended link would point. For example, if we let e denote the hange in log Æ(D; S ) when adding or deleting edge e, a ommon searh proedure is to evaluate e for all feasible hanges at every stage and make that hange orresponding to the largest e. If no e is greater than zero we stop, as the sore an not be improved at the next step. A problem with this method is beoming stuk at a loal maximum so in pratise tehniques suh as multiple starts, peturbation of struture and simulated annealing are used to overome this. For the ase where every node an have at most one parent, polynomial-time algorithms an be used to nd the highest soring network based on methods for nding a maximum weight spanning tree. The ase where eah node has k parents, k > 1, is NP-hard with respet to k [7℄. It has been shown that greedy algorithms and loal searh proedures an perform well [1℄ and proedures suh as branh and bound have also been applied suessfully. A ommonly used heuristi is known as K2, rst proposed in [7℄. It begins by assuming a node has no parents and at eah stage adds that parent whose addition inreases the probability of the struture by the greatest amount, as measured by the likelihood of the 70 5.3. METHODS OF LEARNING BAYESIAN NETWORKS observed data. The sore used is a funtion of eah node, its parents and the observed data suh that maximising the sore is equivalent to maximising the likelihood. In this ase the omplexity is onstrained by assigning to eah node an upper bound on the number of parents it may have. 5.3.2 Maximum Likelihood Maximum Likelihood is used extensively in the eld of statistis. Given a xed model with struture S and onditional probabilities , the sample likelihood is dened as N Y L(S; jx) = P (xl jS; ); (5.2) =1 where we have assummed the observations x1 ; : : : ; xN are independent. If we were to searh over all of S , using the priniple of maximum likelihood the optimal value for (S; ) would be that for whih the likelihood (5.2) is maximised. The maximum likelihood estimators S^ and ^ an be onsidered the most likely values of the orresponding parameters for the model onsidered, in the sense that the observed data is most likely under the model dened by S^ and ^. As an example, onsider the xed struture as in Figure 5.1. In this ase eah observation xl is a vetor (tl ; el ; nl ) and the parameters jS = (T ; E ; N ) where, for example, E is the onditional probability table for node E . Then the likelihood for the parameters l Taxable Income T Number of Dependents N E Expenditure on leisure Figure 5.1: A Bayesian Network for whih we wish to learn the parameters by the method of maximum likelihood. , L(jx; S ) = N Y l = =1 N Y =1 ( P (tl ; el ; nl jT ; E ; N ) P (tl jT )P (el jE )P (nl jN ) l = N Y l =1 )( P (tl jT ) N Y l 71 =1 )( P (el jE ) N Y l =1 ) P (nl jN ) : 5.3. METHODS OF LEARNING BAYESIAN NETWORKS The likelihood an be maximised by maximising eah term in the braes independently and so is deomposable. However, this deomposition only applies if there are no missing or hidden values [1℄. Another way of onsidering the maximum likelihood approah is to see it as nding the struture S whose likelihood over the parameters is greatest, that is argmax max S^ = P (xj; S ) S argmax ^ = (jS ): S The maximum likelihood approah is best for medium sized problems as diÆulties may arise when there is no data at ertain ongurations or when there is too muh data. For example, if we need to estimate P (E = ejT = t) for some states e of E and t of T , and there are no observations with T = t, then the maximum likelihood estimate is undened. 5.3.3 Hypothesis Testing Many lassial statistial approahes involve hypothesis testing in some form or another. In this ontext a hypothesis test may involve the hypotheses H0 : a link exists from node X to node Y Ha : no link exists from node X to node Y or H0 : S1 is no better than S2 Ha : S1 is better than S2 , for some strutures S1 and S2 . The likelihood ratio hypothesis test is based on the ratio of the likelihoods L(1 jx) L(2 jx) for two distint values of the parameters 1 and 2. If we ondition on the struture of the network, then this determines the parameters to be learnt. The generalised likelihood ratio test is based on the ratio max L(jS2; x) (5.3) max L(jS1 ; x); or equivalently the dierene in log likelihoods. Small values of (5.3) provide evidene against the hypothesis that S1 is no better than S2 . If S1 in the denominator of (5.3) is the omplete struture1 , whih will have the greatest likelihood, then we an say that, for all S2 , S2 is nested in S1 . That is, all independenies 1 In this ontext a omplete network is the saturated model. 72 5.3. METHODS OF LEARNING BAYESIAN NETWORKS implied by S2 an be represented by S1 . The deviane is dened as twie the logarithm of this ratio. The edge exlusion deviane for testing if a single edge an be exluded from the saturated model is given in [36℄ to be N log(1 orrN (Xi ; Xj jXnfXi ; Xj g)2 ); where orrN (Xi ; Xj jXnfXi ; Xj g) is the sample orrelation oeÆient of Xi and Xj given the remaining variables. This deviane has an asymptoti hi-square distribution with one degree of freedom. Hene, when n is small, we an ompute the p-values for all n 2 edge exlusion devianes from the omplete network struture and drop any nonsigniant edges. This proedure an be supplemented by an iterative proedure whih further evaluates the edge inlusion devianes of those edges left out of the model, to see if any should now be inluded. See [36℄ for details. However, if this test were implemented at the 95% level of ondene say, then we would expet that 5 out of 100 times we would delete a link where the true model would ontain one. For a network with a reasonable number of nodes there are many links, and so repeatedly applying this test will result in a large number of errors. Bonferroni [12℄ suggested that, for an overall ondene level of (referred to as the family-wise error rate) the signiane level of eah test should be set to =t, where t is the number of tests to be performed. However, suh a stringent rejetion riteria an lead to tests with low power, that is we inlude links whih ould be deleted, hene inreasing omplexity unneesarily. Proedures for exluding edges from the omplete graph are based on the theory of variable subset regression, where subsets of the maximal model are hosen and ompared in a systemati fashion in order to hoose one best subset. In this ontext a subset is a set of links and the best subset is that with as few members as possible whih provides an adequate t to the data. [36℄ derives the relationship between the deviane for testing if the independenies implied by a Bayesian Network struture hold, and the F-statisti for testing that a subset of the regression oeÆients is zero. 5.3.4 Resampling One tehnique that has been developed over the last twenty years, as omputational power has inreased, is resampling. This term generally overs any iterative proedure in whih a sample is drawn from the data or from a distribution dened by the data at every stage. The Gibbs Sampler, rst introdued in Setion 3.3.3.4, is often used for learning Bayesian Networks in the ase of missing data and is disussed in this ontext in Setion 5.5.5.1. 73 5.3. METHODS OF LEARNING BAYESIAN NETWORKS The bootstrap is another resampling tehnique and involves repeatedly drawing samples from the original sample. It an be used to estimate the sampling distribution of any statisti. An appliation of the bootstrap to learning Bayesian Networks is given in Setion 5.6. 5.3.5 Bayesian Methods The main fator that dierentiates Bayesian from non-Bayesian methods is the speiation of a prior distribution, P (), for the parameters . This is possible beause, unlike in the usual lassial approah when the parameter to be estimated is onsidered xed, the Bayesian philosophy is to onsider the parameter as a random variable. The prior distribution expresses all the information we have about before having observed the data. We then ombine this prior knowledge with data believed to have been generated from the underlying distribution to alulate the posterior distribution - the distribution of the parameters after having onsidered the information available from the data. The maximum a posteriori (MAP) approah is to selet as the best model that with the largest posterior probability. Given that we observe some data D, the posterior probability is given by Bayes' rule as P (jD) = / P (Dj)P () P (D) P (Dj)P (): (5.4) Hene we an ompare two models with parameters 1 and 2 by onsidering the posterior odds, P (Dj1 )P (1 ) ; (5.5) P (Dj2 )P (2 ) the produt of the Likelihood Ratio, P (Dj1 )=P (Dj2 ) and the prior odds P (1 )=P (2 ). Hene it may be that we have two models dened by 1 and 2 suh that P (1 ) > P (2 ) but that, beause the data is most likely under the model dened by 2, the posterior probability of 2 is greater than that of 1. If represents the vetor of parameters (1 ; : : : ; n ), where i speies the onditional probability table for node Xi , and the i are independent given some network struture, then P (jD) = / n Y i =1 i =1 n Y 74 P (i jD) P (Dji )P (i ): 5.4. THE INFORMATION THEORETIC APPROACH Hene the posterior probability for the network is deomposable and so an be maximised by nding i to maximise P (i jD) for eah individual node. In general, a prior distribution is needed for both the struture and the onditional probabilities. A prior distribution for struture assigns some probability to every possible struture over the variables, though often many strutures are assigned a prior probability of zero or the struture is assumed known. Although prior distributions provide a natural way to inlude domain or expert knowledge, their speiation an also be mathematially omplex. As model seletion may be sensitive to this speiation, a poorly hosen prior distribution an make a Bayesian method perform poorly against alternative tehniques [1℄, or result in inferene based on spurious assumptions. Prior distributions an be lassed as informative or non-informative. Informative prior distributions are used when it is believed some parameter values are more likely than others. Non-informative priors are hosen if there is no prior evidene to suggest that one partiular struture or parameter speiation would be any more likely than another. The hoie of prior distributions for Bayesian Networks is disussed in detail in Setion 5.5.4. The Bayesian approah an be useful when dealing with sparse data. In this ase a prior distribution an be dened for the parameters with all values having a positive probability, in order to avoid undened values. We have already seen some of these ideas in the ontext of information propagation in Chapter 3. In that ontext the prior distribution is the distribution over the states of the variables before we have reeived evidene. We then observe some evidene e and, through the propagation algorithm, update the distribution at eah node to the posterior distribution P (X je). At those leaf nodes where data was not observed we assigned the uninformative uniform prior distribution. This is also an example of sequential updating. At eah stage of the algorithm a node updates its belief given the information it reeives, by omputing the posterior distribution. This posterior distribution is then used as the prior distribution at the next stage when new messages are reeived, until after several iterations all the information has been taken into aount. 5.4 The Information Theoreti Approah One of the onstraints often imposed on the struture when learning Bayesian Networks is that eah node an have at most k parents, for some integer k > 0. In 1968, in what was one of the rst attempts to learn the struture of a probabilisti network, Chow and Liu [3℄ presented a method to nd the optimal network to approximate a probability distribution given that eah node an have at most one parent, that is the network must 75 5.4. THE INFORMATION THEORETIC APPROACH be a tree. A similar method for learning a polytree struture from data was explored by Pearl [30℄, though in this ase an optimal approximation an be found only if the generating distribution an be represented as a polytree. Here we disuss Chow and Liu's method, and then look at extensions to this approah. 5.4.1 Chow and Liu Trees Chow and Liu were motivated by the large storage requirements of the information neessary to alulate P (X). As has been disussed, networks whih apture the independenies amongst the variables make for more eÆient storage, and alulation of probabilities an be arried out more eÆiently by use of the hain rule. A further advantage of enforing a tree struture is that, for a given set of data, the ratio of data to the number of unknown parameters is greater than for more ompliated strutures, and the probability of enountering missing data is redued. Although a tree struture results in more eÆient storage of a distribution, the strutural onstraints may orrespond to assumming independenies whih do not hold. Hene for distributions that annot be represented by a tree, enforing a tree struture results in an approximation. It is desirable to make this approximation as good as possible. Chow and Liu wished to nd the spanning tree2 whose resulting distribution P t (X) most losely mathed the distribution from whih the data was generated, P (X). They hose the Kullbak-Liebler divergene as their measure of optimality. The Kullbak-Liebler divergene between the two distributions P and P t an be expanded as follows: distK (P; P t ) = = = = X x X x P (x)(log P (x) log P t (x)) X P (x) log P (x) H (X) H (X) X x " P (x) n X X i =1 x x n X i n Y P (x) log[ =1 i =1 P t (xi jpa(Xi ))℄ # log P (xi jpa(Xi )) ; t (5.1) P (x) log P t (xi jpa(Xi )); where H (X) is the entropy of X as dened in Setion 4.1. As P t (xi jpa(Xi )) is onstant 2 A spanning tree of a set of nodes X is a tree whih onnets all nodes. If jXj = n, a tree is a spanning tree if and only if the number of links is equal to n 1. 76 5.4. THE INFORMATION THEORETIC APPROACH for all ongurations of the variables in XnfXi ; Pa(Xi )g, distK (P; P t ) = H (X ) n X i X x = =1 H (X ) H (X) H (X) pa(Xi ) X =1 xi ;pa(Xi ) n X X =1 xi ;pa(Xi ) n X X i log P t (xi jpa(Xi )) xi ; n X i = X 4 # P (x)I(fXi [ Pa(Xi )g[x℄ = (xi ; pa(Xi )) i = 2 =1 pa(Xi ) log P t (xi jpa(Xi ))P (xi ; pa(Xi )) (5.2) P (xi jpa(Xi ))P (pa(Xi )) log P t (xi jpa(Xi )) P (pa(Xi )) X xi P (xi jpa(Xi )) log P t (xi jpa(Xi )): (5.3) P The nal summation in (5.3) is of the form x f (x) log g(x). Using Lagrange multipliers to P nd the value of g whih maximises this expression, subjet to the onstraint rj=1 g(xj ) = 1 gives f (x1 ) f (x r ) T ;:::; = (1; : : : ; 1)T ; g(x1 ) g(xr ) that is, f (xj ) = g(xj ) for all j . Then r X j =1 g(xj ) = 1 =) = so g(xj ) = f (xj ) for all j . That is, hene (5.3) minimised, when r X =1 = 1 f (xj ) j P xi P (xi jpa(Xi )) log P t (xi jpa(Xi )) is maximised, and P t (Xi jPa(Xi )) = P (Xi jPa(Xi )) for all Xi : (5.4) This implies that the best approximation based on the Kullbak-Liebler divergene is realised when the onditional probabilities for the tree are the same as those obtained from P . This is a somewhat suprising result. Reall that, as the tree struture is likely to be a simpliation of the original struture, the parent set of a node Xi in the tree is unlikely to be the same as in the original network. Hene we might imagine that to obtain the best approximation of P , the onditional probability distributions whih dene P t may in some way need to ompensate for the strutural hanges. Instead we nd that the 77 5.4. THE INFORMATION THEORETIC APPROACH required probabilities for P t an be alulated diretly from P , and onsequently, if the loal struture of a node is the same in P t as it is in P , that is Pat (Xi ) = Pa(Xi ), then the onditional probability table at Xi does not hange. On substitution of the optimal probabilities dened by (5.4) into (5.2), we obtain distK (P; P ) = n X H (X ) t i X =1 xi ;pa(Xi ) P (xi ; pa(Xi )) log P (xi jpa(Xi )): (5.5) Using the relation P (xi ; pa(Xi )) log P (xi jpa(Xi )) = log log P (xi ) + log P (xi ) P (pa(Xi )) P (xi ; pa(Xi )) + log P (xi ) = log P (xi )P (pa(Xi )) (5.6) in (5.5) gives distK (P; P ) = t H (X) n X i = n X P (xi ; pa(Xi )) P (xi ; pa(Xi )) log P (xi )P (pa(Xi )) i=1 xi ;pa(Xi ) X =1 xi ;pa(Xi ) H (X) X n X i =1 P (xi ; pa(Xi )) log P (xi ) I (Xi ; Pa(Xi )) n X X i =1 xi P (xi ) log P (xi ): As the rst and third terms are onstant over all strutures, minimising distK (P; P t ) P is equivalent to maximising ni=1 I (Xi ; Pa(Xi )). Hene if we dene the weight on any branh between Xi and its parent to be I (Xi ; Pa(Xi )), the optimal spanning tree is then the maximum weight spanning tree. Any algorithm for nding a maximum weight spanning tree an be used to nd the optimal tree approximation. Using Kruskal's algorithm [5℄ this is done as follows: 1. Compute the distributions P (Xi ; Xj ) for all pairs Xi ; Xj . 2. Compute all weights I (Xi ; Xj ) and order aording to magnitude. 3. Take the branh of largest weight not already in the tree and add it to the tree unless it forms a loop, in whih ase disard it. 4. Repeat step 3 until a spanning tree is formed, that is there are n 1 branhes in the tree. 78 5.4. THE INFORMATION THEORETIC APPROACH This result was rst proved when the probability distribution P is known. However, it is shown in [3℄ that by using the sample frequenies as estimates of P (Xi ; Xj )3 to ompute weights I^(Xi ; Pa(Xi )), the optimal tree distribution is also the maximum likelihood distribution and the orresponding onsisteny properties hold. That is, if the data is generated from an underlying distribution with a tree representation, as the size of the data set inreases the resulting distribution will onverge to the true underlying distribution. Note that here we are minimising distane over all nodes in the model and not just the hypothesis nodes, as was the ase in Setion 4.3. It may be the ase that the distribution dened by the optimal tree is not an adequate approximation to the original distribution for every purpose. In 2000, Williamson [37℄ looked at extending the ideas of Chow and Liu to show, in general, that maximising the mutual information weight gives the best approximation of a probability distribution. 5.4.2 More general networks Another onstraint and an obvious extension to Chow and Liu's work is to speify that the graph is singly onneted, that is, a polytree. In [30℄, Pearl presents a method for onstruting the optimal polytree from data, given that the underlying probability distribution has a polytree representation. This proedure has several short-falls: 1. It is only valid if we know the underlying distribution an be represented as a polytree. 2. When forming the network from data the proedure is not onsistent in the sense of the Chow and Liu algorithm. 3. It returns only a partially direted network. Although the rst two points are harateristis of the method, the third point is due to the ourrene of equivalene strutures, as disussed in Setion 4.2. Note that the Chow and Liu algorithm returns a tree with undireted links. This is beause, as pointed out previously, the relations X ! Y ! Z , X Y Z and X Y ! Z are indistinguishable with regard to independene struture. In a tree there are no onverging onnetions, at whih we would be able to determine the diretion of the links. As this latter form of onnetion does exist in polytrees, there are some links able to be direted but not all. In general, if a user desires a direted network, the remaining arrows must be assigned manually, for example to represent notions of ausation. 3 The sample frequeny estimate of P (xi ; xj ) is given by the proportion of observations in whih Xi = xi and Xj = xj . 79 5.4. THE INFORMATION THEORETIC APPROACH Williamson [37℄ onsiders a more general approah, where the onstraint on struture is now generalised to speifying that no node has more than k parents. He onsiders searhing for the spanning network of maximum weight, where the weights are dened as previously. The justiation that a maximum weight network will be a good approximation to the given probability distribution follows from the derivations of Chow and Liu. To obtain a measure of the distane between the underlying distribution and the approximate distribution P 0 dened by the network, we apply the results from (5.4) and (5.6) in (5.1) to obtain distK (P; P 0 ) = H (X ) H (X ) = X x X x P (x) n X i =1 log P 0 (xi jpa(Xi )) n X P (xi ; pa(Xi )) P (x) log P (xi )P (pa(Xi )) x i=1 X P (x) n X i =1 log P (xi ): (5.7) Applying the result X x P (x) n X i =1 log P (xi ) = " n X X i = =1 " x n X X i = =1 xi n X X i =1 xi # P (x) log P (xi ) log P (xi ) X x # P (x)I(fXi g[x℄ = xi ) P (xi ) log P (xi ) in (5.7), then we get distK (P; P 0 ) = H (X) n X X i =1 xi n X P (xi ; pa(Xi )) P (xi ; pa(Xi )) log P (xi )P (pa(Xi )) i=1 xi ;pa(Xi ) X P (xi ) log P (xi ) n X n X H (Xi ): i=1 =1 As the rst and third terms are again independent of the hoie of network, we see that P in general the distribution will be minimised when I (Xi ; Pa(Xi )) is maximised, that is when the alloation of parents is suh that eah node reeives maximum information from its parents. Note that the weight here is attahed to nodes rather than to ars and the weights hange with the alloation of parents. The proedure Williamson adopts is to = H (X) I (Xi ; Pa(Xi )) + i 80 5.5. THE BAYESIAN APPROACH attah a weight to eah ar from parent Vl to hild Xi of the form I (Xi ; Vl jV (l) ), where V (l) = Pa(Xi )nfVl g: This is then used as the basis for a greedy searh method whih at eah stage adds the ar of maximum weight whih an be inluded without violating the onstraints or adding yles. This proedure stops when no more ars an be added. As the weights hange at eah stage, the nal network will not neessarily be the maximum weight network, though the examples given in [37℄ show the algorithm often returns lose to optimal results. 5.5 The Bayesian Approah When learning a Bayesian Network it is obviously desirable to base our hoie of network on as muh information as possible. We saw in Setion 2.3 how networks an be formed on expert or ausal knowledge alone, and in the previous setion we showed how an information theoreti approah an be used when the only information utilised is that from the data. In Setion 4.3 prior knowledge was used when enforing independene relations whih are suspeted to be true, in order to onstrain the searh spae. In general, if we possess prior knowledge and data it is advantageous to use both. Bayesian methods provide a solid framework for ombining prior knowledge with data, and a well developed approah to model seletion. Initially, in Setions 5.5.2 and 5.5.3, we assume our data is omplete. In Setion 5.5.2 we onsider the `simplest' ase - learning the parameters of a xed struture, and then in Setion 5.5.3 we disuss methods for learning both the struture and the parameters. Throughout these setions it will be assumed that all neessary prior distributions have been speied, and in Setion 5.5.4 we look at how this an be ahieved. Speially we show that under several assumptions all neessary prior distributions an be obtained from the speiation of a prior network and one other assessment. Setion 5.5.5 disusses methods for dealing with inomplete data. As this ase an be omputationally demanding, we introdue a large sample Gaussian approximation, and by a further approximation the more eÆient Bayesian Information Criterion for model seletion. 5.5.1 Notation Our objetive is to learn the struture S and orresponding parameters of a Bayesian Network, given that we have observed a random sample of N ases (observations) whih are stored in a data base D. We assume there are n variables, X1 ; : : : ; Xn , with eah variable Xi having ri states. We use Nijk to denote the number of observations in D for whih Xi is in state k and its parents are in onguration j . For a struture S the parameters are denoted S 2 S , where S = (1 ; : : : ; n ) and i 81 5.5. THE BAYESIAN APPROACH ontains the probabilities assoiated with node Xi . In general, we require the probability that Xi is in state k given that its parents are in onguration j , for all i; j and k. This is denoted by ijk = P (xki jpa(xi )j ); and we will also refer to the vetor ij = P (xki jpa(xi )j ); k = 1; : : : ; ri ; and matries i = P (xki jpa(xi )j ); k = 1; : : : ; ri ; j = 1; : : : ; qi ; where qi is the number of ongurations of the parents of Xi . 5.5.2 Known Struture Here it is assumed that the struture of the network is xed. We enode our prior unertainty about S in the expression P (S jS ) for some hypothesised struture S , and must then ompute the posterior probability P (S jD; S ). The maximum a posteriori approah speies that we selet the S whih maximises this expression. Consider rst the observations from a single variable Xi . If we assume that the proess generating the data is onstant over time, then for any given observation the probability P (X = k). Further, the that Xi will be in state k an be given by a salar fi=k = i parameters fi = (fi=1 ; fi=2 ; : : : ; fi=ri ) render the observations independent. A sequene whih satises these onditions is known as an (ri 1)-dimensional multinomial sample with parameters fi [20℄. We now have, from equation (5.4) P (fi jD) = :P (Djfi )P (fi ) = :P (fi ) ri Y fiN=ikk ; (5.1) =1 where Nik is the number of observations for whih Xi was in state k, and is a normalising onstant. If we assume the prior probability of the parameters follows a Dirihlet distribution4, then by denition Pri ri 0 Y 0 k =1 Nik P (fi ) = Qri (5.2) fiN=ikk 1 ; Ni0k > 0 0 k =1 (Nik ) k =1 k This assumption is justied as, under the assumptions of parameter independene and likelihood equivalene introdued later in this setion a Dirihlet distribution on the network parameters is inevitable. See [20℄ or [14℄ for a derivation. 4 82 5.5. THE BAYESIAN APPROACH where () is the gamma funtion. The Ni0k are user dened parameters, the evaluation of whih is disussed briey in Setion 5.5.4, and in more detail in [20℄. Hene, from (5.1) we have that the posterior distribution after having observed a multinomial sample, for a single variable with a Dirihlet prior distribution is given by Pri ri ri 0 Y Y Ni0k 1 k =1 Nik Q P (fi jD) = : ri fi=k fiN=ikk 0 ) ( N k =1 ik k =1 k =1 i Y 0 = 0 : fiN=ikk +Nik 1 : r =1 Note that, given a Dirihlet prior distribution for fi the posterior distribution for fi also has a Dirihlet distribution, with parameters Nik + Ni0k ; k = 1; : : : ; ri . We say that the Dirihlet distribution is onjugate under multinomial sampling. As the posterior distribution is of the same funtional form as the prior distribution, this greatly simplies the mathematis of omputing the posterior distribution, and in general onjugate distributions are useful for sequential updating. We now introdue two more assumptions whih we assume to hold for the remainder of this setion. The rst assumption is that of global parameter independene. This says that the parameters assoiated with a variable in a Bayesian Network are independent of the parameters assoiated with any other variable. Hene we have k P ( jS ) = S n Y i =1 P (i jS ): The seond assumption is that of loal parameter independene. This says that the parameters assoiated with a variable are independent for eah onguration of the parents. That is, qi Y P (ij jS ); i = 1; : : : ; n: =1 The validity of this assumption is more questionable than that of the rst, though has shown to be reasonable for many problems [20℄. The above two assumptions are together referred to as parameter independene. Under these assumptions P (i jS ) = j qi n Y Y P (ij jS ) =1 j =1 and Bayes' Rule, along with the assumptions of a multinomial sample and Dirihlet prior P ( jS ) = S i 83 5.5. THE BAYESIAN APPROACH distribution gives P ( jD; S ) = :P (Dj ; S ) S S qi n Y Y i = : = : P (ij jS ) =1 j =19 8 qi ri n Y <Y Y : i =1 j =1 k=1 qi n Y Y i = 0 : =1 j =1 P (ij jS ) qi ri n Y Y Y i Nijk ijk =1 j =1 k=1 qi n Y =Y ; ri Y k =1 i =1 j =1 Nijk ijk 0 +Nijk Nijk ijk P (ij jS ) 1 ; (5.3) where the last line follows from the generalisation of (5.2). Given that we have speied 0 , we an use (5.3) a prior distribution on the parameters and so have values for the Nijk in onjuntion with a searh method to nd the value for S with the largest relative posterior probabiltiy. 5.5.3 Unknown Struture We ould use the above method for seleting the parameters of a Bayesian Network with xed struture when the struture is unknown, by applying this method to every possible struture and seleting the struture S and parameters S whih result in the largest value of P (S jD; S ). However, this approah does not utilise prior knowledge and beause of the large number of possible strutures is not omputationally feasible. Prior knowledge about struture may be used to onstrain the searh spae. Alternatively, taking the Bayesian approah, we speify a prior distribution on the struture and again look for some measure of posterior probability P (S jD) that we an use as a soring funtion. In taking this approah we begin by an appliation of the hain rule, P (DjS ) = N Y P (xl jDl ; S ); (5.4) =1 where xl is the lth ase in the data base and Dl denotes the rst l 1 ases. P (xl jDl ; S ) is hene the probability distribution of the lth ase given those observed so far, assumming the struture S . Conditioning on S we have l P (xl jDl ; S ) = Z P (xl jDl ; S; S )P (S jDl ; S )dS : Given that we know S , the probability distribution of xl does not depend on Dl . Hene 84 5.5. THE BAYESIAN APPROACH the multinomial assumption gives us P (xl jDl ; S; ) = S n Y i = =1 qi ri n Y Y Y i where I = I(i; j; k; l) = We then have P (xki jpa(Xi )j ) =1 j =1 k=1 I ; ijk 1 if xi = k and pa(Xi ) = j in xl 0 otherwise. P (xl jDl ; S ) = = = 8 Z <Y qi ri n Y Y : i =1 j =1 k=1 qi Z n Y Y I ijk 9 qi n Y =Y ; i =1 j =1 P (ij jDl ; S ) i =1 j =1 i =1 j =1 k=1 qi ri n Y Y Y ri Y =1 E (ijk jDl ; S )I ; P (ij jDl ; S )dij I dij ijk k (5.5) where E (ijk jDl ; S ) is the expeted value of ijk with respet to P (ij jDl ; S ). Hene, on substitution of (5.5) into (5.4) we have P (DjS ) = qi ri N n Y Y YY i =1 j =1 k=1 l=1 E (ijk jDl ; S )I : (5.6) E (ijk jDl ; S ) is equal to the probability that Xi will be in state k, given that its parents are in onguration j at the next observation xl . For any j this is, intuitively, given by the ratio of the number of eetive observations in whih Xi was in state k and its parents in onguration j , to the total number of eetive observations in whih the parents of Xi were in onguration j . That is, ! 0 + Nijk Nijk E (ijk jD; S ) = ; Nij0 + Nij P (5.7) where Nij = rki=1 Nijk . We use the term `eetive observations' to indiate that this ratio 0 , otherwise known as the eetive sample depends on the hoie of the parameters Nijk 0 indiate size. From (5.3) it an be seen that the parameters of the prior distribution Nijk that our prior knowledge is equivalent to having observed the sample N 0 prior to observing our present sample with ounts Nijk . As I = 1 only if xi = k and pa(Xi ) = j in xl , eah time I = 1 in (5.6) for some k in the produt over l, Nijk will inrease by 1. Hene substituting (5.7) into (5.6), using 85 5.5. THE BAYESIAN APPROACH the fat that () = ( 1)! for = 1; 2; : : : and simplifying, results in an expression for P (DjS ). We hene obtain the Bayesian Dirihlet (BD) metri P (D; S ) = P (S )P (DjS ) = P (S ) ri 0 + Nijk ) (Nijk (Nij0 ) Y 0 + Nij ) 0 ) : ( N ( N ij ijk i=1 j =1 k =1 qi n Y Y (5.8) Given a prior distribution for the struture, we an maximise (5.8) by nding for eah variable the parent set that maximises the seond produt of this expansion. Hene for eah variable Xi ; i = 1; : : : ; n; we utilise a searh proedure as disussed in Setion 5.3.1 to nd that set of parents whih maximises this sore. This results in the network with greatest posterior probability. We an then use (5.3) to assign the network parameters. 5.5.3.1 The BD metri and Network Equivalene Reall that two Bayesian Network strutures S1 and S2 are independene equivalent if they enode the same independene assumptions. Independene equivalene is an equivalene relation and indues a set of equivalene lasses over the possible strutures for the variables in U [19℄. Two strutures are distribution equivalent if every joint probability distribution enoded by one struture an be enoded by the other and vie versa. In this ase, if the networks are aausal it does not make sense to dierentiate between the two strutures, and so hypothesising the struture S1 is equivalent to hypothesising the struture S2 . This is referred to by Hekerman, Geiger and Chikering [20℄ as hypothesis equivalene. Given this property we would also expet that equivalent strutures S1 and S2 satisfy likelihood equivalene, that is P (DjS1 ) = P (DjS2 ) and sore equivalene, that is P (D; S1 ) = P (D; S2 ). However, for ausal networks we annot validly assume hypothesis equivalene sine a hypothesised struture inludes the hypothesis that a node's parents are its diret auses. The above BD metri does not satisfy the assumption of likelihood equivalene. Hekerman et al. [20℄ derive a metri they all the BDe metri whih does satisfy likelihood equivalene and whih simplies the onstrution of a prior distribution for the parameters. The form of the BDe metri is idential to (5.8) exept that speiation of the N 0 is subjet to ertain onstraints. The details are given in [20℄ and will be disussed further in the following setion. 5.5.4 Prior Distributions For the disussion in Setions 5.5.2 and 5.5.3 we assumed that we had speied prior distributions both for the parameters of the network, P (S ), and the network strutures, 86 5.5. THE BAYESIAN APPROACH P (S ). In this setion we show how we an derive all prior distributions from the formation of a prior network and an assessment of the equivalent sample size N 0 . 5.5.4.1 The Prior Network The assumptions of parameter independene and likelihood equivalene onstrain the parameters of a omplete network struture to have a Dirihlet distribution [20℄, where the parameters of this distribution must satisfy 0 = N 0 P (xk ; pa(xi )j jS ); Nijk i (5.9) where S denotes a omplete network struture and N 0 is the equivalent sample size. In 0 for our prior distribution, we should use as muh of our prior knowledge speifying the Nijk as possible. This prior knowledge an be represented ompatly in what is alled a prior network. A prior network is a Bayesian Network whih the user reates for the domain based on their knowledge. This is done essentially as in Setion 2.3 when it was disussed how Bayesian Networks ould be formed based on expert knowledge alone. The dierene is that, in this setion, we then ombine data with the prior knowledge enoded in the prior network to obtain the posterior network. In our prior network we speify both the struture, whih indiates our beliefs in the independene relations between the variables, and the parameters. However, to speify our prior distribution P (S ) we are required, as in (5.9), to ondition on a omplete network. Given any struture S , we an form a omplete network struture S from S , whih enodes the same assumptions of independene, by adding in dummy links to omplete the network. Dummy links hange the struture of the network but not the underlying distribution. To see how this is done, onsider the prior network (whih is not omplete) on binary variables X1 ; X2 and X3 shown in Figure 5.2 a). The X2 a) X3 X2 X1 b) X3 X1 Figure 5.2: Diagram illustrating the addition of dummy links to the network in a) to form the omplete network b). onditional probabilities are speied as 87 5.5. THE BAYESIAN APPROACH 0 1 P (X3 ) = (0:5; 0:5); P (X2 ) = (0:4; 0:6) and P (X1 jX2 ) = 0 0:3 0:7 : 1 0:2 0:8 To make this into a omplete network we need to add 2 dummy links. For example we ould add the links X3 ! X1 and X2 ! X3 as in Figure 5.2 b). The only onstraint on this stage of the proess is that we ensure the network remains ayli. The orresponding onditional probabilities are then 0 1 0 1 0 0 0:3 0:7 P (X1 jX2 ; X3 ) = 0 1 0:3 0:7 P (X3 jX2 ) = 0 0:5 0:5 P (X2 ) = (0:4; 0:6): 1 0 0:2 0:8 1 0:5 0:5 1 1 0:2 0:8 Hene the variable X3 remains independent of X1 and X2 . Note that a omplete network formed by the addition of dummy links is not a Probabilisti Bayesian Network as it is not a minimal I-map. Given that we have a omplete network orresponding to our prior network, by using standard Bayesian Network inferene we an ompute the probabilities required in (5.9). 0 and The hoie of a suitable value for N 0 then allows us to ompute the parameters Nijk hene our prior distribution P (S) for the parameters of our `ompleted' prior network. 5.5.4.2 Prior distributions for the network parameters In addition to assuming parameter independene and likelihood equivalene we will make two additional assumptions whih greatly simplify the speiation of a prior distribution. The rst of these, likelihood modularity, says that given any struture S , P (xi jpa(xi ); i ; S ) = P (xi jpa(xi ); i ) for all Xi . That is, the probabilities at Xi depend only on its parent set, and not on the remaining struture of the network. The seond assumption, prior modularity, says that given any two strutures S1 and S2 suh that Xi has the same parents in S1 and S2 , P (i jS1 ) = P (i jS2 ): These two assumptions formalise the notion that the likelihood of Xi and the parameters at Xi depend only on the struture loal to Xi . That is, if Xi has the same parents in two dierent network strutures, these values will be the same. The issue of struture equivalene now arises. If two strutures are distribution equivalent the two hypotheses S1 and S2 should satisfy prior equivalene, that is the prior probabilities of equivalent strutures should be equal. Instead of pretending suh ases do 88 5.5. THE BAYESIAN APPROACH not exist and assoiating a prior probability with every network struture, we assoiate eah hypothesis with an equivalene lass of strutures. Therefore we are atually learning an equivalene lass of strutures, and not every struture is in the hypothesis spae. As a omplete struture represents no assertions of onditional independene, all omplete strutures are independene equivalent. Given likelihood equivalene we an ompute P (DjS; S ) and P (SjS ) for any omplete struture S from the likelihood and prior distribution for another omplete struture. We do this by performing a hange of variables from those in the joint likelihood speied by one network to the variables in the required joint likelihood. X Y (a) X Y X (b) Y (c) Figure 5.3: All strutures on two variables. For example, onsider the network strutures over two binary variables X and Y , as shown in Figure 5.3. There are three possibilities: X Y; X ! Y or X Y , the nal two of whih are equivalent. The density funtion for the joint variables xy ; xy and xy, where xy = P (X = x; Y = y), is given by P1 (xy ; xy ; xy). Suppose that we want to obtain the parameters for the struture X ! Y , P2 (x ; yjx ; yjx ). The inverse transformation from P2 to P1 is given by the relations xy = x yjx; xy = (1 x )yjx and xy = x yjx; and the Jaobian of the transformation is xy =x xy =x xy=x yjx J = xy =yjx xy =yjx xy=yjx = yjx xy =yjx xy =yjx xy=yjx 1 yjx = x (1 x) > 0: x 0 0 1 x x 0 Hene we an obtain the required values P2 (x ; yjx ; yjx ) = P1 (x yjx; (1 x )yjx ; x (1 yjx)):x (1 x ): In general we an use the relation x1 ;:::;xn = n Y i =1 xi jx1 ;:::;xi 1 ; 89 5.5. THE BAYESIAN APPROACH and if S is any omplete struture over the variables in the domain the Jaobian for the transformation from the joint likelihood of this domain U to S is given by n Qn Y1 Y (xi jx1;:::;xi 1 ) j=i+1 rj 1 ; J= i=1 x1 ;:::;xi where rj is the number of states of Xj . A derivation of this result is given in the appendix of [20℄. Given that we an now ompute the prior distribution for any omplete struture, under the assumptions of parameter modularity and likelihood independene we an onstrut the prior distribution P (S jS ) for any struture S . To do this reall that we an express n Y P (i jS ); =1 by the assumption of global independene. To determine the terms P (i jS ) in this expression we rst nd a omplete network struture Si suh that Xi has the same parents in both S and Si . We then use the proedure desribed above to obtain P (iSjSi ) from the parameters of our ompleted prior network P (SjS ). We an then use global independene to obtain P (i jSi ), whih by the modularity assumptions is equal to P (i jS ). Hene, under the above assumptions, given that we have speied a prior network we an ompute the prior parameters for any other struture. P (S jS ) = i 5.5.4.3 Prior distributions for network struture The most simple prior distribution is often the uninformative uniform prior distribution whih assigns equal probability to every struture. However, as every struture is then onsidered equally likely, this makes no use of any prior knowledge we may have about the struture. This method an be rened through use of prior knowledge to enfore some onstraints on the struture or node ordering. Those strutures to be disallowed are assigned a prior probability of zero and the remaining probability is then distributed uniformly over those strutures whih are allowable. Another approah (attributed to Buntine, [1℄) is to assign an ordering to the variables and a probability assessment of the presene or absene of eah of the n2 links, onsidered to be independent. In this way the prior probability of any struture under that ordering an be obtained. A similar method an be used when a prior network has been dened. Hekerman et al. [20℄ propose a method whih penalises a network aording to how muh it diers from the prior network, the struture of whih is onsidered to be most likely. Here the dierene between some struture S and that of the prior network S p is measured in terms 90 5.5. THE BAYESIAN APPROACH of the number of ars by whih they dier, denoted by Æ, and S is penalised by a onstant fator for eah suh ar. That is, P (S ) = Æ ; where 0 < 1 is the user dened penalty and the normalisation onstant. Note that in this ase a network whih is equivalent to S p will not have the same prior probability and in general it an be seen that this speiation does not satisfy prior equivalene. Hene this method should not be used for aausal networks. 5.5.5 Inomplete Data When we do not have a omplete data set, that is some of the xli in D are not reorded, it is important to asertain why this is so. If the absene of the observation is dependent on the state of the variable then missing data should be handled dierently to if the absene is independent of state, for example if the variable is hidden. An example of the former ase is non-response in a survey where subjets may hoose not to respond to a question dealing with drug use if they are heavy users, for fear of reprimand or some other sensitive reason. In this setion we assume all missing data is due to hidden variables or is otherwise independent of state. Suppose there exists a single inomplete observation in our data base. If we let Y U denote the observed variables and Z = fUnY g the unobserved variables, then the posterior distribution an be expressed as P (ij jy; S ) = X z P (ij jz; y; S )P (zjy; S ): (5.10) It turns out, under the Dirihlet assumption, that the posterior distribution (5.10) is a linear ombination of Dirihlet distributions [18℄. If we observe further inomplete ases some or all of these Dirihlet distributions will themselves beome linear ombinations of Dirihlet distributions and so the number of terms in the posterior will inrease exponentially with the number of inomplete observations. Hene, in general, exat inferene is intratable and approximations need to be made. 5.5.5.1 Gibbs Sampling When we have missing data the Gibbs sampler is often used to approximate the posterior distribution P (S jD; S ) by repeatedly sampling values for the missing data to form a omplete data base. To arry out this proedure eah missing observation in D is randomly assigned a value. We then iterate through the unobserved ases and reassign the state of 91 5.5. THE BAYESIAN APPROACH eah by sampling from the probability distribution P (x0 ; D nx jS ) P (x0li jD nxli ; S ) = P li 0 li ; x0li P (xli ; D nxli jS ) where Xi was unobserved in ase l and D nxli is the urrent ompleted data base exluding xli . Eah time we have reassigned all missing values to obtain a new D , we ompute P (S jD ; S ) by the methods presented in Setion 5.5.2. This proedure is then iterated G times and the approximation is taken to be the average, G 1X S ^ P ( jD; S ) = P (S jDg ; S ); G g=1 where Dg is the ompleted data base from the gth iteration. In the limit as G tends to innity, this estimate will onverge to the expeted value of P (S jD; S ), though in pratie onvergene an be quite slow. When the struture is unknown, Gibbs Sampling an be used to approximate P (DjS ) by the expression P (D; S jS ) P (S jD; S ) P (S jS )P (DjS ; S ) : = P (S jD; S ) P (DjS ) = Given that a prior distribution has been speied for the parameters, we an ompute the numerator using inferene on the Bayesian Network dened by S and S , and an ompute the denomimator using Gibbs sampling as above. At eah of the G iterations, it is neessary to form and then sample from a probability distribution for eah missing value, and then ompute P (S jD ; S ). From (5.3) we see that this last step requires that a prior distribution P (S ) be speied. Additionally, to ompute P (DjS ) we are required to use inferene in a Bayesian Network, whih is NP-hard in the number of nodes. In the next setion we derive a large-sample Gaussian approximation for the distribution of the parameters, and then go on to show how this an simplify the alulations when we have large amounts of data. 5.5.5.2 Gaussian Approximation Here we show how one an approximate the distribution P (S jD; S ) by a multivariate Gaussian distribution. A d-dimensional multivariate Gaussian distribution with mean vetor of dimension d 1 and variane matrix of dimension d d is denoted Nd (; ) 92 5.5. THE BAYESIAN APPROACH and has a probability density funtion of the form P (S ) = 1 1 exp (S )T 1 (S 2 1 = 2 2 2 jj d= ): P As we are approximating S , the dimension d is given by ni=1 qi (ri 1), the sum over all nodes of the number of entries to be learnt for the onditional probability tables. We rst assume that the struture S is xed. From (5.4), the maximum a posteriori (MAP) onguration for P (S jD; S ) is that whih maximises P (DjS ; S )P (S jS ), or equivalently log(P (DjS ; S )P (S jS )): g(S ) = If we let ~S be the MAP estimate, then a Taylor series approximation to g(S ) about ~S , trunated after two terms, is 1 g(S ) ' g(~S ) + (S 2 ~S )H (S ~S )T ; where H is the Hessian of g(S ) evaluated at ~S . Thus, with H 0 = H , expfg(S )g ' expfg(~S )g expf 1 S ( 2 ~S )H 0 (S ~S )T g: Hene P (DjS ; S )P (S jS ) ' P (Dj~S ; S )P (~S jS ) expf 21 (S ~S )H 0(S ~S )T g = : expf 1 S ( 2 ~S )H 0 (S ~S )T g; (5.11) so that we an approximate P (S jD; S ) by the Gaussian distribution with mean ~S and variane matrix (H 0 ) 1 . Had we hosen to take the Taylor series expansion about some value other than ~, then this would be the mean of our Normal approximation. The MAP estimate ~ was hosen as it is an intuitively reasonable estimator. To use this approximation we are required to evaluate ~S and also H 0 , whih requires onsiderable omputation. The Expetation-Maximisation algorithm, introdued later in this setion, an be used to evaluate ~S . Given the omputations involved it seems there is little benet in this approah over the Gibbs sampler. However, omputation of the MAP estimate ~S is more eÆient than the Gibbs sampler when our data base is very large. It also allows for the development of an eÆient approximation to the distribution P (DjS ) when the struture is unknown, whih we derive in the following setion. 93 5.5. THE BAYESIAN APPROACH 5.5.5.3 Laplae's Approximation In Setion 5.5.5.2 we assumed the struture was xed. In the ase of unknown struture we wish to approximate P (DjS ) = = Z Z P (D; S jS ) dS P (DjS ; S )P (S jS ) dS : From (5.11) we have that P (DjS ; S )P (S jS ) (5.12) yields P (DjS ) Z Nd (~S ; (H 0 ) 1 ). Substituting this into 1 S ~S 0 S ( )H ( 2 Z 1 S S ~ ~ = P (Dj ; S )P ( jS ) expf (S ~S )H 0 (S 2 = P (Dj~S ; S )P (~S jS )(2d=2 )jH 0 j 1=2 ; ' (5.12) P (Dj~S ; S )P (~S jS ) expf ~S )T g dS ~S )T g dS where we have used the fat that j(H 0 ) 1 j = jH 0 j 1 . Hene 1 d log jH 0 j: (5.13) log P (DjS ) ' log P (Dj~S ; S ) + log P (~S jS ) + log(2) 2 2 This approximation, known as Laplae's approximation, is aurate to order (1=N ) and so an be very aurate for large samples [18℄. Again the most omputationally intensive stage of this aproah is in alulating H 0 and ~S . For large samples the prior distribution has a relatively small inuene on the posterior distribution and so in this ase ~ an be approximated by the maximum likelihood estimate ^. To obtain a more eÆient (and less aurate) approximation to (5.13) we an drop the seond and third terms to leave only those that inrease with N , and substitute ^ for ~ and d log N for log jH 0 j (as jH 0 j inreases with N d ). We then obtain what is known as the Bayesian Information Criterion (BIC), d log N: 2 This is of the form given in (5.1) of a measure of t plus a omplexity penalty, where the penalty inreases in proportion to the number of parameters to be estimated. As the BIC does not depend on a prior distribution we an use this riterion without needing to assess a prior distribution. However, this is beause we have assumed that our sample is large enough to render any prior knowledge insigniant. If this is not the ase the BIC should not be used. We now show how ~S and ^S an be alulated. log P (DjS ) ' log P (Dj^S ; S ) 94 5.6. CONFIDENCE MEASURES ON STRUCTURAL FEATURES 5.5.5.4 The Expetation-Maximisation Algorithm In this ase we wish to estimate maximum likelihood or MAP values when we have missing data. In general, the Expetation-Maximisation (E-M) algorithm an also be used to simplify diÆult maximum likelihood problems. To initialise the algorithm we assign values to the ijk to obtain a onguration for S . For the Expetation step we then ompute the expeted values of the ounts Nijk for a omplete data set5 . This is given by E (Nijk ) = N X P (xki ; pa(Xi )j jyl ; S ; S ); (5.14) =1 where yl denotes the possibly inomplete lth ase in D. Note that the expetation is with respet to the joint density for X onditioned on the assigned S and observed data D. Further, (5.14) an be evaluated using inferene on the Bayesian Network speied by parameters S and S and with evidene yl . In the Maximisation step we treat the E (Nijk ) as if they were atual values and nd the onguration of S that maximises P (S jD ; S ), where D is the ompleted data set. As in (5.7) this is given by l ~ijk = 0 + E (Nijk ) Nijk P : Nij0 + rki=1 E (Nijk ) We then iterate these two steps. Under ertain regularity onditions disussed in [26℄, the E-M algorithm will onverge to a loal maximum. 5.6 Condene Measures on Strutural Features Most of the previous disussion has involved induing networks with high sores. This is a global approah in that the soring funtion ompares entire network strutures. Edge exlusion devianes were also mentioned briey in Setion 5.3.3. This is a more loal approah where links are onsidered individually and inluded or disarded based on whether the data gave enough support to the hypothesis that the link exists. To do this, an asymptoti distribution is used even though in pratie we may not have enough data for this to be an adequate approximation. The approah we now onsider falls somewhere between these two methods and is based on appliation of the bootstrap. As we have a random sample of observations, any inferene made about the underlying probability distribution and struture has some unertainty assoiated with it due to the hane these observations are not representative. Beause of this unertainty, statistiians 5 The Nijk are suÆient for the parameters ijk . 95 5.6. CONFIDENCE MEASURES ON STRUCTURAL FEATURES tend to assoiate with every estimate a measure of unertainty or a ondene measure. For example, a 95% ondene interval for a point estimate suh as the expeted value of a random variable means that we an be 95% ertain that the true value lies within this interval but, due to sampling variation (the variation between samples of size N drawn from the distribution) there is a small hane that the true value is outside this interval. The sampling distribution is the probability distribution of the statisti of interest when alulated from a (random) sample of size N . In theory we ould estimate the sampling distribution by repeatedly drawing samples of size N from the population, and alulating the value of the statisti for eah sample. However, given that in pratie we have aess to only the N observations in our original sample, we an estimate the sampling distribution by repeatedly drawing samples of size N , with replaement, from this sample. This is known as bootstrapping. The bootstrap uses the empirial distribution funtion (that represented by the data) as an approximation to the atual distribution funtion and then, by resampling from this distribution, hopes to estimate the atual sampling distribution. Hene, given a data set D, we an alulate ondene measures for any harateristi, or feature, of that sample. In the ontext of Bayesian Networks the features to be estimated may be the existene of an edge between two variables or the Markov Blanket6 of a given node. In their paper `Data Analysis with Bayesian Networks: A Bootstrap Approah' [13℄, Friedman, Goldszmidt and Wyner propose a method for obtaining ondene measures on suh features. As the features we are interested in are harateristis of the network learnt from our data, eah time we resample a learning algorithm must be run so that we an observe the indued network. Whatever the feature f of interest, we an assign f the value 1 if that feature is present in the indued struture and the value 0 otherwise. If we take many samples of size N and on eah sample run the same learning algorithm, due to sampling variation we will indue diering strutures. The value of interest is the probability that a feature f exists in the network indued from a sample of size N . Using the bootstrap approah we an obtain an estimate of this value by the following algorithm: For i = 1;2; : : : ; m Draw, with replaement, a sample of size N from the data D. Denote this sample by Di . Indue a network struture S^i by applying some learning algorithm to Di . 6 Reall that the Markov Blanket of a node Xi is the set of nodes whih ontains the parents of Xi , the hildren of Xi and all parents of the hildren of Xi . 96 5.6. CONFIDENCE MEASURES ON STRUCTURAL FEATURES For every feature of interest f , dene P ^ PN (f ) = m1 m i=1 f (Si ); where ( 1 if f appears in S^i f (S^i ) = 0 otherwise. Friedman et al. tested this approah by generating data from known networks. They were interested primarily in three types of feature: existene of links, members of a node's Markov blanket and partial ordering of variables, that is whether one node is an anestor of another. The indution proedure used the BDe metri with a uniform prior distribution and equivalent sample size of 0.5. The searh proedure was a greedy hilllimbing algorithm with random restarts. The authors were able to draw several onlusions, 1. The bootstrap samples are `autious.' The number of false positives is generally small ompared to the number of true positives and false negatives. Thus if the ondene in a feature is high it is likely to exist in the underlying network. 2. Establishing the Markov blanket of a node and partial orderings are more robust than features suh as the existene of an edge. As well as simply providing ondene measures on indued features it is suggested that suh an approah ould be used to onstrain the searh spae over network strutures. By re-sampling to obtain ondene measures those features with high ondene an be retained and other features assummed not to exist to hene narrow the searh. For example, if the ondene that X is an anestor of Y has ondene greater than some threshold (the authors used 0.8), the indued network must respet this order. The results show that for small training sets slightly better networks an be found in this way. However, as the main omputational aspet of this algorithm is the network indution at eah iteration and the improvement in sore using this approah is very small, it may be more eÆient omputationally to simply alloate more time to the searh proess. This remains to be shown. Another interesting appliation of this approah may be in deteting hidden variables, that is variables whih are present in the underlying network but not inluded in the indued network. If we an nd a group of variables whih with high ondene are in eah other's Markov blanket, but the edge relationships are unlear, this may be indiative of a hidden variable. 97 Chapter 6 Inuene Diagrams for Deision Making 6.1 Deision Senarios In previous hapters we have looked at how to enter evidene and update the marginal distributions at the hypothesis nodes, prior to making a deision. In this hapter we look at deision making more formally. Namely, given a deision senario, we wish to determine the optimal deision. If we hoose to take some ation, this amounts to xing the state of a variable. Note that this is fundamentally dierent to simply observing the state of a variable as in previous hapters. When we observe that a variable is in some state, it is in that state beause of the inuene of other variables in the network. When we make the deision to x the state of a variable, the state of this variable is then independent of the states of the other variables. However, when we make a deision, that is set the state of some variable, then suh a hoie normally alters the probability distribution of another set of variables. This is the onsequene of that deision. The proess of attahing a numerial value to a onsequene or outome is known as utility theory, and the atual set of values attahed to the possible outomes is alled a utility measure. We may attah a utility measure to one or more variables to represent the desirability of the outome at those variables. This is done by direting a link from eah node with an outome of interest to a utility node U say. We then attah a utility to U for eah onguration of the parents of U . The priniple of Maximum Expeted Utility is to hoose as the optimal deision that whih maximises the expeted utility. Consider the one deision ase. There is one deision node D say, represented in Figure 6.1 by a square node. This has states orresponding to the possible deisions. The 98 6.1. DECISION SCENARIOS utility nodes U1 ; U2 and U3 are represented in the gure by diamond shaped nodes. The round nodes are, as before, intermediate random variables, whih in this ontext we refer to as hane nodes. X3 X1 U2 X4 D X2 X6 X5 U3 U1 Figure 6.1: A Bayesian Network with a deision node D and utility nodes U1 ; U2 and U3 . The expeted utility at U2 , EU (U2 ) = XX x3 x4 U2 (x3 ; x4 )P (x3 ; x4 jD; e); where e represents any other information we have available. Similarly, EU (U3 ) = and XXX x4 x5 x6 U3 (x4 ; x5 ; x6 )P (x4 ; x5 ; x6 jD; e) EU (U1 ) = U1 (D): Utilities an be used to represent all osts and rewards. For example U1 may represent the ost of implementation of the deision and U2 and U3 represent the desirability of the outomes. In general, given that we have k utility nodes, the expeted utility of a deision D an be alulated as EU (Dje) = X pa(U1 ) U1 (pa(U1 ))P (pa(U1 )jD; e) + : : : + X pa(Uk ) Uk (pa(Uk ))P (pa(Uk )jD; e); where again pa(Ui ) represents a onguration of the variables in the set of parents of Ui , Pa(Ui ). We then hoose that deision d whih results in the maximum expeted utility, that is argmax d = d EU (D = dje): 99 6.1. DECISION SCENARIOS In general there an be many deision nodes, often with a temporal ordering whih speies that some deisions must be taken before others. For example, onsider the position of a young man who is at the tiket oÆe to buy tikets for himself and his new girlfriend to attend a movie on their rst date. He has forgotten his student ard whih would have entitled him to a 30% disount. At the tiket booth his rst deision is whih of two queues to join, and hene by whih teller he will be served. The teller may or may not ask for his student ID. He must then deide to buy tikets for either the ation or omedy movie. He plaes a value on how muh his girlfriend enjoys the movie, but not quite as muh value as he plaes on the ost for her enjoyment. This deision senario is represented in Figure 6.2. Here D1 represents the deision of E U2 E U2 E U2 comedy E U2 action E U2 comedy E U2 action E U2 comedy E U2 action U1 D2 yes comedy ID action no queue 1 D2 D1 D2 queue 2 yes ID no D2 U1 Figure 6.2: A network representing the two deisions whih must be made in order to buy a movie tiket. whih queue to join. The ar labels represent the deision made - one for eah possible deision. The variable ID is a hane node whih represents whether he was asked for ID. The utility U1 is dened as follows, Yes No ID . Utility -100 0 D2 represents the deision ation or omedy, and E is a hane node representing the girl's enjoyment of the lm. We dene the utilities U2 , 100 6.1. Enjoyment Utility DECISION SCENARIOS Low Average High . -90 0 90 In addition, we require the onditional probabiltities P (Enjoymentjation), P (Enjoymentjomedy), P (IDjqueue 1), and P (IDjqueue 2). A Bayesian Network may be used to determine the probabilities P (E jD2 ) and P (IDjD1 ) and may inorporate other variables, suh as in Figure 6.3. Then P (E jD2 ) = XX m a P (E jD2 ; m; a)P (m; a): M D2 E A Figure 6.3: Here A represents the variables Attration to boyfriend, M represents the variable Mood, and E represents the variable Enjoyment. Given that all utilities and onditional probabilities have been dened, we an then use probability theory to alulate the expeted utility of taking deision D1 and the expeted utility of D2 given ID and D1 1 . We hene have the optimal deision sequene (d1 ; d2 ). There are several points to note from this example, 1. Utilities are arbitrary, but utilities at eah node should be dened on the same sale. 2. Bayesian Networks an be used to determine the onditional probability distributions. 3. Eah deision results in the same sequene of deision-observation senarios, no matter what deision was made. A deision senario whih satises point 3 is alled symmetri. A symmetri deision senario an be represented by a hain of variables, as in Figure 6.4. Note that although ID has 2 hildren, this is still a hain of variables as utility nodes do not represent variables. The hain in Figure 6.4 allows us to asertain the order of the deisions, namely that D1 must be made prior to D2 , and also the observations whih are made between deisions. In a network, to indiate graphially whih observations have been made prior to a deision For a desription of how to alulate the optimal deision in a deision tree suh as that in Figure 6.2, see for example [32℄. 1 101 6.2. D1 ID D2 INFLUENCE DIAGRAMS E U2 U1 Figure 6.4: A hain of variables representing the deision senario in Figure 6.2. we an add a link from the observed node to the next deision node. Suh links are alled Information Links as they indiate the information we have available when making that deision. Similarly, to indiate the order of deisions we an add links whih are direted from a deision node to the next deision node. Suh links are known as Preedene Links. Figure 6.5 represents the deision senario of Figure 6.2 with information links and preedene links added. As the addition of preedene links means that the network is no longer a tree, we an inlude the variables Attration and Mood from Figure 6.3 in our model expliitly. The resulting diagram is alled an Inuene Diagram and ontains all A D1 ID D2 M E U2 U1 Figure 6.5: The deision senario of Figure 6.2 represented as a network with added information and preedene links. the information in Figures 6.2, 6.3 and 6.4 in a muh more ompat form. 6.2 Inuene Diagrams An Inuene Diagram is a direted ayli graph over a set of deision nodes, hane nodes and utility nodes, with the following properties [22℄, 1. There is a direted path enompassing all deision nodes. 2. The utility nodes are leaf nodes. 102 6.3. SOLUTION OF INFLUENCE DIAGRAMS 3. The deision nodes and hane nodes are disrete, with a ninte number of mutually exlusive states. 4. Eah hane node Xi is assoiated with a onditional probability table whih speies P (Xi jPa(Xi )). 5. Eah utility node U is assoiated with a real-valued funtion over pa(U ). Inuene Diagrams were originally developed as a ompat representation of symmetri deision senarios. However, they are now onsidered an extension of Bayesian Networks to deision senarios. There are several restritions on what is able to be represented by an Inuene Diagram. Firstly, point 1 in the above denition of an Inuene Diagram speies there must be a direted path between deision nodes and hene there must be some way of ordering the deisions. This is not always possible and it may be the ase that a better solution an be found if the deisions are onsidered in an alternate ordering. The seond assumption impliit in an Inuene Diagram is that of no-forgetting, that is at eah stage the deision maker has omplete knowledge of what has gone before. Inuene Diagrams are very losely related to Bayesian Networks and hene we an use a modied hain rule for Inuene Diagrams. If we denote the set of hane nodes by XC and the set of deision nodes by XD , then P (XC = xC jXD = xD ) = Y X 2XC P (X = fX g[xC ℄jPa(X ) = Pa(X )[xC ; xD ℄); where the parents of X may ontain both hane nodes and deision nodes. 6.3 Solution of Inuene Diagrams In general, we onsider the task of solving an Inuene Diagram or deision tree to be equivalent to alulation of the optimal deision sequene. As for Bayesian Networks, there are many methods for solving Inuene Diagrams. One again these methods make use of the d-separation relations in the graph, though d-separation in an Inuene Diagram is slightly dierent to that in Bayesian Networks. When examining the d-separation relations we ignore the utility nodes and also the links into deision nodes, that is we ignore the information and preedene links. The most intuitive way to solve an Inuene Diagram is to `unfold' the graph to a tree representation. In our movie example this would be the tree shown in Figure 6.2 where we use the Bayesian Network of Figure 6.3 to alulate the probabilties P (E jD2 ). We an then use tehniques for solving deision trees (see for instane [32℄) to nd the optimal 103 6.3. SOLUTION OF INFLUENCE DIAGRAMS deision sequene. However, for even moderately small networks, or networks in whih the variables have a large number of states, this is omputationally infeasible, as the number of branhes inreases exponentially with the number of possibilities at eah node. The most eÆient method for solving Inuene Diagrams is to use a lustering method similar to that used for Bayesian Networks and disussed in Setion 3.3.3.1. This tehnique builds what is referred to by Jensen [22℄ as strong juntion trees whih, beause of the temporal ordering present in Inuene Diagrams, involve what is known as strong triangulation - where order matters. The major omplexity issue in solving Inuene Diagrams is that, beause of the noforgetting assumption, the past is often intratably large. There exist several approximate methods whih an be used to get around this by making use of d-separation, for example by the use of information bloking or what are alled LIMIDs - Limited Memory Inuene Diagrams. Information bloking uses d-separation to simplify the network. As an example onsider a simplied version of the movie tiket senario where the only deision required to be made is whih genre of movie to buy a tiket for. If the boy has a series of dates, a relevent Inuene Diagram would be that depited in Figure 6.6, where the girl's enjoyment of the past movie inuenes her enjoyment of the next movie (for example she may only enjoy omedy) and the deision of her boyfriend as to whih tiket to buy for the next date. Note A M A M E D2 E D2 U2 A M E D2 U2 U2 Figure 6.6: Inuene Diagram representing the deision of whih tiket to buy on a sequene of dates. that suh a diagram, taken over several time slies, is often represented ompatly as in Figure 6.7, where a double link represents the existene of a link between those variables from one time slie to the next. To use the information bloking approximation we ignore the link between enjoyment nodes from one time slie to the next. Then beause we know whih deision was made at D2 at the previous time step, this node d-separates the two time slies and they beome independent, signiantly reduing the omputational burden of solution. 104 6.3. SOLUTION OF INFLUENCE DIAGRAMS A M E D2 U2 Figure 6.7: Compat representation of the Inuene Diagram in Figure 6.6. When we use a LIMID we drop the no-forgetting assumption and represent diretly what is remembered at eah deision node by the use of information links. Hene if we want a memory whih is good for N steps say, then we would diret links from the relevant variables in the past N steps to the deision at present. A LIMID is hene an extension of the information bloking approah. Although the solution to a LIMID is an approximation to the solution of the orresponding innite memory inuene diagram, it is a good ompromise between auray and omputational eeieny. In general, solving Inuene Diagrams is more omplex than solving Bayesian Networks, but many of the same tehniques an be used. One a model has been formed there exist software pakages whih an be used to assist in the solution of the Inuene Diagram. 105 Chapter 7 Conlusions and Remarks In Chapter 2 we introdued several types of Bayesian Networks, namely Probabilisti Bayesian Networks, Bayesian Belief Networks and Causal Bayesian Networks. These are the most general types of Bayesian Network and make lear the distintion between ausal and non-ausal models. Bayesian Networks an be used to model a wide range of situations, and so more spei types have been developed. For example Bayesian Networks are partiulary useful for models involving hidden variables, and in this ontext are often referred to as Dynami Bayesian Networks [29℄. However, all are simply speialisations of the more general types introdued here and so the algorithms and general properties hold. In Chapter 3 we disussed methods for alulating the distribution after new information had been reeived. The method presented in Setions 3.3.1 and 3.3.2 is exat and omputationally feasible in trees and polytrees of reasonable size, though in pratial situations, unless we hoose to simplify the network rst using the tehniques disussed in Setion 4.3 for example, most networks will ontain loops. The join tree method disussed in Setion 3.3.3 was originally developed by Shafer et al. [35℄. This is still one of the most eÆient methods available when exat alulation is feasible, though Lauritzen and Speigelhalter [27℄, and Jensen et al. [24℄ developed a slightly more eÆient method where the onditional probabilities of the network are hanged dynamially (referred to as the Hugin method). Most reent researh is foused around nding good approximations by use of Monte Carlo Methods. Gilks et al. developed Gibbs sampling for Bayesian Networks, see [16℄, whilst the AIS-BN algorithm mentioned in Setion 3.3.3.5 appears to be the most eÆient for unlikely instantiations of the evidene nodes [2℄. The value of evidene and the eet of hanges to network struture, disussed in Chapter 4, was motivated by the issue of sensitivity analysis. That is, when there is unertainty regarding the struture (or parameters) of a network, we wish to know how sensitive our onlusions are to hanges in those parameters within a reasonable range. More speially, if is a set of parameters for a Bayesian Network, we may be interested 106 7. CONCLUSIONS AND REMARKS in how P (X h je) varies with . This is an important issue as, obviously, we do not want to pretend onlusions drawn from a model are valid, unless they are robust to small hanges of that model. It turns out that under a general assumption, one-way sensitivity analysis, in whih we determine the eet on P (X h je) of varying only one parameter and holding the remainder onstant, requires less than two propagations through the network and simple alulations. More sophistiated analyses have also been developed. Jensen [22℄ gives a good overview, and more details an be found in [25℄ and [23℄. In Setion 4.4.2 we showed that it is best to ontinually update the set of evidene nodes, based on the eet of evidene observed at the previous time-step. This leads into the theory of nding an optimal trouble-shooting strategy. In that ase we have some problem, whose ause is unertain, that we wish to retify. We an perform a series of ations (whih may x the problem) and tests (to help deide whih ations may be best), eah of whih has some assoiated ost. The optimal trouble shooting strategy is that sequene of ations and tests whih results in the minimum expeted ost of repair, that is the minimum expeted ost of the ations and tests whih need to be performed before the problem is xed. Inuene Diagrams an be used to nd the optimal sequene, and details of their appliation to this problem an be found in [22℄. The Bayesian method for learning Bayesian Networks, disussed in Setion 5.5, allows a way to inorporate prior knowledge and unertainty about the parameters of the network into the model. Although other methods for learning networks exist, as eluded to in Setion 5.3 for example, the Bayesian Method is most popular at present. A good soure of referenes to the literature on the topi of learning Bayesian Networks is [1℄. In Chapter 6 we introdued the idea of extending Bayesian Networks to Inuene Diagrams, when we wish to model a deision senario. Although in order to be omputationally feasible these must satisfy several strit assumptions, many pratial problems are still able to be modelled. Prominent appliations of Inuene Diagrams are nding optimal trouble shooting strategies and, in management, to model symmetri deision senarios. See, for example, [22℄. Although Bayesian Networks are primarily a numerial model on whih to base inferene, the graphial aspet of Bayesian Networks is often utilised. For example, given data, a learning algorithm an be run to determine a suitable struture. Using a method suh as the bootstrap disussed in Setion 5.6, the `strength' of eah of the indued links an be estimated. This graphial representation allows one to easily determine the relationships between the variables. For example, the network available online at [38℄ produed from a study of gene expression data from a miroarray analysis allows one to see immediately the relationships between the genes - whih pairs are strongly related and whih genes are independent of the others, for example. Hene a Bayesian Network provides a way of 107 7. CONCLUSIONS AND REMARKS extrating easily interpretable information about the independene relations between the variables from very large data bases. The graphial omponent of a Bayesian Network is also used by Sebastiani et al. in [34℄. In this ase a learning algorithm was run on a database of survey results. The network displayed the relationships between the variables, and this ould be published without any risk of loss of ondentiality to the survey partiipants. Throughout this thesis we onsidered only the ase of disrete variables. Bayesian Networks an also be extended to use with ontinuous variables (Gaussian) and a mixture of both disrete and Gaussian variables. See for example [28℄ and [29℄. Continuous variables ompliate information propagation but an simplify learning - in the Gaussian ase there are only two paramters (the mean and variane) to be learnt for eah onguration of the parents of eah variable. In general, we have primarily disussed what a Bayesian Network is, how they an be formed and the mehanisms underlying the propagation of information through a network. Rather than fous on a spei appliation, we hose instead to investigate and present aspets of the large theoretial basis whih underpins the appliation of Bayesian Networks to modelling real-world problems. By having disussed the mathematis behind the appliation of Bayesian Networks it is hoped that the suitability of a Bayesian Network for any spei modelling task an be assessed, and the broad range of appliations realised. By understanding the impliations of the struture on the relationship between the variables we are better able to understand the onstraints of our partiular model. Furthermore, we are able to understand the soure of the omplexity issues whih may arise when modelling real problems, how best to deal with them, and the eet of any subsequent simpliations. Understanding how information is stored by and propagated through a network allows us to understand the eet of information on the distributions at the hypothesis nodes and hene analyse the value of evidene to our partiular problem. After having gained an understanding of the theory of Bayesian Networks in the rst 5 hapters, it was then natural, in Chapter 6, to extend this to Inuene Diagrams. This illustrates the ability of Bayesian Networks to be extended to model many dierent senarios. Bayesian Networks form a very broad lass of models and generalise many others. They are primarily used for sequential updating, lassiation, normal regression, hidden Markov Models and, through the learning algorithms, as a `data mining' tool to extrat information about the relationships between variables in a domain. It is hoped this thesis has allowed the reader to appreiate the exibility of modelling using Bayesian Networks, and furthermore that the approah taken has led to a better understanding of the issues involved in probabilisti reasoning under unertainty. 108 Bibliography [1℄ Buntine, W. (1996) A guide to the Literature on Learning Probabilisti Networks from Data, IEEE Trans. on Knowledge and Data Engineering, vol 8, no. 2, 195210. [2℄ Cheng, J. and Druzdel, M. (2000) AIS-BN: An Adaptive Importane Sampling Algorithm for Evidential Reasoning in Large Bayesian Networks, Journal of Artiial Intelligene Researh vol. 13, 155-188. [3℄ Chow, C. K. and Liu, C. N. (1968) Approximating Disrete Probability Distributions with Dependene Trees, IEEE Trans. on Info. Theory vol. IT-14, 462-467. [4℄ Chow, C. K. and Wagner, T. J. (1973) Consisteny of an Estimate of TreeDependent Probability Distributions, IEEE Trans. on Info. Theory, vol 19, 369371. [5℄ Clark, J. and Derek, A. H. (1991) A First Look at Graph Theory, World Sienti. [6℄ Cooper, G. F. (1990) The Computational Complexity of Probabilisti Inferene using Bayesian Belief Networks, Artiial Intelligene vol. 42, 393-405. [7℄ Cooper, G. F. and Herskovits, E. (1992) A Bayesian Method for the Indution of Probabilisti Networks from Data, Mahine Learning vol 9, 309-347. [8℄ Cover, T.M. and Thomas, J.A. (1991) Elements of Information Theory, John Wiley and Sons. [9℄ Dagum, P. (1993) Approximating Probabilisti Inferene in Bayesian Belief Networks is NP-hard, Artiial Intelligene vol 60, 141-153. [10℄ Das, B. (1999) Representing Unertainties Using Bayesian Networks, DSTO-TR0918. [11℄ Diestal, R. (1997) Graph Theory, Springer-Verlag, New York. 109 BIBLIOGRAPHY [12℄ Ewens, W. J. (2001) Statistial Methods in Bioinformatis: an introdution, Springer, New York. [13℄ Friedman, N., Goldszmidt, M. and Wyner, A. (1999) Data Analysis with Bayesian Networks: A Bootstrap Approah, Pro. of the Fifteenth Conferene on Unertainty in Artiial Intelligene. [14℄ Geiger, D. and Hekerman, D. (1994) A Charaterisation of the Dirihlet Distribution, Mirosoft Researh MSR-TR-94-16. [15℄ Geiger, D., Verma, T. and Pearl, J. (1990) Identifying Independene in Bayesian Networks, Networks vol 20, 507-534. [16℄ Gilks, T. and Spiegelhalter, D. (1994) A language and a program for omplex Bayesian modelling The Statistiian, vol 43, 169-178. [17℄ Hastie, T., Tibshirani, R. and Friedman, J. H. (2001) The Elements of Statistial Learning:Data Mining, Inferene and Predition, Springer. [18℄ Hekerman, D. (1995) A Tutorial on Learning with Bayesian Networks, Mirosoft Researh MSR-TR-95-06. [19℄ Hekerman, D. and Geiger, D. (1995) Likelihoods and Parameter Priors for Bayesian Networks, Mirosoft Researh MSR-TR-95-54. [20℄ Hekerman, D., Geiger, D. and Chikering, D. (1995) Learning Bayesian Networks: The ombination of Knowledge and Statistial Data, Mahine Learning, vol 20, 197-243. [21℄ Hwang, F. (1992) The Steiner Tree Problem, Annals of Disrete Maths vol. 53. [22℄ Jensen, F. (2001) Bayesian Networks and Deision Graphs, Springer-Verlag, New York. [23℄ Jensen, F., Aldenryd, S. and Jensen, K. (1995) Sensitivity Analysis in Bayesian Networks, Leture notes in Artiial Intelligene, vol. 946, 243-250, Springer. [24℄ Jensen, F., Lauritzen, S. and Olesen, K. (1990) Bayesian Updating in ausal probabilisti networks by loal omputations Computational Statistis Quarterly, vol. 4, 269-282. [25℄ Kjaerul, U. and van der Gaag, L. C. (2000) Making Sensitivity Analysis Computationally eÆient, Proeedings of the 16th Conferene on Unertainty in Artiial Intelligene, 317-325, Morgan Kaufmann Publishers. 110 BIBLIOGRAPHY [26℄ Lauritzen, S. (1995) The Expetation-Maximisation algorithm for Graphial Assoiation models with missing data, Computational Statistis and Data Analysis, vol 19 no. 2, 191-201. [27℄ Lauritzen, S. and Spiegelhalter, D. (1988) Loal omputations with probabilities on graphial strutures and their appliation to expert systems, Journal of the Royal Statistial Soiety, Series B, vol. 50, 157-224. [28℄ MMihael, D., Liu, L. and Pan, H. (1999) Estimating the Parameters of Mixed Bayesian Networks from Inomplete Data, Pro. of the International Conf. on Information, Deision and Control. [29℄ Murphy, K. (1998) A Brief Introdution to Graphial Models and Bayesian Networks, www.ai.mit.edu/~murphyk/Bayes/bayes.html [30℄ Pearl, J. (1998) Probabilisti Reasoning in Intelligent Systems, Morgan Kaufmann Publishers, California. [31℄ Pearl, J. (2000) Causality : models, reasoning and inferene, Cambridge University Press. [32℄ Raia, H. (1968) Deision Analysis: Introdutory Letures on Reasoning Under Unertainty, Addison-Wesley. [33℄ Rie, J. A. (1988) Mathematial Statistis and Data Analysis, Duxbury Press. [34℄ Sebastiani, P. and Ramoni, M. (2001) On the Use of Bayesian Networks to Analyze Survey Data, www.iteseer.nj.ne.om/514636.html [35℄ Shafer, G. and Shenoy, P. (1990) Probability Propagation Annals of Mathematis and Artiial Intelligene 2, 327 - 352. [36℄ Whittaker, J. (1989) Graphial Models in Applied Mathematial Multivariate Statistis, John Wiley and Sons. [37℄ Williamson, J. (2000) Approximating Disrete Probability Distributions with Bayesian Networks, Pro. of the International Conf. on Artiial Intelligene in Siene and Tehnology. [38℄ www.s.huji.a.il/~nirf/GeneExpression/top800/ 111