Bayesian Networks Overview 1. 2. 3. 4. 5. Introduction to Bayesian Networks Inference Learning Parameters Learning Topology Decision Making BAYESIAN NETWORKS: AN INTRODUCTION Overview 1. 2. 3. 4. 5. A Bayesian Network Instantiation Probability Flows Bayesian Networks and Causality Edges and Conditional Independence Bayesian Networks: Introduction A Bayesian network consists of two parts: 1. The qualitative, which is given in the form of a directed acyclic graph (DAG). • • Each node of the graph represents a variable of the system the Bayesian Network is modeling. The edges of the graph represent independency relations between the variables (more on this later) 2. The qualitative, which is given by probability distributions associated with each variable, which is to say with each node in the graph. (From now on I will talk interchangeably about nodes and variables). • These distributions give the probability that the associated variable takes a particular value given the values of its parent nodes in the graph. 1 0 .7 .3 1 0 .2 .8 1 0 .6 .4 B A B C 1 0 1 1 .9 .1 1 0 .8 .2 0 1 .6 .4 0 0 .1 .9 C D E B C E 1 0 1 1 1 .5 .5 1 1 0 .4 .6 A 1 0 1 .3 .7 1 0 1 .7 .3 0 .9 .1 1 0 0 .4 .6 0 1 1 .3 .7 0 1 0 .5 .5 0 0 1 .6 .4 0 0 0 .1 .9 F Instantiation • We will talk about nodes being ‘instantiates’ when we know that they have a particular value, and uninstantiated when we do not. • Let’s look at what occurs when a node is instantiated… Probability Flows • Downwards: Pr(D=1) =(.7)(.3)+(.3)(.9) =.21+.27 =.48 1 0 .7 .3 A A 1 0 1 .3 .7 0 .9 .1 D Probability Flows • Downwards: Let A=1. Pr(D=1) A =.3 A 1 0 1 .3 .7 0 .9 .1 D Probability Flows • Upwards: Pr(A=1) 1 0 .7 .3 A =.7 A 1 0 1 .3 .7 0 .9 .1 D Probability Flows • Upwards: Let D=1. Pr(A=1) 1 0 .7 .3 A =(.7)(.3)/((.7)(.3)+(.3)(.9)) =.21/(.21+.27) A 1 0 1 .3 .7 =.4375 0 .9 .1 D Probability Flows • Sideways! • A Priori: 1 0 .2 .8 B C 1 0 .6 .4 Pr(B=1) =.2 • If C is instantiated and E is not, this is unchanged: Let C=0 Pr(B=1) =.2 E B C 1 0 1 1 .9 .1 1 0 .8 .2 0 1 .6 .4 0 0 .1 .9 Probability Flows • Sideways! 1 0 .2 .8 B • But if E is instantiated, then knowing the value of C affects our knowledge of B… C E 1 0 .6 .4 B C 1 0 1 1 .9 .1 1 0 .8 .2 0 1 .6 .4 0 0 .1 .9 Probability Flows • Sideways! 1 0 .2 .8 B C 1 0 .6 .4 • Let E=1 Pr(B=1) =(.2)(.6)(.9)+(.2)(.4)(.8)/ ((.2)(.6)(.9)+(.2)(.4)(.8)+ (.8)(.6)(.6)+(.8)(.4)(.1)) =.172/.588 =.293 E B C 1 0 1 1 .9 .1 1 0 .8 .2 0 1 .6 .4 0 0 .1 .9 Probability Flows • Sideways! 1 0 .2 .8 B C 1 0 .6 .4 • Let E=1,C=0 Pr(B=1) =(.2)(.4)(.8) / (.2)(.4)(.8)+(.8)(.4)(.1) =.064/.096 =.666 E B C 1 0 1 1 .9 .1 1 0 .8 .2 0 1 .6 .4 0 0 .1 .9 Probability Flows • What is going on? • Sideways inference is akin to ‘explaining away’ Hypothesis 1: Mr N.N. suffered a stroke. Hypothesis 2: Mr N.N. had a heart attack. Event: Mr N.N. died. Probability Flows • What is going on? • Sideways inference is akin to ‘explaining away’ Hypothesis 1: Mr N.N. suffered a stroke. Hypothesis 2: Mr N.N. had a heart attack. Event: Mr N.N. died. Bayesian Networks and Causality • So edges represent causation? NO!!! Bayesian Networks are not (in general) causal maps. – When creating a BN from expert knowledge, the network is often constructed from known causal connections, since humans tend to think in terms of causal relations. – When learning a BN from data we cannot assume that an edge represents a causal relationship. • There have been controversial methodologies suggested for reading causal relationships of non-causal Bayesian Networks. Cf Neopolitan Edges and Conditional Independence • We said ‘The graph represents independency relations between the variables.’ • It does so through the edges, or, more accurately, through the ABSENCE of edges, between nodes. – Recall that two variables, A and B, are independent if: P(A,B)=P(A).P(B) – And they are conditionally independent of a variable C if: P(A,B|C)=P(A|C).P(B|C) BAYESIAN NETWORKS: TECHNICALITIES AND DEFINITIONS Overview 1. 2. 3. 4. The Markov Condition D-Separation The Markov Blanket Markov Equivalence The Markov Condition The Markov Condition: A node in a Bayesian Network is conditionally independent of its non-descendents given its parents. Be careful: – Does the sideways flow of probabilities clash with what you think the Markov Condition claims? – A node is NOT conditionally independent of its nondescendents given its parents AND its descendents! The Markov Condition • We can think of a Bayesian Network as ‘pulling apart’ a Joint Probability Distribution by its conditional independencies, and thereby rendering it tractable. • This permits us to use probability theory to reason about systems in a tractable way. – Imagine each variable in our six node network can take ten values. The conditional probability tables would then have a total of 11,130 values. (10,000 of which would be for node F). – The full joint probability table would have 1,000,000 values. BUT THEY REPRESENT THE SAME DISTRIBUTION The Markov Condition A BN can do this only because it meets the Markov Condition. In fact, meeting this conditional is the formal definition of a Bayesian Network: A DAG G, and a Probability distribution P is a Bayesian Network if and only if the pair <G,P> together satisfy the Markov Condition. The Markov Condition also entails further conditional independencies… D-Separation Some definitions: Where we have a set of nodes {X1,X2, . . . .,Xk}, where k ≥ 2, such that Xi -> Xi-1 or Xi-1 -> Xi, for 2 ≤ i ≤ k, we call the set of edges connecting these nodes a chain between X1 and Xk. Let the head of an edge be the side next to the child (where the arrow is on our graphs) and the tail be the side next to the parent. We will talk of the edges of a chain meeting at a node on the chain. D-Separation Some definitions: Let A be a set of nodes, and X and Y be distinct nodes not in A, and c be a chain between X and Y. Then c is blocked by A if one of the following holds: 1. There is a node Z ∈ A on c, and the edges that meet at Z on c meet head-to-tail. 2. There is a node Z ∈ A on c, and the edges that meet at Z on c meet tail-to-tail. 3. There is a node Z on c, such that Z and all of Z’s descendents are not in A, and the edges that meet at Z on c meet head-to-head. D-Separation D-Separation Let A be a set of nodes, and X and Y be distinct nodes not in A. X and Y are d-separated by A if and only if every chain between X and Y is blocked by A. (This can be generalized: Let G = (V, E) be a DAG, and A, B, and C be mutually disjoint subsets of V. We say A and B are d-separated by C in G if and only if, for every X ∈ A and Y ∈ B, X and Y are d-separated by C.) D-Separation 1) The Markov condition entails that all dseparations are conditional independencies; 2) Every conditional independencies entailed by the Markov condition is identified by a dseparation. The Markov Blanket The Markov Blanket A node is conditionally independent of every other node in the graph given its parents, its children, and the other parents of its children. These form the Markov Blanket of the node. • Why the other parents of the nodes children? • Because of the sideways flow of probabilities. The Markov Blanket Markov Equivalence A definition: If edges from nodes A and B meet at a node C, we say that this meeting is coupled if and only there is also an edge between nodes A and B. Otherwise this meeting is uncoupled. Markov Equivalence Markov Equivalence Two DAGs are Markov Equivalent: ⇔They encode the same conditional independencies; ⇔They entail the same d-separations; ⇔They have the same links (edges without regard for direction) and the same set of uncoupled head-to-head meetings. INFERENCE 1: THE VARIABLE ELIMINATION ALGORITHM Potentials We define a potential to be a function mapping value combinations of a set of variables to the non-negative real numbers. For example f is a potential: f (A,D) = a d f(A=a,D=d) 1 1 .3 1 0 .7 0 1 .9 0 0 .1 Notice this is the conditional probability table of node D. • Conditional Probability tables are potentials. • Joint Probability tables are potentials. Potentials • A potential’s ‘input’ variables (those variables which can take more than one value) are called its scheme; here Scheme(f) = {A,D}. • So, if we know D is true, then the potential corresponding to our a posteriori knowledge from from D’s conditional probability table would be: g (A) = a g(A=a,D=1) 1 1.2 0 .8 Scheme(g) = {A} Multiplication of Potentials We next define two operations on potentials: Multiplication and Marginalization. Given a set of potentials, F, the multiplication of these potentials is itself a potential. The value of each row, r, of a potential formed in this manner is obtained from the product of the row of each function f in F which assigns the variables of Scheme(f) the same values as they are assigned in r: h(x) Obviously: f ( x | Scheme ( f )) f F Scheme ( h ) Scheme (f) f F If f and g are potentials, when we perform the calculation/assignment f=f.g (possible only when Scheme(g) ⊆ Scheme(f)) we will say we multiply g into f. Multiplication of Potentials f (A,D) = g(A,X) = a d f(A=a,D=d) 1 1 .3 1 0 .7 0 1 .9 0 0 .1 a x g(A=a,X=x) 1 1 .7 1 0 .3 0 1 .5 0 0 .5 If h = f.g then: h(A,D,X) = a d X f(A=a,D=d) 1 1 1 .3 * .7 = .21 1 1 0 .3 * .3 = .09 1 0 1 .7 * .7 = .49 1 0 0 .7 * .3 = .21 0 1 1 .9 * .5 = .45 0 1 0 .9 * .5 = .45 0 0 1 .1 * .5 = .05 0 0 0 .1 * .5 = .05 Marginalization of Potentials Given a potential f, we can marginalize out a set of variables from f and the result is itself a potential. If i is such a potential, then: Scheme ( i ) Scheme ( f ) { A} Each row, r, in i is computed by summing the rows of f where the variables in Scheme(f) have the values assigned to them by r. If we wish to assign to a function variable the result of marginalizing out one of its variables, we will simply say we marginalize out from the function variable. So if g is the potential that results from marginalizing out the variable D from our example potential, then: g(A)= a g(A=a) 1 1 0 1 The Variable Elimination Algorithm 1. 2. 3. 4. Perform a Topological Sort (Ancestral Ordering) on the Graph. This will provide us with an ordering where no ancestor is before any of its descendants. This is always possible since the graph is a DAG. Construct a set of 'buckets', where there is one bucket associated with each variable in the Network, b(i), and one additional bucket b∅. Each bucket will hold a set of potentials (or constant functions in the case of b∅ ). The buckets are ordered according to the topological order obtained in step one, with b∅ at the beginning. Convert the conditional probability tables of the network into potentials and place them in the bucket associated with the largest variable in their scheme, based on the ordering. If there are no variables in a potential's scheme, it is placed in the null bucket. Proceed in reverse order through the buckets: i. ii. iii. 5. Multiply all potentials in the bucket, producing a new potential, p. Marginalize out the variable associated from the bucket from p, producing the potential p`. Place p' in the bucket associated with the largest variable in its scheme. Process the null bucket, which involves simply joining constant functions and is simply scalar multiplication. The Variable Elimination Algorithm Bucket Ø Bucket A Bucket B Bucket C Bucket D Bucket E Bucket F -From Bucket A {} -Node A’s CPT {A} -From Bucket B {A} - Node B’s CPT {B} -From Bucket C {A,B} -Node C’s CPT {C} -From Bucket D {A,B,C} -Node D’s CPT {A,D} -From Bucket E {B,C,D} -Node E’s CPT {B,C,E} -From Bucket F {B,D,E} -Node F’s CPT {B,D,E,F} 7. Multiply all, and we have result. 6. Multiply all and marg out A. {} 5. Multiply all and marg out B. {A} 4. Multiply all and marg out C. {A,B} 3. Multiply all and marg out D. {A,B,C} 2. Multiply all and marg out E {B,C,D} 1. (Multiply all and) Marg out F. {B,D,E} Points to note The algorithm produces the probability of the evidence. So if it is run without any evidence, it simply marginalizes all variables out and returns 1! To actually get, say, the A Priori probabilities of each variable, we have to run the algorithm repeatedly, assigning each value to each variable (one at a time). Likewise to get A Posteriori probabilities, we run the algorithm on the evidence, then with the evidence and the Variable-Value we are interested in, and divide the second result by the first. Conclusion The Variable Elimination Algorithm is… • VERY INEFFICIENT!!! • (When supplemented) the only algorithm that can estimate error bars for arbitrary nodes in a Bayesian Network. - once we have completed the algorithm as given, we proceed backwards calculating the derivatives of the functions involved. From these we can produce an effective approximation of the variance of the probability distribution, from which we estimate the error bars. INFERENCE 2: THE JUNCTION TREE ALGORITHM Junction Trees • A Junction Tree is a secondary structure that we construct from a Bayesian Network. 1. 2. 3. 4. 5. Take copy of DAG and undirect edges Moralize Triangulate Performed in Find cliques of triangulated graph. a single step Insert sepsets between cliques. Build an optimal Junction Tree 1. 2. 3. Begin with a set of n trees, each consisting of a single clique, and an empty set S. For each distinct pair of cliques X and Y, insert a candidate sepset into S, containing all and only nodes in both X and Y. Repeat until n-1 sepsets have been inserted into the forest. A. B. C. Choose the candidate sepset, C, which contains the largest number of nodes, breaking ties by choosing the sepset which has the smaller value product (the product of the number of values of the nodes/variables in the sepset). Delete C from S. Insert C between the cliques X and Y only if X and Y are on different trees in the forest. (NB This merges the two trees into a larger tree.) Our DAG B A C D E F Undirected B A C D E F Moralized A B D C E F Obtain Cliques from Triangulated Graph whilst Triangulating 1. Take the moral graph, G1, and make a copy of it, G2. 2. While there are still nodes left in G2: A. Select a node V from G2, such that V causes the least number of edges to be added in Step 2b, breaking ties by choosing the node that induces the cluster with the smallest weight, where: • The weight of a node V is the number of values of V . • The weight of a cluster is the product of the weights of its constituent nodes. B. The node V and its neighbors in G2 form a cluster, C. Connect all of the nodes in this cluster. C. If C is not a sub-graph of a previous cluster, store C. D. Remove V from G2. Cliques A,D B,D, E,F B,C,E Creating Separation Sets 1. Create n trees, each consisting of a single clique, and an empty set S. 2. For each distinct pair of cliques X and Y, insert X ∩ Y into S, recording the cliques this set was formed from. 3. Repeat until n-1 sepsets have been inserted into the forest. a. b. c. Select from S the sepset, s, that has the largest number of variables in it, breaking ties by choosing the set which has a lower value product (the product of the number of values of each variable in the set). Further ties can be broken arbitrarily. Delete s from S. Insert s between the cliques X and Y only if X and Y are on different trees in the forest. (Note this merges these two trees into a larger tree.) Cliques and Sepsets A,D D B,D, E,F B,E B,C,E Junction Trees and Sub-Graphs • Junction Trees can be formed from subgraphs: Find the smallest sub-graph d-separated from the remainder of the graph by instantiated nodes, that includes all nodes we wish to predict a posteriori values for. Construct the JT from this sub-graph. (Store created JTs for reuse.) A Message Pass Passing a message from clique X to clique Y via sepset S: 1. Save the potential associated with S. 2. Marginalize a new potential for S out of X. 3. Assign a new potential to Y, such that: pot ( y ) new pot ( y ) old ( pot ( s ) new pot ( s ) old ) Collect and Disperse Evidence COLLECT-EVIDENCE(X) 1. Mark X. 2. Call Collect-Evidence recursively on X's unmarked neighboring clusters, if any. 3. Pass a message from X to the cluster which invoked CollectEvidence(X). DISTRIBUTE-EVIDENCE(X) 1. Mark X. 2. Pass a message from X to each of its unmarked neighboring clusters, if any. 3. Call Distribute-Evidence recursively on X's unmarked neighboring clusters, if any. Collect and Disperse Evidence Evidence Potentials Variable Value 1 Value 2 Value 3 A 1 1 1 B 1 1 1 C 1 0 0 D 1 1 1 E .7 .25 .05 F 1 1 0 We know: • Variable C has value 1 • Variable F has value 1 or 2, but not 3 • Variable E has a 70% chance of being value 1, 20% chance of being value 2 and a 5% chance of being value 3. (Soft evidence) The Junction Tree Algorithm 1. Initialize Junction Tree: – 2. Multiply in Conditional Probabilities – 3. For each node, find a clique containing the node and its parents (it will exist) and multiply in the node’s conditional probability table to the clique’s potential. Multiply in Evidence – 4. For each node, multiply any evidence that exists in to a clique where the node is present. Disperse and Collect Evidence – 5. Pick an arbitrary root clique, and call collect evidence and then disperse evidence on this clique. Marginalize Out desired values – 6. Select the smallest clique for each node that you wish to obtain A Posteriori probabilities for and marginalize out that node from the clique. Normalize – NB Associate with each clique and sepset a potential with all values set to 1. Normalize the potential obtained from step 5. If using soft evidence, steps 3 and 4 must be repeated until a desired level of convergence is reached. The Junction Tree Algorithm • Very efficient: Calculates all a posteriori probabilities at once. • Numerous efficiency enhancements (see literature). • Complexity dominated by largest clique. • Should be algorithm of choice unless seeking error bars. INFERENCE 3: SAMPLING Logic Sampling Given a set of variables, E, whose values, e, we know (or are assuming), estimate a posteriori probabilities for the other variables, U, in the network: 1. 2. Perform a topological sort on the graph. For each node in U, create a score card, with a number for each value. Initially set these to 0. Repeat: 3. • • • 4. Randomly generate values for each variable from their conditional probability tables in the order generated in step 1. If E≠e, discard the results. Otherwise, for each node in U, add 1 to the score for the value it has taken in this sample. For each variable in U normalize its scorecard to obtain a posteriori probabilities. Logic Sampling Problem: We end up discarding too many cases when the evidence has low probability (which is routinely the case when dealing with large sets of evidence variables). DO NOT USE LOGIC SAMPLING Likelihood Sampling Given a set of variables, E, whose values, e, we know (or are assuming), estimate a posteriori probabilities for the other variables, U, in the network: 1. 2. 3. Perform a topological sort on the graph. Set all nodes in E to their known values. For each node in U, create a score card, with a number for each value. Initially set these to 0. Repeat: 4. • • • 5. For each node in U, randomly generate values for each variable from their conditional probability tables in the order generated in step 1. Given the values assigned, calculate the probability, p, that E=e from these nodes relative probability tables. For each node in U, add p to the score for the value it has taken in this sample. For each variable in U, sum results and normalize. LEARNING 1: PARAMETER LEARNING Dirichlet Probability Distributions Definition: The Dirichlet Probability Distribution with parameters a1… an, where: n N is: a m m 1 ( f 1 , f 2 ... f n 1 ) And: n (N ) f1 n (a m a1 1 a 2 1 , f2 a n 1 ... f n ) m 1 0 f m 1, f m 1 m 0 Wonkish aside… There is one more parameter than function since the final function is uniquely determined by those that came before it: n 1 fn 1 m 0 fm Dirichlet Probability Distributions • Don’t worry: – If the probability of a random variable, X, taking particular values from the set {a1, a2, … an} is given by a Dirichlet distribution and N= a1 + a2 + … an, then: Pr( X a i ) ai N – It is often said that Dirichlet distributions represent the probabilities related with seeing value ai occur ai out of N times. Dirichlet Probability Distributions • Let the binary variable X be represented by the Dirichlet distribution Dir(f1,4,6) Pr(X=v1)=.4 Pr(X=v2)=.6 • Likewise, let the binary variable Y be represented by the Dirichlet distribution Dir(f1,40,60) Pr(X=v1)=.4 Pr(X=v2)=.6 Dirichlet Probability Distributions • However, our confidence in the probabilities given for Y would be much higher than those given for X (since so much more of the distribution lies in the vicinity of these values). • We shall also see that, for our purposes, the probabilities for Y would be much more resistant to emendation from new evidence than those for X. Learning Dirichlet Distributions • Imagine we have a network topology and wish to learn the parameters (conditional probability distributions) associated with each variable from the data set D: Data Set D A B 1 1 1 0 0 1 1 0 0 1 A B 0 1 ? ? A 0 1 0 ? ? 1 ? ? Learning Dirichlet Distributions • The basic procedure is: 1. Create a Dirichlet Distribution for each row of each conditional probability table. 2. For each datum, for each node, find the row in the conditional probability table which corresponds to the values the node’s parents take in the given datum, and add 1 to the parameter associated with the value the node takes in the datum. Learning Dirichlet Distributions • So we have 3 Dirichlet Distributions (let us just show the parameters, which will be zero): Data Set D A B 1 1 1 0 0 1 1 0 0 1 A B Dir(0,0) A=0 Dir(0,0) A=1 Dir(0,0) Learning Dirichlet Distributions • And we add the values from the data set. Data Set D A B 1 1 1 0 0 1 1 0 0 1 A B Dir(3,2) 0 1 .6 .4 A=0 Dir(0,2) A 0 1 A=1 Dir(2,1) 0 0 1 1 .66 .33 Learning Dirichlet Distributions • But... are we really willing to conclude that it is 100% certain that if A=0, B=1 from two observations?!? NO Data Set D A A B 1 1 1 0 0 1 1 0 0 1 B Dir(3,2) 0 1 .6 .4 A=0 Dir(0,2) A 0 1 A=1 Dir(2,1) 0 0 1 1 .66 .33 Learning Dirichlet Distributions • To avoid such issues we do not commence with Dirichlets having only zeros. Rather, if, prior to observing any data, we assume that all values are equally likely we might assign the a priori distributions : A B Dir(1,1) A=0 Dir(1,1) A=1 Dir(1,1) Learning Dirichlet Distributions • And we add the values from the data set. Data Set D A B 1 1 1 0 0 1 1 0 0 1 A B Dir(1+3,1+2) 0 1 .57 .43 A=0 Dir(1+0,1+2) A 0 1 A=1 Dir(1+2,1+1) 0 .25 .75 1 .6 .4 Learning Dirichlet Distributions • With these prior distributions we can: – Ensure some reasonable conservativism! – Encode defeasible domain knowledge. – Set the responsiveness of our learning algorithm to data. (Large values in the parameters will be less responsive.) LEARNING 2: LEARNING STRUCTURE/TOPOLOGY Topology Learning • Given our algorithm for learning parameters from Data we can learn topology: Our Task in Topology learning: Find the graph topology which, when its parameters are learnt from the data, renders the data most likely. The Bayesian Scoring Criterion n Pr( d | G ) • • • • • • (N (N i 1 • • • • (G ) qi j 1 (G ) ij (G ) ij ri ) M (G ) ij ) k 1 (a (G ) ijk s (G ) ijk ) ( a ijk ) (G ) d is our data G is the graph we are scoring n is the number of nodes in the graph q is the number of parent value combinations the node i has in its conditional probability table given G. N is the sum of the Dirichlet prior parameters for the row of node i’s conditional probability table corresponding to row j. M is the sum of the learnt additions to the Dirichlet parameters for the same row. r is the number of values node i has. a is the Dirichlet prior parameter corresponding to value k in row j for node i in graph G. s is the sum of the learnt additions to the same parameter. Γ is the Gamma function. The Bayesian Scoring Criterion n Pr( d | G ) (G ) qi (N (N i 1 j 1 (G ) ij (G ) ij ri ) M (G ) ij ) k 1 (a (G ) ijk s (G ) ijk ) ( a ijk ) (G ) Overview in words: The probability of the Data given the graph is the product of the probability that a node would have the values it did when its parents had the values they did, given our prior parameters and data encountered thus far for every node and for every row of that node’s conditional probability table. The Bayesian Scoring Criterion n Pr( d | G ) (G ) qi (N (N i 1 j 1 (G ) ij (G ) ij ri ) M (G ) ij ) k 1 (a (G ) ijk s (G ) ijk ) ( a ijk ) (G ) The criterion is: • Decomposable: It is the product of a score for each node in the graph. • Locally updateable: If we change a graph by adding, removing or reversing an edge, we need only rescore the nodes were or are now Children on the edge in question. Strategy: Greedy Hill Climb • • • • Traverse the state space of graph topologies via a number of operations on a graph (eg insert, remove, reverse an edge). At any point, calculate the alteration to the graphs score any of these moves would make via local scoring of the nodes affected, and choose the best so long as it is an improvement. Stop when no improvements are possible. Use multiple restarts to try and avoid local maxima. (We could also use simulated annealing.) Problems (Simulated Annealing has the same problems.) 1. We would like (Markov) equivalence classes of DAGs to get the same score, since they represent the same conditional independencies. Given our criterion, this will not, in general, occur. Problems 2. We would like each (Markov) equivalence classes of DAGs to have the same a priori probability of being that which is chosen. Given our state space (all possible DAGs) and search method (which can be stuck at local maxima) this is not the case, since some equivalence classes have massively more instances than others. Problems 3. A naïve implementation of the equation given is complex: Each node scored would require us to setup the Dirichlets by loading them with the information from the data (linear on the size of the data set) and then multiply the Dirichlets (exponential on the number of parents of the node). Solution 1: Equivalent Sample Size • We can ensure that (Markov) equivalent DAGs have the same score by using an equivalent sample size when we set up our Dirichlet priors. • The sum of the (initial – see solution three) prior parameters of all Dirichlet distributions associated with a node must equal a particular number. Solution 1: Equivalent Sample Size Non-Equivalent Sample Size: – For node A: 1+1=2 – For node B: 1+1+1+1=4. A B Dir(1,1) A=0 Dir(1,1) A=1 Dir(1,1) Solution 1: Equivalent Sample Size Non-Equivalent Sample Size: – For node A: 1+1=2 – For node B: 1+1+1+1=4. A B Dir(1,1) A=0 Dir(1,1) A=1 Dir(1,1) Solution 1: Equivalent Sample Size Equivalent Sample Size of 4: – For node A: 2+2=4 – For node B: 1+1+1+1=4. A B Dir(2,2) A=0 Dir(1,1) A=1 Dir(1,1) Solution 1: Equivalent Sample Size Fresh concerns? – Big ESSs lead to the parameters of nodes with no or few parents being resistant to emendation from evidence. – Small ESSs lead to nodes with many parents having misleadingly unconservative probabilities in conditional probability rows that occur only once or twice. Common to have small sizes – e.g. the number of the values of the node with the most values. Solution 2: DAG Patterns • We ensure (Markov) equivalence classes of DAGs to have the same a priori probability of being that which is chosen by searching a state space of DAG equivalence classes rather than DAGs. • This is done by using DAG Patterns, form of nonreversible and reversible edges. (See the literature for more information.) • The operations commonly used for traversing state space of DAG equivalence classes does not permit the use of simulated annealing. Solution 3: Tractable calculations • The naïve implementation of the scoring criterion can be improved. We note: 1. The result of the calculation is the same if we add evidence one at a time or all at once: ( (2) (2 2) (( (( (2) ( 2 1) (3) ( 3 1) )( )( )( (1 2 ) (1) (1 1) (1) ( 2 1) (2) )( )( )( (1 0 ) (1) (1 0 ) (1) (1 0 ) (1) ) 2/6 )) )) (1 / 2 )( 4 / 6 ) 2 / 6 Solution 3: Tractable calculations • The naïve implementation of the scoring criterion can be improved. We note: 2. Each Dirichlet distributions that includes only the (original) prior parameters has a value of 1: (G ) qi j 1 (N (G ) ij (N (G ) ij ) ri ) k 1 (a (G ) ijk ) (a (G ) ijk ) 1 Solution 3: Tractable calculations • The naïve implementation of the scoring criterion can be improved: 1. Set the nodes score to 1. 2. For each datum, add the datum to the relevant Dirichlet, and multiply the score by: ( N ij ) (G ) (N (G ) ij M ri (G ) ij ) k 1 ( a ijk s ijk ) (G ) (a (G ) (G ) ijk ) Where, if this parent combination has been encountered before, N and each a is now the updated prior. Complexity is linear on the size of the data set