NOTES on Cooper,Herskovits 1992 Machine Learning: Bayesian network scoring Notation ci Î{v i1,...,v ir }, possible values of node c i i qi = # configurations (w ij ) of p i , the parents of ci D = the data = {c ih }; h indexes cases BP ijk = Pr(ci = v ik | p i = w ij ) = the model Nijk = #{h : c i = v ik & p i = w ij ) To make it simple, consider just a single variable i and a single parent configuration j. The number of values is ri º K , data are (Nij1,...,Nijr ) º (n1,...,nK ) , multinomial with sample size i Nij º n and parameter vector (BPij1,...,BPijr ) º (p1,...,pK ) . The marginal probability of the i data is dirichlet-multinomial. When K = ri =2, the marginal of n1 is beta-binomial, and with a flat prior it is discrete uniform on {0,…,n} and of course n2 = n - n1 . Generally, Pr(n1,...,nK ) = = n! ò p ò Õn !Õ p nk k Pr(n1,...,nK | p1,...,pK )p (p1,...,pK ) ×1/ Dir(1,...,1) k (1) = = n! n Dir(n1 + 1,...,nK + 1) × pk k / Dir(1,...,1) Õ nk ! n! Õ G(nk + 1) ×(K - 1)! Õ nk ! G(K + n) n! ×(K - 1)! (K + n - 1)! This is the reciprocal of the number of distinct values for the multinomial. Thus the probability is uniform across all data vectors. This we confirm by simulation. For K =2, this is 1/(n+1), a well known result for beta-binomial. = This is for a specific ordered data vector, where you know the count for each label v ik . The C-H formula is: Õ nk ! ×(K - 1)! (K + n - 1)! which you obtain by omitting the multinomial coefficient. (2) If you know only the set of counts but not the order (which count goes with which variable), the number of sets is given by the C-H formula. If the different data vectors leading to the same count vector are counted as 1 instead of n!/ nk ! , indeed you would divide by n!/ nk ! and get agreement with C-H. Õ Õ This is similar to Bose-Einstein statistics, in which the indistinguishability of bosons means that, for example, three photons distributed to two energy levels have probability distribution Pr(HL)=Pr(HH)=Pr(LL)=1/3 instead of Pr(HL)=1/2, Pr(HH)=Pr(LL)=1/4. Demonstration that (1) is correct by simulation is seen in Cooper,Herskovits.html. Going from A4 to A5: To simplify, n=1, and qi=1, and ri=2. A4 amounts to m Õp dh 1-dh (1- p) = pn (1- p)m-n , corresponding to a single data assignment. h=1 A5 is the same. Near the bottom of p.33, C-H distinguish between D and D’, ordered and unordered data results. So I think everything’s fine, since they deal with D and I am dealing with D’. Next issue: For each parent configuration w, the probability of the data, conditioning on the parent but marginalizing on the probability table, will be uniform on count vectors D’ as we have shown. In terms of D, the probability will be lower for count vectors with lots of multiplicity, i.e. where n!/ nk ! is big. For example, for 8 Bernoulli observations, the Õ probability is highest for all heads or all tails, and smallest for 4 each regardless of order. The process penalizes impure nodes. Very close to classification trees it seems. (Note: The uniform dirichlet density is 1/(K-1)! = 1/D(1,…,1). The volume of the simplex is NOT 1/(K-1)! For example, for K=2, the 1-simplex has length sqrt(2), but the density is 1, not 1/sqrt(2). This is because of the Jacobean, transforming between (p1,p2 ) and (p1 - p2 ,p1 + p2 ) , yielding sqrt(2) ; K=3, the 2-simplex has base = sqrt(2) and height = sqrt(sqrt(2)^2-(sqrt(2)/2)^2)=sqrt(3)/2. —Jacobean!)