CS B553 Homework 5: Bayesian Network Modeling and Inference Due date: 3/8/2012 Your boss holds a meeting and assigns you to a project that requires you to apply numerical optimization. Being an overpaid MBA type her instructions and clarity of the scope of the project are exceedingly vague: “optimize our strategic partnership throughput to maximize synergies” or some other nonsense. Consider devising a Bayesian network for recommending the scope of the project and an optimization algorithm to solve the project. 1. Draw a Bayesian network with the variables: Algorithm with domain {GD,QN,SA,GA} indicating gradient descent, quasi newton, simulated annealing, and genetic algorithms. ProblemSize with domain {Small,Medium,Large}. ObjectiveType with domain {Convex,Nonconvex,Nonsmooth}. A binary variable Derivatives indicating respectively whether derivatives of the objective function are available. LocalMinima with domain {One,Few,Many}. Speed with domain {Slow,Medium,Fast} indicating the speed of the algorithm. A binary variable Optimality indicating whether a global optimum will be reached. ProblemSize, ObjectiveType, and Algorithm should have no parents. Derivatives should have the parent ObjectiveType. LocalMinima should have the parent ProblemSize and ObjectiveType, because convex problems have one minimum, and nonconvex/nonsmooth problems in highdimensional spaces often have more minima than in low dimensional ones. Speed should have parents Algorithm, Derivatives, and ProblemSize. Optimality should have parents Algorithm, LocalMinima, and ObjectiveType. List the CPTs that need to be entered. How many free parameters are there in this network? How many free parameters would be needed in a joint distribution table? 2. Enter in some reasonable values for the CPTs given your knowledge of optimization algorithms. These values should be consistent with facts learned in class. For example, without derivative information, GD and QN methods are slower than when derivative information is available, because finite differencing is somewhat costly. Deterministic quantities should be specified with certainty. For example, nonsmooth objectives will not have derivatives available, and convex problems will have one local minimum. When you have no reasonable background information to determine a variable’s value, set its conditional distribution to be uniform. 3. Show the steps that you would take to calculate the unconditional distribution over Speed by reasoning with the joint distribution. Show the steps that you would take to calculate the unconditional distribution over Algorithm. (Do this symbolically, do not substitute the numeric values you supplied in question 2) 4. Suppose you ask your boss about the size of the problem. She replies that it’s a small problem – no more than a dozen variables. Furthermore she overheard you mentioning genetic algorithms, and she asks you to use it – it sure does sound good. Show the symbolic steps that variable elimination would take in order to calculate the distribution over Speed given this new information. (Use the best variable ordering that you can find). 5. Your boss indicates that it is more important to find a solution quickly than to achieve global optimality. Propose a method for selecting the algorithm that is most likely to solve the problem with Speed=Fast. You also bring up the point that it may be important to spend a week’s worth of effort to calculate the derivatives of the objective function, and your boss asks you “How likely would this work improve the optimization outcomes?” Describe how you would calculate the increase in probability that Speed=Fast supposing that derivative information was available. 6. Since your boss is known to have only an imperfect notion of technical matters, you suspect that the information she tells you during this meeting may be incorrect, or the requirements may change halfway through the project. Describe how you would change the network in order to represent the imperfect knowledge about ProblemSize and Speed given in problems 4 and 5. (The resulting model should still be a rigorous probabilistic representation.) Alg ObjType PSize Deriv LMin Opt Speed 1. There must be 7 CPTs entered: P(Alg), P(ObjType), P(PSize), P(Deriv|ObjType), P(LMin|ObjType,PSize), P(Opt|Alg,LMin,ObjType), and P(Speed|Alg,Deriv,PSize) These have (4-1)+(3-1)+(3-1)+3(2-1)+3*3*(3-1)+3*4*3*(2-1)+4*2*3*(3-1) = 3+2+2+3+18+36+48=130 parameters. A joint distribution would need 4*3*3*2*3*3*2-1 = 1295 parameters. 2. Some reasonable values might be as follows: P(Alg)=uniform P(ObjType) = uniform P(PSize) = uniform P(Deriv|ObjType=Convex) = 0.75, P(Deriv|ObjType=Nonconvex) = 0.5, P(Deriv|ObjType=Nonsmooth) = 0 P(LMin=One| ObjType=Convex,PSize) = 1.0 P(LMin | ObjType!=Convex,PSize=Small) = [0.2,0.6,0.2] P(LMin | ObjType!=Convex,PSize=Medium) = [0.1,0.6,0.3] P(LMin | ObjType!=Convex,PSize=Large) = [0.01,0.49,0.5] P(Opt|LMin=One,Alg,ObjType={Convex or NonConvex}) = 0.99 P(Opt|LMin=Few,Alg={GD,QN},ObjType=Convex or NonConvex) = 0.3 P(Opt|LMin=Few,Alg={SA,EA},ObjType=Convex or NonConvex) = 0.7 P(Opt|LMin=Many,Alg={GD,QN},ObjType=Convex or NonConvex) = 0.1 P(Opt|LMin=Many,Alg={SA,EA},ObjType=Convex or NonConvex) = 0.3 P(Opt|LMin=One,Alg={GD,QN},ObjType=NonSmooth) = 0.5 P(Opt|LMin=One,Alg={GD,QN},ObjType=NonSmooth) = 0.95 P(Opt|LMin=Few,Alg={GD,QN},ObjType=NonSmooth) = 0.15 P(Opt|LMin=Few,Alg={GD,QN},ObjType=NonSmooth) = 0.6 P(Opt|LMin=Many,Alg={GD,QN},ObjType=NonSmooth) = 0.05 P(Opt|LMin= Many,Alg={GD,QN},ObjType=NonSmooth) = 0.2 P(Speed |Alg={SA,EA},Deriv,PSize=Small) = [0.7,0.3,0.0] P(Speed |Alg={SA,EA},Deriv,PSize=Medium) = [0.9,0.1,0.0] P(Speed |Alg={SA,EA},Deriv,PSize=Large) = [0.999,0.001,0.0] P(Speed |Alg=GD,Deriv=T,PSize=Small) = [0.2,0.3,0.5] P(Speed |Alg=GD,Deriv=F,PSize=Small) = [0.2,0.4,0.3] P(Speed |Alg=GD,Deriv=T,PSize=Medium) = [0.3,0.4,0.3] P(Speed |Alg=GD,Deriv=F,PSize= Medium) = [0.4,0.5,0.1] P(Speed |Alg=GD,Deriv=T,PSize=Large) = [0.8,0.19,0.01] P(Speed |Alg=GD,Deriv=F,PSize= Large) = [0.9,0.099,0.001] P(Speed |Alg=QN,Deriv=T,PSize=Small) = [0.2,0.3,0.5] P(Speed |Alg= QN,Deriv=F,PSize=Small) = [0.2,0.4,0.3] P(Speed |Alg= QN,Deriv=T,PSize=Medium) = [0.3,0.4,0.3] P(Speed |Alg= QN,Deriv=F,PSize= Medium) = [0.4,0.5,0.1] P(Speed |Alg= QN,Deriv=T,PSize=Large) = [0.8,0.19,0.01] P(Speed |Alg= QN,Deriv=F,PSize= Large) = [0.9,0.099,0.001] (General rules: larger problems are more likely to be slow. For descent-based methods GD and QN, derivative information is likely to speed up the solver. QN is in general faster than GD, perhaps except for very large problems in which its N2 cost per iteration is prohibitive) 3. P(S) = a,ot,p,d,l,o P(a,ot,p,d,l,o,S) (marginalization) = a,ot,p,d,l,o P(a) P(ot) P(p) P(d|ot) P(l|ot,p) P(o|l,ot) P(S|a,d,p) (chain rule for BNs) = a,ot,p,d P(a) P(d|ot) P(p) P(S|a,d,p) P(ot) l,o P(l|ot,p) P(o|l,ot) (pulling out o and l) = a,ot,p,d P(a) P(d|ot) P(p) P(S|a,d,p) P(ot) * 1 (summand = 1) = a,p,d P(a) P(p) P(S|a,d,p) ot P(ot) P(d|ot) (pulling out ot) Next, compute P(D) = ot P(ot) P(D|ot) Finally, compute P(S) = a,p,d P(a) P(p) P(S|a,d,p) P(d) P(A) = ot,p,d,l,o.s P(A,ot,p,d,l,o,s) = ot,p,d,l,o,s P(A) P(ot) P(p) P(d|ot) P(l|ot,p) P(o|l,ot) P(s|A,d,p) (chain rule for BNs) = P(A) ot,p,d,l,o,s P(ot) P(p) P(d|ot) P(l|ot,p) P(o|l,ot) P(s|A,d,p) (pull out P(A) from summation) = P(A) (summation is 1) 4. P(S,A=GA,P=Small) = ot,d,l,o P(A=GA) P(ot) P(P=Small) P(d|ot) P(l|ot,P=Small) P(o|l,ot) P(S|A=GA,d, P=Small) =P(A=GA) P(P=Small) ot,d,l,o P(ot) P(d|ot) P(l|ot,P=Small) P(o|l,ot) P(S|A=GA,d, P=Small) =P(A=GA) P(P=Small) d P(S|A=GA,d, P=Small)ot P(ot) P(d|ot) l P(l|ot,P=Small) o P(o|l,ot) First, O would be eliminated, which involves summing over P(o|l,ot), giving a factor 1 over (L,OT) with value 1. Next, L would be eliminated by summing over 1P(l|ot,P=Small), giving a factor 2 over OT. Third, OT would be eliminated via a 3-term sum product, then finally D. 5. One reasonable method would be to find P(S=Fast|A) for all values of A, and pick the algorithm that maximizes the probability. To calculate the increase in probability given derivative information, compute the difference between P(S=Fast|A,D=T) and P(S=Fast|A,D=F) for the chosen algorithm A. 6. A convenient way to do this is to introduce auxiliary variables BossProblemSize and BossSpeed which are new nodes in the network with arcs from ProblemSize and Speed respectively. They indicate the boss’ statement of the requirements whereas ProblemSize and Speed are the true requirements of the project. The uncertainty of the Boss’s knowledge is encoded in the CPTs P(BossProblemSize|ProblemSize) and P(BossSpeed|Speed). The evidence for your probability queries is no longer on ProblemSize and Speed, but rather on the auxiliary variables. [Note: I realize this question is ambiguous, and other methods might work as well.]