CS B553 Homework 5: Bayesian Network Modeling and

advertisement
CS B553 Homework 5: Bayesian Network Modeling and Inference
Due date: 3/8/2012
Your boss holds a meeting and assigns you to a project that requires you to apply numerical
optimization. Being an overpaid MBA type her instructions and clarity of the scope of the project are
exceedingly vague: “optimize our strategic partnership throughput to maximize synergies” or some
other nonsense. Consider devising a Bayesian network for recommending the scope of the project and
an optimization algorithm to solve the project.
1. Draw a Bayesian network with the variables:
 Algorithm with domain {GD,QN,SA,GA} indicating gradient descent, quasi newton,
simulated annealing, and genetic algorithms.
 ProblemSize with domain {Small,Medium,Large}.
 ObjectiveType with domain {Convex,Nonconvex,Nonsmooth}.
 A binary variable Derivatives indicating respectively whether derivatives of the objective
function are available.
 LocalMinima with domain {One,Few,Many}.
 Speed with domain {Slow,Medium,Fast} indicating the speed of the algorithm.
 A binary variable Optimality indicating whether a global optimum will be reached.
ProblemSize, ObjectiveType, and Algorithm should have no parents. Derivatives should have the
parent ObjectiveType. LocalMinima should have the parent ProblemSize and ObjectiveType,
because convex problems have one minimum, and nonconvex/nonsmooth problems in highdimensional spaces often have more minima than in low dimensional ones. Speed should have
parents Algorithm, Derivatives, and ProblemSize. Optimality should have parents Algorithm,
LocalMinima, and ObjectiveType.
List the CPTs that need to be entered. How many free parameters are there in this network?
How many free parameters would be needed in a joint distribution table?
2. Enter in some reasonable values for the CPTs given your knowledge of optimization algorithms.
These values should be consistent with facts learned in class. For example, without derivative
information, GD and QN methods are slower than when derivative information is available,
because finite differencing is somewhat costly.
Deterministic quantities should be specified with certainty. For example, nonsmooth objectives
will not have derivatives available, and convex problems will have one local minimum.
When you have no reasonable background information to determine a variable’s value, set its
conditional distribution to be uniform.
3. Show the steps that you would take to calculate the unconditional distribution over Speed by
reasoning with the joint distribution. Show the steps that you would take to calculate the
unconditional distribution over Algorithm. (Do this symbolically, do not substitute the numeric
values you supplied in question 2)
4. Suppose you ask your boss about the size of the problem. She replies that it’s a small problem –
no more than a dozen variables. Furthermore she overheard you mentioning genetic
algorithms, and she asks you to use it – it sure does sound good. Show the symbolic steps that
variable elimination would take in order to calculate the distribution over Speed given this new
information. (Use the best variable ordering that you can find).
5. Your boss indicates that it is more important to find a solution quickly than to achieve global
optimality. Propose a method for selecting the algorithm that is most likely to solve the
problem with Speed=Fast. You also bring up the point that it may be important to spend a
week’s worth of effort to calculate the derivatives of the objective function, and your boss asks
you “How likely would this work improve the optimization outcomes?” Describe how you would
calculate the increase in probability that Speed=Fast supposing that derivative information was
available.
6. Since your boss is known to have only an imperfect notion of technical matters, you suspect that
the information she tells you during this meeting may be incorrect, or the requirements may
change halfway through the project. Describe how you would change the network in order to
represent the imperfect knowledge about ProblemSize and Speed given in problems 4 and 5.
(The resulting model should still be a rigorous probabilistic representation.)
Alg
ObjType
PSize
Deriv
LMin
Opt
Speed
1.
There must be 7 CPTs entered: P(Alg), P(ObjType), P(PSize), P(Deriv|ObjType), P(LMin|ObjType,PSize),
P(Opt|Alg,LMin,ObjType), and P(Speed|Alg,Deriv,PSize)
These have (4-1)+(3-1)+(3-1)+3(2-1)+3*3*(3-1)+3*4*3*(2-1)+4*2*3*(3-1) = 3+2+2+3+18+36+48=130
parameters. A joint distribution would need 4*3*3*2*3*3*2-1 = 1295 parameters.
2. Some reasonable values might be as follows:







P(Alg)=uniform
P(ObjType) = uniform
P(PSize) = uniform
P(Deriv|ObjType=Convex) = 0.75, P(Deriv|ObjType=Nonconvex) = 0.5,
P(Deriv|ObjType=Nonsmooth) = 0
P(LMin=One| ObjType=Convex,PSize) = 1.0
P(LMin | ObjType!=Convex,PSize=Small) = [0.2,0.6,0.2]
P(LMin | ObjType!=Convex,PSize=Medium) = [0.1,0.6,0.3]
P(LMin | ObjType!=Convex,PSize=Large) = [0.01,0.49,0.5]
P(Opt|LMin=One,Alg,ObjType={Convex or NonConvex}) = 0.99
P(Opt|LMin=Few,Alg={GD,QN},ObjType=Convex or NonConvex) = 0.3
P(Opt|LMin=Few,Alg={SA,EA},ObjType=Convex or NonConvex) = 0.7
P(Opt|LMin=Many,Alg={GD,QN},ObjType=Convex or NonConvex) = 0.1
P(Opt|LMin=Many,Alg={SA,EA},ObjType=Convex or NonConvex) = 0.3
P(Opt|LMin=One,Alg={GD,QN},ObjType=NonSmooth) = 0.5
P(Opt|LMin=One,Alg={GD,QN},ObjType=NonSmooth) = 0.95
P(Opt|LMin=Few,Alg={GD,QN},ObjType=NonSmooth) = 0.15
P(Opt|LMin=Few,Alg={GD,QN},ObjType=NonSmooth) = 0.6
P(Opt|LMin=Many,Alg={GD,QN},ObjType=NonSmooth) = 0.05
P(Opt|LMin= Many,Alg={GD,QN},ObjType=NonSmooth) = 0.2
P(Speed |Alg={SA,EA},Deriv,PSize=Small) = [0.7,0.3,0.0]
P(Speed |Alg={SA,EA},Deriv,PSize=Medium) = [0.9,0.1,0.0]
P(Speed |Alg={SA,EA},Deriv,PSize=Large) = [0.999,0.001,0.0]
P(Speed |Alg=GD,Deriv=T,PSize=Small) = [0.2,0.3,0.5]
P(Speed |Alg=GD,Deriv=F,PSize=Small) = [0.2,0.4,0.3]
P(Speed |Alg=GD,Deriv=T,PSize=Medium) = [0.3,0.4,0.3]
P(Speed |Alg=GD,Deriv=F,PSize= Medium) = [0.4,0.5,0.1]
P(Speed |Alg=GD,Deriv=T,PSize=Large) = [0.8,0.19,0.01]
P(Speed |Alg=GD,Deriv=F,PSize= Large) = [0.9,0.099,0.001]
P(Speed |Alg=QN,Deriv=T,PSize=Small) = [0.2,0.3,0.5]
P(Speed |Alg= QN,Deriv=F,PSize=Small) = [0.2,0.4,0.3]
P(Speed |Alg= QN,Deriv=T,PSize=Medium) = [0.3,0.4,0.3]
P(Speed |Alg= QN,Deriv=F,PSize= Medium) = [0.4,0.5,0.1]
P(Speed |Alg= QN,Deriv=T,PSize=Large) = [0.8,0.19,0.01]
P(Speed |Alg= QN,Deriv=F,PSize= Large) = [0.9,0.099,0.001]
(General rules: larger problems are more likely to be slow. For descent-based methods GD and
QN, derivative information is likely to speed up the solver. QN is in general faster than GD,
perhaps except for very large problems in which its N2 cost per iteration is prohibitive)
3. P(S) = a,ot,p,d,l,o P(a,ot,p,d,l,o,S) (marginalization)
= a,ot,p,d,l,o P(a) P(ot) P(p) P(d|ot) P(l|ot,p) P(o|l,ot) P(S|a,d,p) (chain rule for BNs)
= a,ot,p,d P(a) P(d|ot) P(p) P(S|a,d,p) P(ot) l,o P(l|ot,p) P(o|l,ot) (pulling out o and l)
= a,ot,p,d P(a) P(d|ot) P(p) P(S|a,d,p) P(ot) * 1
(summand = 1)
= a,p,d P(a) P(p) P(S|a,d,p) ot P(ot) P(d|ot)
(pulling out ot)
Next, compute P(D) = ot P(ot) P(D|ot)
Finally, compute P(S) = a,p,d P(a) P(p) P(S|a,d,p) P(d)
P(A) = ot,p,d,l,o.s P(A,ot,p,d,l,o,s)
= ot,p,d,l,o,s P(A) P(ot) P(p) P(d|ot) P(l|ot,p) P(o|l,ot) P(s|A,d,p) (chain rule for BNs)
= P(A) ot,p,d,l,o,s P(ot) P(p) P(d|ot) P(l|ot,p) P(o|l,ot) P(s|A,d,p) (pull out P(A) from summation)
= P(A) 
(summation is 1)
4. P(S,A=GA,P=Small) = ot,d,l,o P(A=GA) P(ot) P(P=Small) P(d|ot) P(l|ot,P=Small) P(o|l,ot)
P(S|A=GA,d, P=Small)
=P(A=GA) P(P=Small) ot,d,l,o P(ot) P(d|ot) P(l|ot,P=Small) P(o|l,ot) P(S|A=GA,d, P=Small)
=P(A=GA) P(P=Small) d P(S|A=GA,d, P=Small)ot P(ot) P(d|ot) l P(l|ot,P=Small) o P(o|l,ot)
First, O would be eliminated, which involves summing over P(o|l,ot), giving a factor 1 over
(L,OT) with value 1. Next, L would be eliminated by summing over 1P(l|ot,P=Small), giving a
factor 2 over OT. Third, OT would be eliminated via a 3-term sum product, then finally D.
5. One reasonable method would be to find P(S=Fast|A) for all values of A, and pick the algorithm
that maximizes the probability. To calculate the increase in probability given derivative
information, compute the difference between P(S=Fast|A,D=T) and P(S=Fast|A,D=F) for the
chosen algorithm A.
6. A convenient way to do this is to introduce auxiliary variables BossProblemSize and BossSpeed
which are new nodes in the network with arcs from ProblemSize and Speed respectively. They
indicate the boss’ statement of the requirements whereas ProblemSize and Speed are the true
requirements of the project. The uncertainty of the Boss’s knowledge is encoded in the CPTs
P(BossProblemSize|ProblemSize) and P(BossSpeed|Speed). The evidence for your probability
queries is no longer on ProblemSize and Speed, but rather on the auxiliary variables.
[Note: I realize this question is ambiguous, and other methods might work as well.]
Download