Learning in Bayes Nets

advertisement
Learning in Bayes Nets
• Task 1: Given the network structure and
given data, where a data point is an
observed setting for the variables, learn the
CPTs for the Bayes Net. Might also start
with priors for CPT probabilities.
• Task 2: Given only the data (and possibly a
prior over Bayes Nets), learn the entire
Bayes Net (both Net structure and CPTs).
Task 1: Maximum Likelihood by
Example (Howard 1970)
• Suppose we have a thumbtack (with a round
flat head and sharp point) that when flipped
can land either with the point up (tails) or
with the point touching the ground (heads).
• Suppose we flip the thumbtack 100 times,
and 70 times it lands on heads. Then we
estimate that the probability of heads the
next time is 0.7. This is the maximum
likelihood estimate.
The General Maximum
Likelihood Setting
• We had a binomial distribution b(n,p) for
n=100, and we wanted a good guess at p.
• We chose the p that would maximize the
probability of our observation of 70 heads.
• In general we have a parameterized
distribution and want to estimate a (several)
parameter(s): choose the value that
maximizes the probability of the data.
Back to the Frequentist-Bayes
Debate
• The preceding seems backwards: we want
to maximize the probability of p, not
necessarily of the data (we already have it).
• A Frequentist will say this is the best we can
do: we can’t talk about probability of p; it is
fixed (though unknown).
• A Bayesian says the probability of p is the
degree of belief we assign to it ...
Fortunately the Two Agree
(Almost)
• It turns out that for Bayesians, if our prior
belief is that all values of p are equally
likely, then after observing the data we’ll
assign the highest probability to the
maximum likelihood estimate for p.
• But what if our prior belief is different?
How do we merge the prior belief with the
data to get the best new belief?
Encode Prior Beliefs as a Beta
Distribution
a,b
Any intuition for this?
• For any positive integer y, G(y) = (y-1)!.
• Suppose we use this, and we also replace
– x with p
– a with x
– a+b with n
(n  1)! x
p (1  p) n  x
• Then we get:
x!(n  x)!
• The beta(a,b) is just the binomial(n,p) where n=a+p,
and p becomes the variable. With change of
variable, we need a different normalizing constant
so the sum (integral) is 1. Hence (n+1)! replaces n!.
Incorporating a Prior
• We assume a beta distribution as our prior
distribution over the parameter p.
• Nice properties: unimodal, we can choose
the mode to reflect the most probable value,
we can choose the variance to reflect our
confidence in this value.
• Best property: a beta distribution is
parameterized by two positive numbers, a
Beta Distribution (Continued)
• (Continued)… and b. Higher values of a
relative to b cause the mode of the
distribution to be more to the left, and
higher values of both a and b cause the
distribution to be more peaked (lower
variance). We might for example take a to
be the number of heads, and b to be the
number of tails. At any time, the mode of
Beta Distribution (Continued)
• (Continued)… the beta distribution (the
expectation for p) is a/(a+b), and as we get
more data, the distribution becomes more
peaked reflecting higher confidence in our
expectation. So we can specify our prior
belief for p by choosing initial values for a
and b such that a/(a+b)=p, and we can
specify confidence in this belief with high
Beta Distribution (Continued)
• (Continued)… initial values for a and b.
Updating our prior belief based on data to
obtain a posterior belief simply requires
incrementing a for every heads outcome
and incrementing b for every tails outcome.
• So after h heads out of n flips, our posterior
distribution says P(heads)=(a+h)/(a+b+n).
Dirichlet Distributions
• What if our variable is not Boolean but can
take on more values? (Let’s still assume
our variables are discrete.)
• Dirichlet distributions are an extension of
beta distributions for the multi-valued case
(corresponding to the extension from
binomial to multinomial distributions).
• A Dirichlet distribution over a variable with
n values has n parameters rather than 2.
Back to Frequentist-Bayes
Debate
• Recall that under the frequentist view we
estimate each parameter p by taking the ML
estimate (maximum likelihood estimate: the
value for p that maximizes the probability
of the data).
• Under the Bayesian view, we now have a
prior distribution over values of p. If this
prior is a beta, or more generally a Dirichlet
Frequentist-Bayes Debate
(Continued)
• (Continued)… then we can update it to a
posterior distribution quite easily using the
data as illustrated in the thumbtack
example. The result yields a new value for
the parameter p we wish to estimate (e.g.,
probability of heads) called the MAP
(maximum a posteriori) estimate.
• If our prior distribution was uniform over
values for p, then ML and MAP agree.
Learning CPTs from Complete
Settings
• Suppose we are given a set of data, where
each data point is a complete setting for all
the variables.
• One assumption we make is that the data set
is a random sample from the distribution
we’re trying to model.
• For each node in our network, we consider
each entry in its CPT (each setting of values
Learning CPTs (Continued)
• (Continued)… for its parents). For each
entry in the CPT, we have a prior (possibly
uniform) Dirichlet distribution over its
values. We simply update this distribution
based on the relevant data points (those that
agree on the settings for the parents that
correspond with this CPT entry).
• A second, implicit assumption is that the
Learning CPTs (Continued)
• (Continued)… distributions over different
rows of the CPT are independent of one
another.
• Finally, it is worth noting that instead of this
last assumption, we might have a stronger
bias over the form of the CPT. We might
believe it is a noisy-OR, a linear function,
or a tree, in which case we would instead
use machine learning, linear regression, etc.
Simple Example
• Suppose we believe the variables PinLength
and HeadWeight directly influence whether
a thumbtack comes up heads or tails. For
simplicity, suppose PinLength can be long
or short and HeadWeight can be heavy or
light.
• Suppose we adopt the following prior over
the CPT entries for the variable Thumbtack.
Simple Example (Continued)
HeadWeight
PinLength
heavy
light
long short long short
Thumbtack
heads
.9
.8
.8
.7
tails
.1
.2
.2
.3
(9,1) (4,1) (16,4) (7,3)
(Normal roles of rows and columns in CPT reversed just to make it fit.)
Simple Example (Continued)
• Notice that we have equal confidence in our
prior (initial) probabilities for the first and
last columns of the CPT, less confidence in
those of the second column, and more in
those of the third column.
• A new data point will affect only one of the
columns. A new data point will have more
effect on the second column than the others.
More Difficult Case: What if
Some Variables are Missing
• Recall our earlier notion of hidden
variables.
• Sometimes a variable is hidden because it
cannot be explicitly measured. For
example, we might hypothesize that a
chromosomal abnormality is responsible for
some patients with a particular cancer not
responding well to treatment.
Missing Values (Continued)
• We might include a node for this
chromosomal abnormality in our network
because we strongly believe it exists, other
variables can be used to predict it, and it is
in turn predictive of still other variables.
• But in estimating CPTs from data, none of
our data points has a value for this variable.
Missing Values (Continued)
• This missing value (hidden variable)
problem arises frequently.
• Chicken-and-Egg issue: if we had CPTs, we
could fill in the data, or if we had data we
could estimate CPTs.
• We do have partial data and partial (prior)
CPTs. Can we somehow leverage these into
full data and posterior CPTs?
Three Approaches
• Expectation-Maximization (EM) Algorithm.
• Gibbs Sampling (again).
• Gradient Ascent (Hill-climbing).
K-Means as EM
K-Means as EM
K-Means as EM
K-Means as EM
K-Means as EM
K-Means as EM
General EM Framework
• Given: Data with missing values, Space of
possible models, Initial model.
• Repeat until no change greater than
threshold:
– Expectation (E) Step: Compute expectation
over missing values, given model.
– Maximization (M) Step: Replace current model
with model that maximizes probability of data.
(“Soft”) EM vs. “Hard” EM
• Standard (soft) EM: expectation is a probability
distribution.
• Hard EM: expectation is “all or nothing”… most
likely/probable value.
• K-means is usually run as “hard” EM but doesn’t
have to be.
• Advantage of hard EM is computational efficiency
when expectation is over state consisting of values
for multiple variables (next example illustrates).
EM for Parameter Learning: E Step
• For each data point with missing values,
compute the probability of each possible
completion of that data point. Replace the
original data point with all these
completions, weighted by probabilities.
• Computing the probability of each
completion (expectation) is just answering
query over missing variables given others.
EM for Parameter Learning: M Step
• Use the completed data set to update our
Dirichlet distributions as we would use any
complete data set, except that our counts
(tallies) may be fractional now.
• Update CPTs based on new Dirichlet
distributions, as we would with any
complete data set.
EM for Parameter Learning
• Iterate E and M steps until no changes
occur. We will not necessarily get the
global MAP (or ML given uniform priors)
setting of all the CPT entries, but under a
natural set of conditions we are guaranteed
convergence to a local MAP solution.
• EM algorithm is used for a wide variety of
tasks outside of BN learning as well.
Subtlety for Parameter Learning
• Overcounting based on number of
interations required to converge to settings
for the missing values.
• After each repetition of E step, reset all
Dirichlet distributions before repeating M
step.
EM for Parameter Learning
Data
P(A)
A 0.1 (1,9)
P(B)
B 0.2 (1,4)
A B P(C)
T T 0.9 (9,1)
T F 0.6 (3,2)
F T 0.3 (3,7)
F F 0.2 (1,4)
C
D
C P(D)
T 0.9 (9,1)
F 0.2 (1,4)
E
C P(E)
T 0.8 (4,1)
F 0.1 (1,9)
A
0
0
1
0
0
0
1
0
0
0
B
0
0
0
0
1
0
1
0
0
0
C
?
?
?
?
?
?
?
?
?
?
D
0
1
1
0
1
0
1
0
1
0
E
0
0
1
1
0
1
1
0
0
1
EM for Parameter Learning
Data
P(A)
A 0.1 (1,9)
P(B)
B 0.2 (1,4)
A B P(C)
T T 0.9 (9,1)
T F 0.6 (3,2)
F T 0.3 (3,7)
F F 0.2 (1,4)
C
D
C P(D)
T 0.9 (9,1)
F 0.2 (1,4)
E
C P(E)
T 0.8 (4,1)
F 0.1 (1,9)
A
0
0
1
0
0
0
1
0
0
0
B C D E
0 0:1: 0.99
0
0.01 0
0: 0.80
0 1: 0.20 1 0
0 0:1: 0.02
1
0.98 1
0 0:1: 0.80
1
0.20 0
1 0:1: 0.70
0
0.30 1
0 0:1: 0.80
0 1
0.20
1 0:1: 0.003
1 1
0.997
0 0:1: 0.99
0 0
0.01
0 0:1: 0.80
1 0
0.20
0 0:1: 0.80
0 1
0.20
Multiple Missing Values
Data
P(A)
A 0.1 (1,9)
P(B)
B 0.2 (1,4)
A B P(C)
T T 0.9 (9,1)
T F 0.6 (3,2)
F T 0.3 (3,7)
F F 0.2 (1,4)
C
D
C P(D)
T 0.9 (9,1)
F 0.2 (1,4)
E
C P(E)
T 0.8 (4,1)
F 0.1 (1,9)
A B C D E
? 0 ? 0 1
Multiple Missing Values
Data
P(A)
A 0.1 (1,9)
P(B)
B 0.2 (1,4)
A B P(C)
T T 0.9 (9,1)
T F 0.6 (3,2)
F T 0.3 (3,7)
F F 0.2 (1,4)
C
D
C P(D)
T 0.9 (9,1)
F 0.2 (1,4)
E
C P(E)
T 0.8 (4,1)
F 0.1 (1,9)
A
0
0
1
1
B
0
0
0
0
C
0
1
0
1
D E
0 1
0 1
0 1
0 1
0.72
0.18
0.04
0.06
Multiple Missing Values
P(A)
0.1 (1.1,9.9)
A
Data
B
A B P(C)
T T 0.9 (9,1)
T F 0.6 (3.06,2.04)
F T 0.3 (3,7)
F F 0.2 (1.18,4.72)
C
D
C P(D)
T 0.88 (9,1.24)
F 0.17 (1,4.76)
P(B)
0.17 (1,5)
E
C P(E)
T 0.81 (4.24,1)
F 0.16 (1.76,9)
A
0
0
1
1
B
0
0
0
0
C
0
1
0
1
D E
0 1
0 1
0 1
0 1
0.72
0.18
0.04
0.06
Problems with EM
• Only local optimum (not much way around
that, though).
• Deterministic … if priors are uniform, may
be impossible to make any progress…
• … next figure illustrates the need for some
randomization to move us off an
uninformative prior…
What will EM do here?
P(A)
0.5 (1,1)
A
A P(B)
T 0.5 (1,1)
F 0.5 (1,1)
B P(C)
T 0.5 (1,1)
F 0.5 (1,1)
B
C
Data
A
0
1
0
1
0
1
B
?
?
?
?
?
?
C
0
1
0
1
0
1
EM Dependent on Initial Beliefs
P(A)
0.5 (1,1)
A
A P(B)
T 0.6 (6,4)
F 0.4 (4,6)
B P(C)
T 0.5 (1,1)
F 0.5 (1,1)
B
C
Data
A
0
1
0
1
0
1
B
?
?
?
?
?
?
C
0
1
0
1
0
1
EM Dependent on Initial Beliefs
P(A)
0.5 (1,1)
A
A P(B)
T 0.6 (6,4)
F 0.4 (4,6)
B P(C)
T 0.5 (1,1)
F 0.5 (1,1)
B
C
Data
A
0
1
0
1
0
1
B
?
?
?
?
?
?
C
0
1
0
1
0
1
B is more likely T
than F when A is T.
Filling this in makes
C more likely T than
F when B is T. This
makes B still more
likely T than F when
A is T. Etc. Small
change in CPT for
B (swap 0.6 and 0.4)
would have opposite
effect.
A Second Approach: Gibbs
Sampling
• The idea is analogous to Gibbs Sampling
for Bayes Net inference, which we have
seen in detail.
• First, initialize the values of hidden
variables arbitrarily. Update CPTs based on
current (now complete) data.
• Second, choose one data point and one
unobserved variable X for that data point.
Gibbs Sampling (Continued)
• (Continued)… Reset the value of X within
that data point based on the current CPTs
and the current setting of the variables in
the Markov Blanket of X within that data
point.
• Third, repeat this process for all the other
unobserved variables throughout the data
set and then update the CPTs.
Gibbs Sampling (Continued)
• Fourth, iterate through the previous three
steps some number of times (chain length)
• Gibbs faster than (soft) EM if many data
missing values per data point.
• Gibbs often slower than hard EM (we have
more variables now than in the inference
case) but results may be better.
Approach 3: Gradient Ascent
• We want to maximize posterior probability
(or likelihood if uniform priors). Where
w1,…,wk are the probabilities in the CPTs
(analogous to weights in a neural network)
and D1,…, Dn are the data points, we will
use a greedy hill-climbing search (making
small changes in the direction of the
gradient) to maximize P(D|w1,…,wk) =
P(D1|w1,…,wk)...P(Dn|w1,…,wk).
Gradient Ascent (Continued)
• Must first define the gradient (slope) of the
function we wish to maximize. It turns out
it is easier to do this with the logarithm of
the posterior probability or likelihood.
Russell & Norvig focus on likelihood so we
will as well. Maximizing log likelihood
also will maximize likelihood but the
function is easier to work with (additive).
Gradient Ascent (Continued)
• Because the log likelihood is additive, we
simply need to compute the gradient
relative to each individual probability wi,
which we can do as follows.
 ln P( D| w1,... wk )  ln  j P( Dj| w1,..., wk )


wi
wi
 ln P( Dj| w1,..., wk )
 ln P( Dj| w1,..., wk ) / wi

j
wi
P( Dj )
j
Gradient Ascent (Continued)
• Based on the preceding derivation, we can
calculate the gradient contribution of each
data point and sum the contributions.
• Hence we want to find the gradient
contribution for a single case (Dj) from a
single CPT with which wi is associated.
• Assume wi is the probability that Xi = xi
given that Pa(Xi) = U is set to ui. So
wi=P(xi,ui).
Gradient Ascent (Continued)
• We now work with the probability of these
settings out of all possible settings.

 P( Dj ) / wi wi ( x,u P( Dj| x,u ) P( x,u ))


P( Dj )
P( Dj )

( x ,u P( Dj| x ,u ) P( x|u ) P(u ))
wi
P( Dj )
Gradient Ascent (Continued)
• In the preceding, wi appears in only one
term of the summation, where x=xi and
u=ui. For this term, P(x,u) = wi. So
 P( Dj ) / wi
P( Dj )
P( Dj| xi ,ui ) P(ui )

P( Dj )
Gradient Ascent (Continued)
• Applying Bayes’ Theorem:
 P( Dj ) / wi
P( xi ,ui | Dj ) P( Dj ) P(ui )


P( Dj )
P( xi ,ui ) P( Dj )
P( xi ,ui | Dj ) P( xi ,ui | Dj )

P( xi ,ui )
wi
Gradient Ascent (Continued)
• We can compute P(xi,ui|Dj) using one of our
already-studied mechanisms for Bayes Net
inference (answering queries from Bayes
Nets).
• We then sum over all data point to get the
gradient contribution from each probability.
We assume the probabilities are
independent of one another, making it easy
to take a small step up the gradient.
Download