Expectation Maximization

advertisement
Maximum-Likelihood and Bayesian
Parameter Estimation
Expectation Maximization (EM)
CSE 555: Srihari
Estimating Missing Feature Value
Estimating missing variable with known
parameters
In the absence of x1, most
likely class is ω2
Choose that value of x1
which maximizes the
likelihood
Known
Value
(Μissing variable)
Choosing mean of missing feature (over all classes) will result in worse
performance!
This is a case of estimating the hidden variable given the parameters.
In EM unknowns are both: parameters and hidden variables
CSE 555: Srihari
EM Task
• Estimate unknown parameters θ given measurement
•
•
data U
However some variables J are missing which need
to be integrated out
We want to maximize the posterior probability of θ,
given the data U, marginalizing over J
θ = arg max ∑ P(θ , J |U )
θ
JεJ n
Parameter
to be estimated
Μissing Variables
CSE 555: Srihari
Data
EM Principle
• Estimate unknown parameters θ
given measurement
data U and not given nuisance variables J which
need to be integrated out
θ = arg max ∑ P(θ , J |U )
θ
Jε J n
• Alternate between estimating unknowns θ and the
•
hidden variables J
At each iteration, instead of finding the best J ε J
given an estimate θ at each iteration, EM computes a
distribution over the space J
CSE 555: Srihari
k-means Algorithm as EM
• Estimate means of k classes when class labels are unknown
• Parameters: means to be estimated
• Hidden variables: class labels
begin initialize m1, m2,..mk (E-step)
do classify n samples according to nearest mi (M-step)
recompute mi
until no change in mi
return m1,m2,..mk
end
An iterative algorithm derivable from EM
CSE 555: Srihari
EM Importance
• EM algorithm
• widely used for learning in the presence of unobserved
•
•
•
•
variables, e.g., missing features, class labels
used even for variables whose value is never directly
observed, provided the general form of the pdf governing
these variables is known
has been used to train Radial Basis Function Networks and
Bayesian Belief Networks
basis for many unsupervised clustering algorithms
basis for widely used Baum-Welch forward-backward
algorithm for HMMs
CSE 555: Srihari
EM Algorithm
• Learning in the presence of unobserved variables
• Only a subset of relevant instance features might be
•
•
observable
Case of unsupervised learning or clustering
How many classes are there?
CSE 555: Srihari
EM Principle
• EM algorithm iteratively estimates the
likelihood given the data that is present
CSE 555: Srihari
Likelihood Formulation
•
Sample points from a single distribution
W = {x1 ,.., xn }
•
Any sample has good and missing (bad ) features
x k = {x kg ,.., x kb }
•
Features are divided into two sets
W = W g UWb
CSE 555: Srihari
Likelihood Formulation
•
Central equation in EM
Expected Value is over missing features
Current best estimate for the full distribution
Candidate Vector for an improved estimate
• Algorithm will select the best candidate θ and call it θ i+1
CSE 555: Srihari
Algorithm EM
begin initialize θ 0, T, i = 0
do i = i+1
E step: compute Q(θ ;θ i)
M step: θ i+1=arg maxθ Q(θ ;θ i)
until Q(θ i+1;θ i) - Q(θ i;θ i-1)< T
CSE 555: Srihari
EM for 2D Normal Model
Suppose data consists of 4 points in 2 dimensions, one
point of which is missing a feature
, where * represents
⎧⎛ 0 ⎞ ⎛ 1 ⎞ ⎛ 2 ⎞ ⎛ * ⎞⎫
D = {x1 , x2 , x3 , x4 } = ⎨⎜⎜ ⎟⎟, ⎜⎜ ⎟⎟, ⎜⎜ ⎟⎟, ⎜⎜ ⎟⎟⎬
⎩⎝ 2 ⎠ ⎝ 0 ⎠ ⎝ 2 ⎠ ⎝ 4 ⎠⎭
the unknown value of the first feature of point x4. Thus our bad
data Db consists of the single feature x41, and the good data Dg
consists of the rest.
x2
x1
CSE 555: Srihari
EM for 2D Normal Model
Assuming that the model is a Gaussian with
diagonal covariance and arbitrary mean, it can
be described by the parameter vector
⎛ µ1 ⎞
⎜ ⎟
⎜ µ2 ⎟
θ =⎜ 2⎟
σ
⎜ 12 ⎟
⎜σ ⎟
⎝ 2⎠
CSE 555: Srihari
EM for 2D Normal model
We take our initial initial guess to be a
Gaussian centered on the origin having Σ = 1,
that is,
⎛ 0⎞
⎜ ⎟
⎜ 0⎟
0
θ =⎜ ⎟
1
⎜ ⎟
⎜ 1⎟
⎝ ⎠
CSE 555: Srihari
EM for a 2D Normal Model
To find improved estimate, must calculate
CSE 555: Srihari
EM for a 2D Normal Model
Simplifying
Completes E step. Gives the next estimate
Final Solution
0.667
CSE 555: Srihari
EM for 2D normal model
Four data points, one missing the value of x1, are in red.
Initial estimate is a circularly symmetric Gaussian, centered on the origin (gray).
(A better initial estimate could have been derived from the 3 known points.)
Each iteration leads to an improved estimate, labeled by iteration number i;
after 3 iterations, the algorithm converged.
CSE 555: Srihari
EM to Estimate Means of k Gaussians
• Data D drawn from mixture of k distinct
•
normal distributions
Two-step process generates samples
• One of the k distributions is selected at random
• A single random instance xi is generated
according to selected distribution
CSE 555: Srihari
Instances Generated by a Mixture of Two Normal
Distributions
instances
CSE 555: Srihari
Example of EM to Estimate Means of k Gaussians
•
Each instance is generated by
1 Choosing one of the k Gaussians with uniform probability
2 Generating an instance at random according to that Gaussian
• The single normal distribution is chosen with
uniform probability
• Each of the k Normal distributions has the same
known variance
• Learning task: output a hypothesis
that describes the means of each ofhthe
=<k distributions
µ ,.., µ >
1
CSE 555: Srihari
k
Estimating Means of k Gaussians
• We would like to find a maximum
likelihood hypothesis for these means: a
hypothesis h that maximizes p(D|h)
CSE 555: Srihari
Maximum Likelihood Estimate of Mean of a
Single Gaussian
• Given observed data instances
x1 , x2 ..., xm
• Drawn from a single distribution that is
•
normally distributed
Problem is to find the mean of that
distribution
CSE 555: Srihari
Maximum Likelihood Estimate of Mean of a
Single Gaussian
•
Maximum likelihood estimate of the mean of a normal
distribution can be shown to be one that minimizes the sum
of squared errors
m
µ ML = arg min 12 ∑ ( xi − µ ) 2
µ
•
Right hand side has a maximum value at
µ ML
•
i =1
1 m
= ∑ xi
m i =1
which is the sample mean
CSE 555: Srihari
Mixture of Two Normal Distributions
•
•
•
•
•
•
We cannot observe as to which instances were generated by
which distribution
Full description of instance
< x ,z ,z >
i
i1
i2
xi = observed value of ith instance
zi1 and zi2 indicate which of two normal distributions was used
to generate xi
zij = 1 if zij was used to generate xi , 0 otherwise
zi1 and zi2 are hidden variables, which have probability
distributions associated with them
CSE 555: Srihari
Hidden variables specify distribution
Zi1=1
Zi2=0
Zi1=0
Zi2=1
(xi,1,0)
CSE 555: Srihari
2-Means Problem
• Full description of instance
< xi , zi1 , zi 2 >
• xi = observed variable
• zi1 and zi2 are hidden variables
• If zi1 and zi2 were observed, we could use maximum
likelihood estimates for the means:
1 m
µ1 = ∑ xi
m i =1
∋ zi1 =1
1 m
µ 2 = ∑ xi
m i =1
∋ zi2 =1
• Since we do not know zi1 and zi2, we will use EM
instead
CSE 555: Srihari
EM Algorithm Applied to k-Means Problem
•
Search for a Maximum Likelihood Hypothesis by repeatedly
re-estimating expected values of hidden binary variables zij
given its current hypothesis
< µ1 ,...µ k >
•
Then recalculate the maximum likelihood hypothesis using
these expected values for the hidden variables
CSE 555: Srihari
EM algorithm for two means
Zi1=1
Zi2=0
Zi1=0
Zi2=1
Regarded
As
Probabilities
1. Hypothesize means, then determine expected value of hidden variables for all samples
2. Use these hidden variable values to recalculate the means
CSE 555: Srihari
EM Applied to Two-Means Problem
• Initialize the hypothesis to
h =< µ1 , µ 2 >
• Estimate the expected values of hidden variables zij
•
•
given its current hypothesis
Recalculate the maximum likelihood hypothesis
using these expected values for the hidden variables
Re-estimate h repeatedly until the procedure
converges to a stationary value of h
CSE 555: Srihari
EM Algorithm for 2-Means
Step 1
Calculate the expected value E[zij] of each hidden variable
zij, assuming the current hypothesis
holds
h = ⟨ µ1 , µ 2 ⟩
Step 2
Calculate new maximum likelihood hypothesis
′ ′
⟨ µits
assuming the value taken on by each hidden variablehz′ ij= is
1, µ2 ⟩
expected value E[zij] calculated in Step 1.
Then replace hypothesis
by the new hypothesis
and iterate.
h = ⟨ µ1 , µ 2 ⟩
h′ = ⟨ µ1′, µ 2′ ⟩
CSE 555: Srihari
EM First Step
Calculate the expected value E[zij] of each hidden variable
zij, assuming the current hypothesis
holds
h = ⟨ µ1 , µ 2 ⟩
E[ zij ] =
p ( x = xi | µ = µ j )
∑
2
n =1
−
=
e
∑
p ( x = xi | µ = µ n )
1
2σ
2
n =1
= Probability that instance xi was
generated by the jth Gaussian
2
(
)
x
−
µ
i
j
2
−
e
1
2σ 2
( xi − µ j )2
CSE 555: Srihari
EM Second Step
Calculate a new maximum likelihood hypothesis
h′ = ⟨ µ1′, µ 2′ ⟩
E[ z ] x
∑
←
∑ E[ z ]
m
µj
i =1
m
i =1
ij
i
ij
Observations:
1. Similar to earlier sample mean calculation for a
single Gaussian.
µ ML
1 m
= ∑ xi
m i =1
CSE 555: Srihari
Clustering using EM
CSE 555: Srihari
Feature Extraction
0.367379
Image I_1
0.715380
Feature
Extraction
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
Feature Vector (74 values)
1.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.076923
0.000000
0.529915
0.993590
0.387363
0.000000
0.596154
1.000000
0.727273
1.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.175000
0.962500
0.688889
0.129167
0.419643
1.000000
0.387500
0.000000
0.000000
0.000000
1.000000
1.000000
0.108295
0.000000
1.000000
0.000000
0.337662
0.599078
0.207547
0.207547
0.194969
0.779874
0.760108
0.215633
0.350943
0.630728
0.127660
0.351064
0.073286
0.879433
1.000000
0.607903
0.263830
0.474164
i
CSE 555: Srihari
0.111111
0.407407
0.637860
0.765432
0.870370
0.000000
0.114815
0.687831
Feature Extraction
0.367379
Image I_1
0.715380
Feature
Extraction
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
1.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
2 Global Features: Aspect ratio
Stroke ratio
72 Local Features
s(i,j)
F(i,j) =
N(i)*S(j)
i= 1,9
j = 0,7
1.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
1.000000
1.000000
0.108295
0.000000
1.000000
0.337662
0.599078
0.207547
0.207547
0.194969
0.779874
0.760108
0.215633
0.350943
0.630728
S(j) = max s(i,j)
N(i)
0.727273
0.419643
1.000000
0.387500
0.000000
s(i,j) - no. of components with slope j in
subimage i
N(i) - no. of components in subimage i
i
0.076923
0.000000
0.529915
0.993590
0.387363
0.000000
0.596154
1.000000
0.175000
0.962500
0.688889
0.129167
0.127660
0.351064
0.073286
0.879433
1.000000
0.607903
0.263830
0.474164
CSE 555: Srihari
0.111111
0.407407
0.637860
0.765432
0.870370
0.000000
0.114815
0.687831
Feature Vector (74 values)
For One Feature
P(I_1|C_1) P(I_1|C_2) P(I_1|C_3) P(I_1|C_4) P(I_1|C_5)
Cycle 1
Cycle 2
…
Cycle N
1.000000
0.000000
0.000000
0.000000
0.000000
1.000000
0.000000
0.000000
0.000000
0.000000
0.990915
0.009083
0.000001
0.000000
0.000000
Final Clusters Centers (Images closer to the means)
P(C_3)= 0.203198
P(C_1)= 0.206088
P(C_2)= 0.228980
CSE 555: Srihari
P(C_3)= 0.218480
P(C_3)= 0.143254
Clustering
0.367379
0.545132
0.715380
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.076923
0.000000
0.529915
0.993590
0.387363
0.000000
0.596154
1.000000
0.825456
0.175000
0.962500
0.688889
0.129167
0.419643
1.000000
0.387500
10 cycles
0.023339
0.002406
0.003211
0.008192
0.011329
0.003749
0.046144
0.039846
0.471009
0.055688
0.091545
0.345479
0.456706
0.133725
0.513211
0.573237
0.428854
0.554012
0.838875
0.297697
0.373176
0.393534
0.193757
0.096639
0.209345
0.193814
0.077345
0.255358
0.525436
0.090627
0.071594
0.741128
0.849134
0.103370
0.000000
1.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.727273
1.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
1.000000
1.000000
0.108295
0.000000
1.000000
0.251790
0.101519
0.029442
0.022731
0.052462
0.040323
0.063098
0.085502
0.337662
0.467050
0.583712
0.227030
0.042000
0.650248
0.632097
0.599078
0.207547
0.207547
0.194969
0.779874
0.760108
0.215633
0.350943
0.630728
0.127660
0.351064
0.073286
0.879433
1.000000
0.607903
0.263830
0.474164
0.111111
0.407407
0.637860
0.765432
0.870370
0.000000
0.114815
0.687831
Initial Mean for Cluster 1
(Initialized with FV for Image 1)
0.597694
0.246492
0.174002
0.360720
0.718818
0.558005
0.359154
0.128743
0.511530
0.824493
0.419992
0.430754
0.447975
0.492662
0.127002
0.418266
0.543050
0.442549
0.444120
0.720268
0.512506
0.137092
0.187808
0.391781
CSE 555: Srihari
Final Mean for Cluster 1
Essence of EM Algorithm
• The current hypothesis is used to estimate the
•
•
•
unobserved variables
Expected values of these variables are used to
calculate an improved hypothesis
It can be shown that on each iteration through the
loop EM increases the likelihood P(D/h) unless it is
at a local maximum
Algorithm converges on a local maximum likelihood
hypothesis for
< µ ,µ >
1
CSE 555: Srihari
2
General Statement of EM Algorithm
Parameters of interest were θ =<µ1, µ2 >
Full data were the triples < xi , zi1 , zi 2 > , of
which only xi is observed
CSE 555: Srihari
General Statement of EM Algorithm (cont’d.)
• Let X = {x1 , x2 ..., xm }
• denote the observed data in a set of m independently
drawn instances
• Let Z = {z1 , z2 ..., zm }
•
•
denote the unobserved data in these same instances
Let
Y = X ∪Z
•
denote the full data
CSE 555: Srihari
General Statement of EM Algorithm (cont’d.)
• The unobserved Z can be treated as a random
variable whose p.d.f. depends on the unknown
parameters θ and on the observed data X
• Similarly Y is a random variable because it is defined
•
•
in terms of the random variable Z
h denotes current hypothesized values of the
parameters θ
h ′ denotes revised hypothesis estimated on each
iteration of EM algorithm
CSE 555: Srihari
General Statement of EM Algorithm (cont’d.)
• EM algorithm searches for m.l.e. hypothesis
by seeking the h′ that maximizes h′
E[ln P (Y | h′)]
CSE 555: Srihari
General Statement of EM Algorithm (cont’d.)
• Define the function
Q (h′ | h) that gives
E[ln P (Y | h′)]
• as a function of
h′ under the assumption that
θ =h
Q ( h′ | h) ← E[ln P (Y | h′) | h, X ]
• And given the observed portion X of the full
data Y
CSE 555: Srihari
General Statement of EM Algorithm
• Repeat until convergence
• Step 1: Estimation (E) Step: Calculate
Q(h′ | h)
using the current hypothesis h and the observed
data X to estimate the probability distribution over
Y
Q ( h′ | h) ← E[ln P (Y | h′) | h, X ]
• Step 2: Maximization (M) Step: Replace
hypothesis h by the hypothesis h′ that maximizes
this Q function
h ← arg max Q (h′ | h)
h′
CSE 555: Srihari
Derivation of k Means Algorithm from General EM algorithm
• Derive previously seen algorithm for estimating the
means of a mixture of k Normal distributions.
Estimate the parameters
θ =< µ1 ,...µ k >
• We are given the observed data
X = {⟨ xi ⟩}
• The hidden variables
Z = {⟨ zi1 , K zik ⟩}
• Indicate which of the k Normal distributions was
used to generate xi
CSE 555: Srihari
Derivation of the k Means Algorithm from
•
General EM Algorithm (cont’d.)
Need to derive an expression for
Q ( h / h' )
• First we derive an expression for
ln P (Y | h′)
CSE 555: Srihari
Derivation of the k Means Algorithm
• The probability
p( yi | h′)
of a single instance
of the full data can be written
yi = ⟨ x1 , zi1 ,K zik ⟩
p ( yi | h′) = p ( xi , zi1 , K zik | h′) =
−
1
2π σ
CSE 555: Srihari
2
e
z ( x − µ ′j ) 2
2 ∑ j =1 ij 1
2σ
1
k
Derivation of the k Means Algorithm (cont’d.)
•
p ( yi, |the
h′)
Given this probability for a single instance
logarithmic probability ln P (Y | h′for
) all m instances in the
data is
m
ln P (Y | h′) = ln ∏ p ( yi | h′)
i =1
m
= ∑ ln p ( yi | h′)
i =1
⎛
1
1
= ∑ ⎜ ln
−
2
⎜
2σ 2
i =1
2
π
σ
⎝
m
CSE 555: Srihari
⎞
zij ( xi − µ ′j ) ⎟
∑
⎟
j =1
⎠
k
2
Derivation of the k Means Algorithm
(cont’d.)
• In general, for any function f(z) that is a linear
function of z, the following equality holds
E[ f ( z )] = f ( E | z ])
⎡m ⎛
1
1
⎜
−
E[ln P (Y | h′)] = E ⎢∑ ln
2
2
⎜
2
σ
⎢ i =1 ⎝
2π σ
⎣
⎛
1
1
⎜
= ∑ ln
−
2
2
⎜
2
σ
i =1
2π σ
⎝
CSE 555: Srihari
m
⎞⎤
⎟⎥
′
−
z
(
x
µ
)
∑
ij
i
j
⎟⎥
j =1
⎠⎦
k
2
⎞
⎟
′
−
E
[
z
](
x
µ
)
∑
ij
i
j
⎟
j =1
⎠
k
2
Derivation of the k Means Algorithm
•
Eis
[ zijcalculated
]
′
h′ = ⟨ µ1′, K , µ and
To summarize, where
k⟩
on the current hypothesis h and the observed data X, the
function
for the
′ | h) problem is
Q (khmeans
⎛
1
1
⎜
−
Q(h′ | h ) = ∑ ln
2
2
⎜
2
σ
i =1
2π σ
⎝
m
−
E[ zij ] =
e
∑
k
1
2σ
e
n =1
⎞
E[ zij ]( xi − µ ′j ) ⎟
∑
⎟
j =1
⎠
k
2
µ
(
−
)
x
i
j
2
−
1
2σ
2
( xi − µ n ) 2
CSE 555: Srihari
2
Derivation of k Means Algorithm:
Second (Maximization Step) To Find the Values
h′ = ⟨ µ1′, K , µ k′ ⟩
⎛
1
1
⎜
arg max Q(h′ | h) = arg max ∑ ln
−
2
2
⎜
2
σ
h′
h′
i =1
2π σ
⎝
m
m
k
⎞
⎟
′
E
[
z
](
x
µ
)
−
∑
ij
i
j
⎟
j =1
⎠
k
= arg min ∑∑ E[ zij ]( xi − µ ′j ) 2
h′
i =1 j =1
E[ z ] x
∑
←
∑ E[ z ]
m
µj
i =1
m
CSE 555:i Srihari
=1
ij
i
ij
2
Summary
• In many parameter estimation tasks, some of
•
the relevant instance variables may be
unobservable.
In this case, the EM algorithm is useful.
CSE 555: Srihari
Download