Segmentation and Fitting by EM

advertisement
EM and model selection
Missing variable problems
• In many vision problems, if some variables were known the
maximum likelihood inference problem would be easy
– fitting; if we knew which line each token came from, it would
be easy to determine line parameters
– segmentation; if we knew the segment each pixel came from,
it would be easy to determine the segment parameters
– fundamental matrix estimation; if we knew which feature
corresponded to which, it would be easy to determine the
fundamental matrix
– etc.
• This sort of thing happens in statistics, too
CS8690 Computer Vision
University of Missouri at Columbia
Missing variable problems
• Strategy
– estimate appropriate values for the missing variables
– plug these in, now estimate parameters
– re-estimate appropriate values for missing variables, continue
• eg
– guess which line gets which point
– now fit the lines
– now reallocate points to lines, using our knowledge of the
lines
– now refit, etc.
• We’ve seen this line of thought before (k means)
CS8690 Computer Vision
University of Missouri at Columbia
Missing variables - strategy
• We have a problem with
parameters, missing variables
• This suggests:
• Iterate until convergence
– replace missing variable with
expected values, given fixed
values of parameters
– fix missing variables, choose
parameters to maximize
likelihood given fixed values of
missing variables
CS8690 Computer Vision
• e.g., iterate till convergence
– allocate each point to a
line with a weight, which
is the probability of the
point given the line
– refit lines to the weighted
set of points
• Converges to local extremum
• Somewhat more general form
is available
University of Missouri at Columbia
K-Means
•
Choose a fixed number of
•
clusters
Algorithm
–
•
•
Choose cluster centers
and point-cluster
allocations to minimize
error
can’t do this by search,
because there are too
many possible
allocations.
–
•
fix cluster centers;
allocate points to closest
cluster
fix allocation; compute
best cluster centers
x could be any set of
features for which we can
compute a distance
(careful about scaling)

2 
x j   i 



iclusters jelements of i' th cluster

CS8690 Computer Vision
University of Missouri at Columbia
* From Marc Pollefeys COMP 256 2003
K-Means
CS8690 Computer Vision
University of Missouri at Columbia
K-Means
CS8690 Computer Vision
University of Missouri at Columbia
* From Marc Pollefeys COMP 256 2003
Image Segmentation by K-Means
•
•
•
•
•
•
Select a value of K
Select a feature vector for every pixel (color, texture,
position, or combination of these etc.)
Define a similarity measure between feature vectors
(Usually Euclidean Distance).
Apply K-Means Algorithm.
Apply Connected Components Algorithm.
Merge any components of size less than some threshold to
an adjacent component that is most similar to it.
CS8690 Computer Vision
University of Missouri at Columbia
* From Marc Pollefeys COMP 256 2003
K-Means
• Is an approximation to EM
– Model (hypothesis space): Mixture of N Gaussians
– Latent variables: Correspondence of data and Gaussians
• We notice:
– Given the mixture model, it’s easy to calculate the
correspondence
– Given the correspondence it’s easy to estimate the mixture
models
CS8690 Computer Vision
University of Missouri at Columbia
Generalized K-Means (EM)
CS8690 Computer Vision
University of Missouri at Columbia
Generalized K-Means
• Converges!
• Proof [Neal/Hinton, McLachlan/Krishnan]:
– E/M step does not decrease data likelihood
– Converges at saddle point
CS8690 Computer Vision
University of Missouri at Columbia
Idea
• Data generated from mixture of Gaussians
• Latent variables: Correspondence between Data Items
and Gaussians
CS8690 Computer Vision
University of Missouri at Columbia
Learning a Gaussian Mixture
(with known covariance)
E-Step
E[ zij ] 
p( x  xi |    j )
k
 p( x  x |   
i
n 1


e
k
1
2
e

n
)
2
(
x


)
i
j
2
1
2
2
(
x


)
i
n
2
n 1
M-Step
CS8690 Computer Vision
1 m
 j   E[ zij ] xi
m i 1
University of Missouri at Columbia
The Expectation/Maximization (EM)
algorithm
• In general we may imagine that an image
comprises L segments (labels)
• Within segment l the pixels (feature vectors) have a
probability distribution represented by pl ( x | l )
•  l represents the parameters of the data in segment l
– Mean and variance of the greylevels
– Mean vector and covariance matrix of the colours
– Texture parameters
CS8690 Computer Vision
University of Missouri at Columbia
14
The Expectation/Maximization (EM)
algorithm
2
1
5
3
4
CS8690 Computer Vision
University of Missouri at Columbia
15
The Expectation/Maximization
(EM) algorithm
• Once again a chicken and egg problem arises
– If we knew l : l  1..L then we could obtain a labelling for
each x by simply choosing that label which maximizes pl ( x | l )
– If we knew the label for each x we could obtain l : l  1..L by
using a simple maximum likelihood estimator
• The EM algorithm is designed to deal with this type of problem
but it frames it slightly differently
– It regards segmentation as a missing (or incomplete) data
estimation problem
CS8690 Computer Vision
University of Missouri at Columbia
16
The Expectation/Maximization
(EM) algorithm
• The incomplete data are just the measured pixel greylevels or
feature vectors
– We can define a probability distribution of the incomplete data
as pi ( x;1,2 ..... L )
• The complete data are the measured greylevels or feature vectors
plus a mapping function f (.) which indicates the labelling of
each pixel
– Given the complete data (pixels plus labels) we can easily
work out estimates of the parameters l : l  1..L
– But from the incomplete data no closed form solution exists
CS8690 Computer Vision
University of Missouri at Columbia
The Expectation/Maximization
(EM) algorithm
• Once again we resort to an iterative strategy and
hope that we get convergence
• The algorithm is as follows:
Initialize an estimate of
Repeat
l : l  1..L
Step 1: (E step)
Obtain an estimate of the labels based on
the current parameter estimates
Step 2: (M step)
Update the parameter estimates based on the
current labelling
Until Convergence
CS8690 Computer Vision
University of Missouri at Columbia
18
The Expectation/Maximization
(EM) algorithm
• A recent approach to applying EM to image
segmentation is to assume the image pixels or
feature vectors follow a mixture model
– Generally we assume that each component of the
mixture model is a Gaussian
– A Gaussian mixture model (GMM)
p( x | )  l 1 l pl ( x |  l )
L
pl ( x | l ) 
CS8690 Computer Vision
1
1
T 1
exp(

(
x


)
l ( x  l ))
l
d /2
1/ 2
(2 ) det(l )
2

L
l 1
l  1
University of Missouri at Columbia
19
The Expectation/Maximization
(EM) algorithm
• Our parameter space for our distribution now
includes the mean vectors and covariance
matrices for each component in the mixture
plus the mixing weights
  1 , 1 , 1 ,..........L ,  L ,  L 
• We choose a Gaussian for each component
because the ML estimate of each parameter in
the E-step becomes linear
CS8690 Computer Vision
University of Missouri at Columbia
20
The Expectation/Maximization
(EM) algorithm
• Define a posterior probability P(l | x j ,l ) as the
probability that pixel j belongs to region l given
the value of the feature vector x j
• Using Bayes rule we can write the following
equation
 p (x |  )
P(l | x ,  ) 
  p (x |  )
• This actually is the E-step of our EM algorithm
as allows us to assign probabilities to each label
at each pixel
l
j
l
j
l
L
k 1
CS8690 Computer Vision
l
k
k
k
University of Missouri at Columbia
21
The Expectation/Maximization
(EM) algorithm
• The M step simply updates the parameter
estimates using MLL estimation

( m 1)
l
1 n
  P(l | x j ,l( m) )
n j 1
n
l( m1) 
 x P(l | x , 
j 1
n
j
j
(m)
l
)
(m)
P
(
l
|
x
,

)

j
l
j 1

n
l( m1) 
( m)
( m)
( m) T
P
(
l
|
x
,

)
(
x


)(
x


)

j
l
j
l
j
l
j 1
n
 P(l | x ,
j 1
CS8690 Computer Vision

j
( m)
l
)
University of Missouri at Columbia
22
Segmentation with EM
Figure from “Color and Texture Based Image Segmentation Using EM and Its Application to Content
Based Image Retrieval”,S.J. Belongie et al., Proc. Int. Conf. Computer Vision, 1998, c1998, IEEE
CS8690 Computer Vision
University of Missouri at Columbia
EM Clustering: Results
CS8690 Computer Vision
University of Missouri at Columbia
Lines and robustness
• We have one line, and n
points
• Some come from the
line, some from “noise”
• This is a mixture model:
• We wish to determine
– line parameters
– p(comes from line)
Ppoint | line and noise params   Ppoint | line Pcomes from line  
Ppoint | noise Pcomes from noise 
 Ppoint | line   Ppoint | noise (1   )
CS8690 Computer Vision
University of Missouri at Columbia
Estimating the mixture model
• Introduce a set of hidden • Here K is a normalising
constant, kn is the noise
variables, d, one for each
point. They are one
intensity (we’ll choose
when the point is on the
this later).
line, and zero when off.
• If these are known, the
negative log-likelihood
2


 
x
cos
f

y
sin
f

c


becomes (the line’s
i
i
 
d i 
2

2
   K
parameters are f, c): Qcx;   i  


1 d i kn

CS8690 Computer Vision
University of Missouri at Columbia
Substituting for delta
• We shall substitute the
• Notice that if kn is small
and positive, then if
expected value of d, for a
distance is small, this
given 
value is close to 1 and if
• recall =(f, c, )
it
is
large,
close
to
zero
• E(d_i)=1.
P(d_i=1|)+0....
Px i | d i  1, Pd i  1
Pd i  1|  , x i  
Px i | d i  1, Pd i  1 Px i | d i  0,  Pd i  0 

CS8690 Computer Vision

exp 1

exp 1
2

2 xi cos f  yi sin   c 
2
2
2
2 xi cos f  yi sin   c   expkn 1   

University of Missouri at Columbia
Algorithm for line fitting
• Obtain some start point
 0  f 0 ,c 0 , 0 
• Iterate to convergence
• Now compute d’s using
formula above
• Now compute maximum
likelihood estimate of  1
– f, c come from fitting to
weighted points
–  comes by counting
CS8690 Computer Vision
University of Missouri at Columbia
CS8690 Computer Vision
University of Missouri at Columbia
The expected values of the deltas at the maximum
(notice the one value close to zero).
CS8690 Computer Vision
University of Missouri at Columbia
Closeup of the fit
CS8690 Computer Vision
University of Missouri at Columbia
Choosing parameters
• What about the noise parameter, and the sigma for the line?
– several methods
• from first principles knowledge of the problem (seldom
really possible)
• play around with a few examples and choose (usually quite
effective, as precise choice doesn’t matter much)
– notice that if kn is large, this says that points very seldom
come from noise, however far from the line they lie
• usually biases the fit, by pushing outliers into the line
• rule of thumb; its better to fit to the better fitting points,
within reason; if this is hard to do, then the model could be
a problem
CS8690 Computer Vision
University of Missouri at Columbia
Other examples
• Segmentation
– a segment is a gaussian that
emits feature vectors (which
could contain colour; or colour
and position; or colour, texture
and position).
– segment parameters are mean
and (perhaps) covariance
– if we knew which segment each
point belonged to, estimating
these parameters would be easy
– rest is on same lines as fitting
line
CS8690 Computer Vision
• Fitting multiple lines
– rather like fitting one line,
except there are more hidden
variables
– easiest is to encode as an array
of hidden variables, which
represent a table with a one
where the i’th point comes from
the j’th line, zeros otherwise
– rest is on same lines as above
University of Missouri at Columbia
Issues with EM
• Local maxima
– can be a serious nuisance in some problems
– no guarantee that we have reached the “right”
maximum
• Starting
– k means to cluster the points is often a good idea
CS8690 Computer Vision
University of Missouri at Columbia
Local maximum
CS8690 Computer Vision
University of Missouri at Columbia
which is an excellent fit to some points
CS8690 Computer Vision
University of Missouri at Columbia
and the deltas for this maximum
CS8690 Computer Vision
University of Missouri at Columbia
A dataset that is well fitted by four lines
CS8690 Computer Vision
University of Missouri at Columbia
Result of EM fitting, with one line (or at least,
one available local maximum).
CS8690 Computer Vision
University of Missouri at Columbia
Result of EM fitting, with two lines (or at least,
one available local maximum).
CS8690 Computer Vision
University of Missouri at Columbia
Seven lines can produce a rather logical answer
CS8690 Computer Vision
University of Missouri at Columbia
Motion segmentation with EM
• Model image pair (or video
sequence) as consisting of
regions of parametric motion
– affine motion is popular
vx  a b x t x 
    
v   
 y  c d y t y 
Now we need to
– determine which pixels belong
to which region
– estimate parameters
CS8690 Computer Vision
• Likelihood
– assume
I x, y,t   I x  v x , y  vy ,t  1
noise
• Straightforward missing
variable problem, rest is
calculation
University of Missouri at Columbia
Three frames from the MPEG “flower garden” sequence
Figure from “Representing Images with layers,”, by J. Wang and E.H. Adelson, IEEE
Transactions on Image Processing, 1994, c 1994, IEEE
CS8690 Computer Vision
University of Missouri at Columbia
Grey level shows region no. with highest probability
Segments and motion fields associated with them
Figure from “Representing Images with layers,”, by J. Wang and E.H. Adelson, IEEE
Transactions on Image Processing, 1994, c 1994, IEEE
CS8690 Computer Vision
University of Missouri at Columbia
If we use multiple frames to estimate the appearance
of a segment, we can fill in occlusions; so we can
re-render the sequence with some segments removed.
Figure from “Representing Images with layers,”, by J. Wang and E.H. Adelson, IEEE
Transactions on Image Processing, 1994, c 1994, IEEE
CS8690 Computer Vision
University of Missouri at Columbia
Some generalities
• Many, but not all
problems that can be
attacked with EM can
also be attacked with
RANSAC
– need to be able to get a
parameter estimate with a
manageably small number
of random choices.
– RANSAC is usually better
CS8690 Computer Vision
• Didn’t present in the most general
form
– in the general form, the
likelihood may not be a linear
function of the missing
variables
– in this case, one takes an
expectation of the likelihood,
rather than substituting
expected values of missing
variables
– Issue doesn’t seem to arise in
vision applications.
University of Missouri at Columbia
Model Selection
• We wish to choose a
model to fit to data
– e.g. is it a line or a circle?
– e.g is this a perspective or
orthographic camera?
– e.g. is there an aeroplane
there or is it noise?
CS8690 Computer Vision
• Issue
– In general, models with
more parameters will fit a
dataset better, but are
poorer at prediction
– This means we can’t
simply look at the
negative log-likelihood (or
fitting error)
University of Missouri at Columbia
Top is not necessarily a better
fit than bottom
(actually, almost always worse)
CS8690 Computer Vision
University of Missouri at Columbia
CS8690 Computer Vision
University of Missouri at Columbia
We can discount the fitting error with some term in the number
of parameters in the model.
CS8690 Computer Vision
University of Missouri at Columbia
Discounts
• AIC (an information
criterion)
– choose model with
smallest value of
2LD; *  2 p
– p is the number of
parameters
• BIC (Bayes information
criterion)
– choose model with
smallest value of
2LD; *  p log N
– N is the number of data
points
• Minimum description
length
– same criterion as BIC, but
derived in a completely
different way
CS8690 Computer Vision
University of Missouri at Columbia
Cross-validation
• Split data set into two
pieces, fit to one, and
compute negative loglikelihood on the other
• Average over multiple
different splits
• Choose the model with
the smallest value of this
average
CS8690 Computer Vision
• The difference in
averages for two
different models is an
estimate of the difference
in KL divergence of the
models from the source
of the data
University of Missouri at Columbia
Model averaging
• Very often, it is smarter to use
multiple models for
prediction than just one
• e.g. motion capture data
– there are a small number
of schemes that are used
to put markers on the body
– given we know the
scheme S and the
measurements D, we can
estimate the configuration
of the body X
CS8690 Computer Vision
• We want
PX | D   PX | S1 , DPS1 | D
PX | S2 , DPS2 | D 
PX | S3 , DPS3 | D
• If it is obvious what the
scheme is from the data, then
averaging makes little
difference
• If it isn’t, then not averaging
underestimates the variance
of X --- we think we have a
more precise estimate than
we do.
University of Missouri at Columbia
Download