Document 13510805

advertisement
9.913 Pattern Recognition for Vision
Class VI – Density Estimation
Yuri Ivanov
Fall 2004
Road Map
Density Estimation
Semi-parametric
Mixture Density
K-NN Method
Kernel Methods
Histograms
Bayesian
Non-parametric
Max Likelihood
Fall 2004
Parametric
Pattern Recognition for Vision
Generative vs. Discriminative
There are two schools of thought in Machine Learning:
1. Generative:
• Estimate class models from data
• Compute the discriminant function
• Plug in your data – get the answer
2. Discriminative:
• Estimate the discriminant from data
• Plug in your data – get the answer
Fall 2004
Last class
Pattern Recognition for Vision
Density Estimation
Density Estimation is at the core of generative Pattern Recognition
b
P ( a < x < b ) = � p ( x ) dx
a
mean : E[ x ] = � xp ( x )dx
covariance : E[( x - E[ x ])( x - E[ x ])T ]
= � ºØ ( x - E[ x ])( x - E[ x ])T øß p ( x ) dx
function mean : E[ f ( x )] = � f ( x ) p ( x ) dx
conditional mean : E[ y | x ] = � yp ( y | x ) dx
Fall 2004
Pattern Recognition for Vision
Refresher
Minimum expected risk:
R* = � min [ R(a | x) ] p ( x ) dx
w
… is based on conditional risk:
wi = arg min R (a | x )
w
… which is computed from the posterior:
R(a | x) = L (a | w ) P(w | x )
… which depends on the likelihood:
p ( x | w ) P (w )
P(w | x ) =
p( x)
Fall 2004
Pattern Recognition for Vision
Setting
Data:
D = {Di }iC=1
Assume that D j contains noinformation about wi , "i „ j
NOTATIONALLY - we abandon the class label:
p( x | wi )
Keep in mind:
p ( x)
p ( x | wi ) „ p ( x )
Goal:
model the probability density function p(x), given a finite number of
data points, x1, x2, …, xN, drawn from it.
Fall 2004
Pattern Recognition for Vision
Three Methods
1. Parametric
• Good: small number of parameters
• Bad: choice of the parametric form
2. Non-parametric
• Good: data “dictates” the approximator
• Bad: large number of parameters
3. Semi-parametric
• Good: combine the best of both worlds
• Bad: harder to design
• Good again: design can be subject to optimization
Fall 2004
Pattern Recognition for Vision
Parametric Density Estimation
Estimate the density from a given functional family
Given:
Find:
p ( x | q ) = f ( x, q )
q
Two methods of parameter estimation:
1. Maximum Likelihood method
• Parameters are viewed as unknown but fixed values
2. Bayesian method
• Parameters are random variables that have their distributions
Fall 2004
Pattern Recognition for Vision
Normal (Gaussian) Density Function
A common assumption - Gaussian
q = ( m , S)
� 1
�
T -1
p( x | q ) =
exp � - ( ( x - m ) S ( x - m ) ) �
d /2
1/2
(2p ) | S |
Ł 2
ł
1
Number of
dimensions
“Volume”
of the
covariance
Squared
Mahalanobis
distance
m = E [ x]
- d parameters
S = E غ ( x - m )T ( x - m ) øß
- d ( d + 1) / 2 parameters
Fall 2004
Pattern Recognition for Vision
Normal Density
Ø0ø
m = Œ0œ
Œ œ
μ 0 ϧ
Ø 1 0 .5ø
S = Œ 0 1 .3œ
Œ
œ
μ.5 .3 1 ϧ
T -1
Constant density, ( x - m ) S ( x - m ) = C
- quadratic surface
S - Positive semidefinite
ellipsoid
Principal axes: eigenvectors of S
li , l - eigenvalues of S
Length:
Fall 2004
Pattern Recognition for Vision
Whitening Transform
Define:
L = diag ( eigval ( S ) ) - Scaling matrix
F = eigvec ( S )
- Rotation matrix
Then:
W = L -1 / 2 F T
-“Unscales” and “unrotates”
the data
For all:
x ∼ N ( 0, S )
Fall 2004
Wx ∼ N ( 0, I )
Pattern Recognition for Vision
You Are Here
Density Estimation
Semi-parametric
Mixture Density
K-NN Method
Kernel Methods
Histograms
Bayesian
Non-parametric
Max Likelihood
Fall 2004
Parametric
Pattern Recognition for Vision
Maximum Likelihood
Parameters are fixed but unknown.
D ” { x1 , x 2 ,..., x N } - a data set, drawn from p(x)
Notationally, we make density explicitly dependent on parameters:
p ( x |q )
p ( x)
Assuming that the data is drawn independently (i.i.d.):
N
L (q ) ” p ( D | q ) = � p ( x n | q )
- a likelihood function
n =1
To find θ Maximize L(θ ) w.r.t. parameters.
Fall 2004
Pattern Recognition for Vision
Maximum Likelihood
Maximizing L(θ ) is equivalent to maximizing log-likelihood function:
N
N
n =1
n =1
l (q ) ” log L (q ) = log � p ( x n | q ) = � log p ( x n | q )
To find θ set the derivative to 0:
N
�q l (q ) = � �q log p ( x n | q ) = 0
n =1
And solve for θ
Fall 2004
Pattern Recognition for Vision
Quick Summary – ML Parameter Estimation
p ( x | w i ,q i ) P (w i | q i )
P (w i | x ) = P (w i | x ,q i ) =
p( x | qi )
easy
p ( x | qˆ)
we pick that
argmax p ( D | q )
q
N
n
p
(
x
|q )
�
n =1
Fall 2004
Pattern Recognition for Vision
Solving a Maximum Likelihood Problem
Fixed covariance:
some candidates
p( D | q )
log p ( D | q )
qˆ
Fall 2004
qˆ
Pattern Recognition for Vision
Maximum Likelihood Example
In d-dimensions:
1
1 n
� d
�
T -1
n
�q l (q ) = � �q �- log [ 2p ] - log ºØ S øß - ( x - m ) S ( x - m ) �
2
2
� 2
�
n
Solving for the mean:
1
� m l (q ) = - � S -1 ( x n - mˆ ) = 0 �
2 n
1
mˆ =
N
Fall 2004
N
n
x
�
- arithmetic average of samples
n =1
Pattern Recognition for Vision
Maximum Likelihood Example (cont.)
1
1 n
� d
�
T -1
n
�q l (q ) = � �q �- log [ 2p ] - log ºØ S øß - ( x - m ) S ( x - m ) �
2
2
� 2
�
n
Solving for the covariance:
d M
For symmetric M:
dM
= M M
-1
and
d ( a T M -1b )
dM
{
= M -1ab T M -1
}
1
-1
-1
n
n
T ˆ -1
ˆ
ˆ
ˆ
ˆ
� S l (q ) = - � S - S ( x - m )( x - m ) S = 0 �
2 n
biased
Fall 2004
ˆS = 1
N
� ( x - mˆ )( x - mˆ )
N
n
n =1
n
T
-arithmetic average of
indiv. covariances
Pattern Recognition for Vision
Recursive ML
What if data comes one sample at a time?
1
mˆ N =
N
N
1
n
x =
�
N
n =1
Ø N N -1 n ø
Œx + � x œ
º
ß
n =1
1 N
1 N
= غ x + ( N - 1) mˆ N -1 øß = mˆ N -1 + غ x - mˆ N -1 øß
N
N
This estimate “stiffens” with more data (as it should).
One idea – fix the fraction. Then the estimate can track
a non-stationary process
Fall 2004
Pattern Recognition for Vision
Recursive ML
Fix the update rate and retrace the steps:
v N = v N -1 + g غ x N - v N-1 øß = (1 - g )vN -1 + g x N
= (1 - g ) 2 vN - 2 + (1 - g )g x N -1 + g x N
M
Recency weights
= (1 - g ) M vN - M + � (1 - g ) M -k g x k
k =1
vn
Fall 2004
mn
Pattern Recognition for Vision
Simple Example
Several images from a static camera:
How much noise is in it?
x = vec ( I t - I t -1 )
m=0
s = 1.2
Now we can set a threshold that will
statistically distinguish pixel noise
from an object
Fall 2004
Pattern Recognition for Vision
Problems with ML
We are given two estimates:
µ1, Σ1
µ2, Σ2
True
distribution
Which one do we believe?
ML gives a single solution, regardless of uncertainty.
Fall 2004
Pattern Recognition for Vision
You Are Here
Density Estimation
Semi-parametric
Mixture Density
K-NN Method
Kernel Methods
Histograms
Bayesian
Non-parametric
Max Likelihood
Fall 2004
Parametric
Pattern Recognition for Vision
Bayesian Parameter Estimation
In classification our goal so far has been to estimate P (w | x )
Let’s make the dependency on the data explicit:
p ( x | wi , D) P (wi | D)
P (wi | x, D ) =
p ( x | D)
P (wi | D)
- this is easy to compute
P( x | D)
- this is easy to compute by marginalization
What about p ( x | wi , D ) ?
Fall 2004
Pattern Recognition for Vision
Bayesian Parameter Estimation
P (wi | x, D ) =
This is a supervised problem so far:
p ( x | wi, D ) P (wi | D)
p ( x | D)
D = {D1 , D2 ,..., DN }
(
)
= p ( x | w , D , {D } ) = p ( x | w , D )
p ( x | wi , D ) = p x | wi , { D j }
i
i
j =1... N
j
j„i
i
i
p ( x | wi , Di ) P(wi | D )
P (wi | x , D ) =
p( x | D )
Fall 2004
Pattern Recognition for Vision
Bayesian Parameter Estimation
We will assume that we can obtain “labeled” data, so again:
Notationally:
p ( x | w i , Di )
p( x | D )
Now our problem is to compute density for x given the data D.
We assume the form of p(x) – the source density for D:
p ( x)
p ( x |q )
… and treat θ as a random variable
Fall 2004
Pattern Recognition for Vision
Bayesian Parameter Estimation
Instead of choosing a value for a parameter, we use them all:
p ( x | D ) = � p ( x ,q | D ) dq = � p ( x | q , D ) p (q | D ) dq
Data predicts the new sample
x is independent of D given q
= � p ( x | q ) p (q | D )dq
We chose the form
of this
What is this?
Average densities p(x|q) for ALL possible values of q weighted by
its posterior probability
Fall 2004
Pattern Recognition for Vision
Bayesian Parameter Estimation
Computing the posterior probability for q :
Using Bayes rule:
What is this?
p (q | D ) =
p ( D | q ) p (q )
� p( x | q ) p (q | D )dq
Prior belief about
the parameters
(denisty)
� p( D | q ) p(q )dq
Using independence:
N
p (D | q ) = � p( x n | q )
n =1
Bayesian method does not commit to a particular value of θ, but uses
the entire distribution.
Fall 2004
Pattern Recognition for Vision
Quick Summary – Bayesian Parameter Estimation
p ( x | wi , Di ) P (wi | D)
P (wi | x ) = P(wi | x, D ) =
p ( x | D)
hard
easy
� p( x | q ) p (q | D) dq
we pick that
p ( D | q ) p (q )
p( D)
hard
we “know” this*, **
* Non-informative prior – doesn’t
introduce bias
N
n
p
(
x
|q )
�
n =1
Fall 2004
** Conjugate prior – causes p(q |D)
have the same functional form as
p(D|q )
Pattern Recognition for Vision
Bayesian Parameter Estimation
� p( x | m ) p (m | D )d m
For q = m :
Parameter posterior
Parameter prior
ML solution
posterior
Bayesian solution
weighted
likelihoods
Fall 2004
Pattern Recognition for Vision
Bayesian Parameter Estimation - Example
� p( x | m ) p (m | D )d m
First let’s deal with the parameter:
Likelihood:
p ( x | m ) = N( m , s 2 )
fixed
Parameter prior: p ( m ) = N( m0 ,s 02 )
Need to find:
p( m | D )
Bayes rule again:
p( D | m ) p (m )
Ø N
ø
pN ( m | D ) =
= a Œ� p ( x n | m ) œ N ( m 0 , s 02 ) = N (m N , s N )
p ( D)
º n =1
ß
N-sample parameter posterior
Fall 2004
This is a Gaussian
Need
these
Pattern Recognition for Vision
Bayesian Parameter Estimation - Example
So, the posterior is a Gaussian
pN ( m | D ) = N( mN ,s N )
After some algebra and identifying the terms:
1
1
1
= 2N+ 2
2
sN s
s0
- when Gaussians multiply – precisions add
… and
Ns 02
s2
mN =
x+
m0
2
2
2
2
Ns 0 + s
Ns 0 + s
With increasing N covariance of the posterior decreases and the prior
becomes unimportant.
Fall 2004
Pattern Recognition for Vision
Bayesian Parameter Estimation - Example
Now the integral:
p ( x | D ) = � p ( x | q ) p (q | D )dq
= � N( m , s 2 ) N ( m N ,s N2 )d m = N ( m N ,s 2 + s N2 )
You can show that it
is also a Gaussian
Any guesses about why Gaussian is such a common assumption?
Fall 2004
Pattern Recognition for Vision
Recursive Bayes
� p( x | q ) p (q | D )dq
For N-point likelihood:
N
p( D N | q ) = � p( x n | q )
n =1
N -1
= p ( x N | q )� p ( x n | q ) = p ( x N | q ) p ( D N -1 | q )
n =1
From this the recursive relation for the posterior:
N
N -1
p
(
x
|
q
)
p
(
D
| q ) p( q )
N
p (q | D ) =
p( D N )
p ( x N | q ) p (q | D N -1 )
=
N
N -1
p
(
x
|
q
)
p
(
q
|
D
)dq
�
Fall 2004
Pattern Recognition for Vision
Recursive Bayes (cont.)
Again:
p (q | D N ) =
p ( x N | q ) p (q | D N -1 )
� p( x
N
| q ) p (q | D
N -1
) dq
- 1-point update.
Setting N=1:
1
1
1
= 2+ 2
2
sn s
s n -1
s n2-1
s2
mn = 2
x+ 2
m n -1
2
2
s n -1 + s
s n -1 + s
Fall 2004
Pattern Recognition for Vision
Recursive Bayes (cont.)
N = 10
p (q | D N )
p( x | q )
Fall 2004
N =2
N =1
Pattern Recognition for Vision
Problems with Bayesian Method
1. Integration is difficult
2. Analytic solutions are only available for restricted class of
densities
3. Technicality: If the true p(x|q) is NOT what we assume it is,
the prior probability of any parameter setting is 0!
4. Integration is difficult
5. Did I mention that the integration is hard?
Fall 2004
Pattern Recognition for Vision
Relation between Bayesian and ML Inference
p (q | D) � p ( D | q ) p(q )
peaks at qˆML
Ø
ø
n
= Œ� p ( x | q ) œ p (q ) = L (q ) p(q )
º n
ß
If the peak is sharp and p(θ ) is flat, then:
p ( x | D ) = � p ( x | q ) p (q | D )dq
� � p ( x | qˆ) p (q | D ) dq = p ( x | qˆ) � p (q | D ) dq = p ( x | qˆ )
As N fi ¥, p ( x | D) « p ( x | qˆ )
Fall 2004
Pattern Recognition for Vision
Non-Parametric Methods for Density Estimation
Non-parametric methods do not assume any particular form for p(x)
1. Histograms
2. Kernel Methods
3. K-NN method
Fall 2004
Pattern Recognition for Vision
You Are Here
Density Estimation
Semi-parametric
Mixture Density
K-NN Method
Kernel Methods
Histograms
Bayesian
Non-parametric
Max Likelihood
Fall 2004
Parametric
Pattern Recognition for Vision
Histograms
Pˆ ( x ) is a discrete approximation of p ( x)
• Count a number of times that x lands in the i -th bin
N
H (i ) = � I ( x ˛ Ri ), "i = 1, 2,..., M
j =1
• Normalize
Pˆ (i ) =
H (i )
M
� H ( j)
j =1
Fall 2004
Pattern Recognition for Vision
Histograms
How many bins?
p( x)
M =3
Pˆ ( x )
p( x)
M = 20
Pˆ ( x )
“Oversmoothing”
M = 10
p( x)
p( x)
Pˆ ( x )
M = 50
Pˆ ( x )
“Overfitting”
Fall 2004
Pattern Recognition for Vision
Histograms
Good:
• Once it is constructed, the data can be discarded
• Quick and intuitive
Bad:
• Very sensitive to number of bins, M
• Estimated density is not smooth
• Poor generalization in higher dimensions
Fall 2004
Pattern Recognition for Vision
Aside: Curse of dimensionality (Bellman, ‘61):
• Imagine we build a histogram of a 1-d feature (say, Hue)
• 10 bins
• 1 bin = 10% of the input space
• need at least 10 points to populate every bin
• We add another feature (say, Saturation)
• 10 bins again
• 1 bin = 1% of the input space
• we need at least 100 points to populate every bin
•We add another feature (say, Value)
• 10 bins again
• 1 bin = 0.1% of the input space
• we need at least 1000 points to populate every bin
N = bd
Fall 2004
- number of points grows exponentially
Pattern Recognition for Vision
Aside: Curse Continues
Volume of a cube in Rd with side l:
Vl = l d
Volume of a cube with side l-ε:
Ve = (l - e ) d
Volume of the ε-shell:
D = Vl - Ve = l d - (l - e ) d
Ratio of the volume of the ε-shell to the volume of the cube:
D l - (l - e )
� e�
=
= 1 - � 1 - � fi 1 as d fi ¥ !!!!!
d
Vl
l
lł
Ł
d
Fall 2004
d
d
Pattern Recognition for Vision
Aside: Lessons of the curse
In generative models:
• Use as much data as you can get your hands on
• Reduce dimensionality as much as you can get away with
<End of Digression>
Fall 2004
Pattern Recognition for Vision
General Reasoning
By definintion:
P ( x ˛ R ) = P = � p ( x ') dx '
R
If we have N i.i.d. points drawn form p(x):
N!
P (| x ˛ R |= k ) =
P k (1 - P ) N - k = B ( N, P)
k !( N - k )!
Num. of unique splits
K vs. (N-K)
Prob that k of
particular x-es are in R
Prob that the
rest are not
B(N, P) is a binomial distribution of k
Fall 2004
Pattern Recognition for Vision
General Reasoning (cont.)
Mean and variance of B(N, P):
Mean:
m = E[k ] = NP
� P = E[ k / N ]
Variance:
s 2 = E[(k - m ) 2 ] = NP (1 - P )
2
2
�s �
Ø
ø
� E ( k / N - P ) = � � = P (1 - P ) / N
º
ß ŁNł
That is:
• E[k/N] is a good estimate of P
• P is distributed around this estimate with vanishing variance
So:
Fall 2004
P�k/N
Pattern Recognition for Vision
General Reasoning (cont.)
So:
P�k/N
On the other hand, under mild assumptions:
p(x)
P = � p ( x ') dx ' � p ( x )V
R
Volume of R
(not p(x))
V
x
… which leads to:
k
p ( x) �
NV
Fall 2004
Pattern Recognition for Vision
General Reasoning (cont.)
Now, given N data points – how do we really estimate p(x)?
k
p ( x) �
NV
Fix k and vary V
until it encloses k
points
K-Nearest Neighbors (KNN)
Fall 2004
Fix V and count how
many points (k) it
encloses
Kernel methods
Pattern Recognition for Vision
You Are Here
Density Estimation
Semi-parametric
Mixture Density
K-NN Method
Kernel Methods
Histograms
Bayesian
Non-parametric
Max Likelihood
Fall 2004
Parametric
Pattern Recognition for Vision
Kernel Methods of Density Estimation
We choose V by specifying a hypercube with a side h:
V = hd
Mathematically:
�1
H (y ) = �
�0
| y j |< 1/ 2
j = 1,..., d
otherwise
1/2
1
kernel function:
H ( y ) ‡ 0, "y and
Fall 2004
� H (y )dy = 1
Pattern Recognition for Vision
Parzen Windows
Then
H
(
n
x
x
(
)/h
)
- a hypercube with side h centered at xn
H can help count the points in a volume V around any x:
h-neighborhood of x
�x-x �
k ( x) = � H �
�
n =1
Ł h ł
N
n
No contribution
to the count at x
Fall 2004
x2
x1
Pattern Recognition for Vision
Rectangular Kernel
So the number of points in h-neighborhood of x:
� x - xn �
k ( x) = � H �
�
n =1
Ł h ł
N
… is easily converted to the density estimate:
k ( x) 1 N 1 � x - x n �
= � d H�
p� ( x ) =
�
NV
N n =1 h
Ł h ł
K ( x, x n )
Integrates to 1
Subtle point:
Ø1
� Œº N
1
n ø
K ( x, x )œ dx =
�
N
n =1
ß
N
n
Ø
ø =1
K
x
,
x
dx
(
)
�
�
º
ß
n =1
N
� � p� ( x )dx = 1
Fall 2004
Pattern Recognition for Vision
Example
Fall 2004
Source
h=1
h=2
h=4
Pattern Recognition for Vision
Smoothed Window Functions
The problem is as in histograms – it is discontinuous
We can choose a smoother function, s.t.:
p� ( x) ‡ 0, "x and
� p� ( x )dx = 1
Ensured by kernel conditions
Eg: <loud cheer> a (spherical) Gaussian:
n
�
x
x
1
n
K (x , x ) =
exp � 2
d
�
h
2
( 2p h )
Ł
… so:
Fall 2004
�
�
�
ł
n
�
x-x
1
1
p� ( x ) = �
exp � d
�
N n =1 ( 2p h )
2h2
Ł
N
�
�
�
ł
Pattern Recognition for Vision
Example
Fall 2004
Source
h=0.6
h=1.0
h=1.3
Pattern Recognition for Vision
Some Insight
Interesting to look at expectation of the estimate with respect to all
possible datasets:
Ø1
E [ p� ( x ) ] = E Œ
ºN
ø
K ( x , x ) œ = E [K (x , x ') ]
�
n =1
ß
N
n
= � K ( x - x ' ) p ( x ') dx ' - convolution with true density
That is:
p� ( x ) = p ( x ) if K ( x , x ') = d ( x , x ')
But not for the finite data set!
Fall 2004
Pattern Recognition for Vision
Conditions for Convergence
How small can we make h for a given N ?
lim hNd = 0
- It should go to 0
lim NhNd = ¥
- But slower than 1/N
N ޴
N ޴
Based on the similar analysis of
variance of estimates
Eg:
hNd = h1d / N
hNd = h1d / log( N )
Note that the choice of h1d is still up to us.
Fall 2004
Pattern Recognition for Vision
Problems With Kernel Estimation
•
Need to choose the width parameter, h
•
•
•
Need to store all data to represent the density
•
Fall 2004
Can be chosen empirically
Can be adaptive, eg. hj = hdjk – where djk the distance from
xj to k-th nearest neighbor
Leads to Mixture Density Estimation
Pattern Recognition for Vision
You Are Here
Density Estimation
Semi-parametric
Mixture Density
K-NN Method
Kernel Methods
Histograms
Bayesian
Non-parametric
Max Likelihood
Fall 2004
Parametric
Pattern Recognition for Vision
K-Nearest Neighbors
Recall that:
k
p� ( x ) =
NV
Now we fix k (typically k =
N ) and expand V to contain k points
This is not a true density!
Eg.: choose N=1, k=1. Then:
1
p� ( x) =
1 � x - x1
Oops!
BUT it is useful for a number of theoretical and practical reasons.
Fall 2004
Pattern Recognition for Vision
K-NN Classification Rule
Let’s try classification with K-NN density estimate
Data:
N
Nj
- total points
- points in class w j
Need to find the class label for a query, x
Expand a sphere from x to include K points
K - number of neighbors of x
K j - points of class w j among K
Fall 2004
x
Pattern Recognition for Vision
KNN Classification
Then class priors are given by:
p (w j ) =
Nj
N
We can estimate conditional and marginal densities around any x:
p( x | w j ) =
Kj
K
p ( x) =
NV
N jV
K j N j NV K j
By Bayes rule: p (w j | x ) =
=
N jV N K
K
Then for minimum error rate classification:
C = arg max K j
j
Fall 2004
Pattern Recognition for Vision
KNN Classification
Important theoretical result:
In the extreme case, K=1, it can be shown that:
N-sample error rate
P = lim PN (error )
for
N ޴
c
�
*�
P £ P £ P �2P �
c -1 ł
Ł
*
*
That is, using just a single neighbor rule, the error rate is at most
twice the Bayes error!!!
Fall 2004
Pattern Recognition for Vision
Problems with Non-parametric Methods
• Memory: need to store all data points
• Computation: need to compute distances to all data points
every time
• Parameter choice: need to choose the smoothing parameter
Fall 2004
Pattern Recognition for Vision
You Are Here
Density Estimation
Semi-parametric
Mixture Density
K-NN Method
Kernel Methods
Histograms
Bayesian
Non-parametric
Max Likelihood
Fall 2004
Parametric
Pattern Recognition for Vision
Mixture Density Model
Mixture model – a linear combination of parametric densities
Number of components
M
p ( x) = � p ( x | j ) P( j)
j =1
Component
density
P( j ) ‡ 0, "j
Component
“prior”
M
and
� P( j) = 1
j =1
Uses MUCH less “kernels” than kernel methods
Kernels are parametric densities, subject to estimation
Fall 2004
Pattern Recognition for Vision
Example
M
p ( x) = � p ( x | j ) P( j)
j =1
Fall 2004
Pattern Recognition for Vision
Mixture Density
Using ML principle, the objective function is the log-likelihood:
M
�
�
n
n
l (q ) ” log � p( x ) = � log �� p( x | j ) P( j ) �
n=1
n =1
� j =1
�
N
N
Differentiate w.r.t. parameters:
¶
�M
�
n
�q j l (q ) = �
log �� p ( x | k )P (k ) �
n =1 ¶q j
� k =1
�
N
N
=�
n =1
1
M
n
p
(
x
� | k )P (k )
¶
p ( x n | j )P ( j )
¶q j
k =1
Fall 2004
Pattern Recognition for Vision
Mixture Density
Again let’s assume that p(x|ω) is a Gaussian
We need to estimate M priors, and M sets of means and covariances
¶l (q ) N
= � P( j | x n ) غS -j 1 ( xn - mˆ j ) øß
¶m j
n =1
Setting it to 0 and solving for µj:
N
mˆ j =
n
n
P
(
j
|
x
)
x
�
n =1
N
n
P
(
j
|
x
)
�
- convex sum of all data
n =1
Fall 2004
Pattern Recognition for Vision
Mixture Density
Similarly for the covariances:
¶l (q ) N
n
n
n
T ˆ -1
-1
-1
ˆ
ˆ
Ø
ø
ˆ
ˆ
=
P
j
x
S
S
)(
S
x
m
x
m
(
|
)
(
)
�
j
j
j
j
j
2
º
ß
¶s j
n =1
Setting it to 0 and solving for Σi:
� P( j | x ) ( x
N
n
Sˆ j =
n=1
n
- mˆ j ) ( x - mˆ j )
n
T
N
� P( j | x )
n
n=1
Fall 2004
Pattern Recognition for Vision
Mixture Density
A little harder for P(j) – optimization is subject to constraints:
M
� P( j ) = 1
j =1
and P ( j ) ‡ 0, "j
Here is a trick to enforce the constraints:
P( j) =
exp(g j )
M
� exp(g
k =1
k
)
¶P (i )
= d (i - j ) P( j ) - P (i) P( j )
¶g j
Fall 2004
Pattern Recognition for Vision
Mixture Density
Using the chain rule:
¶l (q ) ¶P( k ) M N p( x n | k )
�g j l (q ) = �
= ��
d jk P ( j) - P ( j )P( k ) )
(
P( x)
k =1 ¶P (k ) ¶g j
k =1 n=1
M
M
� p( xn | j )
�
p( x n | k )
= ��
P( j) - �
P( j) P (k ) �
P( x)
n=1 � P ( x )
k =1
�
N
N
M
�
�
= � � P( j | x n ) - P ( j) � p (k | x n ) � = � {P ( j | x n ) - P ( j )} = 0
n=1 �
k =1
� n=1
N
The last expression gives the value at the extremum:
1
P( j) =
N
Fall 2004
N
n
P
(
j
|
x
)
�
n=1
Pattern Recognition for Vision
Mixture Density
What’s the problem?
1
P( j) =
N
N
N
n
P
(
j
|
x
)
�
mˆ j =
n =1
n
n
P
(
j
|
x
)
x
�
n =1
N
� P( j | x )
n
n =1
� P( j | x ) ( x
N
n
Sˆ j =
n =1
n
- mˆ j )( x - mˆ j )
n
T
N
n
P
(
j
|
x
)
�
n =1
We can’t compute these directly!
Solution – EM algorithm. We will study it in Clustering.
Fall 2004
Pattern Recognition for Vision
You Are Here
Density Estimation
Semi-parametric
Mixture Density
K-NN Method
Kernel Methods
Histograms
Bayesian
Non-parametric
Max Likelihood
Fall 2004
Parametric
Pattern Recognition for Vision
Download