9.913 Pattern Recognition for Vision Class VI – Density Estimation Yuri Ivanov Fall 2004 Road Map Density Estimation Semi-parametric Mixture Density K-NN Method Kernel Methods Histograms Bayesian Non-parametric Max Likelihood Fall 2004 Parametric Pattern Recognition for Vision Generative vs. Discriminative There are two schools of thought in Machine Learning: 1. Generative: • Estimate class models from data • Compute the discriminant function • Plug in your data – get the answer 2. Discriminative: • Estimate the discriminant from data • Plug in your data – get the answer Fall 2004 Last class Pattern Recognition for Vision Density Estimation Density Estimation is at the core of generative Pattern Recognition b P ( a < x < b ) = � p ( x ) dx a mean : E[ x ] = � xp ( x )dx covariance : E[( x - E[ x ])( x - E[ x ])T ] = � ºØ ( x - E[ x ])( x - E[ x ])T øß p ( x ) dx function mean : E[ f ( x )] = � f ( x ) p ( x ) dx conditional mean : E[ y | x ] = � yp ( y | x ) dx Fall 2004 Pattern Recognition for Vision Refresher Minimum expected risk: R* = � min [ R(a | x) ] p ( x ) dx w … is based on conditional risk: wi = arg min R (a | x ) w … which is computed from the posterior: R(a | x) = L (a | w ) P(w | x ) … which depends on the likelihood: p ( x | w ) P (w ) P(w | x ) = p( x) Fall 2004 Pattern Recognition for Vision Setting Data: D = {Di }iC=1 Assume that D j contains noinformation about wi , "i „ j NOTATIONALLY - we abandon the class label: p( x | wi ) Keep in mind: p ( x) p ( x | wi ) „ p ( x ) Goal: model the probability density function p(x), given a finite number of data points, x1, x2, …, xN, drawn from it. Fall 2004 Pattern Recognition for Vision Three Methods 1. Parametric • Good: small number of parameters • Bad: choice of the parametric form 2. Non-parametric • Good: data “dictates” the approximator • Bad: large number of parameters 3. Semi-parametric • Good: combine the best of both worlds • Bad: harder to design • Good again: design can be subject to optimization Fall 2004 Pattern Recognition for Vision Parametric Density Estimation Estimate the density from a given functional family Given: Find: p ( x | q ) = f ( x, q ) q Two methods of parameter estimation: 1. Maximum Likelihood method • Parameters are viewed as unknown but fixed values 2. Bayesian method • Parameters are random variables that have their distributions Fall 2004 Pattern Recognition for Vision Normal (Gaussian) Density Function A common assumption - Gaussian q = ( m , S) � 1 � T -1 p( x | q ) = exp � - ( ( x - m ) S ( x - m ) ) � d /2 1/2 (2p ) | S | Ł 2 ł 1 Number of dimensions “Volume” of the covariance Squared Mahalanobis distance m = E [ x] - d parameters S = E غ ( x - m )T ( x - m ) øß - d ( d + 1) / 2 parameters Fall 2004 Pattern Recognition for Vision Normal Density Ø0ø m = Œ0œ Œ œ Œº 0 œß Ø 1 0 .5ø S = Œ 0 1 .3œ Œ œ Œº.5 .3 1 œß T -1 Constant density, ( x - m ) S ( x - m ) = C - quadratic surface S - Positive semidefinite ellipsoid Principal axes: eigenvectors of S li , l - eigenvalues of S Length: Fall 2004 Pattern Recognition for Vision Whitening Transform Define: L = diag ( eigval ( S ) ) - Scaling matrix F = eigvec ( S ) - Rotation matrix Then: W = L -1 / 2 F T -“Unscales” and “unrotates” the data For all: x ∼ N ( 0, S ) Fall 2004 Wx ∼ N ( 0, I ) Pattern Recognition for Vision You Are Here Density Estimation Semi-parametric Mixture Density K-NN Method Kernel Methods Histograms Bayesian Non-parametric Max Likelihood Fall 2004 Parametric Pattern Recognition for Vision Maximum Likelihood Parameters are fixed but unknown. D ” { x1 , x 2 ,..., x N } - a data set, drawn from p(x) Notationally, we make density explicitly dependent on parameters: p ( x |q ) p ( x) Assuming that the data is drawn independently (i.i.d.): N L (q ) ” p ( D | q ) = � p ( x n | q ) - a likelihood function n =1 To find θ Maximize L(θ ) w.r.t. parameters. Fall 2004 Pattern Recognition for Vision Maximum Likelihood Maximizing L(θ ) is equivalent to maximizing log-likelihood function: N N n =1 n =1 l (q ) ” log L (q ) = log � p ( x n | q ) = � log p ( x n | q ) To find θ set the derivative to 0: N �q l (q ) = � �q log p ( x n | q ) = 0 n =1 And solve for θ Fall 2004 Pattern Recognition for Vision Quick Summary – ML Parameter Estimation p ( x | w i ,q i ) P (w i | q i ) P (w i | x ) = P (w i | x ,q i ) = p( x | qi ) easy p ( x | qˆ) we pick that argmax p ( D | q ) q N n p ( x |q ) � n =1 Fall 2004 Pattern Recognition for Vision Solving a Maximum Likelihood Problem Fixed covariance: some candidates p( D | q ) log p ( D | q ) qˆ Fall 2004 qˆ Pattern Recognition for Vision Maximum Likelihood Example In d-dimensions: 1 1 n � d � T -1 n �q l (q ) = � �q �- log [ 2p ] - log ºØ S øß - ( x - m ) S ( x - m ) � 2 2 � 2 � n Solving for the mean: 1 � m l (q ) = - � S -1 ( x n - mˆ ) = 0 � 2 n 1 mˆ = N Fall 2004 N n x � - arithmetic average of samples n =1 Pattern Recognition for Vision Maximum Likelihood Example (cont.) 1 1 n � d � T -1 n �q l (q ) = � �q �- log [ 2p ] - log ºØ S øß - ( x - m ) S ( x - m ) � 2 2 � 2 � n Solving for the covariance: d M For symmetric M: dM = M M -1 and d ( a T M -1b ) dM { = M -1ab T M -1 } 1 -1 -1 n n T ˆ -1 ˆ ˆ ˆ ˆ � S l (q ) = - � S - S ( x - m )( x - m ) S = 0 � 2 n biased Fall 2004 ˆS = 1 N � ( x - mˆ )( x - mˆ ) N n n =1 n T -arithmetic average of indiv. covariances Pattern Recognition for Vision Recursive ML What if data comes one sample at a time? 1 mˆ N = N N 1 n x = � N n =1 Ø N N -1 n ø Œx + � x œ º ß n =1 1 N 1 N = غ x + ( N - 1) mˆ N -1 øß = mˆ N -1 + غ x - mˆ N -1 øß N N This estimate “stiffens” with more data (as it should). One idea – fix the fraction. Then the estimate can track a non-stationary process Fall 2004 Pattern Recognition for Vision Recursive ML Fix the update rate and retrace the steps: v N = v N -1 + g غ x N - v N-1 øß = (1 - g )vN -1 + g x N = (1 - g ) 2 vN - 2 + (1 - g )g x N -1 + g x N M Recency weights = (1 - g ) M vN - M + � (1 - g ) M -k g x k k =1 vn Fall 2004 mn Pattern Recognition for Vision Simple Example Several images from a static camera: How much noise is in it? x = vec ( I t - I t -1 ) m=0 s = 1.2 Now we can set a threshold that will statistically distinguish pixel noise from an object Fall 2004 Pattern Recognition for Vision Problems with ML We are given two estimates: µ1, Σ1 µ2, Σ2 True distribution Which one do we believe? ML gives a single solution, regardless of uncertainty. Fall 2004 Pattern Recognition for Vision You Are Here Density Estimation Semi-parametric Mixture Density K-NN Method Kernel Methods Histograms Bayesian Non-parametric Max Likelihood Fall 2004 Parametric Pattern Recognition for Vision Bayesian Parameter Estimation In classification our goal so far has been to estimate P (w | x ) Let’s make the dependency on the data explicit: p ( x | wi , D) P (wi | D) P (wi | x, D ) = p ( x | D) P (wi | D) - this is easy to compute P( x | D) - this is easy to compute by marginalization What about p ( x | wi , D ) ? Fall 2004 Pattern Recognition for Vision Bayesian Parameter Estimation P (wi | x, D ) = This is a supervised problem so far: p ( x | wi, D ) P (wi | D) p ( x | D) D = {D1 , D2 ,..., DN } ( ) = p ( x | w , D , {D } ) = p ( x | w , D ) p ( x | wi , D ) = p x | wi , { D j } i i j =1... N j j„i i i p ( x | wi , Di ) P(wi | D ) P (wi | x , D ) = p( x | D ) Fall 2004 Pattern Recognition for Vision Bayesian Parameter Estimation We will assume that we can obtain “labeled” data, so again: Notationally: p ( x | w i , Di ) p( x | D ) Now our problem is to compute density for x given the data D. We assume the form of p(x) – the source density for D: p ( x) p ( x |q ) … and treat θ as a random variable Fall 2004 Pattern Recognition for Vision Bayesian Parameter Estimation Instead of choosing a value for a parameter, we use them all: p ( x | D ) = � p ( x ,q | D ) dq = � p ( x | q , D ) p (q | D ) dq Data predicts the new sample x is independent of D given q = � p ( x | q ) p (q | D )dq We chose the form of this What is this? Average densities p(x|q) for ALL possible values of q weighted by its posterior probability Fall 2004 Pattern Recognition for Vision Bayesian Parameter Estimation Computing the posterior probability for q : Using Bayes rule: What is this? p (q | D ) = p ( D | q ) p (q ) � p( x | q ) p (q | D )dq Prior belief about the parameters (denisty) � p( D | q ) p(q )dq Using independence: N p (D | q ) = � p( x n | q ) n =1 Bayesian method does not commit to a particular value of θ, but uses the entire distribution. Fall 2004 Pattern Recognition for Vision Quick Summary – Bayesian Parameter Estimation p ( x | wi , Di ) P (wi | D) P (wi | x ) = P(wi | x, D ) = p ( x | D) hard easy � p( x | q ) p (q | D) dq we pick that p ( D | q ) p (q ) p( D) hard we “know” this*, ** * Non-informative prior – doesn’t introduce bias N n p ( x |q ) � n =1 Fall 2004 ** Conjugate prior – causes p(q |D) have the same functional form as p(D|q ) Pattern Recognition for Vision Bayesian Parameter Estimation � p( x | m ) p (m | D )d m For q = m : Parameter posterior Parameter prior ML solution posterior Bayesian solution weighted likelihoods Fall 2004 Pattern Recognition for Vision Bayesian Parameter Estimation - Example � p( x | m ) p (m | D )d m First let’s deal with the parameter: Likelihood: p ( x | m ) = N( m , s 2 ) fixed Parameter prior: p ( m ) = N( m0 ,s 02 ) Need to find: p( m | D ) Bayes rule again: p( D | m ) p (m ) Ø N ø pN ( m | D ) = = a Œ� p ( x n | m ) œ N ( m 0 , s 02 ) = N (m N , s N ) p ( D) º n =1 ß N-sample parameter posterior Fall 2004 This is a Gaussian Need these Pattern Recognition for Vision Bayesian Parameter Estimation - Example So, the posterior is a Gaussian pN ( m | D ) = N( mN ,s N ) After some algebra and identifying the terms: 1 1 1 = 2N+ 2 2 sN s s0 - when Gaussians multiply – precisions add … and Ns 02 s2 mN = x+ m0 2 2 2 2 Ns 0 + s Ns 0 + s With increasing N covariance of the posterior decreases and the prior becomes unimportant. Fall 2004 Pattern Recognition for Vision Bayesian Parameter Estimation - Example Now the integral: p ( x | D ) = � p ( x | q ) p (q | D )dq = � N( m , s 2 ) N ( m N ,s N2 )d m = N ( m N ,s 2 + s N2 ) You can show that it is also a Gaussian Any guesses about why Gaussian is such a common assumption? Fall 2004 Pattern Recognition for Vision Recursive Bayes � p( x | q ) p (q | D )dq For N-point likelihood: N p( D N | q ) = � p( x n | q ) n =1 N -1 = p ( x N | q )� p ( x n | q ) = p ( x N | q ) p ( D N -1 | q ) n =1 From this the recursive relation for the posterior: N N -1 p ( x | q ) p ( D | q ) p( q ) N p (q | D ) = p( D N ) p ( x N | q ) p (q | D N -1 ) = N N -1 p ( x | q ) p ( q | D )dq � Fall 2004 Pattern Recognition for Vision Recursive Bayes (cont.) Again: p (q | D N ) = p ( x N | q ) p (q | D N -1 ) � p( x N | q ) p (q | D N -1 ) dq - 1-point update. Setting N=1: 1 1 1 = 2+ 2 2 sn s s n -1 s n2-1 s2 mn = 2 x+ 2 m n -1 2 2 s n -1 + s s n -1 + s Fall 2004 Pattern Recognition for Vision Recursive Bayes (cont.) N = 10 p (q | D N ) p( x | q ) Fall 2004 N =2 N =1 Pattern Recognition for Vision Problems with Bayesian Method 1. Integration is difficult 2. Analytic solutions are only available for restricted class of densities 3. Technicality: If the true p(x|q) is NOT what we assume it is, the prior probability of any parameter setting is 0! 4. Integration is difficult 5. Did I mention that the integration is hard? Fall 2004 Pattern Recognition for Vision Relation between Bayesian and ML Inference p (q | D) � p ( D | q ) p(q ) peaks at qˆML Ø ø n = Œ� p ( x | q ) œ p (q ) = L (q ) p(q ) º n ß If the peak is sharp and p(θ ) is flat, then: p ( x | D ) = � p ( x | q ) p (q | D )dq � � p ( x | qˆ) p (q | D ) dq = p ( x | qˆ) � p (q | D ) dq = p ( x | qˆ ) As N fi ¥, p ( x | D) « p ( x | qˆ ) Fall 2004 Pattern Recognition for Vision Non-Parametric Methods for Density Estimation Non-parametric methods do not assume any particular form for p(x) 1. Histograms 2. Kernel Methods 3. K-NN method Fall 2004 Pattern Recognition for Vision You Are Here Density Estimation Semi-parametric Mixture Density K-NN Method Kernel Methods Histograms Bayesian Non-parametric Max Likelihood Fall 2004 Parametric Pattern Recognition for Vision Histograms Pˆ ( x ) is a discrete approximation of p ( x) • Count a number of times that x lands in the i -th bin N H (i ) = � I ( x ˛ Ri ), "i = 1, 2,..., M j =1 • Normalize Pˆ (i ) = H (i ) M � H ( j) j =1 Fall 2004 Pattern Recognition for Vision Histograms How many bins? p( x) M =3 Pˆ ( x ) p( x) M = 20 Pˆ ( x ) “Oversmoothing” M = 10 p( x) p( x) Pˆ ( x ) M = 50 Pˆ ( x ) “Overfitting” Fall 2004 Pattern Recognition for Vision Histograms Good: • Once it is constructed, the data can be discarded • Quick and intuitive Bad: • Very sensitive to number of bins, M • Estimated density is not smooth • Poor generalization in higher dimensions Fall 2004 Pattern Recognition for Vision Aside: Curse of dimensionality (Bellman, ‘61): • Imagine we build a histogram of a 1-d feature (say, Hue) • 10 bins • 1 bin = 10% of the input space • need at least 10 points to populate every bin • We add another feature (say, Saturation) • 10 bins again • 1 bin = 1% of the input space • we need at least 100 points to populate every bin •We add another feature (say, Value) • 10 bins again • 1 bin = 0.1% of the input space • we need at least 1000 points to populate every bin N = bd Fall 2004 - number of points grows exponentially Pattern Recognition for Vision Aside: Curse Continues Volume of a cube in Rd with side l: Vl = l d Volume of a cube with side l-ε: Ve = (l - e ) d Volume of the ε-shell: D = Vl - Ve = l d - (l - e ) d Ratio of the volume of the ε-shell to the volume of the cube: D l - (l - e ) � e� = = 1 - � 1 - � fi 1 as d fi ¥ !!!!! d Vl l lł Ł d Fall 2004 d d Pattern Recognition for Vision Aside: Lessons of the curse In generative models: • Use as much data as you can get your hands on • Reduce dimensionality as much as you can get away with <End of Digression> Fall 2004 Pattern Recognition for Vision General Reasoning By definintion: P ( x ˛ R ) = P = � p ( x ') dx ' R If we have N i.i.d. points drawn form p(x): N! P (| x ˛ R |= k ) = P k (1 - P ) N - k = B ( N, P) k !( N - k )! Num. of unique splits K vs. (N-K) Prob that k of particular x-es are in R Prob that the rest are not B(N, P) is a binomial distribution of k Fall 2004 Pattern Recognition for Vision General Reasoning (cont.) Mean and variance of B(N, P): Mean: m = E[k ] = NP � P = E[ k / N ] Variance: s 2 = E[(k - m ) 2 ] = NP (1 - P ) 2 2 �s � Ø ø � E ( k / N - P ) = � � = P (1 - P ) / N º ß ŁNł That is: • E[k/N] is a good estimate of P • P is distributed around this estimate with vanishing variance So: Fall 2004 P�k/N Pattern Recognition for Vision General Reasoning (cont.) So: P�k/N On the other hand, under mild assumptions: p(x) P = � p ( x ') dx ' � p ( x )V R Volume of R (not p(x)) V x … which leads to: k p ( x) � NV Fall 2004 Pattern Recognition for Vision General Reasoning (cont.) Now, given N data points – how do we really estimate p(x)? k p ( x) � NV Fix k and vary V until it encloses k points K-Nearest Neighbors (KNN) Fall 2004 Fix V and count how many points (k) it encloses Kernel methods Pattern Recognition for Vision You Are Here Density Estimation Semi-parametric Mixture Density K-NN Method Kernel Methods Histograms Bayesian Non-parametric Max Likelihood Fall 2004 Parametric Pattern Recognition for Vision Kernel Methods of Density Estimation We choose V by specifying a hypercube with a side h: V = hd Mathematically: �1 H (y ) = � �0 | y j |< 1/ 2 j = 1,..., d otherwise 1/2 1 kernel function: H ( y ) ‡ 0, "y and Fall 2004 � H (y )dy = 1 Pattern Recognition for Vision Parzen Windows Then H ( n x x ( )/h ) - a hypercube with side h centered at xn H can help count the points in a volume V around any x: h-neighborhood of x �x-x � k ( x) = � H � � n =1 Ł h ł N n No contribution to the count at x Fall 2004 x2 x1 Pattern Recognition for Vision Rectangular Kernel So the number of points in h-neighborhood of x: � x - xn � k ( x) = � H � � n =1 Ł h ł N … is easily converted to the density estimate: k ( x) 1 N 1 � x - x n � = � d H� p� ( x ) = � NV N n =1 h Ł h ł K ( x, x n ) Integrates to 1 Subtle point: Ø1 � Œº N 1 n ø K ( x, x )œ dx = � N n =1 ß N n Ø ø =1 K x , x dx ( ) � � º ß n =1 N � � p� ( x )dx = 1 Fall 2004 Pattern Recognition for Vision Example Fall 2004 Source h=1 h=2 h=4 Pattern Recognition for Vision Smoothed Window Functions The problem is as in histograms – it is discontinuous We can choose a smoother function, s.t.: p� ( x) ‡ 0, "x and � p� ( x )dx = 1 Ensured by kernel conditions Eg: <loud cheer> a (spherical) Gaussian: n � x x 1 n K (x , x ) = exp � 2 d � h 2 ( 2p h ) Ł … so: Fall 2004 � � � ł n � x-x 1 1 p� ( x ) = � exp � d � N n =1 ( 2p h ) 2h2 Ł N � � � ł Pattern Recognition for Vision Example Fall 2004 Source h=0.6 h=1.0 h=1.3 Pattern Recognition for Vision Some Insight Interesting to look at expectation of the estimate with respect to all possible datasets: Ø1 E [ p� ( x ) ] = E Œ ºN ø K ( x , x ) œ = E [K (x , x ') ] � n =1 ß N n = � K ( x - x ' ) p ( x ') dx ' - convolution with true density That is: p� ( x ) = p ( x ) if K ( x , x ') = d ( x , x ') But not for the finite data set! Fall 2004 Pattern Recognition for Vision Conditions for Convergence How small can we make h for a given N ? lim hNd = 0 - It should go to 0 lim NhNd = ¥ - But slower than 1/N N fi¥ N fi¥ Based on the similar analysis of variance of estimates Eg: hNd = h1d / N hNd = h1d / log( N ) Note that the choice of h1d is still up to us. Fall 2004 Pattern Recognition for Vision Problems With Kernel Estimation • Need to choose the width parameter, h • • • Need to store all data to represent the density • Fall 2004 Can be chosen empirically Can be adaptive, eg. hj = hdjk – where djk the distance from xj to k-th nearest neighbor Leads to Mixture Density Estimation Pattern Recognition for Vision You Are Here Density Estimation Semi-parametric Mixture Density K-NN Method Kernel Methods Histograms Bayesian Non-parametric Max Likelihood Fall 2004 Parametric Pattern Recognition for Vision K-Nearest Neighbors Recall that: k p� ( x ) = NV Now we fix k (typically k = N ) and expand V to contain k points This is not a true density! Eg.: choose N=1, k=1. Then: 1 p� ( x) = 1 � x - x1 Oops! BUT it is useful for a number of theoretical and practical reasons. Fall 2004 Pattern Recognition for Vision K-NN Classification Rule Let’s try classification with K-NN density estimate Data: N Nj - total points - points in class w j Need to find the class label for a query, x Expand a sphere from x to include K points K - number of neighbors of x K j - points of class w j among K Fall 2004 x Pattern Recognition for Vision KNN Classification Then class priors are given by: p (w j ) = Nj N We can estimate conditional and marginal densities around any x: p( x | w j ) = Kj K p ( x) = NV N jV K j N j NV K j By Bayes rule: p (w j | x ) = = N jV N K K Then for minimum error rate classification: C = arg max K j j Fall 2004 Pattern Recognition for Vision KNN Classification Important theoretical result: In the extreme case, K=1, it can be shown that: N-sample error rate P = lim PN (error ) for N fi¥ c � *� P £ P £ P �2P � c -1 ł Ł * * That is, using just a single neighbor rule, the error rate is at most twice the Bayes error!!! Fall 2004 Pattern Recognition for Vision Problems with Non-parametric Methods • Memory: need to store all data points • Computation: need to compute distances to all data points every time • Parameter choice: need to choose the smoothing parameter Fall 2004 Pattern Recognition for Vision You Are Here Density Estimation Semi-parametric Mixture Density K-NN Method Kernel Methods Histograms Bayesian Non-parametric Max Likelihood Fall 2004 Parametric Pattern Recognition for Vision Mixture Density Model Mixture model – a linear combination of parametric densities Number of components M p ( x) = � p ( x | j ) P( j) j =1 Component density P( j ) ‡ 0, "j Component “prior” M and � P( j) = 1 j =1 Uses MUCH less “kernels” than kernel methods Kernels are parametric densities, subject to estimation Fall 2004 Pattern Recognition for Vision Example M p ( x) = � p ( x | j ) P( j) j =1 Fall 2004 Pattern Recognition for Vision Mixture Density Using ML principle, the objective function is the log-likelihood: M � � n n l (q ) ” log � p( x ) = � log �� p( x | j ) P( j ) � n=1 n =1 � j =1 � N N Differentiate w.r.t. parameters: ¶ �M � n �q j l (q ) = � log �� p ( x | k )P (k ) � n =1 ¶q j � k =1 � N N =� n =1 1 M n p ( x � | k )P (k ) ¶ p ( x n | j )P ( j ) ¶q j k =1 Fall 2004 Pattern Recognition for Vision Mixture Density Again let’s assume that p(x|ω) is a Gaussian We need to estimate M priors, and M sets of means and covariances ¶l (q ) N = � P( j | x n ) غS -j 1 ( xn - mˆ j ) øß ¶m j n =1 Setting it to 0 and solving for µj: N mˆ j = n n P ( j | x ) x � n =1 N n P ( j | x ) � - convex sum of all data n =1 Fall 2004 Pattern Recognition for Vision Mixture Density Similarly for the covariances: ¶l (q ) N n n n T ˆ -1 -1 -1 ˆ ˆ Ø ø ˆ ˆ = P j x S S )( S x m x m ( | ) ( ) � j j j j j 2 º ß ¶s j n =1 Setting it to 0 and solving for Σi: � P( j | x ) ( x N n Sˆ j = n=1 n - mˆ j ) ( x - mˆ j ) n T N � P( j | x ) n n=1 Fall 2004 Pattern Recognition for Vision Mixture Density A little harder for P(j) – optimization is subject to constraints: M � P( j ) = 1 j =1 and P ( j ) ‡ 0, "j Here is a trick to enforce the constraints: P( j) = exp(g j ) M � exp(g k =1 k ) ¶P (i ) = d (i - j ) P( j ) - P (i) P( j ) ¶g j Fall 2004 Pattern Recognition for Vision Mixture Density Using the chain rule: ¶l (q ) ¶P( k ) M N p( x n | k ) �g j l (q ) = � = �� d jk P ( j) - P ( j )P( k ) ) ( P( x) k =1 ¶P (k ) ¶g j k =1 n=1 M M � p( xn | j ) � p( x n | k ) = �� P( j) - � P( j) P (k ) � P( x) n=1 � P ( x ) k =1 � N N M � � = � � P( j | x n ) - P ( j) � p (k | x n ) � = � {P ( j | x n ) - P ( j )} = 0 n=1 � k =1 � n=1 N The last expression gives the value at the extremum: 1 P( j) = N Fall 2004 N n P ( j | x ) � n=1 Pattern Recognition for Vision Mixture Density What’s the problem? 1 P( j) = N N N n P ( j | x ) � mˆ j = n =1 n n P ( j | x ) x � n =1 N � P( j | x ) n n =1 � P( j | x ) ( x N n Sˆ j = n =1 n - mˆ j )( x - mˆ j ) n T N n P ( j | x ) � n =1 We can’t compute these directly! Solution – EM algorithm. We will study it in Clustering. Fall 2004 Pattern Recognition for Vision You Are Here Density Estimation Semi-parametric Mixture Density K-NN Method Kernel Methods Histograms Bayesian Non-parametric Max Likelihood Fall 2004 Parametric Pattern Recognition for Vision