Document 13510805

9.913 Pattern Recognition for Vision Class VI – Density Estimation Yuri Ivanov Fall 2004 Road Map Density Estimation Semi-parametric Mixture Density K-NN Method Kernel Methods Histograms Bayesian Non-parametric Max Likelihood Fall 2004 Parametric Pattern Recognition for Vision Generative vs. Discriminative There are two schools of thought in Machine Learning: 1. Generative: • Estimate class models from data • Compute the discriminant function • Plug in your data – get the answer 2. Discriminative: • Estimate the discriminant from data • Plug in your data – get the answer Fall 2004 Last class Pattern Recognition for Vision Density Estimation Density Estimation is at the core of generative Pattern Recognition b P ( a < x < b ) = � p ( x ) dx a mean : E[ x ] = � xp ( x )dx covariance : E[( x - E[ x ])( x - E[ x ])T ] = � ºØ ( x - E[ x ])( x - E[ x ])T øß p ( x ) dx function mean : E[ f ( x )] = � f ( x ) p ( x ) dx conditional mean : E[ y | x ] = � yp ( y | x ) dx Fall 2004 Pattern Recognition for Vision Refresher Minimum expected risk: R* = � min [ R(a | x) ] p ( x ) dx w … is based on conditional risk: wi = arg min R (a | x ) w … which is computed from the posterior: R(a | x) = L (a | w ) P(w | x ) … which depends on the likelihood: p ( x | w ) P (w ) P(w | x ) = p( x) Fall 2004 Pattern Recognition for Vision Setting Data: D = {Di }iC=1 Assume that D j contains noinformation about wi , "i „ j NOTATIONALLY - we abandon the class label: p( x | wi ) Keep in mind: p ( x) p ( x | wi ) „ p ( x ) Goal: model the probability density function p(x), given a finite number of data points, x1, x2, …, xN, drawn from it. Fall 2004 Pattern Recognition for Vision Three Methods 1. Parametric • Good: small number of parameters • Bad: choice of the parametric form 2. Non-parametric • Good: data “dictates” the approximator • Bad: large number of parameters 3. Semi-parametric • Good: combine the best of both worlds • Bad: harder to design • Good again: design can be subject to optimization Fall 2004 Pattern Recognition for Vision Parametric Density Estimation Estimate the density from a given functional family Given: Find: p ( x | q ) = f ( x, q ) q Two methods of parameter estimation: 1. Maximum Likelihood method • Parameters are viewed as unknown but fixed values 2. Bayesian method • Parameters are random variables that have their distributions Fall 2004 Pattern Recognition for Vision Normal (Gaussian) Density Function A common assumption - Gaussian q = ( m , S) � 1 � T -1 p( x | q ) = exp � - ( ( x - m ) S ( x - m ) ) � d /2 1/2 (2p ) | S | Ł 2 ł 1 Number of dimensions “Volume” of the covariance Squared Mahalanobis distance m = E [ x] - d parameters S = E Øº ( x - m )T ( x - m ) øß - d ( d + 1) / 2 parameters Fall 2004 Pattern Recognition for Vision Normal Density Ø0ø m = Œ0œ Œ œ Œº 0 œß Ø 1 0 .5ø S = Œ 0 1 .3œ Œ œ Œº.5 .3 1 œß T -1 Constant density, ( x - m ) S ( x - m ) = C - quadratic surface S - Positive semidefinite ellipsoid Principal axes: eigenvectors of S li , l - eigenvalues of S Length: Fall 2004 Pattern Recognition for Vision Whitening Transform Define: L = diag ( eigval ( S ) ) - Scaling matrix F = eigvec ( S ) - Rotation matrix Then: W = L -1 / 2 F T -“Unscales” and “unrotates” the data For all: x ∼ N ( 0, S ) Fall 2004 Wx ∼ N ( 0, I ) Pattern Recognition for Vision You Are Here Density Estimation Semi-parametric Mixture Density K-NN Method Kernel Methods Histograms Bayesian Non-parametric Max Likelihood Fall 2004 Parametric Pattern Recognition for Vision Maximum Likelihood Parameters are fixed but unknown. D ” { x1 , x 2 ,..., x N } - a data set, drawn from p(x) Notationally, we make density explicitly dependent on parameters: p ( x |q ) p ( x) Assuming that the data is drawn independently (i.i.d.): N L (q ) ” p ( D | q ) = � p ( x n | q ) - a likelihood function n =1 To find θ Maximize L(θ ) w.r.t. parameters. Fall 2004 Pattern Recognition for Vision Maximum Likelihood Maximizing L(θ ) is equivalent to maximizing log-likelihood function: N N n =1 n =1 l (q ) ” log L (q ) = log � p ( x n | q ) = � log p ( x n | q ) To find θ set the derivative to 0: N �q l (q ) = � �q log p ( x n | q ) = 0 n =1 And solve for θ Fall 2004 Pattern Recognition for Vision Quick Summary – ML Parameter Estimation p ( x | w i ,q i ) P (w i | q i ) P (w i | x ) = P (w i | x ,q i ) = p( x | qi ) easy p ( x | qˆ) we pick that argmax p ( D | q ) q N n p ( x |q ) � n =1 Fall 2004 Pattern Recognition for Vision Solving a Maximum Likelihood Problem Fixed covariance: some candidates p( D | q ) log p ( D | q ) qˆ Fall 2004 qˆ Pattern Recognition for Vision Maximum Likelihood Example In d-dimensions: 1 1 n � d � T -1 n �q l (q ) = � �q �- log [ 2p ] - log ºØ S øß - ( x - m ) S ( x - m ) � 2 2 � 2 � n Solving for the mean: 1 � m l (q ) = - � S -1 ( x n - mˆ ) = 0 � 2 n 1 mˆ = N Fall 2004 N n x � - arithmetic average of samples n =1 Pattern Recognition for Vision Maximum Likelihood Example (cont.) 1 1 n � d � T -1 n �q l (q ) = � �q �- log [ 2p ] - log ºØ S øß - ( x - m ) S ( x - m ) � 2 2 � 2 � n Solving for the covariance: d M For symmetric M: dM = M M -1 and d ( a T M -1b ) dM { = M -1ab T M -1 } 1 -1 -1 n n T ˆ -1 ˆ ˆ ˆ ˆ � S l (q ) = - � S - S ( x - m )( x - m ) S = 0 � 2 n biased Fall 2004 ˆS = 1 N � ( x - mˆ )( x - mˆ ) N n n =1 n T -arithmetic average of indiv. covariances Pattern Recognition for Vision Recursive ML What if data comes one sample at a time? 1 mˆ N = N N 1 n x = � N n =1 Ø N N -1 n ø Œx + � x œ º ß n =1 1 N 1 N = Øº x + ( N - 1) mˆ N -1 øß = mˆ N -1 + Øº x - mˆ N -1 øß N N This estimate “stiffens” with more data (as it should). One idea – fix the fraction. Then the estimate can track a non-stationary process Fall 2004 Pattern Recognition for Vision Recursive ML Fix the update rate and retrace the steps: v N = v N -1 + g Øº x N - v N-1 øß = (1 - g )vN -1 + g x N = (1 - g ) 2 vN - 2 + (1 - g )g x N -1 + g x N M Recency weights = (1 - g ) M vN - M + � (1 - g ) M -k g x k k =1 vn Fall 2004 mn Pattern Recognition for Vision Simple Example Several images from a static camera: How much noise is in it? x = vec ( I t - I t -1 ) m=0 s = 1.2 Now we can set a threshold that will statistically distinguish pixel noise from an object Fall 2004 Pattern Recognition for Vision Problems with ML We are given two estimates: µ1, Σ1 µ2, Σ2 True distribution Which one do we believe? ML gives a single solution, regardless of uncertainty. Fall 2004 Pattern Recognition for Vision You Are Here Density Estimation Semi-parametric Mixture Density K-NN Method Kernel Methods Histograms Bayesian Non-parametric Max Likelihood Fall 2004 Parametric Pattern Recognition for Vision Bayesian Parameter Estimation In classification our goal so far has been to estimate P (w | x ) Let’s make the dependency on the data explicit: p ( x | wi , D) P (wi | D) P (wi | x, D ) = p ( x | D) P (wi | D) - this is easy to compute P( x | D) - this is easy to compute by marginalization What about p ( x | wi , D ) ? Fall 2004 Pattern Recognition for Vision Bayesian Parameter Estimation P (wi | x, D ) = This is a supervised problem so far: p ( x | wi, D ) P (wi | D) p ( x | D) D = {D1 , D2 ,..., DN } ( ) = p ( x | w , D , {D } ) = p ( x | w , D ) p ( x | wi , D ) = p x | wi , { D j } i i j =1... N j j„i i i p ( x | wi , Di ) P(wi | D ) P (wi | x , D ) = p( x | D ) Fall 2004 Pattern Recognition for Vision Bayesian Parameter Estimation We will assume that we can obtain “labeled” data, so again: Notationally: p ( x | w i , Di ) p( x | D ) Now our problem is to compute density for x given the data D. We assume the form of p(x) – the source density for D: p ( x) p ( x |q ) … and treat θ as a random variable Fall 2004 Pattern Recognition for Vision Bayesian Parameter Estimation Instead of choosing a value for a parameter, we use them all: p ( x | D ) = � p ( x ,q | D ) dq = � p ( x | q , D ) p (q | D ) dq Data predicts the new sample x is independent of D given q = � p ( x | q ) p (q | D )dq We chose the form of this What is this? Average densities p(x|q) for ALL possible values of q weighted by its posterior probability Fall 2004 Pattern Recognition for Vision Bayesian Parameter Estimation Computing the posterior probability for q : Using Bayes rule: What is this? p (q | D ) = p ( D | q ) p (q ) � p( x | q ) p (q | D )dq Prior belief about the parameters (denisty) � p( D | q ) p(q )dq Using independence: N p (D | q ) = � p( x n | q ) n =1 Bayesian method does not commit to a particular value of θ, but uses the entire distribution. Fall 2004 Pattern Recognition for Vision Quick Summary – Bayesian Parameter Estimation p ( x | wi , Di ) P (wi | D) P (wi | x ) = P(wi | x, D ) = p ( x | D) hard easy � p( x | q ) p (q | D) dq we pick that p ( D | q ) p (q ) p( D) hard we “know” this*, ** * Non-informative prior – doesn’t introduce bias N n p ( x |q ) � n =1 Fall 2004 ** Conjugate prior – causes p(q |D) have the same functional form as p(D|q ) Pattern Recognition for Vision Bayesian Parameter Estimation � p( x | m ) p (m | D )d m For q = m : Parameter posterior Parameter prior ML solution posterior Bayesian solution weighted likelihoods Fall 2004 Pattern Recognition for Vision Bayesian Parameter Estimation - Example � p( x | m ) p (m | D )d m First let’s deal with the parameter: Likelihood: p ( x | m ) = N( m , s 2 ) fixed Parameter prior: p ( m ) = N( m0 ,s 02 ) Need to find: p( m | D ) Bayes rule again: p( D | m ) p (m ) Ø N ø pN ( m | D ) = = a Œ� p ( x n | m ) œ N ( m 0 , s 02 ) = N (m N , s N ) p ( D) º n =1 ß N-sample parameter posterior Fall 2004 This is a Gaussian Need these Pattern Recognition for Vision Bayesian Parameter Estimation - Example So, the posterior is a Gaussian pN ( m | D ) = N( mN ,s N ) After some algebra and identifying the terms: 1 1 1 = 2N+ 2 2 sN s s0 - when Gaussians multiply – precisions add … and Ns 02 s2 mN = x+ m0 2 2 2 2 Ns 0 + s Ns 0 + s With increasing N covariance of the posterior decreases and the prior becomes unimportant. Fall 2004 Pattern Recognition for Vision Bayesian Parameter Estimation - Example Now the integral: p ( x | D ) = � p ( x | q ) p (q | D )dq = � N( m , s 2 ) N ( m N ,s N2 )d m = N ( m N ,s 2 + s N2 ) You can show that it is also a Gaussian Any guesses about why Gaussian is such a common assumption? Fall 2004 Pattern Recognition for Vision Recursive Bayes � p( x | q ) p (q | D )dq For N-point likelihood: N p( D N | q ) = � p( x n | q ) n =1 N -1 = p ( x N | q )� p ( x n | q ) = p ( x N | q ) p ( D N -1 | q ) n =1 From this the recursive relation for the posterior: N N -1 p ( x | q ) p ( D | q ) p( q ) N p (q | D ) = p( D N ) p ( x N | q ) p (q | D N -1 ) = N N -1 p ( x | q ) p ( q | D )dq � Fall 2004 Pattern Recognition for Vision Recursive Bayes (cont.) Again: p (q | D N ) = p ( x N | q ) p (q | D N -1 ) � p( x N | q ) p (q | D N -1 ) dq - 1-point update. Setting N=1: 1 1 1 = 2+ 2 2 sn s s n -1 s n2-1 s2 mn = 2 x+ 2 m n -1 2 2 s n -1 + s s n -1 + s Fall 2004 Pattern Recognition for Vision Recursive Bayes (cont.) N = 10 p (q | D N ) p( x | q ) Fall 2004 N =2 N =1 Pattern Recognition for Vision Problems with Bayesian Method 1. Integration is difficult 2. Analytic solutions are only available for restricted class of densities 3. Technicality: If the true p(x|q) is NOT what we assume it is, the prior probability of any parameter setting is 0! 4. Integration is difficult 5. Did I mention that the integration is hard? Fall 2004 Pattern Recognition for Vision Relation between Bayesian and ML Inference p (q | D) � p ( D | q ) p(q ) peaks at qˆML Ø ø n = Œ� p ( x | q ) œ p (q ) = L (q ) p(q ) º n ß If the peak is sharp and p(θ ) is flat, then: p ( x | D ) = � p ( x | q ) p (q | D )dq � � p ( x | qˆ) p (q | D ) dq = p ( x | qˆ) � p (q | D ) dq = p ( x | qˆ ) As N fi ¥, p ( x | D) « p ( x | qˆ ) Fall 2004 Pattern Recognition for Vision Non-Parametric Methods for Density Estimation Non-parametric methods do not assume any particular form for p(x) 1. Histograms 2. Kernel Methods 3. K-NN method Fall 2004 Pattern Recognition for Vision You Are Here Density Estimation Semi-parametric Mixture Density K-NN Method Kernel Methods Histograms Bayesian Non-parametric Max Likelihood Fall 2004 Parametric Pattern Recognition for Vision Histograms Pˆ ( x ) is a discrete approximation of p ( x) • Count a number of times that x lands in the i -th bin N H (i ) = � I ( x ˛ Ri ), "i = 1, 2,..., M j =1 • Normalize Pˆ (i ) = H (i ) M � H ( j) j =1 Fall 2004 Pattern Recognition for Vision Histograms How many bins? p( x) M =3 Pˆ ( x ) p( x) M = 20 Pˆ ( x ) “Oversmoothing” M = 10 p( x) p( x) Pˆ ( x ) M = 50 Pˆ ( x ) “Overfitting” Fall 2004 Pattern Recognition for Vision Histograms Good: • Once it is constructed, the data can be discarded • Quick and intuitive Bad: • Very sensitive to number of bins, M • Estimated density is not smooth • Poor generalization in higher dimensions Fall 2004 Pattern Recognition for Vision Aside: Curse of dimensionality (Bellman, ‘61): • Imagine we build a histogram of a 1-d feature (say, Hue) • 10 bins • 1 bin = 10% of the input space • need at least 10 points to populate every bin • We add another feature (say, Saturation) • 10 bins again • 1 bin = 1% of the input space • we need at least 100 points to populate every bin •We add another feature (say, Value) • 10 bins again • 1 bin = 0.1% of the input space • we need at least 1000 points to populate every bin N = bd Fall 2004 - number of points grows exponentially Pattern Recognition for Vision Aside: Curse Continues Volume of a cube in Rd with side l: Vl = l d Volume of a cube with side l-ε: Ve = (l - e ) d Volume of the ε-shell: D = Vl - Ve = l d - (l - e ) d Ratio of the volume of the ε-shell to the volume of the cube: D l - (l - e ) � e� = = 1 - � 1 - � fi 1 as d fi ¥ !!!!! d Vl l lł Ł d Fall 2004 d d Pattern Recognition for Vision Aside: Lessons of the curse In generative models: • Use as much data as you can get your hands on • Reduce dimensionality as much as you can get away with <End of Digression> Fall 2004 Pattern Recognition for Vision General Reasoning By definintion: P ( x ˛ R ) = P = � p ( x ') dx ' R If we have N i.i.d. points drawn form p(x): N! P (| x ˛ R |= k ) = P k (1 - P ) N - k = B ( N, P) k !( N - k )! Num. of unique splits K vs. (N-K) Prob that k of particular x-es are in R Prob that the rest are not B(N, P) is a binomial distribution of k Fall 2004 Pattern Recognition for Vision General Reasoning (cont.) Mean and variance of B(N, P): Mean: m = E[k ] = NP � P = E[ k / N ] Variance: s 2 = E[(k - m ) 2 ] = NP (1 - P ) 2 2 �s � Ø ø � E ( k / N - P ) = � � = P (1 - P ) / N º ß ŁNł That is: • E[k/N] is a good estimate of P • P is distributed around this estimate with vanishing variance So: Fall 2004 P�k/N Pattern Recognition for Vision General Reasoning (cont.) So: P�k/N On the other hand, under mild assumptions: p(x) P = � p ( x ') dx ' � p ( x )V R Volume of R (not p(x)) V x … which leads to: k p ( x) � NV Fall 2004 Pattern Recognition for Vision General Reasoning (cont.) Now, given N data points – how do we really estimate p(x)? k p ( x) � NV Fix k and vary V until it encloses k points K-Nearest Neighbors (KNN) Fall 2004 Fix V and count how many points (k) it encloses Kernel methods Pattern Recognition for Vision You Are Here Density Estimation Semi-parametric Mixture Density K-NN Method Kernel Methods Histograms Bayesian Non-parametric Max Likelihood Fall 2004 Parametric Pattern Recognition for Vision Kernel Methods of Density Estimation We choose V by specifying a hypercube with a side h: V = hd Mathematically: �1 H (y ) = � �0 | y j |< 1/ 2 j = 1,..., d otherwise 1/2 1 kernel function: H ( y ) ‡ 0, "y and Fall 2004 � H (y )dy = 1 Pattern Recognition for Vision Parzen Windows Then H ( n x x ( )/h ) - a hypercube with side h centered at xn H can help count the points in a volume V around any x: h-neighborhood of x �x-x � k ( x) = � H � � n =1 Ł h ł N n No contribution to the count at x Fall 2004 x2 x1 Pattern Recognition for Vision Rectangular Kernel So the number of points in h-neighborhood of x: � x - xn � k ( x) = � H � � n =1 Ł h ł N … is easily converted to the density estimate: k ( x) 1 N 1 � x - x n � = � d H� p� ( x ) = � NV N n =1 h Ł h ł K ( x, x n ) Integrates to 1 Subtle point: Ø1 � Œº N 1 n ø K ( x, x )œ dx = � N n =1 ß N n Ø ø =1 K x , x dx ( ) � � º ß n =1 N � � p� ( x )dx = 1 Fall 2004 Pattern Recognition for Vision Example Fall 2004 Source h=1 h=2 h=4 Pattern Recognition for Vision Smoothed Window Functions The problem is as in histograms – it is discontinuous We can choose a smoother function, s.t.: p� ( x) ‡ 0, "x and � p� ( x )dx = 1 Ensured by kernel conditions Eg: <loud cheer> a (spherical) Gaussian: n � x x 1 n K (x , x ) = exp � 2 d � h 2 ( 2p h ) Ł … so: Fall 2004 � � � ł n � x-x 1 1 p� ( x ) = � exp � d � N n =1 ( 2p h ) 2h2 Ł N � � � ł Pattern Recognition for Vision Example Fall 2004 Source h=0.6 h=1.0 h=1.3 Pattern Recognition for Vision Some Insight Interesting to look at expectation of the estimate with respect to all possible datasets: Ø1 E [ p� ( x ) ] = E Œ ºN ø K ( x , x ) œ = E [K (x , x ') ] � n =1 ß N n = � K ( x - x ' ) p ( x ') dx ' - convolution with true density That is: p� ( x ) = p ( x ) if K ( x , x ') = d ( x , x ') But not for the finite data set! Fall 2004 Pattern Recognition for Vision Conditions for Convergence How small can we make h for a given N ? lim hNd = 0 - It should go to 0 lim NhNd = ¥ - But slower than 1/N N fi¥ N fi¥ Based on the similar analysis of variance of estimates Eg: hNd = h1d / N hNd = h1d / log( N ) Note that the choice of h1d is still up to us. Fall 2004 Pattern Recognition for Vision Problems With Kernel Estimation • Need to choose the width parameter, h • • • Need to store all data to represent the density • Fall 2004 Can be chosen empirically Can be adaptive, eg. hj = hdjk – where djk the distance from xj to k-th nearest neighbor Leads to Mixture Density Estimation Pattern Recognition for Vision You Are Here Density Estimation Semi-parametric Mixture Density K-NN Method Kernel Methods Histograms Bayesian Non-parametric Max Likelihood Fall 2004 Parametric Pattern Recognition for Vision K-Nearest Neighbors Recall that: k p� ( x ) = NV Now we fix k (typically k = N ) and expand V to contain k points This is not a true density! Eg.: choose N=1, k=1. Then: 1 p� ( x) = 1 � x - x1 Oops! BUT it is useful for a number of theoretical and practical reasons. Fall 2004 Pattern Recognition for Vision K-NN Classification Rule Let’s try classification with K-NN density estimate Data: N Nj - total points - points in class w j Need to find the class label for a query, x Expand a sphere from x to include K points K - number of neighbors of x K j - points of class w j among K Fall 2004 x Pattern Recognition for Vision KNN Classification Then class priors are given by: p (w j ) = Nj N We can estimate conditional and marginal densities around any x: p( x | w j ) = Kj K p ( x) = NV N jV K j N j NV K j By Bayes rule: p (w j | x ) = = N jV N K K Then for minimum error rate classification: C = arg max K j j Fall 2004 Pattern Recognition for Vision KNN Classification Important theoretical result: In the extreme case, K=1, it can be shown that: N-sample error rate P = lim PN (error ) for N fi¥ c � *� P £ P £ P �2P � c -1 ł Ł * * That is, using just a single neighbor rule, the error rate is at most twice the Bayes error!!! Fall 2004 Pattern Recognition for Vision Problems with Non-parametric Methods • Memory: need to store all data points • Computation: need to compute distances to all data points every time • Parameter choice: need to choose the smoothing parameter Fall 2004 Pattern Recognition for Vision You Are Here Density Estimation Semi-parametric Mixture Density K-NN Method Kernel Methods Histograms Bayesian Non-parametric Max Likelihood Fall 2004 Parametric Pattern Recognition for Vision Mixture Density Model Mixture model – a linear combination of parametric densities Number of components M p ( x) = � p ( x | j ) P( j) j =1 Component density P( j ) ‡ 0, "j Component “prior” M and � P( j) = 1 j =1 Uses MUCH less “kernels” than kernel methods Kernels are parametric densities, subject to estimation Fall 2004 Pattern Recognition for Vision Example M p ( x) = � p ( x | j ) P( j) j =1 Fall 2004 Pattern Recognition for Vision Mixture Density Using ML principle, the objective function is the log-likelihood: M � � n n l (q ) ” log � p( x ) = � log �� p( x | j ) P( j ) � n=1 n =1 � j =1 � N N Differentiate w.r.t. parameters: ¶ �M � n �q j l (q ) = � log �� p ( x | k )P (k ) � n =1 ¶q j � k =1 � N N =� n =1 1 M n p ( x � | k )P (k ) ¶ p ( x n | j )P ( j ) ¶q j k =1 Fall 2004 Pattern Recognition for Vision Mixture Density Again let’s assume that p(x|ω) is a Gaussian We need to estimate M priors, and M sets of means and covariances ¶l (q ) N = � P( j | x n ) ØºS -j 1 ( xn - mˆ j ) øß ¶m j n =1 Setting it to 0 and solving for µj: N mˆ j = n n P ( j | x ) x � n =1 N n P ( j | x ) � - convex sum of all data n =1 Fall 2004 Pattern Recognition for Vision Mixture Density Similarly for the covariances: ¶l (q ) N n n n T ˆ -1 -1 -1 ˆ ˆ Ø ø ˆ ˆ = P j x S S )( S x m x m ( | ) ( ) � j j j j j 2 º ß ¶s j n =1 Setting it to 0 and solving for Σi: � P( j | x ) ( x N n Sˆ j = n=1 n - mˆ j ) ( x - mˆ j ) n T N � P( j | x ) n n=1 Fall 2004 Pattern Recognition for Vision Mixture Density A little harder for P(j) – optimization is subject to constraints: M � P( j ) = 1 j =1 and P ( j ) ‡ 0, "j Here is a trick to enforce the constraints: P( j) = exp(g j ) M � exp(g k =1 k ) ¶P (i ) = d (i - j ) P( j ) - P (i) P( j ) ¶g j Fall 2004 Pattern Recognition for Vision Mixture Density Using the chain rule: ¶l (q ) ¶P( k ) M N p( x n | k ) �g j l (q ) = � = �� d jk P ( j) - P ( j )P( k ) ) ( P( x) k =1 ¶P (k ) ¶g j k =1 n=1 M M � p( xn | j ) � p( x n | k ) = �� P( j) - � P( j) P (k ) � P( x) n=1 � P ( x ) k =1 � N N M � � = � � P( j | x n ) - P ( j) � p (k | x n ) � = � {P ( j | x n ) - P ( j )} = 0 n=1 � k =1 � n=1 N The last expression gives the value at the extremum: 1 P( j) = N Fall 2004 N n P ( j | x ) � n=1 Pattern Recognition for Vision Mixture Density What’s the problem? 1 P( j) = N N N n P ( j | x ) � mˆ j = n =1 n n P ( j | x ) x � n =1 N � P( j | x ) n n =1 � P( j | x ) ( x N n Sˆ j = n =1 n - mˆ j )( x - mˆ j ) n T N n P ( j | x ) � n =1 We can’t compute these directly! Solution – EM algorithm. We will study it in Clustering. Fall 2004 Pattern Recognition for Vision You Are Here Density Estimation Semi-parametric Mixture Density K-NN Method Kernel Methods Histograms Bayesian Non-parametric Max Likelihood Fall 2004 Parametric Pattern Recognition for Vision

Document 13510805

Related documents

Products

Support

Document 13510805

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib