Document

Course Outline MODEL INFORMATION COMPLETE Bayes Decision Theory Parametric Approach “Optimal” Rules Plug-in Rules INCOMPLETE Supervised Learning Nonparametric Approach Unsupervised Learning Parametric Approach Density Geometric Rules Estimation (K-NN, MLP) Mixture Resolving Nonparametric Approach Cluster Analysis (Hard, Fuzzy) Two-dimensional Feature Space Supervised Learning Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation  Introduction  Maximum-Likelihood Estimation  Bayesian Estimation  Curse of Dimensionality  Component analysis & Discriminants  EM Algorithm 3  Introduction  Bayesian framework  We could design an optimal classifier if we knew:   P(i) : priors P(x | i) : class-conditional densities Unfortunately, we rarely have this complete information!  Design a classifier based on a set of labeled training samples (supervised learning)   Assume priors are known Need sufficient no. of training samples for estimating class-conditional densities, especially when the dimensionality of the feature space is large Pattern Classification, Chapter1 3   Assumption about the problem: parametric model of P(x | i) is available 4 Assume P(x | i) is multivariate Gaussian P(x | i) ~ N( i, i)   Characterized by 2 parameters Parameter estimation techniques   Maximum-Likelihood (ML) and Bayesian estimation Results of the two procedures are nearly identical, but there is a subtle difference Pattern Classification, Chapter1 3 5 In ML estimation parameters are assumed to be fixed but unknown! Bayesian parameter estimation procedure, by its nature, utilizes whatever prior information is available about the unknown parameter  MLE: Best parameters are obtained by maximizing the probability of obtaining the samples observed  Bayesian methods view the parameters as random variables having some known prior distribution; How do we know the priors?  In either approach, we use P(i | x) for our classification rule!  Pattern Classification, Chapter1 3  Maximum-Likelihood Estimation    Has good convergence properties as the sample size increases; estimated parameter value approaches the true value as n increases Simpler than any other alternative technique General principle  Assume we have c classes and P(x | j) ~ N( j, j) P(x | j)  P (x | j, j), where   ( j ,  j )  1 2 11 22 m n ( j ,  j ,...,  j ,  j , cov(x j , x j )...) Use class j samples to estimate class j parameters Pattern Classification, Chapter2 3 6 7   Use the information in training samples to estimate  = (1, 2, …, c); i (i = 1, 2, …, c) is associated with the ith category Suppose sample set D contains n iid samples, x1, x2,…, xn k n P(D | )   P(x k | )  F() k 1 P(D | ) is called the likelihood of  w.r.t. the set of samples)  ˆ that ML estimate of  is, by definition, the value  maximizes P(D | ) “It is the value of  that best agrees with the actually observed training samples” Pattern Classification, Chapter2 3 8 Pattern Classification, Chapter2 3 9  Optimal estimation  Let  = (1, 2, …, p)t and  be the gradient operator         , ,...,  p   1  2  We define l() as the log-likelihood function l() = ln P(D | )  New problem statement: determine  that maximizes the log-likelihood t ˆ  arg maxl()  Pattern Classification, Chapter2 3 10 Set of necessary conditions for an optimum is: k n ( l     ln P(x k | )) k 1 l = 0 Pattern Classification, Chapter2 3  Example of a specific case: unknown   P(x | ) ~ N(, ) (Samples are drawn from a multivariate normal population)   1 1 1 d t ln P(x k | )   ln (2)   (x k  )  (x k  ) 2 2 1 and   ln P(x k | )   (x k  )  = , therefore the ML estimate for  must satisfy: k n 1  (x k  ˆ )  0  k 1 Pattern Classification, Chapter2 3 11 12 • Multiplying by  and rearranging, we obtain: 1 k n ˆ   x k n k 1 which is the arithmetic average or the mean of the samples of the training samples! Conclusion: Given P(xk | j), j = 1, 2, …, c to be Gaussian in a ddimensional feature space, estimate the vector  = (1, 2, …, c)t and perform a classification using the Bayes decision rule of chapter 2! Pattern Classification, Chapter2 3 13  ML Estimation:  Univariate Gaussian Case: unknown  &   = (1, 2) = (, 2) l  ln P( x k | )   1 1 ln 2 2  ( x k  1 ) 2 2 2 2    (ln P( x k | ))   1 0  l      (ln P( x k | ))    2   1   (x k  1 )  0  2  2 ( x   ) 1 k 1   0 2  2  2 2 2  Pattern Classification, Chapter2 3 14 Summation: k n 1 ( x   )  0  k 1  k 1 ˆ 2   k n 2 k  n ˆ   1   ( x k  1 )  0  k 1 ˆ 2 k 1 ˆ 22 (1) (2) Combining (1) and (2), one obtains: k n k n x   k 1 k n ;   2  (x k  ) 2 k 1 n Pattern Classification, Chapter2 3 15  Bias  ML estimate for 2 is biased n1 2 1 2 E  ( x i  x )   .   2 n n   An unbiased estimator for  is: 1 k n t C  (x k   )(x k  ˆ ) 1 k 1 n-   Sample covariance matrix Pattern Classification, Chapter2 3  Bayesian Estimation (Bayesian learning approach for pattern classification problems)  In MLE  was supposed to have a fixed value  In BE  is a random variable  The computation of posterior probabilities P(i | x) lies at the heart of Bayesian classification  Goal: compute P(i | x, D) Given the training sample set D, Bayes formula can be written P(x | i , D ).P(i | D ) P(i | x, D )  c  P(x |  j , D ).P( j | D ) j 1 3 1 Pattern Classification, Chapter 16 17  To demonstrate the preceding equation, use: P(x, D | i )  P(x | D i ).P(D | i ) P(x | D )   P(x,  j | D ) j P(i )  P(i | D ) (Training sample provides this!) Thus : P(i | x, D )  P(x | i , D i ).P(i ) c  P(x |  j , D ).P( j ) j 1 3 1 Pattern Classification, Chapter 18  Bayesian Parameter Estimation: Gaussian Case Goal: Estimate  using the a-posteriori density P( | D)  The univariate Gaussian case: P( | D)  is the only unknown parameter P(x |  ) ~ N(,  2 ) P( ) ~ N( 0 ,  20 ) 0 and 0 are known! 4 1 Pattern Classification, Chapter 19 P(D |  ).P( ) P( | D )   P(D |  ).P( )d (1) k n    P(x k |  ).P( ) k 1  Reproducing density P( | D ) ~ N(n , n2 ) (2) The updated parameters of the prior:  n 20  2  ˆ n   n   . 0 2 2 2 2 n 0    n0  0    and n2   20  2 n 20   2 4 1 Pattern Classification, Chapter 20 4 1 Pattern Classification, Chapter  21 The univariate case P(x | D)   P( | D) has been computed P(x | D) remains to be computed! P(x | D )   P(x | ).P( | D )d is Gaussian It provides: P(x | D ) ~ N(n , 2  n2 ) Desired class-conditional density P(x | Dj, j) P(x | Dj, j) together with P(j) and using Bayes formula, we obtain the Bayesian classification rule:    Max P( j | x, D  Max P( x |  j , D j ).P( j ) j j  4 1 Pattern Classification, Chapter 22  Bayesian Parameter Estimation: General Theory  P(x | D) computation can be applied to any situation in which the unknown density can be parametrized: the basic assumptions are:    The form of P(x | ) is assumed known, but the value of  is not known exactly Our knowledge about  is assumed to be contained in a known prior density P() The rest of our knowledge about  is contained in a set D of n random variables x1, x2, …, xn that follows P(x) 5 1 Pattern Classification, Chapter 23 The basic problem is: “Compute the posterior density P( | D)” then “Derive P(x | D)” Using Bayes formula, we have: P(D | ).P() P( | D )  ,  P(D | ).P()d And by independence assumption: k n P(D | )   P(x k | ) k 1 5 1 Pattern Classification, Chapter Overfitting Problem of Insufficient Data • How to train a classifier (e.g., estimate the covariance matrix) when the training set size is small (compared to the number of features) • Reduce the dimensionality – Select a subset of features – Combine available features to get a smaller number of more “salient” features. • Bayesian techniques – Assume a reasonable prior on the parameters to compensate for small amount of training data • Model Simplification – Assume statistical independence • Heuristics – Threshold the estimated covariance matrix such that only correlations above a threshold are retained. Practical Observations • Most heuristics and model simplifications are almost surely incorrect • In practice, however, the performance of the classifiers base don model simplification is better than with full parameter estimation • Paradox: How can a suboptimal/simplified model perform better than the MLE of full parameter set, on test dataset? – The answer involves the problem of insufficient data Insufficient Data in Curve Fitting Curve Fitting Example (contd) • The example shows that a 10th-degree polynomial fits the training data with zero error – However, the test or the generalization error is much higher for this fitted curve • When the data size is small, one cannot be sure about how complex the model should be • A small change in the data will change the parameters of the 10th-degree polynomial significantly, which is not a desirable quality; stability Handling insufficient data • Heuristics and model simplifications • Shrinkage is an intermediate approach, which combines “common covariance” with individual covariance matrices – Individual covariance matrices shrink towards a common covariance matrix. – Also called regularized discriminant analysis • Shrinkage Estimator for a covariance matrix, given shrinkage factor 0 <  < 1, (1   )ni i  n i ( ) (1   )ni  n • Further, the common covariance can be shrunk towards the Identity matrix, ( ) (1   )  I Problems of Dimensionality Introduction • Real world applications usually come with a large number of features – Text in documents is represented using frequencies of tens of thousands of words – Images are often represented by extracting local features from a large number of regions within an image • Naive intuition: more the number of features, the better the classification performance? – Not always! • There are two issues that must be confronted with high dimensional feature spaces – How does the classification accuracy depend on the dimensionality and the number of training samples? – What is the computational complexity of the classifier? Statistically Independent Features • If features are statistically independent, it is possible to get excellent performance as dimensionality increases • For a two class problem with multivariate normal classes P( x |  j ) ~ N (μ j , ), and equal prior probabilities, the probability of error is 1 P (e)  2  e r u 2 2 du 2 where the Mahalanobis distance is defined as 1 r  (μ1  μ2 )  (μ1  μ2 ) 2 T Statistically Independent Features • When features are independent, the covariance matrix is diagonal, and we have  i1  i 2   r    i  i 1  d 2 2 • Since r2 increases monotonically with an increase in the number of features, P(e) decreases • As long as the means of features in the differ, the error decreases Increasing Dimensionality • If a given set of features does not result in good classification performance, it is natural to add more features • High dimensionality results in increased cost and complexity for both feature extraction and classification • If the probabilistic structure of the problem is completely known, adding new features will not possibly increase the Bayes risk Curse of Dimensionality • In practice, increasing dimensionality beyond a certain point in the presence of finite number of training samples often leads to lower performance, rather than better performance • The main reasons for this paradox are as follows: – the Gaussian assumption, that is typically made, is almost surely incorrect – Training sample size is always finite, so the estimation of the class conditional density is not very accurate • Analysis of this “curse of dimensionality” problem is difficult A Simple Example • Trunk (PAMI, 1979) provided a simple example illustrating this phenomenon. 1 p1   p2   μ1  μ, μ 2  μ 2 2 1 1  N 1  2  xi  i  px | 1    e p X | 1  ~ Gμ1 , I  i 1 2 2 1 1  p X | 2  ~ Gμ2 , I  N 1  2  xi  i  p x | 2    e i 1 2 N: Number of features 1 i i     i th componentof themean vector  1 1 1  μ1  1, , , ,... 2 3 4   1 1 1   μ 2    1, , , ,... 2 3 4   Case 1: Mean Values Known • Bayes decision rule: Decide1 if xt μ  x11  x2 2  ... xN N  0 N xi or i1 i  0  1  12 z 2 N 1 2 Pe   e dz  2  μ1  μ 2  4i 1   i  / 2 2 Pe N     i1 i  N 1   i  1 1  12 z 2 e dz 2 is a divergent series  Pe N   0 as N  Case 2: Mean Values Unknown • m labeled training samples are available 1 m i μˆ  i 1 x , xi is replaced by -xi if xi 2 m POOLED ESTIMATE Plug-in decision rule Decide1 if xt μˆ  x1ˆ1  x2 ˆ 2  ... xN ˆ N  0    Pe N , m  P2 .P xt μˆ  0 | x 2  P1 .P xt μˆ  0 | x 1   P xt μˆ  0 | x 2  due tosymmetry Let z  x μ  i 1 xi μˆ i t N It is difficult to computer he t distribution of z  Case 2: Mean Values Unknown 1  1  N 1 N E z   i 1   VARz   1  i 1    i  m i m z  E z  lim ~ G0,1, Standard Normal N  VARz  N  z  E z      E z  Pe  N , m   Pz  0 | x  2   P   VAR z     VAR z    1 2 1 2z Pe m, N    e dz N 1 2  i1    /2 N   lim PN  0 N  1  lim Pe m, N   N  2 E z  c  VARz   1  N 1 N 1  i 1     m i m Case 2: Mean Values Unknown 42  Component Analysis and Discriminants Combine features in order to reduce the dimension of the feature space  Linear combinations are simple to compute and tractable  Project high dimensional data onto a lower dimensional space  Two classical approaches for finding “optimal” linear transformation    PCA (Principal Component Analysis) “Projection that best represents the data in a least- square sense” MDA (Multiple Discriminant Analysis) “Projection that best separates the data in a least-squares sense” 8 1 Pattern Classification, Chapter

Document

Related documents

Products

Support

Document

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib