ECE 8527 to Machine Learning and Pattern Recognition 8443 – Introduction Pattern Recognition LECTURE 05: BAYESIAN ESTIMATION • Objectives: Bayesian Estimation Example • Resources: D.H.S.: Chapter 3 (Part 2) J.O.S.: Bayesian Parameter Estimation A.K.: The Holy Trinity A.E.: Bayesian Methods J.H.: Euro Coin Introduction to Bayesian Parameter Estimation • In Chapter 2, we learned how to design an optimal classifier if we knew the prior probabilities, P(ωi), and class-conditional densities, p(x|ωi). • Bayes: treat the parameters as random variables having some known prior distribution. Observations of samples converts this to a posterior. • Bayesian learning: sharpen the a posteriori density causing it to peak near the true value. • Supervised vs. unsupervised: do we know the class assignments of the training data. • Bayesian estimation and ML estimation produce very similar results in many cases. • Reduces statistical inference (prior knowledge or beliefs about the world) to probabilities. ECE 8527: Lecture 05, Slide 1 Class-Conditional Densities • Posterior probabilities, P(ωi|x), are central to Bayesian classification. • Bayes formula allows us to compute P(ωi|x) from the priors, P(ωi), and the likelihood, p(x|ωi). • But what If the priors and class-conditional densities are unknown? • The answer is that we can compute the posterior, P(ωi|x), using all of the information at our disposal (e.g., training data). • For a training set, D, Bayes formula becomes: P(i | x, D) likelihood prior evidence p(x i , D) P(i D) c p(x j , D) P( j D) j 1 • We assume priors are known: P(ωi|D) = P(ωi). • Also, assume functional independence: Di have no influence on This gives: P(i | x, D) p(x j , D) if i j p(x i , Di ) P(i ) c p(x j , D j ) P( j ) j 1 ECE 8527: Lecture 05, Slide 2 The Parameter Distribution • Assume the parametric form of the evidence, p(x), is known: p(x|θ). • Any information we have about θ prior to collecting samples is contained in a known prior density p(θ). • Observation of samples converts this to a posterior, p(θ|D), which we hope is peaked around the true value of θ. • Our goal is to estimate a parameter vector: p(x D) p(x, D)d • We can write the joint distribution as a product: p(x D) p(x , D ) p( D)d p(x ) p( D)d because the samples are drawn independently. • This equation links the class-conditional density p(x D) to the posterior, p( D) . But numerical solutions are typically required! ECE 8527: Lecture 05, Slide 3 Univariate Gaussian Case • Case: only mean unknown p(x ) N (, 2 ) • Known prior density: p( ) N (0 , 02 ) • Using Bayes formula: p( D) p( D) p( D ) p( ) • Rationale: Once a value of μ is known, the density for x is completely known. α is a normalization factor that p( D) p( D ) p( ) p( D) p( D ) p( ) p ( D ) p ( ) d [ p ( D ) p ( )] n depends on the data, D. ECE 8527: Lecture 05, Slide 4 p( x k ) p( ) k 1 Univariate Gaussian Case • Applying our Gaussian assumptions: 2 n 1 1 x 1 k 1 0 p | D exp exp k 1 2 2 0 2 2 0 2 1 2 n x 2 0 k exp 2 0 k 1 1 2 2 0 02 n x 2k x k 2 2 2 2 2 exp 2 0 2 k 1 n 1 02 x 2k exp 2 2 2 0 k 1 1 2 2 0 exp 2 02 2 0 1 2 2 0 exp 2 02 2 0 ECE 8527: Lecture 05, Slide 5 n x k 2 (2 2 2 ) k 1 n x k 2 (2 2 2 ) k 1 Univariate Gaussian Case (Cont.) • Now we need to work this into a simpler form: 1 2 2 0 n x k 2 (2 2 2 ) p | D exp 2 2 0 k 1 2 0 1 n 2 2 n x exp 2 2 2 k 2 2 20 0 2 k 1 0 k 1 1 2 2 exp n 2 2 2 2 0 2 n x k 2 k 1 0 2 0 1 n 1 exp( 2 2 0 2 2 1 n 0 2 n x 2 2 k 1 k 0 1 n 1 exp( 2 2 0 2 1 n where ˆ n x k n k 1 2 2 1 (nˆ n ) 0 2 02 ECE 8527: Lecture 05, Slide 6 Univariate Gaussian Case (Cont.) • p(μ|D) is an exponential of a quadratic function, which makes it a normal distribution. Because this is true for any n, it is referred to as a reproducing density. • p(μ) is referred to as a conjugate prior. • Write p(μ|D) ~ N(μn,σn 2): p( D) 1 n 2 exp[ ( ) ] 2 n 2 n 1 • Expand the quadratic term: 1 n 2 1 1 2 2 n n2 p( D) exp[ ( ) ] exp[ ( )] 2 2 n 2 n 2 n 2 n 1 • Equate coefficients of our two functions: 1 2 2 n n2 exp 2 n 2 n 2 1 n 2 1 1 0 exp 2 2 2 2 (nˆ n ) 2 2 0 0 1 ECE 8527: Lecture 05, Slide 7 Univariate Gaussian Case (Cont.) • Rearrange terms so that the dependencies on μ are clear: 1 n2 1 1 2 n exp 2 exp 2 2 2 2 2 n 2 n n n 1 n 1 0 1 2 exp 2 2 2 2 (nˆ n ) 2 2 0 0 1 • Associate terms related to σ2 and μ: • There is actually a third equation involving terms not related to : 1 n2 exp 2 or 2 2 n n 1 1 n2 1 exp 2 2 2 n 2 0 n 1 1 2 n but we can ignore this since it is not a function of μ and is a complicated equation to solve. ECE 8527: Lecture 05, Slide 8 Univariate Gaussian Case (Cont.) • Two equations and two unknowns. Solve for μn and σn2. First, solve for μn2 : 2 02 02 2 n 1 n 02 2 n 02 2 2 2 0 2 n 1 • Next, solve for μn: n n2 n2 n ˆ n ( 2 ) 0 2 0 1 n 02 2 2 ˆ n ( 2 ) 0 2 2 n 0 0 n 02 ˆ n 2 2 n 0 02 2 2 2 n 0 2 0 n 2 2 0 • Summarizing: 2 n ( 2 ) ˆ 2 n 2 2 0 n 0 n 0 n 02 n2 02 2 n 02 2 ECE 8527: Lecture 05, Slide 9 Bayesian Learning • μn represents our best guess after n samples. • σn2 represents our uncertainty about this guess. • σn2 approaches σ2/n for large n – each additional observation decreases our uncertainty. • The posterior, p(μ|D), becomes more sharply peaked as n grows large. This is known as Bayesian learning. ECE 8527: Lecture 05, Slide 10 Class-Conditional Density • How do we obtain p(x|D) (derivation is tedious): p ( x D) p ( x ) p( D)d 1 1 x 2 1 1 n 2 exp[ ( ) ][ exp[ ( ) ]d 2 2 2 n 2 n 1 x exp[ ( 2 n2 ) 2 ] f ( , n ) 2n 2 n 1 where: f ( , n ) • Note that: 2 2 2 2 1 n n x n d exp[ ( 2 2 ) 2 2 2 n n p(x | D) ~ N (n , 2 n2 ) 2 • The conditional mean, μn, is treated as the true mean. • p(x|D) and P(ωj) can be used to design the classifier. ECE 8527: Lecture 05, Slide 11 Multivariate Case • Assume: p(x ) ~ N (, ) and p() ~ N (0 , 0 ) where 0 , and 0 are assumed to be known. • Applying Bayes formula: n p | D p(x k | ) p k 1 n 1 t 1 1 t 1 exp ( (n 0 ) 2 ( x k 01 0 )) 2 i 1 which has the form: 1 p | D exp ( n ) t 1 ( n ) 2 • Once again: p | D ~ N ( n , n ) and we have a reproducing density. ECE 8527: Lecture 05, Slide 12 Estimation Equations • Equating coefficients between the two Gaussians: n1 n 1 01 n1 n n 1 ˆ n 01 0 ˆ n 1 n xk n k 1 • The solution to these equations is: 1 1 1 n 0 ( 0 ) 1ˆ n ( 0 ) 1 0 n n n 1 1 n 0 ( 0 ) 1 n n • It also follows that: p(x | D) ~ N ( n , n ) ECE 8527: Lecture 05, Slide 13 General Theory • p(x | D) computation can be applied to any situation in which the unknown density can be parameterized. • The basic assumptions are: The form of p(x | θ) is assumed known, but the value of θ is not known exactly. Our knowledge about θ is assumed to be contained in a known prior density p(θ). The rest of our knowledge about θ is contained in a set D of n random variables x1, x2, …, xn drawn independently according to the unknown probability density function p(x). ECE 8527: Lecture 05, Slide 14 Formal Solution • The posterior is given by: p(x | D) p(x | ) p( | D)d • Using Bayes formula, we can write p(D| θ ) as: p ( | D ) p (D | ) p p (D | ) p d • and by the independence assumption: n p ( D | ) p ( x k k 1 • This constitutes the formal solution to the problem because we have an expression for the probability of the data given the parameters. • This also illuminates the relation to the maximum likelihood estimate: • Suppose p(D| θ ) reaches a sharp peak at θ θˆ . ECE 8527: Lecture 05, Slide 15 Comparison to Maximum Likelihood • This also illuminates the relation to the maximum likelihood estimate: Suppose p(D| θ ) reaches a sharp peak at θ θˆ . p(θ| D) will also peak at the same place if p(θ) is well-behaved. p(x|D) will be approximately p x | ˆ , which is the ML result. If the peak of p(D| θ ) is very sharp, then the influence of prior information on the uncertainty of the true value of θ can be ignored. However, the Bayes solution tells us how to use all of the available information to compute the desired density p(x|D). ECE 8527: Lecture 05, Slide 16 Recursive Bayes Incremental Learning • To indicate explicitly the dependence on the number of samples, let: Dn x1, x2 ,..., xn • We can then write our expression for p(D| θ ): p( Dn | ) p(xn | ) p( Dn1 | ) 0 where p( D | ) p() . • We can write the posterior density using a recursive relation: p ( | D ) n p ( D n | ) p n p ( D | ) p d p ( xn | ) p ( D n 1 | ) n 1 p ( xn | ) p ( | D ) d 0 • where p( | D ) p( ) . • This is called the Recursive Bayes Incremental Learning because we have a method for incrementally updating our estimates. ECE 8527: Lecture 05, Slide 17 When do ML and Bayesian Estimation Differ? • For infinite amounts of data, the solutions converge. However, limited data is always a problem. • If prior information is reliable, a Bayesian estimate can be superior. • Bayesian estimates for uniform priors are similar to an ML solution. • If p(θ| D) is broad or asymmetric around the true value, the approaches are likely to produce different solutions. • When designing a classifier using these techniques, there are three sources of error: Bayes Error: the error due to overlapping distributions Model Error: the error due to an incorrect model or incorrect assumption about the parametric form. Estimation Error: the error arising from the fact that the parameters are estimated from a finite amount of data. ECE 8527: Lecture 05, Slide 18 Noninformative Priors and Invariance • The information about the prior is based on the designer’s knowledge of the problem domain. • We expect the prior distributions to be “translation and scale invariance” – they should not depend on the actual value of the parameter. • A prior that satisfies this property is referred to as a “noninformative prior”: The Bayesian approach remains applicable even when little or no prior information is available. Such situations can be handled by choosing a prior density giving equal weight to all possible values of θ. Priors that seemingly impart no prior preference, the so-called noninformative priors, also arise when the prior is required to be invariant under certain transformations. Frequently, the desire to treat all possible values of θ equitably leads to priors with infinite mass. Such noninformative priors are called improper priors. ECE 8527: Lecture 05, Slide 19 Example of Noninformative Priors • For example, if we assume the prior distribution of a mean of a continuous random variable is independent of the choice of the origin, the only prior that could satisfy this is a uniform distribution (which isn’t possible). • Consider a parameter σ, and a transformation of this variable: new variable, ~ ln( ) . Suppose we also scale by a positive constant: ˆ ln( a ) ln a ln . A noninformative prior on σ is the inverse distribution p(σ) = 1/ σ, which is also improper. ECE 8527: Lecture 05, Slide 20 Sufficient Statistics • Direct computation of p(D|θ) and p(θ|D) for large data sets is challenging (e.g. neural networks) • We need a parametric form for p(x|θ) (e.g., Gaussian) • Gaussian case: computation of the sample mean and covariance, which was straightforward, contained all the information relevant to estimating the unknown population mean and covariance. • This property exists for other distributions. • A sufficient statistic is a function s of the samples D that contains all the information relevant to a parameter, θ. • A statistic, s, is said to be sufficient for θ if p(D|s,θ) is independent of θ: p ( | s , D ) p ( D | s , ) p ( | s ) p ( | s ) p( D | s) ECE 8527: Lecture 05, Slide 21 The Factorization Theorem • Theorem: A statistic, s, is sufficient for θ, if and only if p(D|θ) can be written as: p(D | ) g (s, )h(D) . • There are many ways to formulate sufficient statistics (e.g., define a vector of the samples themselves). • Useful only when the function g() and the sufficient statistic are simple (e.g., sample mean calculation). • The factoring of p(D|θ) is not unique: g ( s, ) f ( s) g ( s, ) h( D) h( D) / f ( s) • Define a kernel density invariant to scaling: g ( s , ) g~ ( s, ) g ( s , ) d • Significance: most practical applications of parameter estimation involve simple sufficient statistics and simple kernel densities. ECE 8527: Lecture 05, Slide 22 Gaussian Distributions 1 t 1 exp x x k k 2 d / 2 1/ 2 k 1 2 n 1 p ( D ) 1 n t 1 t 1 t 1 exp x x k k 2 d / 2 1/ 2 k 1 2 1 n n t 1 t 1 exp x k k 1 2 1 1 n t 1 exp x k x k 2 d / 2 1/ 2 2 k 1 • This isolates the θ dependence in the first term, and hence, the sample mean is a sufficient statistic using the Factorization Theorem. • The kernel is: ~ , g~ n 1 ~ t 1 1 ~ exp n n n 2 d / 2 1/ 2 2 1 ECE 8527: Lecture 05, Slide 23 The Exponential Family t • This can be generalized: p(x | ) x exp[ a() b() c()] n n k 1 k 1 and: p( D | ) exp[ na() b() c(x k )] x k g (s, )h( D) t • Examples: ECE 8527: Lecture 05, Slide 24 Summary • Introduction of Bayesian parameter estimation. • The role of the class-conditional distribution in a Bayesian estimate. • Estimation of the posterior and probability density function assuming the only unknown parameter is the mean, and the conditional density of the “features” given the mean, p(x|θ), can be modeled as a Gaussian distribution. • Bayesian estimates of the mean for the multivariate Gaussian case. • General theory for Bayesian estimation. • Comparison to maximum likelihood estimates. • Recursive Bayesian incremental learning. • Noninformative priors. • Sufficient statistics • Kernel density. ECE 8527: Lecture 05, Slide 25 “The Euro Coin” • Getting ahead a bit, let’s see how we can put these ideas to work on a simple example due to David MacKay, and explained by Jon Hamaker. ECE 8527: Lecture 05, Slide 26