lecture_05

ECE 8527 to Machine Learning and Pattern Recognition 8443 – Introduction Pattern Recognition LECTURE 05: BAYESIAN ESTIMATION • Objectives: Bayesian Estimation Example • Resources: D.H.S.: Chapter 3 (Part 2) J.O.S.: Bayesian Parameter Estimation A.K.: The Holy Trinity A.E.: Bayesian Methods J.H.: Euro Coin Introduction to Bayesian Parameter Estimation • In Chapter 2, we learned how to design an optimal classifier if we knew the prior probabilities, P(ωi), and class-conditional densities, p(x|ωi). • Bayes: treat the parameters as random variables having some known prior distribution. Observations of samples converts this to a posterior. • Bayesian learning: sharpen the a posteriori density causing it to peak near the true value. • Supervised vs. unsupervised: do we know the class assignments of the training data. • Bayesian estimation and ML estimation produce very similar results in many cases. • Reduces statistical inference (prior knowledge or beliefs about the world) to probabilities. ECE 8527: Lecture 05, Slide 1 Class-Conditional Densities • Posterior probabilities, P(ωi|x), are central to Bayesian classification. • Bayes formula allows us to compute P(ωi|x) from the priors, P(ωi), and the likelihood, p(x|ωi). • But what If the priors and class-conditional densities are unknown? • The answer is that we can compute the posterior, P(ωi|x), using all of the information at our disposal (e.g., training data). • For a training set, D, Bayes formula becomes: P(i | x, D)  likelihood  prior  evidence p(x i , D) P(i D) c  p(x  j , D) P( j D) j 1 • We assume priors are known: P(ωi|D) = P(ωi). • Also, assume functional independence: Di have no influence on This gives: P(i | x, D)  p(x  j , D) if i  j p(x i , Di ) P(i ) c  p(x  j , D j ) P( j ) j 1 ECE 8527: Lecture 05, Slide 2 The Parameter Distribution • Assume the parametric form of the evidence, p(x), is known: p(x|θ). • Any information we have about θ prior to collecting samples is contained in a known prior density p(θ). • Observation of samples converts this to a posterior, p(θ|D), which we hope is peaked around the true value of θ. • Our goal is to estimate a parameter vector: p(x D)   p(x, D)d • We can write the joint distribution as a product: p(x D)   p(x  , D ) p( D)d   p(x  ) p( D)d because the samples are drawn independently. • This equation links the class-conditional density p(x D) to the posterior, p( D) . But numerical solutions are typically required! ECE 8527: Lecture 05, Slide 3 Univariate Gaussian Case • Case: only mean unknown p(x  )  N (, 2 ) • Known prior density: p( )  N (0 , 02 ) • Using Bayes formula: p(  D) p( D)  p( D  ) p(  ) • Rationale: Once a value of μ is known, the density for x is completely known. α is a normalization factor that p(  D)   p( D  ) p(  ) p( D) p( D  ) p( )  p ( D  ) p (  ) d   [ p ( D  ) p (  )] n depends on the data, D. ECE 8527: Lecture 05, Slide 4    p( x k  ) p(  ) k 1 Univariate Gaussian Case • Applying our Gaussian assumptions: 2  n  1  1   x   1  k     1 0  p  | D      exp  exp      k 1  2   2   0  2       2  0      2       1      2 n  x    2  0     k    exp        2   0  k 1     1    2  2  0   02  n  x 2k x k   2       2  2 2  2      exp   2  0      2    k 1   n  1   02 x 2k    exp  2   2  2   0 k 1   1    2 2  0   exp   2     02  2    0   1    2 2  0    exp   2    02  2    0 ECE 8527: Lecture 05, Slide 5   n x k   2       (2 2  2 )            k 1   n x k   2       (2 2  2 )           k 1 Univariate Gaussian Case (Cont.) • Now we need to work this into a simpler form:  1    2 2 0   n x k   2       (2 2  2 )    p | D     exp   2  2       0   k 1    2    0  1  n  2   2   n   x      exp     2   2       2 k 2    2 20 0    2    k 1    0   k 1   1 2 2     exp  n 2  2  2 2  0  2   n  x k  2 k 1  0   2   0   1  n 1    exp(  2  2 0  2    2  1  n 0     2 n  x 2  2  k    1  k   0    1  n 1    exp(  2  2 0  2   1 n where ˆ n   x k n k 1  2      2 1 (nˆ n )  0  2   02   ECE 8527: Lecture 05, Slide 6                       Univariate Gaussian Case (Cont.) • p(μ|D) is an exponential of a quadratic function, which makes it a normal distribution. Because this is true for any n, it is referred to as a reproducing density. • p(μ) is referred to as a conjugate prior. • Write p(μ|D) ~ N(μn,σn 2): p(  D)  1   n 2 exp[  ( ) ] 2  n 2 n 1 • Expand the quadratic term: 1   n 2 1 1  2  2 n   n2 p( D)  exp[ ( ) ] exp[ ( )] 2 2 n 2 n 2  n 2  n 1 • Equate coefficients of our two functions:  1   2  2 n   n2   exp      2  n 2  n   2  1  n  2  1     1 0    exp   2  2    2 2 (nˆ n )  2    2    0    0     1 ECE 8527: Lecture 05, Slide 7 Univariate Gaussian Case (Cont.) • Rearrange terms so that the dependencies on μ are clear:  1   n2    1 1 2  n         exp   2  exp   2   2 2     2    2   n   2  n  n   n    1  n  1  0    1  2     exp   2  2    2 2 (nˆ n )  2     2    0    0     1 • Associate terms related to σ2 and μ: • There is actually a third equation involving terms not related to :  1   n2   exp   2       or   2   2  n  n   1  1   n2   1 exp   2      2   2  n 2  0  n   1  1   2     n but we can ignore this since it is not a function of μ and is a complicated equation to solve. ECE 8527: Lecture 05, Slide 8 Univariate Gaussian Case (Cont.) • Two equations and two unknowns. Solve for μn and σn2. First, solve for μn2 :  2 02  02 2     n 1 n 02   2 n 02   2  2 2  0 2 n 1 • Next, solve for μn: n n2   n2   n  ˆ n ( 2 )   0  2   0   1 n   02 2    2  ˆ n ( 2 )   0 2 2    n 0    0  n 02  ˆ n  2 2  n 0     02 2  2 2  n 0     2   0    n 2   2   0     • Summarizing:  2    n  ( 2 ) ˆ  2 n  2 2 0 n 0    n 0    n 02  n2  02 2  n 02   2 ECE 8527: Lecture 05, Slide 9    Bayesian Learning • μn represents our best guess after n samples. • σn2 represents our uncertainty about this guess. • σn2 approaches σ2/n for large n – each additional observation decreases our uncertainty. • The posterior, p(μ|D), becomes more sharply peaked as n grows large. This is known as Bayesian learning. ECE 8527: Lecture 05, Slide 10 Class-Conditional Density • How do we obtain p(x|D) (derivation is tedious): p ( x D)   p ( x  ) p(  D)d   1 1 x 2 1 1   n 2 exp[  ( ) ][ exp[  ( ) ]d 2  2  2  n 2 n 1 x exp[  ( 2 n2 ) 2 ] f ( , n ) 2n 2  n 1 where: f ( , n )  • Note that: 2 2  2 2  1  n  n x   n  d  exp[  ( 2 2 )   2 2  2  n   n  p(x | D) ~ N (n , 2   n2 ) 2 • The conditional mean, μn, is treated as the true mean. • p(x|D) and P(ωj) can be used to design the classifier. ECE 8527: Lecture 05, Slide 11 Multivariate Case • Assume: p(x ) ~ N (, ) and p() ~ N (0 , 0 ) where  0 , and  0 are assumed to be known. • Applying Bayes formula: n p | D     p(x k | ) p  k 1 n  1 t  1 1 t 1    exp  ( (n    0 )  2 (  x k   01  0 ))   2  i 1 which has the form:  1  p | D     exp  (   n ) t  1 (   n )  2  • Once again: p | D  ~ N ( n ,  n ) and we have a reproducing density. ECE 8527: Lecture 05, Slide 12 Estimation Equations • Equating coefficients between the two Gaussians:  n1  n  1   01  n1  n  n  1 ˆ n   01 0 ˆ n  1 n  xk n k 1 • The solution to these equations is: 1 1 1  n   0 ( 0   ) 1ˆ n   ( 0   ) 1 0 n n n 1 1  n   0 ( 0   ) 1  n n • It also follows that: p(x | D) ~ N ( n ,    n ) ECE 8527: Lecture 05, Slide 13 General Theory • p(x | D) computation can be applied to any situation in which the unknown density can be parameterized. • The basic assumptions are:  The form of p(x | θ) is assumed known, but the value of θ is not known exactly.  Our knowledge about θ is assumed to be contained in a known prior density p(θ).  The rest of our knowledge about θ is contained in a set D of n random variables x1, x2, …, xn drawn independently according to the unknown probability density function p(x). ECE 8527: Lecture 05, Slide 14 Formal Solution • The posterior is given by: p(x | D)   p(x | ) p( | D)d • Using Bayes formula, we can write p(D| θ ) as: p ( | D )  p (D | ) p    p (D | ) p  d • and by the independence assumption: n p ( D | )   p ( x k   k 1 • This constitutes the formal solution to the problem because we have an expression for the probability of the data given the parameters. • This also illuminates the relation to the maximum likelihood estimate: • Suppose p(D| θ ) reaches a sharp peak at θ  θˆ . ECE 8527: Lecture 05, Slide 15 Comparison to Maximum Likelihood • This also illuminates the relation to the maximum likelihood estimate:  Suppose p(D| θ ) reaches a sharp peak at θ  θˆ .  p(θ| D) will also peak at the same place if p(θ) is well-behaved.    p(x|D) will be approximately p x | ˆ , which is the ML result.  If the peak of p(D| θ ) is very sharp, then the influence of prior information on the uncertainty of the true value of θ can be ignored.  However, the Bayes solution tells us how to use all of the available information to compute the desired density p(x|D). ECE 8527: Lecture 05, Slide 16 Recursive Bayes Incremental Learning • To indicate explicitly the dependence on the number of samples, let: Dn  x1, x2 ,..., xn • We can then write our expression for p(D| θ ): p( Dn | )  p(xn | ) p( Dn1 | ) 0 where p( D | )  p() . • We can write the posterior density using a recursive relation: p ( | D )  n  p ( D n | ) p   n  p ( D | ) p  d p ( xn |  ) p ( D n 1 |  ) n 1  p ( xn |  ) p ( | D ) d 0 • where p( | D )  p( ) . • This is called the Recursive Bayes Incremental Learning because we have a method for incrementally updating our estimates. ECE 8527: Lecture 05, Slide 17 When do ML and Bayesian Estimation Differ? • For infinite amounts of data, the solutions converge. However, limited data is always a problem. • If prior information is reliable, a Bayesian estimate can be superior. • Bayesian estimates for uniform priors are similar to an ML solution. • If p(θ| D) is broad or asymmetric around the true value, the approaches are likely to produce different solutions. • When designing a classifier using these techniques, there are three sources of error:  Bayes Error: the error due to overlapping distributions  Model Error: the error due to an incorrect model or incorrect assumption about the parametric form.  Estimation Error: the error arising from the fact that the parameters are estimated from a finite amount of data. ECE 8527: Lecture 05, Slide 18 Noninformative Priors and Invariance • The information about the prior is based on the designer’s knowledge of the problem domain. • We expect the prior distributions to be “translation and scale invariance” – they should not depend on the actual value of the parameter. • A prior that satisfies this property is referred to as a “noninformative prior”:  The Bayesian approach remains applicable even when little or no prior information is available.  Such situations can be handled by choosing a prior density giving equal weight to all possible values of θ.  Priors that seemingly impart no prior preference, the so-called noninformative priors, also arise when the prior is required to be invariant under certain transformations.  Frequently, the desire to treat all possible values of θ equitably leads to priors with infinite mass. Such noninformative priors are called improper priors. ECE 8527: Lecture 05, Slide 19 Example of Noninformative Priors • For example, if we assume the prior distribution of a mean of a continuous random variable is independent of the choice of the origin, the only prior that could satisfy this is a uniform distribution (which isn’t possible). • Consider a parameter σ, and a transformation of this variable: new variable, ~  ln(  ) . Suppose we also scale by a positive constant: ˆ  ln( a )  ln a  ln  . A noninformative prior on σ is the inverse distribution p(σ) = 1/ σ, which is also improper. ECE 8527: Lecture 05, Slide 20 Sufficient Statistics • Direct computation of p(D|θ) and p(θ|D) for large data sets is challenging (e.g. neural networks) • We need a parametric form for p(x|θ) (e.g., Gaussian) • Gaussian case: computation of the sample mean and covariance, which was straightforward, contained all the information relevant to estimating the unknown population mean and covariance. • This property exists for other distributions. • A sufficient statistic is a function s of the samples D that contains all the information relevant to a parameter, θ. • A statistic, s, is said to be sufficient for θ if p(D|s,θ) is independent of θ: p ( | s , D )  p ( D | s , ) p ( | s )  p ( | s ) p( D | s) ECE 8527: Lecture 05, Slide 21 The Factorization Theorem • Theorem: A statistic, s, is sufficient for θ, if and only if p(D|θ) can be written as: p(D | )  g (s, )h(D) . • There are many ways to formulate sufficient statistics (e.g., define a vector of the samples themselves). • Useful only when the function g() and the sufficient statistic are simple (e.g., sample mean calculation). • The factoring of p(D|θ) is not unique: g ( s, )  f ( s) g ( s, ) h( D)  h( D) / f ( s) • Define a kernel density invariant to scaling: g ( s , ) g~ ( s, )   g ( s ,  ) d • Significance: most practical applications of parameter estimation involve simple sufficient statistics and simple kernel densities. ECE 8527: Lecture 05, Slide 22 Gaussian Distributions  1  t 1     exp  x    x   k k  2  d / 2 1/ 2 k 1 2   n 1 p ( D )    1 n t 1 t 1 t 1   exp        x  x  k k   2 d / 2 1/ 2  k 1  2   1 n  n t 1  t 1  exp          x k   k 1   2  1  1 n t 1     exp   x k  x k   2 d / 2  1/ 2  2 k 1    • This isolates the θ dependence in the first term, and hence, the sample mean is a sufficient statistic using the Factorization Theorem. • The kernel is: ~ ,   g~  n  1 ~ t  1  1    ~   exp      n n  n   2 d / 2  1/ 2  2  1 ECE 8527: Lecture 05, Slide 23 The Exponential Family t • This can be generalized: p(x | )   x exp[ a()  b() c()] n n k 1 k 1 and: p( D | )  exp[ na()  b()  c(x k )]   x k   g (s, )h( D) t • Examples: ECE 8527: Lecture 05, Slide 24 Summary • Introduction of Bayesian parameter estimation. • The role of the class-conditional distribution in a Bayesian estimate. • Estimation of the posterior and probability density function assuming the only unknown parameter is the mean, and the conditional density of the “features” given the mean, p(x|θ), can be modeled as a Gaussian distribution. • Bayesian estimates of the mean for the multivariate Gaussian case. • General theory for Bayesian estimation. • Comparison to maximum likelihood estimates. • Recursive Bayesian incremental learning. • Noninformative priors. • Sufficient statistics • Kernel density. ECE 8527: Lecture 05, Slide 25 “The Euro Coin” • Getting ahead a bit, let’s see how we can put these ideas to work on a simple example due to David MacKay, and explained by Jon Hamaker. ECE 8527: Lecture 05, Slide 26

lecture_05

Related documents

Products

Support

lecture_05

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib