Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7 CS479/679 Pattern Recognition Dr. George Bebis Bayesian Estimation • Assumes that the parameters q are random variables and that they have some known apriori distribution p(q). • Estimates a distribution rather than making a point estimate like ML: p ( x / D ) p ( x / θ ) p (θ / D ) d θ Note: BE solution might not be of the parametric form assumed. Role of Training Examples • If p(x/ωi) and P(ωi) are known, the Bayes’ rule allows us to compute the posterior probabilities P(ωi /x): p(x / i ) P(i ) P(i / x) p(x / j ) P( j ) j • Consider the role of the training examples D by introducing them in the computation of the posterior probabilities: P(i / x, D) Role of Training Examples (cont’d) p (x, D / i ) P(i ) p ( x / D, i ) p ( D / i ) P (i ) P(i / x, D) p (x, D) p(x / D) p( D) p (x / i , D) P(i / D) p( x / i , D) P(i / D) p (x / D) p(x, j / D j ) j marginalize p (x / i , Di ) P(i / Di ) p(x / j , D j ) P( j / D j ) j Using only the samples from class i Role of Training Examples (cont’d) • The training examples Di are important in determining both the class-conditional densities and the prior probabilities: p(x / i , Di ) P(i / Di ) P(i / x, Di ) p(x / j , D j ) P( j / D j ) j • For simplicity, replace P(ωi /D) with P(ωi): p(x / i , Di ) P(i ) P(i / x, Di ) p(x / j , D j ) P( j ) j Bayesian Estimation (BE) • Need to estimate p(x/ωi,Di) for every class ωi • If the samples in Dj give no information about qi, i we need to solve c independent problems: “Given D, estimate p(x/D)” p(x / i , Di ) P(i ) P(i / x, Di ) p(x / j , D j ) P( j ) j j BE Approach • Estimate p(x/D) as follows: p(x / D) p(x, θ / D)dθ p(x / θ, D) p(θ / D)dθ • Since p(x / θ, D) p(x / θ), we have: p ( x / D ) p ( x / θ ) p (θ / D ) d θ Interpretation of BE Solution • If we are less certain about the exact value of θ, consider a weighted average of p(x / θ) over the possible values of θ: p ( x / D ) p ( x / θ ) p (θ / D ) d θ • Samples D exert their influence on p(x / D) through p(θ / D). BE Solution – Special Case • Suppose p(θ/D) peaks sharply at θ θˆ , then p(x/D) can be approximated as follows: p ( x / D ) p ( x / θ ) p (θ / D ) d θ ˆ) p(x / D) p (x / θ (assuming that p(x/ θ) is smooth) Relation to ML solution p ( x / D ) p ( x / θ ) p (θ / D ) d θ p( D / θ) p(θ) p(θ / D) p( D) • If p(D/ θ) peaks sharply at θ θˆ , then p(θ /D) will, in general, peak sharply at θ θˆ too (i.e., close to ML solution): ˆ) p(x / D) p (x / θ • Therefore, ML is a special case of BE! BE Main Steps (1) Compute p(θ/D) : n p( D / θ) p(θ) p(θ / D) a p(x k / θ) p(θ) p( D) k 1 (2) Compute p(x/D) : p ( x / D ) p ( x / θ ) p (θ / D ) d θ Case 1: Univariate Gaussian, Unknown μ (known μ 0 and σ0 ) D={x1,x2,…,xn} (1) (independently drawn) Case 1: Univariate Gaussian, Unknown μ (cont’d) • It can be shown that p(μ/D) has the following form: X p(μ/D) peaks at μn where: c Case 1: Univariate Gaussian, Unknown μ (cont’d) (i.e., lies between them) 0 0 n as n (ML estimate) implies more samples! Case 1: Univariate Gaussian, Unknown μ (cont’d) n implies more samples! Case 1: Univariate Gaussian, Unknown μ (cont’d) Bayesian Learning Case 1: Univariate Gaussian, Unknown μ (cont’d) (2) independent on μ As the number of samples increases, p(x/D) converges to p(x/μ) Case 2: Multivariate Gaussian, Unknown μ Assume p(x/μ)~N(μ,Σ) and p(μ)~N(μ0, Σ0) (known μ0, Σ0) D={x1,x2,…,xn} (independently drawn) (1) Compute p(μ/D): n p( D / μ) p(μ) p(μ / D) a p (xk / μ) p(μ) p( D) k 1 Case 2: Multivariate Gaussian, Unknown μ (cont’d) • It can be shown that p(μ/D) has the following form: 1 p(μ / D) c exp[ (μ μn )t n1 (μ μ n )] 2 where: 1 1 1 1 1 μ n 0 ( 0 ) xn ( 0 ) μ 0 n n n 1 1 1 n 0 ( 0 ) n n Case 2: Multivariate Gaussian, Unknown μ (cont’d) (2) Compute p(x/D): p(x / D) p(x / μ) p(μ / D)dμ ~ N (μ n , n ) Recursive Bayes Learning • Develop an incremental learning algorithm: Dn: (x1, x2, …., xn-1, xn) Dn-1 n • Rewrite p( D / θ) p(x k / θ) as follows: k 1 p ( D n / θ) p (x n / θ) p ( D n 1 / θ) Recursive Bayes Learning (cont’d) p ( D n / θ) p (θ) p (θ / D ) n p( D ) p ( D n / θ ) p (θ ) n p (x n / θ) p ( D n 1 / θ) p(θ) p(x n / θ) p ( D n 1 / θ) p(θ)dθ p( D n / θ) p (θ)dθ p( x n / θ) p(θ / D n 1 ) p(x n / θ) p(θ / D where p (θ / D 0 ) p (θ) n 1 ) dθ n=1,2,… Example p(θ / D ) n p(x n / θ) p(θ / D n 1 ) p(x n / θ) p(θ / D n 1 )dθ where p(θ / D 0 ) p(θ) p(θ) Example (cont’d) (x4=8) In general: p (q / D ) n 1 qn , for max x [ D n ] q 10 Example (cont’d) p(θ/D4) peaks at qˆ 8 Iterations p(θ/D0) ML estimate: p( x / qˆ) ~ U (0,8) Bayesian estimate: p ( x / D ) p ( x / θ ) p (θ / D ) d θ ML vs Bayesian Estimation • Number of training data – The two methods are equivalent assuming infinite number of training data (and prior distributions that do not exclude the true solution). – For small training data sets, they give different results in most cases. • Computational complexity – ML uses differential calculus or gradient search for maximizing the likelihood. – Bayesian estimation requires complex multidimensional integration techniques. ML vs Bayesian Estimation (cont’d) • Solution complexity – Easier to interpret ML solutions (i.e., must be of the assumed parametric form). – A Bayesian estimation solution might not be of the parametric form assumed. • Prior distribution – If the prior distribution p(θ) is uniform, Bayesian estimation solutions are equivalent to ML solutions. – In general, the two methods will give different solutions. Computational Complexity ML estimation dimensionality: d • Learning complexity # training data: n # classes: c O(dn) O(d2n) O(d3) g ( x) O(1) O(d2) O(n) 1 d 1 ˆ 1 ( x ˆ | ln P( ) ˆ )t ˆ ) ln 2 ln | (x 2 2 2 These computations must be repeated c times (once for each class) (n>d) Computational Complexity dimensionality: d • Classification complexity O(d2) g ( x) # training data: n # classes: c O(1) 1 d 1 ˆ 1 ( x ˆ | ln P( ) ˆ )t ˆ ) ln 2 ln | (x 2 2 2 These computations must be repeated c times and take max Computational Complexity Bayesian Estimation • Learning complexity: higher than ML • Classification complexity: same as ML Main Sources of Error in Classifier Design p(x / i , Di ) P(i ) P(i / x, Di ) p(x / j , D j ) P( j ) • Bayes error j – The error due to overlapping densities p(x/ωi) • Model error – The error due to choosing an incorrect model. • Estimation error – The error due to incorrectly estimated parameters (e.g., due to small number of training examples) Overfitting • When the number of training examples is inadequate, the solution obtained might not be optimal. • Consider the problem of curve fitting: – Points were selected from a parabola (plus noise). – A 10th degree polynomial fits the data perfectly but does not generalize well. A greater error on training data might improve generalization! # training examples > # model parameters Overfitting (cont’d) • Control model complexity – Assume diagonal covariance matrix (i.e., uncorrelated features). – Use the same covariance matrix for all classes. • Shrinkage techniques Shrink individual covariance matrices to common covariance: (1 a)ni i an i (a) , (0 a 1) (1 a)ni an Shrink common covariance matrix to identity matrix: (b) (1 b) bI , (0 b 1)