L6: Parameter estimation • • • • • Introduction Parameter estimation Maximum likelihood Bayesian estimation Numerical examples CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 1 • In previous lectures we showed how to build classifiers when the underlying densities are known – Bayesian Decision Theory introduced the general formulation – Quadratic classifiers covered the special case of unimodal Gaussian data • In most situations, however, the true distributions are unknown and must be estimated from data – Two approaches are commonplace • Parameter Estimation (this lecture) • Non-parametric Density Estimation (the next two lectures) • Parameter estimation – Assume a particular form for the density (e.g. Gaussian), so only the parameters (e.g., mean and variance) need to be estimated • Maximum Likelihood • Bayesian Estimation • Non-parametric density estimation – Assume NO knowledge about the density • Kernel Density Estimation • Nearest Neighbor Rule CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 2 ML vs. Bayesian parameter estimation • Maximum Likelihood – The parameters are assumed to be FIXED but unknown – The ML solution seeks the solution that “best” explains the dataset X π = ππππππ₯ π π|π • Bayesian estimation – Parameters are assumed to be random variables with some (assumed) known a priori distribution – Bayesian methods seeks to estimate the posterior density π(π|π) – The final density π(π₯|π) is obtained by integrating out the parameters π π₯|π = ∫ π π₯ π π π|π ππ • Maximum Likelihood Bayesian p ο¨θ | X ο© p ο¨X | θ ο© p ο¨θ | X ο© p ο¨θ ο© θΜ θ CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU θ 3 Maximum Likelihood • Problem definition – Assume we seek to estimate a density π(π₯) that is known to depends on a number of parameters π = π1 , π2 , … ππ π • For a Gaussian pdf, π1 = π, π2 = π and π(π₯) = π(π, π) • To make the dependence explicit, we write π(π₯|π) – Assume we have dataset π = {π₯ (1 , π₯ (2 , … π₯ (π } drawn independently from the distribution π(π₯|π) (an i.i.d. set) • Then we can write π π π|π = Ππ=1 π π₯ (π |π • The ML estimate of π is the value that maximizes the likelihood π π π π = ππππππ₯ π π|π • This corresponds to the intuitive idea of choosing the value of π that is most likely to give rise to the data CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 4 • For convenience, we will work with the log likelihood – Because the log is a monotonic function, then: Taking logs θΜ = ππππππ₯ log π π|π log p(X|ο±) p(X|ο±) π = ππππππ₯ π π|π θ θΜ θ – Hence, the ML estimate of π can be written as: π π = ππππππ₯ log Ππ=1 π π₯ (π |π π = ππππππ₯ Σπ=1 log π π₯ (π |π • This simplifies the problem, since now we have to maximize a sum of terms rather than a long product of terms • An added advantage of taking logs will become very clear when the distribution is Gaussian CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 5 Example: Gaussian case, π unknown • Problem statement – Assume a dataset π = π₯ (1 , π₯ (2 , … π₯ (π and a density of the form π π₯ = π π, π where π is known – What is the ML estimate of the mean? π π = π ⇒ π = arg πππ₯Σπ=1 ππππ π₯ (π |π = 1 1 π = arg πππ₯Σπ=1 πππ exp − 2 π₯ (π − π 2π 2ππ 1 1 2 π = arg πππ₯Σπ=1 πππ − 2 π₯ (π − π 2π 2ππ 2 = – The maxima of a function are defined by the zeros of its derivative (π πΣπ π=1 ππππ π₯ |π ππ = π ππ π Σπ=1 ππππ ⋅ = 0 ⇒ 1 π (π Σπ=1 π₯ π – So the ML estimate of the mean is the average value of the training data, a very intuitive result! π= CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 6 Example: Gaussian case, both ο and ο³ unknown • A more general case when neither π nor π is known – Fortunately, the problem can be solved in the same fashion – The derivative becomes a gradient since we have two variables π= π1 = π π2 = π 2 π π Σπ=1 ππππ π₯ (π |π ππ1 ⇒ π»π = π π Σπ=1 ππππ π₯ (π |π ππ2 π = Σπ=1 1 (π π₯ − π1 π2 1 π₯ (π − π1 − + 2π2 2π22 – Solving for π1 and π2 yields 1 π (π 1 π π1 = Σπ=1 π₯ ; π2 = Σπ=1 π₯ (π − π1 π π 2 =0 2 • Therefore, the ML of the variance is the sample variance of the dataset, again a very pleasing result – Similarly, it can be shown that the ML estimates for the multivariate Gaussian are the sample mean vector and sample covariance matrix 1 π (π 1 π π π = Σπ=1 π₯ ; Σ = Σπ=1 π₯ (π − π π₯ (π − π π π CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 7 Bias and variance • How good are these estimates? – Two measures of “goodness” are used for statistical estimates – BIAS: how close is the estimate to the true value? – VARIANCE: how much does it change for different datasets? BIAS ο± ο±TRUE VARIANCE – The bias-variance tradeoff • In most cases, you can only decrease one of them at the expense of the other LOW BIAS HIGH VARIANCE ο±TRUE HIGH BIAS LOW VARIANCE ο± CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU ο±TRUE ο± 8 • What is the bias of the ML estimate of the mean? 1 π (π 1 π Σπ=1 π₯ = Σπ=1 πΈ π₯ (π = π π π – Therefore the mean is an unbiased estimate πΈ π =πΈ • What is the bias of the ML estimate of the variance? 1 π π−1 2 2 (π πΈ π =πΈ Σπ=1 π₯ − π = π ≠ π2 π π – Thus, the ML estimate of variance is BIASED 2 • This is because the ML estimate of variance uses π instead of π – How “bad” is this bias? • For π → ∞ the bias becomes zero asymptotically • The bias is only noticeable when we have very few samples, in which case we should not be doing statistics in the first place! – Notice that MATLAB uses an unbiased estimate of the covariance 1 π π (π (π Σπππ΅πΌπ΄π = Σ π₯ −π π₯ −π π − 1 π=1 CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 9 Bayesian estimation • In the Bayesian approach, our uncertainty about the parameters is represented by a pdf – Before we observe the data, the parameters are described by a prior density π(π) which is typically very broad to reflect the fact that we know little about its true value – Once we obtain data, we make use of Bayes theorem to find the posterior π(π|π) • Ideally we want the data to sharpen the posterior π(π|π), that is, reduce our uncertainty about the parameters p ο¨θ | X ο© p ο¨θ | X ο© p ο¨θ ο© θ – Remember, though, that our goal is to estimate π(π₯) or, more exactly, π(π₯|π), the density given the evidence provided by the dataset X CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 10 • Let us derive the expression of a Bayesian estimate – From the definition of conditional probability π π₯, π|π = π π₯|π, π π π|π – π(π₯|π, π) is independent of X since knowledge of π completely specifies the (parametric) density. Therefore π π₯, π|π = π π₯|π π π|π – and, using the theorem of total probability we can integrate π out: π π₯|π = ∫ π π₯|π π π|π ππ • The only unknown in this expression is π(π|π); using Bayes rule π π|π π π π π|π π π π π|π = = π π ∫ π π|π π π ππ • Where π(π|π) can be computed using the i.i.d. assumption π π π₯ (π |π π π|π = π=1 • NOTE: The last three expressions suggest a procedure to estimate π(π₯|π). This is not to say that integration of these expressions is easy! CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 11 • Example – Assume a univariate density where our random variable π₯ is generated from a normal distribution with known standard deviation – Our goal is to find the mean π of the distribution given some i.i.d. data points π = π₯ (1 , π₯ (2 , … π₯ (π – To capture our knowledge about π = π, we assume that it also follows a normal density with mean π0 and standard deviation π0 1 − 2 π−π0 2 1 π0 π = π 2π0 2ππ0 – We use Bayes rule to develop an expression for the posterior π π π π π|π π π π0 π π π π|π = = Ππ=1 π π₯ (π |π = π π π π 1 1 (π −π 2 − 2 π−π0 2 1 1 1 − π₯ 2π2 e 2π0 ∏π e π π π=1 2ππ 2ππ0 [Bishop, 1995] CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 12 – To understand how Bayesian estimation changes the posterior as more data becomes available, we will find the maximum of π(π|π) – The partial derivative with respect to π = π is π π 1 1 2 π 2 log π π|π = 0 ⇒ − 2 π − π0 − Σπ=1 2 π₯ (π − π =0 ππ ππ 2π0 2π – which, after some algebraic manipulation, becomes π2 ππ02 1 π (π ππ = 2 π0 + 2 Σ π₯ π + ππ02 π + ππ02 π π=1 ππ πΌππ ππΏ • Therefore, as N increases, the estimate of the mean ππ moves from the initial prior π0 to the ML solution – Similarly, the standard deviation ππ can be found to be 1 2 ππ = π + π2 1 π02 [Bishop, 1995] CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 13 Example • Assume that the true mean of the distribution π(π₯) is π = 0.8 with standard deviation π = 0.3 • In reality we would not know the true mean; we are just “playing God” – We generate a number of examples from this distribution – To capture our lack of knowledge about the mean, we assume a normal prior π0 (π0 ), with π0 = 0.0 and π0 = 0.3 – The figure below shows the posterior π(π|π) • As π increases, the estimate ππ approaches its true value (π = 0.8) and the spread ππ (or uncertainty in the estimate) decreases P ( ο± |X ) 50 N=10 40 30 N=5 20 N=0 N=1 10 0 0 0 .2 0 .4 CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 0 .6 0 .8 ο± 14 ML vs. Bayesian estimation • What is the relationship between these two estimates? – By definition, π(π|π) peaks at the ML estimate – If this peak is relatively sharp and the prior is broad, then the integral below will be dominated by the region around the ML estimate π π₯|π = ∫ π π₯|π π π|π ππ ≅ π π₯|π ∫ π π|π ππ = π π₯|π =1 • Therefore, the Bayesian estimate will approximate the ML solution – As we have seen in the previous example, when the number of available data increases, the posterior π(π|π) tends to sharpen • Thus, the Bayesian estimate of π(π₯) will approach the ML solution as π→∞ • In practice, only when we have a limited number of observations will the two approaches yield different results CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 15