My lectures 1 Introduction to Bayesian inference and parameter estimation 2 Inference in models with hidden variables 3 Approximate inference 4 Non-parametric modelling: Gaussian process inference in differential equation models of gene regulation Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 1 / 22 Introduction to Inference and Estimation Magnus Rattray Machine Learning and Oprimization Group, University of Manchester June 14th, 2010 Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 2 / 22 Outline Probability basics and Bayes theorem Representing belief as a probability distribution Using Bayes theorem to update beliefs Parameter Estimation: Maximum likelihood, MAP, Posterior Mean and Least Squares Fit Overfitting and regularization Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 3 / 22 Probability basics p(X ) is the probability of observing an event X p(X , Y ) is the probability of observing two events X and Y – the joint probability of X and Y p(X |Y ) is the probability of observing event X given that you observed event Y – the conditional probability of X given Y The basic rules: (1) p(not X ) + p(X ) = 1 (normalisation) (2) p(X , Y ) P = p(X |Y )p(Y ) (product rule) (3) p(Y ) = X p(X , Y ) (sum rule) Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 4 / 22 Probability basics p(X ) is the probability of observing an event X p(X , Y ) is the probability of observing two events X and Y – the joint probability of X and Y p(X |Y ) is the probability of observing event X given that you observed event Y – the conditional probability of X given Y The basic rules: (1) p(not X ) + p(X ) = 1 (normalisation) (2) p(X , Y ) P = p(X |Y )p(Y ) (product rule) (3) p(Y ) = X p(X , Y ) (sum rule) Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 4 / 22 Probability basics p(X ) is the probability of observing an event X p(X , Y ) is the probability of observing two events X and Y – the joint probability of X and Y p(X |Y ) is the probability of observing event X given that you observed event Y – the conditional probability of X given Y The basic rules: (1) p(not X ) + p(X ) = 1 (normalisation) (2) p(X , Y ) P = p(X |Y )p(Y ) (product rule) (3) p(Y ) = X p(X , Y ) (sum rule) Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 4 / 22 Probability basics p(X ) is the probability of observing an event X p(X , Y ) is the probability of observing two events X and Y – the joint probability of X and Y p(X |Y ) is the probability of observing event X given that you observed event Y – the conditional probability of X given Y The basic rules: (1) p(not X ) + p(X ) = 1 (normalisation) (2) p(X , Y ) P = p(X |Y )p(Y ) (product rule) (3) p(Y ) = X p(X , Y ) (sum rule) Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 4 / 22 Probability basics p(X ) is the probability of observing an event X p(X , Y ) is the probability of observing two events X and Y – the joint probability of X and Y p(X |Y ) is the probability of observing event X given that you observed event Y – the conditional probability of X given Y The basic rules: (1) p(not X ) + p(X ) = 1 (normalisation) (2) p(X , Y ) P = p(X |Y )p(Y ) (product rule) (3) p(Y ) = X p(X , Y ) (sum rule) Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 4 / 22 Bayes theorem The product rule of probability is, p(X , Y ) = p(X )p(Y |X ) Since p(X , Y ) = p(Y , X ) we can swap X and Y on the right, p(X , Y ) = p(Y )p(X |Y ) which, after a bit of rearranging, gives: Bayes theorem p(X |Y ) = p(Y |X )p(X ) p(Y ) where p(Y ) = X p(Y |X )p(X ) X Inverse Probability: If we know the probability of an observation Y given an unobserved X then this gives us the probability of X given Y . Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 5 / 22 Bayes theorem The product rule of probability is, p(X , Y ) = p(X )p(Y |X ) Since p(X , Y ) = p(Y , X ) we can swap X and Y on the right, p(X , Y ) = p(Y )p(X |Y ) which, after a bit of rearranging, gives: Bayes theorem p(X |Y ) = p(Y |X )p(X ) p(Y ) where p(Y ) = X p(Y |X )p(X ) X Inverse Probability: If we know the probability of an observation Y given an unobserved X then this gives us the probability of X given Y . Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 5 / 22 Bayes theorem The product rule of probability is, p(X , Y ) = p(X )p(Y |X ) Since p(X , Y ) = p(Y , X ) we can swap X and Y on the right, p(X , Y ) = p(Y )p(X |Y ) which, after a bit of rearranging, gives: Bayes theorem p(X |Y ) = p(Y |X )p(X ) p(Y ) where p(Y ) = X p(Y |X )p(X ) X Inverse Probability: If we know the probability of an observation Y given an unobserved X then this gives us the probability of X given Y . Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 5 / 22 Bayes theorem The product rule of probability is, p(X , Y ) = p(X )p(Y |X ) Since p(X , Y ) = p(Y , X ) we can swap X and Y on the right, p(X , Y ) = p(Y )p(X |Y ) which, after a bit of rearranging, gives: Bayes theorem p(X |Y ) = p(Y |X )p(X ) p(Y ) where p(Y ) = X p(Y |X )p(X ) X Inverse Probability: If we know the probability of an observation Y given an unobserved X then this gives us the probability of X given Y . Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 5 / 22 Bayes theorem example Am I in Edinburgh? Prior: I’m in Edinburgh 5% of the year, otherwise I’m in Manchester Model: It rains on 10% of days in Edinburgh, 30% of days in Manchester Data(D): It has not rained for 5 days Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 6 / 22 Bayes theorem example Am I in Edinburgh? Prior: I’m in Edinburgh 5% of the year, otherwise I’m in Manchester Model: It rains on 10% of days in Edinburgh, 30% of days in Manchester Data(D): It has not rained for 5 days p(Edinburgh) = 0.05 Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 6 / 22 Bayes theorem example Am I in Edinburgh? Prior: I’m in Edinburgh 5% of the year, otherwise I’m in Manchester Model: It rains on 10% of days in Edinburgh, 30% of days in Manchester Data(D): It has not rained for 5 days p(Edinburgh) = 0.05 p(D|Edinburgh) = (0.9)5 Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 6 / 22 Bayes theorem example Am I in Edinburgh? Prior: I’m in Edinburgh 5% of the year, otherwise I’m in Manchester Model: It rains on 10% of days in Edinburgh, 30% of days in Manchester Data(D): It has not rained for 5 days p(Edinburgh) = 0.05 p(D|Edinburgh) = (0.9)5 p(D|Manchester) = (0.7)5 Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 6 / 22 Bayes theorem example Am I in Edinburgh? Prior: I’m in Edinburgh 5% of the year, otherwise I’m in Manchester Model: It rains on 10% of days in Edinburgh, 30% of days in Manchester Data(D): It has not rained for 5 days p(Edinburgh) = 0.05 p(D|Edinburgh) = (0.9)5 p(D|Manchester) = (0.7)5 p(D|Edinburgh)p(Edinburgh) p(D|Edinburgh)p(Edinburgh) + p(D|Manchester)p(Manchester) (0.9)5 × 0.05 = 5 (0.9) × 0.05 + (0.7)5 × 0.95 = 0.1561 p(Edinburgh|D) = Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 6 / 22 Bayes theorem example Am I in Edinburgh? Prior: I’m in Edinburgh 5% of the year, otherwise I’m in Manchester Model: It rains on 10% of days in Edinburgh, 30% of days in Manchester Data(D): It has not rained for 5 days p(Edinburgh) = 0.05 p(D|Edinburgh) = (0.9)5 p(D|Manchester) = (0.7)5 p(D|Edinburgh)p(Edinburgh) p(D|Edinburgh)p(Edinburgh) + p(D|Manchester)p(Manchester) (0.9)5 × 0.05 = 5 (0.9) × 0.05 + (0.7)5 × 0.95 = 0.1561 p(Edinburgh|D) = 15% posterior probability that I’m in Edinburgh Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 6 / 22 Bayesian inference Use probability to represent belief in the parameters θ of a model Use Bayes theorem to update beliefs after observing some data D, captured by the posterior distribution p(θ|D): p(θ|D) = p(D|θ)p(θ) p(D) where p(D) = X p(D|θ)p(θ) θ The prior distribution p(θ) is what we believe before seeing the data The evidence p(D) can be used to compare alternative models Uncertain parameters and random variables treated equivalently The catch - computing p(θ|D) and p(D) usually very difficult Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 7 / 22 Bayesian inference Use probability to represent belief in the parameters θ of a model Use Bayes theorem to update beliefs after observing some data D, captured by the posterior distribution p(θ|D): p(θ|D) = p(D|θ)p(θ) p(D) where p(D) = X p(D|θ)p(θ) θ The prior distribution p(θ) is what we believe before seeing the data The evidence p(D) can be used to compare alternative models Uncertain parameters and random variables treated equivalently The catch - computing p(θ|D) and p(D) usually very difficult Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 7 / 22 Bayesian inference Use probability to represent belief in the parameters θ of a model Use Bayes theorem to update beliefs after observing some data D, captured by the posterior distribution p(θ|D): p(θ|D) = p(D|θ)p(θ) p(D) where p(D) = X p(D|θ)p(θ) θ The prior distribution p(θ) is what we believe before seeing the data The evidence p(D) can be used to compare alternative models Uncertain parameters and random variables treated equivalently The catch - computing p(θ|D) and p(D) usually very difficult Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 7 / 22 Bayesian inference Use probability to represent belief in the parameters θ of a model Use Bayes theorem to update beliefs after observing some data D, captured by the posterior distribution p(θ|D): p(θ|D) = p(D|θ)p(θ) p(D) where p(D) = X p(D|θ)p(θ) θ The prior distribution p(θ) is what we believe before seeing the data The evidence p(D) can be used to compare alternative models Uncertain parameters and random variables treated equivalently The catch - computing p(θ|D) and p(D) usually very difficult Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 7 / 22 Bayesian inference Use probability to represent belief in the parameters θ of a model Use Bayes theorem to update beliefs after observing some data D, captured by the posterior distribution p(θ|D): p(θ|D) = p(D|θ)p(θ) p(D) where p(D) = X p(D|θ)p(θ) θ The prior distribution p(θ) is what we believe before seeing the data The evidence p(D) can be used to compare alternative models Uncertain parameters and random variables treated equivalently The catch - computing p(θ|D) and p(D) usually very difficult Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 7 / 22 Bayesian inference Use probability to represent belief in the parameters θ of a model Use Bayes theorem to update beliefs after observing some data D, captured by the posterior distribution p(θ|D): p(θ|D) = p(D|θ)p(θ) p(D) where p(D) = X p(D|θ)p(θ) θ The prior distribution p(θ) is what we believe before seeing the data The evidence p(D) can be used to compare alternative models Uncertain parameters and random variables treated equivalently The catch - computing p(θ|D) and p(D) usually very difficult Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 7 / 22 Bayesian inference Use probability to represent belief in the parameters θ of a model Use Bayes theorem to update beliefs after observing some data D, captured by the posterior distribution p(θ|D): p(θ|D) = p(D|θ)p(θ) p(D) where p(D) = X p(D|θ)p(θ) θ The prior distribution p(θ) is what we believe before seeing the data The evidence p(D) can be used to compare alternative models Uncertain parameters and random variables treated equivalently The catch - computing p(θ|D) and p(D) usually very difficult Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 7 / 22 Motivating example: estimating gene expression Gene expressed with concentration θ Observation model: xi = θ + random noise Data are observations: D = {x1 , x2 , x3 . . . xn } What is our belief in θ after seeing the data? Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 8 / 22 Motivating example: estimating gene expression Gene expressed with concentration θ Observation model: xi = θ + random noise Data are observations: D = {x1 , x2 , x3 . . . xn } What is our belief in θ after seeing the data? Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 8 / 22 Motivating example: estimating gene expression Gene expressed with concentration θ Observation model: xi = θ + random noise Data are observations: D = {x1 , x2 , x3 . . . xn } What is our belief in θ after seeing the data? Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 8 / 22 Motivating example: estimating gene expression Gene expressed with concentration θ Observation model: xi = θ + random noise Data are observations: D = {x1 , x2 , x3 . . . xn } What is our belief in θ after seeing the data? Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 8 / 22 Motivating example: estimating gene expression Gene expressed with concentration θ Observation model: xi = θ + random noise Data are observations: D = {x1 , x2 , x3 . . . xn } What is our belief in θ after seeing the data? Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 8 / 22 Representing belief as a distribution Normal prior distribution for θ with mean µ0 and variance σ02 : 1.2 p(θ) 1 0.8 0.6 0.4 0.2 0 −3 −2 −1 0 θ 1 2 3 We write θ ∼ N (µ0 , σ02 ) and the density is 1 −(θ − µ0 )2 q p(θ) = exp 2σ02 2πσ02 Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 9 / 22 Representing belief as a distribution Normal prior distribution for θ with mean µ0 and variance σ02 : 1.2 p(θ) 1 0.8 0.6 0.4 0.2 0 −3 −2 −1 0 θ 1 2 3 We write θ ∼ N (µ0 , σ02 ) and the density is 1 −(θ − µ0 )2 q p(θ) = exp 2σ02 2πσ02 Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 9 / 22 Representing observation as a distribution Also natural to use probability to represent observation error Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 10 / 22 Representing observation as a distribution Also natural to use probability to represent observation error Assume errors i normally distributed with mean 0 and variance σ2 : xi = θ + i Magnus Rattray (University of Manchester) → xi ∼ N (θ, σ2 ) Inference and estimation 14/06/10 10 / 22 Representing observation as a distribution Also natural to use probability to represent observation error Assume errors i normally distributed with mean 0 and variance σ2 : → xi = θ + i xi ∼ N (θ, σ2 ) Or in other words: p(xi |θ) = p Magnus Rattray (University of Manchester) 1 2πσ2 exp Inference and estimation −(xi − θ)2 2σ2 14/06/10 10 / 22 Updating beliefs Bayes theorem tells us how to update our belief given some data: p(θ|D) = p(D|θ)p(θ) p(D) or in words: Posterior = Likelihood × Prior Evidence For n independent (i.i.d.) observations: Likelihood p(D|θ) = n Y p(xi |θ) i=1 Denominator is called the Evidence, Bayes factor or Marginal Likelihood: Evidence Z p(D) = Magnus Rattray (University of Manchester) p(D|θ)p(θ)dθ Inference and estimation 14/06/10 11 / 22 Updating beliefs Bayes theorem tells us how to update our belief given some data: p(θ|D) = p(D|θ)p(θ) p(D) or in words: Posterior = Likelihood × Prior Evidence For n independent (i.i.d.) observations: Likelihood p(D|θ) = n Y p(xi |θ) i=1 Denominator is called the Evidence, Bayes factor or Marginal Likelihood: Evidence Z p(D) = Magnus Rattray (University of Manchester) p(D|θ)p(θ)dθ Inference and estimation 14/06/10 11 / 22 Updating beliefs Bayes theorem tells us how to update our belief given some data: p(θ|D) = p(D|θ)p(θ) p(D) or in words: Posterior = Likelihood × Prior Evidence For n independent (i.i.d.) observations: Likelihood p(D|θ) = n Y p(xi |θ) i=1 Denominator is called the Evidence, Bayes factor or Marginal Likelihood: Evidence Z p(D) = Magnus Rattray (University of Manchester) p(D|θ)p(θ)dθ Inference and estimation 14/06/10 11 / 22 Back to the example: estimating expression level Zero-mean, unit variance prior: θ ∼ N (0, 1) Observation model with σ2 = 1: xi ∼ N (θ, 1) 1.2 1 0.8 0.6 0.4 0.2 0 −3 Magnus Rattray (University of Manchester) −2 −1 0 xi Inference and estimation 1 2 3 14/06/10 12 / 22 Back to the example: estimating expression level Zero-mean, unit variance prior: θ ∼ N (0, 1) Observation model with σ2 = 1: xi ∼ N (θ, 1) 1.2 p(θ) 1 0.8 0.6 0.4 0.2 0 −3 Magnus Rattray (University of Manchester) −2 −1 0 θ Inference and estimation 1 2 3 14/06/10 12 / 22 Back to the example: estimating expression level Zero-mean, unit variance prior: θ ∼ N (0, 1) Observation model with σ2 = 1: xi ∼ N (θ, 1) 1.2 p(θ|D) 1 0.8 0.6 0.4 0.2 0 −3 Magnus Rattray (University of Manchester) −2 −1 0 θ Inference and estimation 1 2 3 14/06/10 12 / 22 Back to the example: estimating expression level Zero-mean, unit variance prior: θ ∼ N (0, 1) Observation model with σ2 = 1: xi ∼ N (θ, 1) 1.2 p(θ|D) 1 0.8 0.6 0.4 0.2 0 −3 Magnus Rattray (University of Manchester) −2 −1 0 θ Inference and estimation 1 2 3 14/06/10 12 / 22 Back to the example: estimating expression level Zero-mean, unit variance prior: θ ∼ N (0, 1) Observation model with σ2 = 1: xi ∼ N (θ, 1) 1.2 p(θ|D) 1 0.8 0.6 0.4 0.2 0 −3 Magnus Rattray (University of Manchester) −2 −1 0 θ Inference and estimation 1 2 3 14/06/10 12 / 22 Back to the example: estimating expression level Zero-mean, unit variance prior: θ ∼ N (0, 1) Observation model with σ2 = 1: xi ∼ N (θ, 1) 1.2 p(θ|D) 1 0.8 0.6 0.4 0.2 0 −3 Magnus Rattray (University of Manchester) −2 −1 0 θ Inference and estimation 1 2 3 14/06/10 12 / 22 Back to the example: estimating expression level Zero-mean, unit variance prior: θ ∼ N (0, 1) Observation model with σ2 = 1: xi ∼ N (θ, 1) 1.2 p(θ|D) 1 0.8 0.6 0.4 0.2 0 −3 Magnus Rattray (University of Manchester) −2 −1 0 θ Inference and estimation 1 2 3 14/06/10 12 / 22 Back to the example: estimating expression level Zero-mean, unit variance prior: θ ∼ N (0, 1) Observation model with σ2 = 1: xi ∼ N (θ, 1) 1.2 p(θ|D) 1 0.8 0.6 0.4 0.2 0 −3 Magnus Rattray (University of Manchester) −2 −1 0 θ Inference and estimation 1 2 3 14/06/10 12 / 22 Back to the example: estimating expression level Zero-mean, unit variance prior: θ ∼ N (0, 1) Observation model with σ2 = 1: xi ∼ N (θ, 1) 1.2 p(θ|D) 1 0.8 0.6 0.4 0.2 0 −3 Magnus Rattray (University of Manchester) −2 −1 0 θ Inference and estimation 1 2 3 14/06/10 12 / 22 Back to the example: estimating expression level Zero-mean, unit variance prior: θ ∼ N (0, 1) Observation model with σ2 = 1: xi ∼ N (θ, 1) 1.2 p(θ|D) 1 0.8 0.6 0.4 0.2 0 −3 Magnus Rattray (University of Manchester) −2 −1 0 θ Inference and estimation 1 2 3 14/06/10 12 / 22 Back to the example: estimating expression level Zero-mean, unit variance prior: θ ∼ N (0, 1) Observation model with σ2 = 1: xi ∼ N (θ, 1) 1.2 p(θ|D) 1 0.8 0.6 0.4 0.2 0 −3 Magnus Rattray (University of Manchester) −2 −1 0 θ Inference and estimation 1 2 3 14/06/10 12 / 22 Back to the example: estimating expression level Zero-mean, unit variance prior: θ ∼ N (0, 1) Observation model with σ2 = 1: xi ∼ N (θ, 1) 1.2 p(θ|D) 1 0.8 0.6 0.4 0.2 0 −3 Magnus Rattray (University of Manchester) −2 −1 0 θ Inference and estimation 1 2 3 14/06/10 12 / 22 Back to the example: estimating expression level Zero-mean, unit variance prior: θ ∼ N (0, 1) Observation model with σ2 = 1: xi ∼ N (θ, 1) p(θ) p(θ|D) 1.2 1 0.8 0.6 0.4 0.2 0 −3 Magnus Rattray (University of Manchester) −2 −1 0 θ Inference and estimation 1 2 3 14/06/10 12 / 22 After one data point Normal prior distribution: θ ∼ N (µ0 , σ02 ) After the first observation x1 we have: −(x1 − θ)2 −(θ − µ0 )2 p(θ|x1 ) ∝ p(x1 |θ)p(θ) ∝ exp exp 2 2σ02 One finds p(θ|x1 ) = N (θ|µ, σ 2 ) with: µ= µ0 + x1 σ02 , 1 + σ02 σ2 = σ02 1 + σ02 Posterior mean is a linear combination of the observation and prior: µ = σ 2 x1 + (1 − σ 2 )µ0 Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 13 / 22 After one data point Normal prior distribution: θ ∼ N (µ0 , σ02 ) After the first observation x1 we have: −(x1 − θ)2 −(θ − µ0 )2 p(θ|x1 ) ∝ p(x1 |θ)p(θ) ∝ exp exp 2 2σ02 One finds p(θ|x1 ) = N (θ|µ, σ 2 ) with: µ= µ0 + x1 σ02 , 1 + σ02 σ2 = σ02 1 + σ02 Posterior mean is a linear combination of the observation and prior: µ = σ 2 x1 + (1 − σ 2 )µ0 Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 13 / 22 After one data point Normal prior distribution: θ ∼ N (µ0 , σ02 ) After the first observation x1 we have: −(x1 − θ)2 −(θ − µ0 )2 p(θ|x1 ) ∝ p(x1 |θ)p(θ) ∝ exp exp 2 2σ02 One finds p(θ|x1 ) = N (θ|µ, σ 2 ) with: µ= µ0 + x1 σ02 , 1 + σ02 σ2 = σ02 1 + σ02 Posterior mean is a linear combination of the observation and prior: µ = σ 2 x1 + (1 − σ 2 )µ0 Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 13 / 22 After one data point Normal prior distribution: θ ∼ N (µ0 , σ02 ) After the first observation x1 we have: −(x1 − θ)2 −(θ − µ0 )2 p(θ|x1 ) ∝ p(x1 |θ)p(θ) ∝ exp exp 2 2σ02 One finds p(θ|x1 ) = N (θ|µ, σ 2 ) with: µ= µ0 + x1 σ02 , 1 + σ02 σ2 = σ02 1 + σ02 Posterior mean is a linear combination of the observation and prior: µ = σ 2 x1 + (1 − σ 2 )µ0 Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 13 / 22 After one data point Normal prior distribution: θ ∼ N (µ0 , σ02 ) After the first observation x1 we have: −(x1 − θ)2 −(θ − µ0 )2 p(θ|x1 ) ∝ p(x1 |θ)p(θ) ∝ exp exp 2 2σ02 One finds p(θ|x1 ) = N (θ|µ, σ 2 ) with: µ= µ0 + x1 σ02 , 1 + σ02 σ2 = σ02 1 + σ02 Posterior mean is a linear combination of the observation and prior: µ = σ 2 x1 + (1 − σ 2 )µ0 Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 13 / 22 After one data point 1.2 p(θ) 1 0.8 0.6 0.4 0.2 0 −3 Magnus Rattray (University of Manchester) −2 −1 0 θ Inference and estimation 1 2 3 14/06/10 14 / 22 After one data point 1.2 p(θ|D) 1 0.8 0.6 0.4 0.2 0 −3 Magnus Rattray (University of Manchester) −2 −1 0 θ Inference and estimation 1 2 3 14/06/10 14 / 22 After n data points Given D = {x1 , x2 , . . . , xn } the posterior mean and variance are: P µ0 + σ02 ni=1 xi σ02 2 , σ = µ= 1 + nσ02 1 + nσ02 For n = 0 we recover the prior: µ = µ0 and σ = σ0 P For big n: µ → x̄ = n1 ni=1 xi – empirical mean √ For big n: σ → σ / n – standard error in the mean The prior is most important when the dataset is relatively small Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 15 / 22 After n data points Given D = {x1 , x2 , . . . , xn } the posterior mean and variance are: P µ0 + σ02 ni=1 xi σ02 2 , σ = µ= 1 + nσ02 1 + nσ02 For n = 0 we recover the prior: µ = µ0 and σ = σ0 P For big n: µ → x̄ = n1 ni=1 xi – empirical mean √ For big n: σ → σ / n – standard error in the mean The prior is most important when the dataset is relatively small Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 15 / 22 After n data points Given D = {x1 , x2 , . . . , xn } the posterior mean and variance are: P µ0 + σ02 ni=1 xi σ02 2 , σ = µ= 1 + nσ02 1 + nσ02 For n = 0 we recover the prior: µ = µ0 and σ = σ0 P For big n: µ → x̄ = n1 ni=1 xi – empirical mean √ For big n: σ → σ / n – standard error in the mean The prior is most important when the dataset is relatively small Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 15 / 22 After n data points Given D = {x1 , x2 , . . . , xn } the posterior mean and variance are: P µ0 + σ02 ni=1 xi σ02 2 , σ = µ= 1 + nσ02 1 + nσ02 For n = 0 we recover the prior: µ = µ0 and σ = σ0 P For big n: µ → x̄ = n1 ni=1 xi – empirical mean √ For big n: σ → σ / n – standard error in the mean The prior is most important when the dataset is relatively small Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 15 / 22 After n data points Given D = {x1 , x2 , . . . , xn } the posterior mean and variance are: P µ0 + σ02 ni=1 xi σ02 2 , σ = µ= 1 + nσ02 1 + nσ02 For n = 0 we recover the prior: µ = µ0 and σ = σ0 P For big n: µ → x̄ = n1 ni=1 xi – empirical mean √ For big n: σ → σ / n – standard error in the mean The prior is most important when the dataset is relatively small Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 15 / 22 Parameter point estimates Sometimes we make a point estimate of the parameters: Maximum likelihood (ML) θML = argmax p(D|θ) θ Maximum a posteriori (MAP) θMAP = argmax p(θ|D) θ Posterior mean (PM) θ Magnus Rattray (University of Manchester) PM Z = θp(θ|D) dθ Inference and estimation 14/06/10 16 / 22 Parameter point estimates Sometimes we make a point estimate of the parameters: Maximum likelihood (ML) θML = argmax p(D|θ) θ Maximum a posteriori (MAP) θMAP = argmax p(θ|D) θ Posterior mean (PM) θ Magnus Rattray (University of Manchester) PM Z = θp(θ|D) dθ Inference and estimation 14/06/10 16 / 22 Parameter point estimates Sometimes we make a point estimate of the parameters: Maximum likelihood (ML) θML = argmax p(D|θ) θ Maximum a posteriori (MAP) θMAP = argmax p(θ|D) θ Posterior mean (PM) θ Magnus Rattray (University of Manchester) PM Z = θp(θ|D) dθ Inference and estimation 14/06/10 16 / 22 Parameter point estimates Sometimes we make a point estimate of the parameters: Maximum likelihood (ML) θML = argmax p(D|θ) θ Maximum a posteriori (MAP) θMAP = argmax p(θ|D) θ Posterior mean (PM) θ Magnus Rattray (University of Manchester) PM Z = θp(θ|D) dθ Inference and estimation 14/06/10 16 / 22 Parameter point estimates Probabilities are usually tiny so we work with logarithms ln p(θ|D) = ln p(D|θ) + ln p(θ) + term independent of θ Since argmaxx f (x) = argmaxx ln f (x) we have: Maximum likelihood (ML) parameter estimation θML = argmax [ln p(D|θ)] θ Maximum a posteriori (MAP) parameter estimation θMAP = argmax [ln p(D|θ) + ln p(θ)] θ The log-prior acts as an additive regularisation term Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 17 / 22 Parameter point estimates Probabilities are usually tiny so we work with logarithms ln p(θ|D) = ln p(D|θ) + ln p(θ) + term independent of θ Since argmaxx f (x) = argmaxx ln f (x) we have: Maximum likelihood (ML) parameter estimation θML = argmax [ln p(D|θ)] θ Maximum a posteriori (MAP) parameter estimation θMAP = argmax [ln p(D|θ) + ln p(θ)] θ The log-prior acts as an additive regularisation term Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 17 / 22 Parameter point estimates Probabilities are usually tiny so we work with logarithms ln p(θ|D) = ln p(D|θ) + ln p(θ) + term independent of θ Since argmaxx f (x) = argmaxx ln f (x) we have: Maximum likelihood (ML) parameter estimation θML = argmax [ln p(D|θ)] θ Maximum a posteriori (MAP) parameter estimation θMAP = argmax [ln p(D|θ) + ln p(θ)] θ The log-prior acts as an additive regularisation term Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 17 / 22 Parameter point estimates Probabilities are usually tiny so we work with logarithms ln p(θ|D) = ln p(D|θ) + ln p(θ) + term independent of θ Since argmaxx f (x) = argmaxx ln f (x) we have: Maximum likelihood (ML) parameter estimation θML = argmax [ln p(D|θ)] θ Maximum a posteriori (MAP) parameter estimation θMAP = argmax [ln p(D|θ) + ln p(θ)] θ The log-prior acts as an additive regularisation term Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 17 / 22 Expression example: ML versus MAP Maximum likelihood (ML) estimation: n θ ML 1X xi = x̄ = n i=1 MAP estimation: θMAP = µ0 + nσ02 x̄ 1 + nσ02 For small n the MAP solution is regularised by the prior For large n, θMAP → θML and the two are equivalent Note: θMAP = θPM here but not in general – which one we choose is usually down to computational convenience Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 18 / 22 Expression example: ML versus MAP Maximum likelihood (ML) estimation: n θ ML 1X xi = x̄ = n i=1 MAP estimation: θMAP = µ0 + nσ02 x̄ 1 + nσ02 For small n the MAP solution is regularised by the prior For large n, θMAP → θML and the two are equivalent Note: θMAP = θPM here but not in general – which one we choose is usually down to computational convenience Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 18 / 22 Expression example: ML versus MAP Maximum likelihood (ML) estimation: n θ ML 1X xi = x̄ = n i=1 MAP estimation: θMAP = µ0 + nσ02 x̄ 1 + nσ02 For small n the MAP solution is regularised by the prior For large n, θMAP → θML and the two are equivalent Note: θMAP = θPM here but not in general – which one we choose is usually down to computational convenience Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 18 / 22 Expression example: ML versus MAP Maximum likelihood (ML) estimation: n θ ML 1X xi = x̄ = n i=1 MAP estimation: θMAP = µ0 + nσ02 x̄ 1 + nσ02 For small n the MAP solution is regularised by the prior For large n, θMAP → θML and the two are equivalent Note: θMAP = θPM here but not in general – which one we choose is usually down to computational convenience Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 18 / 22 Expression example: ML versus MAP Maximum likelihood (ML) estimation: n θ ML 1X xi = x̄ = n i=1 MAP estimation: θMAP = µ0 + nσ02 x̄ 1 + nσ02 For small n the MAP solution is regularised by the prior For large n, θMAP → θML and the two are equivalent Note: θMAP = θPM here but not in general – which one we choose is usually down to computational convenience Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 18 / 22 Expression example: ML versus MAP Maximum likelihood (ML) estimation: n θ ML 1X xi = x̄ = n i=1 MAP estimation: θMAP = µ0 + nσ02 x̄ 1 + nσ02 For small n the MAP solution is regularised by the prior For large n, θMAP → θML and the two are equivalent Note: θMAP = θPM here but not in general – which one we choose is usually down to computational convenience Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 18 / 22 Maximum likelihood and least squares Regression model: y = f (x, θ) + Maximum likelihood (ML) parameter estimation θML = argmax [ln p(D|θ)] θ With a normal observation model we have p(D|θ) = n Y p(yi |xi , θ) = i=1 ln Q i fi = P i n Y 1 p exp 2 2πσ i=1 −(yi − f (xi , θ))2 2σ2 ln fi and so we have, ln p(D|θ) = − n 1 X (yi − f (xi , θ))2 + term indep. of θ 2σ2 i=1 When noise is normal and i.i.d. then ML → least squares fit Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 19 / 22 Maximum likelihood and least squares Regression model: y = f (x, θ) + Maximum likelihood (ML) parameter estimation θML = argmax [ln p(D|θ)] θ With a normal observation model we have p(D|θ) = n Y p(yi |xi , θ) = i=1 ln Q i fi = P i n Y 1 p exp 2 2πσ i=1 −(yi − f (xi , θ))2 2σ2 ln fi and so we have, ln p(D|θ) = − n 1 X (yi − f (xi , θ))2 + term indep. of θ 2σ2 i=1 When noise is normal and i.i.d. then ML → least squares fit Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 19 / 22 Maximum likelihood and least squares Regression model: y = f (x, θ) + Maximum likelihood (ML) parameter estimation θML = argmax [ln p(D|θ)] θ With a normal observation model we have p(D|θ) = n Y p(yi |xi , θ) = i=1 ln Q i fi = P i n Y 1 p exp 2 2πσ i=1 −(yi − f (xi , θ))2 2σ2 ln fi and so we have, ln p(D|θ) = − n 1 X (yi − f (xi , θ))2 + term indep. of θ 2σ2 i=1 When noise is normal and i.i.d. then ML → least squares fit Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 19 / 22 Maximum likelihood and least squares Regression model: y = f (x, θ) + Maximum likelihood (ML) parameter estimation θML = argmax [ln p(D|θ)] θ With a normal observation model we have p(D|θ) = n Y p(yi |xi , θ) = i=1 ln Q i fi = P i n Y 1 p exp 2 2πσ i=1 −(yi − f (xi , θ))2 2σ2 ln fi and so we have, ln p(D|θ) = − n 1 X (yi − f (xi , θ))2 + term indep. of θ 2σ2 i=1 When noise is normal and i.i.d. then ML → least squares fit Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 19 / 22 Maximum likelihood overfitting: PSSM example Position Specific Score Matrics (PSSMs) are sequence motif models: x1 x2 x3 x4 x5 θAML θGML θCML ML θT A A A G A T T C T T C C C C C 4/5 1/5 0 0 0 0 1/5 4/5 0 0 1 0 p(ATT) = 4/5 × 4/5 × 0 = 0 – really? Maximum likelihood parameter estimation can overfit on small datasets Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 20 / 22 Maximum likelihood overfitting: PSSM example Position Specific Score Matrics (PSSMs) are sequence motif models: x1 x2 x3 x4 x5 θAML θGML θCML ML θT A A A G A T T C T T C C C C C 4/5 1/5 0 0 0 0 1/5 4/5 0 0 1 0 p(ATT) = 4/5 × 4/5 × 0 = 0 – really? Maximum likelihood parameter estimation can overfit on small datasets Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 20 / 22 Maximum likelihood overfitting: PSSM example Position Specific Score Matrics (PSSMs) are sequence motif models: x1 x2 x3 x4 x5 θAML θGML θCML ML θT A A A G A T T C T T C C C C C 4/5 1/5 0 0 0 0 1/5 4/5 0 0 1 0 p(ATT) = 4/5 × 4/5 × 0 = 0 – really? Maximum likelihood parameter estimation can overfit on small datasets Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 20 / 22 Maximum likelihood overfitting: PSSM example Laplace’s method: Use the posterior mean (PM) estimate with a uniform Dirichlet distribution prior: x1 x2 x3 x4 x5 θAPM θGPM θCPM PM θT A A A G A T T C T T C C C C C 5/9 2/9 1/9 1/9 1/9 1/9 2/9 5/9 1/9 1/9 2/3 1/9 Equivalent to adding a single pseudo-count of each base in each column Informative priors can be used to capture known composition biases – important in profile HMM estimation Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 21 / 22 Maximum likelihood overfitting: PSSM example Laplace’s method: Use the posterior mean (PM) estimate with a uniform Dirichlet distribution prior: x1 x2 x3 x4 x5 θAPM θGPM θCPM PM θT A A A G A T T C T T C C C C C 5/9 2/9 1/9 1/9 1/9 1/9 2/9 5/9 1/9 1/9 2/3 1/9 Equivalent to adding a single pseudo-count of each base in each column Informative priors can be used to capture known composition biases – important in profile HMM estimation Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 21 / 22 Maximum likelihood overfitting: PSSM example Laplace’s method: Use the posterior mean (PM) estimate with a uniform Dirichlet distribution prior: x1 x2 x3 x4 x5 θAPM θGPM θCPM PM θT A A A G A T T C T T C C C C C 5/9 2/9 1/9 1/9 1/9 1/9 2/9 5/9 1/9 1/9 2/3 1/9 Equivalent to adding a single pseudo-count of each base in each column Informative priors can be used to capture known composition biases – important in profile HMM estimation Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 21 / 22 Summary Belief/uncertainty can be represented as a probability distribution Posterior distribution captures our belief in the model parameters after observing some data Bayes theorem can be used to obtain the posterior distribution given a likelihood and prior Maximum likelihood can overfit given limited data Bayesian point estimates (MAP or posterior mean estimates) regularise the ML estimate For further reading and applications to HMMs see Durbin et al. “Biological Sequence Analysis” (Cambridge University Press, 1998). Magnus Rattray (University of Manchester) Inference and estimation 14/06/10 22 / 22