My lectures

advertisement
My lectures
1
Introduction to Bayesian inference and parameter estimation
2
Inference in models with hidden variables
3
Approximate inference
4
Non-parametric modelling: Gaussian process inference in differential
equation models of gene regulation
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
1 / 22
Introduction to Inference and Estimation
Magnus Rattray
Machine Learning and Oprimization Group, University of Manchester
June 14th, 2010
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
2 / 22
Outline
Probability basics and Bayes theorem
Representing belief as a probability distribution
Using Bayes theorem to update beliefs
Parameter Estimation: Maximum likelihood, MAP, Posterior Mean
and Least Squares Fit
Overfitting and regularization
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
3 / 22
Probability basics
p(X ) is the probability of observing an event X
p(X , Y ) is the probability of observing two events X and Y – the
joint probability of X and Y
p(X |Y ) is the probability of observing event X given that you
observed event Y – the conditional probability of X given Y
The basic rules:
(1) p(not X ) + p(X ) = 1 (normalisation)
(2) p(X , Y ) P
= p(X |Y )p(Y ) (product rule)
(3) p(Y ) = X p(X , Y ) (sum rule)
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
4 / 22
Probability basics
p(X ) is the probability of observing an event X
p(X , Y ) is the probability of observing two events X and Y – the
joint probability of X and Y
p(X |Y ) is the probability of observing event X given that you
observed event Y – the conditional probability of X given Y
The basic rules:
(1) p(not X ) + p(X ) = 1 (normalisation)
(2) p(X , Y ) P
= p(X |Y )p(Y ) (product rule)
(3) p(Y ) = X p(X , Y ) (sum rule)
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
4 / 22
Probability basics
p(X ) is the probability of observing an event X
p(X , Y ) is the probability of observing two events X and Y – the
joint probability of X and Y
p(X |Y ) is the probability of observing event X given that you
observed event Y – the conditional probability of X given Y
The basic rules:
(1) p(not X ) + p(X ) = 1 (normalisation)
(2) p(X , Y ) P
= p(X |Y )p(Y ) (product rule)
(3) p(Y ) = X p(X , Y ) (sum rule)
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
4 / 22
Probability basics
p(X ) is the probability of observing an event X
p(X , Y ) is the probability of observing two events X and Y – the
joint probability of X and Y
p(X |Y ) is the probability of observing event X given that you
observed event Y – the conditional probability of X given Y
The basic rules:
(1) p(not X ) + p(X ) = 1 (normalisation)
(2) p(X , Y ) P
= p(X |Y )p(Y ) (product rule)
(3) p(Y ) = X p(X , Y ) (sum rule)
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
4 / 22
Probability basics
p(X ) is the probability of observing an event X
p(X , Y ) is the probability of observing two events X and Y – the
joint probability of X and Y
p(X |Y ) is the probability of observing event X given that you
observed event Y – the conditional probability of X given Y
The basic rules:
(1) p(not X ) + p(X ) = 1 (normalisation)
(2) p(X , Y ) P
= p(X |Y )p(Y ) (product rule)
(3) p(Y ) = X p(X , Y ) (sum rule)
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
4 / 22
Bayes theorem
The product rule of probability is,
p(X , Y ) = p(X )p(Y |X )
Since p(X , Y ) = p(Y , X ) we can swap X and Y on the right,
p(X , Y ) = p(Y )p(X |Y )
which, after a bit of rearranging, gives:
Bayes theorem
p(X |Y ) =
p(Y |X )p(X )
p(Y )
where p(Y ) =
X
p(Y |X )p(X )
X
Inverse Probability: If we know the probability of an observation Y given
an unobserved X then this gives us the probability of X given Y .
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
5 / 22
Bayes theorem
The product rule of probability is,
p(X , Y ) = p(X )p(Y |X )
Since p(X , Y ) = p(Y , X ) we can swap X and Y on the right,
p(X , Y ) = p(Y )p(X |Y )
which, after a bit of rearranging, gives:
Bayes theorem
p(X |Y ) =
p(Y |X )p(X )
p(Y )
where p(Y ) =
X
p(Y |X )p(X )
X
Inverse Probability: If we know the probability of an observation Y given
an unobserved X then this gives us the probability of X given Y .
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
5 / 22
Bayes theorem
The product rule of probability is,
p(X , Y ) = p(X )p(Y |X )
Since p(X , Y ) = p(Y , X ) we can swap X and Y on the right,
p(X , Y ) = p(Y )p(X |Y )
which, after a bit of rearranging, gives:
Bayes theorem
p(X |Y ) =
p(Y |X )p(X )
p(Y )
where p(Y ) =
X
p(Y |X )p(X )
X
Inverse Probability: If we know the probability of an observation Y given
an unobserved X then this gives us the probability of X given Y .
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
5 / 22
Bayes theorem
The product rule of probability is,
p(X , Y ) = p(X )p(Y |X )
Since p(X , Y ) = p(Y , X ) we can swap X and Y on the right,
p(X , Y ) = p(Y )p(X |Y )
which, after a bit of rearranging, gives:
Bayes theorem
p(X |Y ) =
p(Y |X )p(X )
p(Y )
where p(Y ) =
X
p(Y |X )p(X )
X
Inverse Probability: If we know the probability of an observation Y given
an unobserved X then this gives us the probability of X given Y .
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
5 / 22
Bayes theorem example
Am I in Edinburgh?
Prior: I’m in Edinburgh 5% of the year, otherwise I’m in Manchester
Model: It rains on 10% of days in Edinburgh, 30% of days in Manchester
Data(D): It has not rained for 5 days
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
6 / 22
Bayes theorem example
Am I in Edinburgh?
Prior: I’m in Edinburgh 5% of the year, otherwise I’m in Manchester
Model: It rains on 10% of days in Edinburgh, 30% of days in Manchester
Data(D): It has not rained for 5 days
p(Edinburgh) = 0.05
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
6 / 22
Bayes theorem example
Am I in Edinburgh?
Prior: I’m in Edinburgh 5% of the year, otherwise I’m in Manchester
Model: It rains on 10% of days in Edinburgh, 30% of days in Manchester
Data(D): It has not rained for 5 days
p(Edinburgh) = 0.05
p(D|Edinburgh) = (0.9)5
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
6 / 22
Bayes theorem example
Am I in Edinburgh?
Prior: I’m in Edinburgh 5% of the year, otherwise I’m in Manchester
Model: It rains on 10% of days in Edinburgh, 30% of days in Manchester
Data(D): It has not rained for 5 days
p(Edinburgh) = 0.05
p(D|Edinburgh) = (0.9)5
p(D|Manchester) = (0.7)5
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
6 / 22
Bayes theorem example
Am I in Edinburgh?
Prior: I’m in Edinburgh 5% of the year, otherwise I’m in Manchester
Model: It rains on 10% of days in Edinburgh, 30% of days in Manchester
Data(D): It has not rained for 5 days
p(Edinburgh) = 0.05
p(D|Edinburgh) = (0.9)5
p(D|Manchester) = (0.7)5
p(D|Edinburgh)p(Edinburgh)
p(D|Edinburgh)p(Edinburgh) + p(D|Manchester)p(Manchester)
(0.9)5 × 0.05
=
5
(0.9) × 0.05 + (0.7)5 × 0.95
= 0.1561
p(Edinburgh|D) =
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
6 / 22
Bayes theorem example
Am I in Edinburgh?
Prior: I’m in Edinburgh 5% of the year, otherwise I’m in Manchester
Model: It rains on 10% of days in Edinburgh, 30% of days in Manchester
Data(D): It has not rained for 5 days
p(Edinburgh) = 0.05
p(D|Edinburgh) = (0.9)5
p(D|Manchester) = (0.7)5
p(D|Edinburgh)p(Edinburgh)
p(D|Edinburgh)p(Edinburgh) + p(D|Manchester)p(Manchester)
(0.9)5 × 0.05
=
5
(0.9) × 0.05 + (0.7)5 × 0.95
= 0.1561
p(Edinburgh|D) =
15% posterior probability that I’m in Edinburgh
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
6 / 22
Bayesian inference
Use probability to represent belief in the parameters θ of a model
Use Bayes theorem to update beliefs after observing some data D,
captured by the posterior distribution p(θ|D):
p(θ|D) =
p(D|θ)p(θ)
p(D)
where p(D) =
X
p(D|θ)p(θ)
θ
The prior distribution p(θ) is what we believe before seeing the data
The evidence p(D) can be used to compare alternative models
Uncertain parameters and random variables treated equivalently
The catch - computing p(θ|D) and p(D) usually very difficult
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
7 / 22
Bayesian inference
Use probability to represent belief in the parameters θ of a model
Use Bayes theorem to update beliefs after observing some data D,
captured by the posterior distribution p(θ|D):
p(θ|D) =
p(D|θ)p(θ)
p(D)
where p(D) =
X
p(D|θ)p(θ)
θ
The prior distribution p(θ) is what we believe before seeing the data
The evidence p(D) can be used to compare alternative models
Uncertain parameters and random variables treated equivalently
The catch - computing p(θ|D) and p(D) usually very difficult
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
7 / 22
Bayesian inference
Use probability to represent belief in the parameters θ of a model
Use Bayes theorem to update beliefs after observing some data D,
captured by the posterior distribution p(θ|D):
p(θ|D) =
p(D|θ)p(θ)
p(D)
where p(D) =
X
p(D|θ)p(θ)
θ
The prior distribution p(θ) is what we believe before seeing the data
The evidence p(D) can be used to compare alternative models
Uncertain parameters and random variables treated equivalently
The catch - computing p(θ|D) and p(D) usually very difficult
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
7 / 22
Bayesian inference
Use probability to represent belief in the parameters θ of a model
Use Bayes theorem to update beliefs after observing some data D,
captured by the posterior distribution p(θ|D):
p(θ|D) =
p(D|θ)p(θ)
p(D)
where p(D) =
X
p(D|θ)p(θ)
θ
The prior distribution p(θ) is what we believe before seeing the data
The evidence p(D) can be used to compare alternative models
Uncertain parameters and random variables treated equivalently
The catch - computing p(θ|D) and p(D) usually very difficult
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
7 / 22
Bayesian inference
Use probability to represent belief in the parameters θ of a model
Use Bayes theorem to update beliefs after observing some data D,
captured by the posterior distribution p(θ|D):
p(θ|D) =
p(D|θ)p(θ)
p(D)
where p(D) =
X
p(D|θ)p(θ)
θ
The prior distribution p(θ) is what we believe before seeing the data
The evidence p(D) can be used to compare alternative models
Uncertain parameters and random variables treated equivalently
The catch - computing p(θ|D) and p(D) usually very difficult
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
7 / 22
Bayesian inference
Use probability to represent belief in the parameters θ of a model
Use Bayes theorem to update beliefs after observing some data D,
captured by the posterior distribution p(θ|D):
p(θ|D) =
p(D|θ)p(θ)
p(D)
where p(D) =
X
p(D|θ)p(θ)
θ
The prior distribution p(θ) is what we believe before seeing the data
The evidence p(D) can be used to compare alternative models
Uncertain parameters and random variables treated equivalently
The catch - computing p(θ|D) and p(D) usually very difficult
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
7 / 22
Bayesian inference
Use probability to represent belief in the parameters θ of a model
Use Bayes theorem to update beliefs after observing some data D,
captured by the posterior distribution p(θ|D):
p(θ|D) =
p(D|θ)p(θ)
p(D)
where p(D) =
X
p(D|θ)p(θ)
θ
The prior distribution p(θ) is what we believe before seeing the data
The evidence p(D) can be used to compare alternative models
Uncertain parameters and random variables treated equivalently
The catch - computing p(θ|D) and p(D) usually very difficult
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
7 / 22
Motivating example: estimating gene expression
Gene expressed with concentration θ
Observation model: xi = θ + random noise
Data are observations: D = {x1 , x2 , x3 . . . xn }
What is our belief in θ after seeing the data?
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
8 / 22
Motivating example: estimating gene expression
Gene expressed with concentration θ
Observation model: xi = θ + random noise
Data are observations: D = {x1 , x2 , x3 . . . xn }
What is our belief in θ after seeing the data?
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
8 / 22
Motivating example: estimating gene expression
Gene expressed with concentration θ
Observation model: xi = θ + random noise
Data are observations: D = {x1 , x2 , x3 . . . xn }
What is our belief in θ after seeing the data?
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
8 / 22
Motivating example: estimating gene expression
Gene expressed with concentration θ
Observation model: xi = θ + random noise
Data are observations: D = {x1 , x2 , x3 . . . xn }
What is our belief in θ after seeing the data?
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
8 / 22
Motivating example: estimating gene expression
Gene expressed with concentration θ
Observation model: xi = θ + random noise
Data are observations: D = {x1 , x2 , x3 . . . xn }
What is our belief in θ after seeing the data?
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
8 / 22
Representing belief as a distribution
Normal prior distribution for θ with mean µ0 and variance σ02 :
1.2
p(θ)
1
0.8
0.6
0.4
0.2
0
−3
−2
−1
0
θ
1
2
3
We write θ ∼ N (µ0 , σ02 ) and the density is
1
−(θ − µ0 )2
q
p(θ) =
exp
2σ02
2πσ02
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
9 / 22
Representing belief as a distribution
Normal prior distribution for θ with mean µ0 and variance σ02 :
1.2
p(θ)
1
0.8
0.6
0.4
0.2
0
−3
−2
−1
0
θ
1
2
3
We write θ ∼ N (µ0 , σ02 ) and the density is
1
−(θ − µ0 )2
q
p(θ) =
exp
2σ02
2πσ02
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
9 / 22
Representing observation as a distribution
Also natural to use probability to represent observation error
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
10 / 22
Representing observation as a distribution
Also natural to use probability to represent observation error
Assume errors i normally distributed with mean 0 and variance σ2 :
xi = θ + i
Magnus Rattray (University of Manchester)
→
xi ∼ N (θ, σ2 )
Inference and estimation
14/06/10
10 / 22
Representing observation as a distribution
Also natural to use probability to represent observation error
Assume errors i normally distributed with mean 0 and variance σ2 :
→
xi = θ + i
xi ∼ N (θ, σ2 )
Or in other words:
p(xi |θ) = p
Magnus Rattray (University of Manchester)
1
2πσ2
exp
Inference and estimation
−(xi − θ)2
2σ2
14/06/10
10 / 22
Updating beliefs
Bayes theorem tells us how to update our belief given some data:
p(θ|D) =
p(D|θ)p(θ)
p(D)
or in words: Posterior =
Likelihood × Prior
Evidence
For n independent (i.i.d.) observations:
Likelihood
p(D|θ) =
n
Y
p(xi |θ)
i=1
Denominator is called the Evidence, Bayes factor or Marginal Likelihood:
Evidence
Z
p(D) =
Magnus Rattray (University of Manchester)
p(D|θ)p(θ)dθ
Inference and estimation
14/06/10
11 / 22
Updating beliefs
Bayes theorem tells us how to update our belief given some data:
p(θ|D) =
p(D|θ)p(θ)
p(D)
or in words: Posterior =
Likelihood × Prior
Evidence
For n independent (i.i.d.) observations:
Likelihood
p(D|θ) =
n
Y
p(xi |θ)
i=1
Denominator is called the Evidence, Bayes factor or Marginal Likelihood:
Evidence
Z
p(D) =
Magnus Rattray (University of Manchester)
p(D|θ)p(θ)dθ
Inference and estimation
14/06/10
11 / 22
Updating beliefs
Bayes theorem tells us how to update our belief given some data:
p(θ|D) =
p(D|θ)p(θ)
p(D)
or in words: Posterior =
Likelihood × Prior
Evidence
For n independent (i.i.d.) observations:
Likelihood
p(D|θ) =
n
Y
p(xi |θ)
i=1
Denominator is called the Evidence, Bayes factor or Marginal Likelihood:
Evidence
Z
p(D) =
Magnus Rattray (University of Manchester)
p(D|θ)p(θ)dθ
Inference and estimation
14/06/10
11 / 22
Back to the example: estimating expression level
Zero-mean, unit variance prior: θ ∼ N (0, 1)
Observation model with σ2 = 1: xi ∼ N (θ, 1)
1.2
1
0.8
0.6
0.4
0.2
0
−3
Magnus Rattray (University of Manchester)
−2
−1
0
xi
Inference and estimation
1
2
3
14/06/10
12 / 22
Back to the example: estimating expression level
Zero-mean, unit variance prior: θ ∼ N (0, 1)
Observation model with σ2 = 1: xi ∼ N (θ, 1)
1.2
p(θ)
1
0.8
0.6
0.4
0.2
0
−3
Magnus Rattray (University of Manchester)
−2
−1
0
θ
Inference and estimation
1
2
3
14/06/10
12 / 22
Back to the example: estimating expression level
Zero-mean, unit variance prior: θ ∼ N (0, 1)
Observation model with σ2 = 1: xi ∼ N (θ, 1)
1.2
p(θ|D)
1
0.8
0.6
0.4
0.2
0
−3
Magnus Rattray (University of Manchester)
−2
−1
0
θ
Inference and estimation
1
2
3
14/06/10
12 / 22
Back to the example: estimating expression level
Zero-mean, unit variance prior: θ ∼ N (0, 1)
Observation model with σ2 = 1: xi ∼ N (θ, 1)
1.2
p(θ|D)
1
0.8
0.6
0.4
0.2
0
−3
Magnus Rattray (University of Manchester)
−2
−1
0
θ
Inference and estimation
1
2
3
14/06/10
12 / 22
Back to the example: estimating expression level
Zero-mean, unit variance prior: θ ∼ N (0, 1)
Observation model with σ2 = 1: xi ∼ N (θ, 1)
1.2
p(θ|D)
1
0.8
0.6
0.4
0.2
0
−3
Magnus Rattray (University of Manchester)
−2
−1
0
θ
Inference and estimation
1
2
3
14/06/10
12 / 22
Back to the example: estimating expression level
Zero-mean, unit variance prior: θ ∼ N (0, 1)
Observation model with σ2 = 1: xi ∼ N (θ, 1)
1.2
p(θ|D)
1
0.8
0.6
0.4
0.2
0
−3
Magnus Rattray (University of Manchester)
−2
−1
0
θ
Inference and estimation
1
2
3
14/06/10
12 / 22
Back to the example: estimating expression level
Zero-mean, unit variance prior: θ ∼ N (0, 1)
Observation model with σ2 = 1: xi ∼ N (θ, 1)
1.2
p(θ|D)
1
0.8
0.6
0.4
0.2
0
−3
Magnus Rattray (University of Manchester)
−2
−1
0
θ
Inference and estimation
1
2
3
14/06/10
12 / 22
Back to the example: estimating expression level
Zero-mean, unit variance prior: θ ∼ N (0, 1)
Observation model with σ2 = 1: xi ∼ N (θ, 1)
1.2
p(θ|D)
1
0.8
0.6
0.4
0.2
0
−3
Magnus Rattray (University of Manchester)
−2
−1
0
θ
Inference and estimation
1
2
3
14/06/10
12 / 22
Back to the example: estimating expression level
Zero-mean, unit variance prior: θ ∼ N (0, 1)
Observation model with σ2 = 1: xi ∼ N (θ, 1)
1.2
p(θ|D)
1
0.8
0.6
0.4
0.2
0
−3
Magnus Rattray (University of Manchester)
−2
−1
0
θ
Inference and estimation
1
2
3
14/06/10
12 / 22
Back to the example: estimating expression level
Zero-mean, unit variance prior: θ ∼ N (0, 1)
Observation model with σ2 = 1: xi ∼ N (θ, 1)
1.2
p(θ|D)
1
0.8
0.6
0.4
0.2
0
−3
Magnus Rattray (University of Manchester)
−2
−1
0
θ
Inference and estimation
1
2
3
14/06/10
12 / 22
Back to the example: estimating expression level
Zero-mean, unit variance prior: θ ∼ N (0, 1)
Observation model with σ2 = 1: xi ∼ N (θ, 1)
1.2
p(θ|D)
1
0.8
0.6
0.4
0.2
0
−3
Magnus Rattray (University of Manchester)
−2
−1
0
θ
Inference and estimation
1
2
3
14/06/10
12 / 22
Back to the example: estimating expression level
Zero-mean, unit variance prior: θ ∼ N (0, 1)
Observation model with σ2 = 1: xi ∼ N (θ, 1)
1.2
p(θ|D)
1
0.8
0.6
0.4
0.2
0
−3
Magnus Rattray (University of Manchester)
−2
−1
0
θ
Inference and estimation
1
2
3
14/06/10
12 / 22
Back to the example: estimating expression level
Zero-mean, unit variance prior: θ ∼ N (0, 1)
Observation model with σ2 = 1: xi ∼ N (θ, 1)
p(θ)
p(θ|D)
1.2
1
0.8
0.6
0.4
0.2
0
−3
Magnus Rattray (University of Manchester)
−2
−1
0
θ
Inference and estimation
1
2
3
14/06/10
12 / 22
After one data point
Normal prior distribution: θ ∼ N (µ0 , σ02 )
After the first observation x1 we have:
−(x1 − θ)2
−(θ − µ0 )2
p(θ|x1 ) ∝ p(x1 |θ)p(θ) ∝ exp
exp
2
2σ02
One finds p(θ|x1 ) = N (θ|µ, σ 2 ) with:
µ=
µ0 + x1 σ02
,
1 + σ02
σ2 =
σ02
1 + σ02
Posterior mean is a linear combination of the observation and prior:
µ = σ 2 x1 + (1 − σ 2 )µ0
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
13 / 22
After one data point
Normal prior distribution: θ ∼ N (µ0 , σ02 )
After the first observation x1 we have:
−(x1 − θ)2
−(θ − µ0 )2
p(θ|x1 ) ∝ p(x1 |θ)p(θ) ∝ exp
exp
2
2σ02
One finds p(θ|x1 ) = N (θ|µ, σ 2 ) with:
µ=
µ0 + x1 σ02
,
1 + σ02
σ2 =
σ02
1 + σ02
Posterior mean is a linear combination of the observation and prior:
µ = σ 2 x1 + (1 − σ 2 )µ0
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
13 / 22
After one data point
Normal prior distribution: θ ∼ N (µ0 , σ02 )
After the first observation x1 we have:
−(x1 − θ)2
−(θ − µ0 )2
p(θ|x1 ) ∝ p(x1 |θ)p(θ) ∝ exp
exp
2
2σ02
One finds p(θ|x1 ) = N (θ|µ, σ 2 ) with:
µ=
µ0 + x1 σ02
,
1 + σ02
σ2 =
σ02
1 + σ02
Posterior mean is a linear combination of the observation and prior:
µ = σ 2 x1 + (1 − σ 2 )µ0
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
13 / 22
After one data point
Normal prior distribution: θ ∼ N (µ0 , σ02 )
After the first observation x1 we have:
−(x1 − θ)2
−(θ − µ0 )2
p(θ|x1 ) ∝ p(x1 |θ)p(θ) ∝ exp
exp
2
2σ02
One finds p(θ|x1 ) = N (θ|µ, σ 2 ) with:
µ=
µ0 + x1 σ02
,
1 + σ02
σ2 =
σ02
1 + σ02
Posterior mean is a linear combination of the observation and prior:
µ = σ 2 x1 + (1 − σ 2 )µ0
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
13 / 22
After one data point
Normal prior distribution: θ ∼ N (µ0 , σ02 )
After the first observation x1 we have:
−(x1 − θ)2
−(θ − µ0 )2
p(θ|x1 ) ∝ p(x1 |θ)p(θ) ∝ exp
exp
2
2σ02
One finds p(θ|x1 ) = N (θ|µ, σ 2 ) with:
µ=
µ0 + x1 σ02
,
1 + σ02
σ2 =
σ02
1 + σ02
Posterior mean is a linear combination of the observation and prior:
µ = σ 2 x1 + (1 − σ 2 )µ0
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
13 / 22
After one data point
1.2
p(θ)
1
0.8
0.6
0.4
0.2
0
−3
Magnus Rattray (University of Manchester)
−2
−1
0
θ
Inference and estimation
1
2
3
14/06/10
14 / 22
After one data point
1.2
p(θ|D)
1
0.8
0.6
0.4
0.2
0
−3
Magnus Rattray (University of Manchester)
−2
−1
0
θ
Inference and estimation
1
2
3
14/06/10
14 / 22
After n data points
Given D = {x1 , x2 , . . . , xn } the posterior mean and variance are:
P
µ0 + σ02 ni=1 xi
σ02
2
,
σ
=
µ=
1 + nσ02
1 + nσ02
For n = 0 we recover the prior: µ = µ0 and σ = σ0
P
For big n: µ → x̄ = n1 ni=1 xi – empirical mean
√
For big n: σ → σ / n – standard error in the mean
The prior is most important when the dataset is relatively small
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
15 / 22
After n data points
Given D = {x1 , x2 , . . . , xn } the posterior mean and variance are:
P
µ0 + σ02 ni=1 xi
σ02
2
,
σ
=
µ=
1 + nσ02
1 + nσ02
For n = 0 we recover the prior: µ = µ0 and σ = σ0
P
For big n: µ → x̄ = n1 ni=1 xi – empirical mean
√
For big n: σ → σ / n – standard error in the mean
The prior is most important when the dataset is relatively small
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
15 / 22
After n data points
Given D = {x1 , x2 , . . . , xn } the posterior mean and variance are:
P
µ0 + σ02 ni=1 xi
σ02
2
,
σ
=
µ=
1 + nσ02
1 + nσ02
For n = 0 we recover the prior: µ = µ0 and σ = σ0
P
For big n: µ → x̄ = n1 ni=1 xi – empirical mean
√
For big n: σ → σ / n – standard error in the mean
The prior is most important when the dataset is relatively small
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
15 / 22
After n data points
Given D = {x1 , x2 , . . . , xn } the posterior mean and variance are:
P
µ0 + σ02 ni=1 xi
σ02
2
,
σ
=
µ=
1 + nσ02
1 + nσ02
For n = 0 we recover the prior: µ = µ0 and σ = σ0
P
For big n: µ → x̄ = n1 ni=1 xi – empirical mean
√
For big n: σ → σ / n – standard error in the mean
The prior is most important when the dataset is relatively small
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
15 / 22
After n data points
Given D = {x1 , x2 , . . . , xn } the posterior mean and variance are:
P
µ0 + σ02 ni=1 xi
σ02
2
,
σ
=
µ=
1 + nσ02
1 + nσ02
For n = 0 we recover the prior: µ = µ0 and σ = σ0
P
For big n: µ → x̄ = n1 ni=1 xi – empirical mean
√
For big n: σ → σ / n – standard error in the mean
The prior is most important when the dataset is relatively small
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
15 / 22
Parameter point estimates
Sometimes we make a point estimate of the parameters:
Maximum likelihood (ML)
θML = argmax p(D|θ)
θ
Maximum a posteriori (MAP)
θMAP = argmax p(θ|D)
θ
Posterior mean (PM)
θ
Magnus Rattray (University of Manchester)
PM
Z
=
θp(θ|D) dθ
Inference and estimation
14/06/10
16 / 22
Parameter point estimates
Sometimes we make a point estimate of the parameters:
Maximum likelihood (ML)
θML = argmax p(D|θ)
θ
Maximum a posteriori (MAP)
θMAP = argmax p(θ|D)
θ
Posterior mean (PM)
θ
Magnus Rattray (University of Manchester)
PM
Z
=
θp(θ|D) dθ
Inference and estimation
14/06/10
16 / 22
Parameter point estimates
Sometimes we make a point estimate of the parameters:
Maximum likelihood (ML)
θML = argmax p(D|θ)
θ
Maximum a posteriori (MAP)
θMAP = argmax p(θ|D)
θ
Posterior mean (PM)
θ
Magnus Rattray (University of Manchester)
PM
Z
=
θp(θ|D) dθ
Inference and estimation
14/06/10
16 / 22
Parameter point estimates
Sometimes we make a point estimate of the parameters:
Maximum likelihood (ML)
θML = argmax p(D|θ)
θ
Maximum a posteriori (MAP)
θMAP = argmax p(θ|D)
θ
Posterior mean (PM)
θ
Magnus Rattray (University of Manchester)
PM
Z
=
θp(θ|D) dθ
Inference and estimation
14/06/10
16 / 22
Parameter point estimates
Probabilities are usually tiny so we work with logarithms
ln p(θ|D) = ln p(D|θ) + ln p(θ) + term independent of θ
Since argmaxx f (x) = argmaxx ln f (x) we have:
Maximum likelihood (ML) parameter estimation
θML = argmax [ln p(D|θ)]
θ
Maximum a posteriori (MAP) parameter estimation
θMAP = argmax [ln p(D|θ) + ln p(θ)]
θ
The log-prior acts as an additive regularisation term
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
17 / 22
Parameter point estimates
Probabilities are usually tiny so we work with logarithms
ln p(θ|D) = ln p(D|θ) + ln p(θ) + term independent of θ
Since argmaxx f (x) = argmaxx ln f (x) we have:
Maximum likelihood (ML) parameter estimation
θML = argmax [ln p(D|θ)]
θ
Maximum a posteriori (MAP) parameter estimation
θMAP = argmax [ln p(D|θ) + ln p(θ)]
θ
The log-prior acts as an additive regularisation term
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
17 / 22
Parameter point estimates
Probabilities are usually tiny so we work with logarithms
ln p(θ|D) = ln p(D|θ) + ln p(θ) + term independent of θ
Since argmaxx f (x) = argmaxx ln f (x) we have:
Maximum likelihood (ML) parameter estimation
θML = argmax [ln p(D|θ)]
θ
Maximum a posteriori (MAP) parameter estimation
θMAP = argmax [ln p(D|θ) + ln p(θ)]
θ
The log-prior acts as an additive regularisation term
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
17 / 22
Parameter point estimates
Probabilities are usually tiny so we work with logarithms
ln p(θ|D) = ln p(D|θ) + ln p(θ) + term independent of θ
Since argmaxx f (x) = argmaxx ln f (x) we have:
Maximum likelihood (ML) parameter estimation
θML = argmax [ln p(D|θ)]
θ
Maximum a posteriori (MAP) parameter estimation
θMAP = argmax [ln p(D|θ) + ln p(θ)]
θ
The log-prior acts as an additive regularisation term
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
17 / 22
Expression example: ML versus MAP
Maximum likelihood (ML) estimation:
n
θ
ML
1X
xi
= x̄ =
n
i=1
MAP estimation:
θMAP =
µ0 + nσ02 x̄
1 + nσ02
For small n the MAP solution is regularised by the prior
For large n, θMAP → θML and the two are equivalent
Note: θMAP = θPM here but not in general – which one we choose is
usually down to computational convenience
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
18 / 22
Expression example: ML versus MAP
Maximum likelihood (ML) estimation:
n
θ
ML
1X
xi
= x̄ =
n
i=1
MAP estimation:
θMAP =
µ0 + nσ02 x̄
1 + nσ02
For small n the MAP solution is regularised by the prior
For large n, θMAP → θML and the two are equivalent
Note: θMAP = θPM here but not in general – which one we choose is
usually down to computational convenience
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
18 / 22
Expression example: ML versus MAP
Maximum likelihood (ML) estimation:
n
θ
ML
1X
xi
= x̄ =
n
i=1
MAP estimation:
θMAP =
µ0 + nσ02 x̄
1 + nσ02
For small n the MAP solution is regularised by the prior
For large n, θMAP → θML and the two are equivalent
Note: θMAP = θPM here but not in general – which one we choose is
usually down to computational convenience
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
18 / 22
Expression example: ML versus MAP
Maximum likelihood (ML) estimation:
n
θ
ML
1X
xi
= x̄ =
n
i=1
MAP estimation:
θMAP =
µ0 + nσ02 x̄
1 + nσ02
For small n the MAP solution is regularised by the prior
For large n, θMAP → θML and the two are equivalent
Note: θMAP = θPM here but not in general – which one we choose is
usually down to computational convenience
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
18 / 22
Expression example: ML versus MAP
Maximum likelihood (ML) estimation:
n
θ
ML
1X
xi
= x̄ =
n
i=1
MAP estimation:
θMAP =
µ0 + nσ02 x̄
1 + nσ02
For small n the MAP solution is regularised by the prior
For large n, θMAP → θML and the two are equivalent
Note: θMAP = θPM here but not in general – which one we choose is
usually down to computational convenience
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
18 / 22
Expression example: ML versus MAP
Maximum likelihood (ML) estimation:
n
θ
ML
1X
xi
= x̄ =
n
i=1
MAP estimation:
θMAP =
µ0 + nσ02 x̄
1 + nσ02
For small n the MAP solution is regularised by the prior
For large n, θMAP → θML and the two are equivalent
Note: θMAP = θPM here but not in general – which one we choose is
usually down to computational convenience
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
18 / 22
Maximum likelihood and least squares
Regression model: y = f (x, θ) + Maximum likelihood (ML) parameter estimation
θML = argmax [ln p(D|θ)]
θ
With a normal observation model we have
p(D|θ) =
n
Y
p(yi |xi , θ) =
i=1
ln
Q
i fi
=
P
i
n
Y
1
p
exp
2
2πσ
i=1
−(yi − f (xi , θ))2
2σ2
ln fi and so we have,
ln p(D|θ) = −
n
1 X
(yi − f (xi , θ))2 + term indep. of θ
2σ2
i=1
When noise is normal and i.i.d. then ML → least squares fit
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
19 / 22
Maximum likelihood and least squares
Regression model: y = f (x, θ) + Maximum likelihood (ML) parameter estimation
θML = argmax [ln p(D|θ)]
θ
With a normal observation model we have
p(D|θ) =
n
Y
p(yi |xi , θ) =
i=1
ln
Q
i fi
=
P
i
n
Y
1
p
exp
2
2πσ
i=1
−(yi − f (xi , θ))2
2σ2
ln fi and so we have,
ln p(D|θ) = −
n
1 X
(yi − f (xi , θ))2 + term indep. of θ
2σ2
i=1
When noise is normal and i.i.d. then ML → least squares fit
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
19 / 22
Maximum likelihood and least squares
Regression model: y = f (x, θ) + Maximum likelihood (ML) parameter estimation
θML = argmax [ln p(D|θ)]
θ
With a normal observation model we have
p(D|θ) =
n
Y
p(yi |xi , θ) =
i=1
ln
Q
i fi
=
P
i
n
Y
1
p
exp
2
2πσ
i=1
−(yi − f (xi , θ))2
2σ2
ln fi and so we have,
ln p(D|θ) = −
n
1 X
(yi − f (xi , θ))2 + term indep. of θ
2σ2
i=1
When noise is normal and i.i.d. then ML → least squares fit
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
19 / 22
Maximum likelihood and least squares
Regression model: y = f (x, θ) + Maximum likelihood (ML) parameter estimation
θML = argmax [ln p(D|θ)]
θ
With a normal observation model we have
p(D|θ) =
n
Y
p(yi |xi , θ) =
i=1
ln
Q
i fi
=
P
i
n
Y
1
p
exp
2
2πσ
i=1
−(yi − f (xi , θ))2
2σ2
ln fi and so we have,
ln p(D|θ) = −
n
1 X
(yi − f (xi , θ))2 + term indep. of θ
2σ2
i=1
When noise is normal and i.i.d. then ML → least squares fit
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
19 / 22
Maximum likelihood overfitting: PSSM example
Position Specific Score Matrics (PSSMs) are sequence motif models:
x1
x2
x3
x4
x5
θAML
θGML
θCML
ML
θT
A
A
A
G
A
T
T
C
T
T
C
C
C
C
C
4/5
1/5
0
0
0
0
1/5
4/5
0
0
1
0
p(ATT) = 4/5 × 4/5 × 0 = 0 – really?
Maximum likelihood parameter estimation can overfit on small datasets
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
20 / 22
Maximum likelihood overfitting: PSSM example
Position Specific Score Matrics (PSSMs) are sequence motif models:
x1
x2
x3
x4
x5
θAML
θGML
θCML
ML
θT
A
A
A
G
A
T
T
C
T
T
C
C
C
C
C
4/5
1/5
0
0
0
0
1/5
4/5
0
0
1
0
p(ATT) = 4/5 × 4/5 × 0 = 0 – really?
Maximum likelihood parameter estimation can overfit on small datasets
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
20 / 22
Maximum likelihood overfitting: PSSM example
Position Specific Score Matrics (PSSMs) are sequence motif models:
x1
x2
x3
x4
x5
θAML
θGML
θCML
ML
θT
A
A
A
G
A
T
T
C
T
T
C
C
C
C
C
4/5
1/5
0
0
0
0
1/5
4/5
0
0
1
0
p(ATT) = 4/5 × 4/5 × 0 = 0 – really?
Maximum likelihood parameter estimation can overfit on small datasets
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
20 / 22
Maximum likelihood overfitting: PSSM example
Laplace’s method: Use the posterior mean (PM) estimate with a uniform
Dirichlet distribution prior:
x1
x2
x3
x4
x5
θAPM
θGPM
θCPM
PM
θT
A
A
A
G
A
T
T
C
T
T
C
C
C
C
C
5/9
2/9
1/9
1/9
1/9
1/9
2/9
5/9
1/9
1/9
2/3
1/9
Equivalent to adding a single pseudo-count of each base in each column
Informative priors can be used to capture known composition biases –
important in profile HMM estimation
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
21 / 22
Maximum likelihood overfitting: PSSM example
Laplace’s method: Use the posterior mean (PM) estimate with a uniform
Dirichlet distribution prior:
x1
x2
x3
x4
x5
θAPM
θGPM
θCPM
PM
θT
A
A
A
G
A
T
T
C
T
T
C
C
C
C
C
5/9
2/9
1/9
1/9
1/9
1/9
2/9
5/9
1/9
1/9
2/3
1/9
Equivalent to adding a single pseudo-count of each base in each column
Informative priors can be used to capture known composition biases –
important in profile HMM estimation
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
21 / 22
Maximum likelihood overfitting: PSSM example
Laplace’s method: Use the posterior mean (PM) estimate with a uniform
Dirichlet distribution prior:
x1
x2
x3
x4
x5
θAPM
θGPM
θCPM
PM
θT
A
A
A
G
A
T
T
C
T
T
C
C
C
C
C
5/9
2/9
1/9
1/9
1/9
1/9
2/9
5/9
1/9
1/9
2/3
1/9
Equivalent to adding a single pseudo-count of each base in each column
Informative priors can be used to capture known composition biases –
important in profile HMM estimation
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
21 / 22
Summary
Belief/uncertainty can be represented as a probability distribution
Posterior distribution captures our belief in the model parameters
after observing some data
Bayes theorem can be used to obtain the posterior distribution given
a likelihood and prior
Maximum likelihood can overfit given limited data
Bayesian point estimates (MAP or posterior mean estimates)
regularise the ML estimate
For further reading and applications to HMMs see Durbin et al. “Biological
Sequence Analysis” (Cambridge University Press, 1998).
Magnus Rattray (University of Manchester)
Inference and estimation
14/06/10
22 / 22
Download