11. Maximum Likelihood Estimation

advertisement
MT2004
Olivier GIMENEZ
Telephone: 01334 461827
E-mail: olivier@mcs.st-and.ac.uk
Website: http://www.creem.st-and.ac.uk/olivier/OGimenez.html
11. Maximum Likelihood Estimation
• So far, we’ve provided confidence intervals for and tested
hypotheses about model parameters (mean of a normally
distributed population)
• Objective here: estimating the parameters of a model, using
data
• Example: We want to estimate the probability  of getting a
head upon flipping a particular coin.
• We flip the coin ‘independently’ 10 times (i.e. we sample n = 10
flips), obtaining the following result: H H T H H H T T H H
11. Maximum Likelihood Estimation
• We flip the coin ‘independently’ 10 times (i.e. we sample n = 10
flips), obtaining the following result: H H T H H H T T H H
• The probability of obtaining this sequence – in advance of
collecting the data – is a function of the unknown parameter :
• Pr(data | parameter) = Pr(H H T H H H T T H H | )
=   (1 - )    (1 - ) (1 - )  
= 7 (1 - )3
• But the data for our particular sample are fixed: we have already
collected them!
• The parameter  also has a fixed value, but this value is
unknown. We know that it varies between 0 and 1.
11. Maximum Likelihood Estimation
• The value of  varies between 0 and 1
• We shall treat the probability of the observed data as a function
of 
• This function is called the likelihood function:
• L(parameter | data) = Pr(H H T H H H T T H H | )
= L( | H H T H H H T T H H)
= L( | data)
= 7 (1-)3
• The probability function and the likelihood function are the same
equation. But the probability function is a function of the data
with the value of the parameter fixed, while the likelihood
function is a function of the parameter with the data fixed.
11. Maximum Likelihood Estimation
• Here are some representative values of the likelihood for
different values of 
Likelihood of observing 7 heads and 3 tails for different
values of the probability of observing a head, 
11. Maximum Likelihood Estimation
• The probability of obtaining the sample of data that we have in
hand, H H T H H H T T H H, is small regardless of the true value
of .
• This is usually the case; any specific sample result – including
the one that is realised – will have low probability
• Nevertheless, the likelihood contains useful information about
the unknown parameter 
• E.g.  cannot be zero or one (probability 0), and is unlikely to be
close to zero or one.
• Reversing this reasoning, the value of  that is most supported
by the data is the one for which the likelihood is largest
• This value is the maximum-likelihood estimate (MLE) of 
11. Maximum Likelihood Estimation
•
• More generally, for n independent flips of the coin, producing a
particular sequence that includes x head and n - x tails,
L( | data) = Pr(data | ) = x (1-)n-x
• We want the value of  that maximises L( | data), which we
often abbreviate L()
• It is simpler – and equivalent – to find the value of  that
maximises the log of the likelihood
Log L() = x log  + (n – x ) log(1-)
11. Maximum Likelihood Estimation
• Differentiating log L() with respect to  produces
• Setting the derivative to 0 and solving produces the MLE which,
as before, is the sample proportion x / n
11. Maximum Likelihood Estimation
• In greater generality: consider a set of observations x1,…,xn
which are modelled as observations of independent discrete
random variables with probability function f(x;) which depends
on some (vector of) parameters .
• According to the model, the probability of obtaining the
observed data is the product of the probability functions for each
observations, i.e.
• We seek the parameters of the model that make the data look
most probable, in other words, we seek to maximise the likelihood
L(; x1,…, xn) (function of the parameters with data fixed) with
respect to 
11. Maximum Likelihood Estimation
•
Equivalently, we seek to maximise the log-likelihood
•
Recall that log(a b) = log(a) + log(b)
•
Example: suppose that you have n observations x1,…,xn on
independent Poisson distributed random variables each with
probability function
1. Form the likelihood and then the corresponding log-likelihood
2. Maximise the log-likelihood w.r.t.  and obtain its MLE
11. Maximum Likelihood Estimation
1. Form the likelihood…
… and then the corresponding log-likelihood
11. Maximum Likelihood Estimation
2. Maximise the log-likelihood w.r.t.  and obtain its MLE
so the MLE of  is:
•
The expression that we’ve just derived is an estimator, i.e. a
function of the random variables X1,…, Xn
•
The value of this function which is obtained by evaluating it
on observation values x1,…, xn is an estimate.
11. Maximum Likelihood Estimation
The MLE of  is:
•
The expression that we’ve just derived is an estimator, i.e. a
function of the random variables X1,…, Xn
•
The value of this function which is obtained by evaluating it
on observation values x1,…, xn is an estimate.
•
Suppose that in this case we have 4 observations 1, 3, 8 and 2.
What is the maximum likelihood estimate?
11. Maximum Likelihood Estimation
The MLE of  is:
•
Suppose that in this case we have 4 observations 1, 3, 8 and 2.
What is the maximum likelihood estimate?
•
The maximum likelihood estimate is:
•
Note that, in general, we should check that we have obtained a
maximum likelihood estimator, so we should calculate the
second derivative d2 l()/ d 2 and check that it’s negative
(which it is in this example)… In other words, l() is concave.
11. Maximum Likelihood Estimation
A more complicated example…
Suppose that you have a series of measurements y1,…, yn of
radioactive emission counts from samples of caesium of masses
x1,…, xn, respectively. You wish to model the counts as Poisson
random variables, where each Yi has mean  xi. Obtain the
maximum likelihood estimator of  (the radioactivity per unit
mass)
1. Form the likelihood and then the corresponding log-likelihood
2. Maximise the log-likelihood w.r.t.  and obtain its MLE
11. Maximum Likelihood Estimation
Suppose that you have a series of measurements y1,…, yn of radioactive emission counts
from samples of caesium of masses x1,…, xn, respectively. You wish to model the counts
as Poisson random variables, where each Yi has mean  xi. Obtain the maximum
likelihood estimator of  (the radioactivity per unit mass)
11. Maximum Likelihood Estimation
Suppose that you have a series of measurements y1,…, yn of radioactive emission counts
from samples of caesium of masses x1,…, xn, respectively. You wish to model the counts
as Poisson random variables, where each Yi has mean  xi. Obtain the maximum
likelihood estimator of  (the radioactivity per unit mass)
11. Maximum Likelihood Estimation
Likelihood for continuous distributions
• So far, MLE for a single parameter, using discrete data
(Binomial, Poisson)
• Maximum likelihood estimation works as well for continuous
random variables
• The likelihood is the product of the p.d.f.’s of the r.v.’s
• BUT, the likelihood (or log-likelihood) can no longer be
interpreted as a probability of getting the observed data, given ,
only as a probability density of getting the observed data.
11. Maximum Likelihood Estimation
Likelihood for continuous distributions
• BUT, the likelihood (or log-likelihood) can no longer be
interpreted as a probability of getting the observed data, given ,
only as a probability density of getting the observed data.
• In practice, this makes no difference. We maximise the
likelihood w.r.t. the parameters as usual.
11. Maximum Likelihood Estimation
Likelihood for continuous distributions
Example 1: The following data are a small part of a dataset on
coal mining disasters; the numbers are times in days between
major disasters: 157, 33, 186, 78, 538, 3.
One model for such data assumes that the times are independent
random variables T1,…, T6, all with the same p.d.f.
1. Form the likelihood and then the corresponding log-likelihood
2. Maximise the log-likelihood w.r.t.  and obtain the MLEs
(estimator and estimate)
11. Maximum Likelihood Estimation
Likelihood for continuous distributions
Example 1: The following data are a small part of a dataset on coal mining disasters; the
numbers are times in days between major disasters: 157, 33, 186, 78, 538, 3.
One model for such data assumes that the times are independent random variables T1,…,
T6, all with same p.d.f. f(t) =  exp(- t)
1. Form the likelihood and then the corresponding log-likelihood
11. Maximum Likelihood Estimation
Likelihood for continuous distributions
Example 1: The following data are a small part of a dataset on coal mining disasters; the
numbers are times in days between major disasters: 157, 33, 186, 78, 538, 3.
One model for such data assumes that the times are independent random variables T1,…,
T6, all with same p.d.f. f(t) =  exp(- t)
2. Maximise the log-likelihood w.r.t.  and obtain the MLEs
(estimator and estimate)
and setting this to zero gives the maximum-likelihood estimator:
11. Maximum Likelihood Estimation
Likelihood for continuous distributions
Example 1: The following data are a small part of a dataset on coal mining disasters; the
numbers are times in days between major disasters: 157, 33, 186, 78, 538, 3.
One model for such data assumes that the times are independent random variables T1,…,
T6, all with same p.d.f. f(t) =  exp(- t)
2. Maximise the log-likelihood w.r.t.  and obtain the MLEs
(estimator and estimate)
Plugging in the observed values for T1,…, T6, we get an
estimate:
11. Maximum Likelihood Estimation
Likelihood for continuous distributions
• Example 2: Suppose that we have some observations x1,…, xn
which we wish to model as observations of i.i.d. r.v.’s from a
normal distribution with unknown mean  and unknown variance
2, to be estimated.
1. Form the likelihood and then the corresponding log-likelihood
2. Maximise the log-likelihood w.r.t.  and 2 and obtain the MLEs
11. Maximum Likelihood Estimation
Likelihood for continuous distributions
1. Form the likelihood and then the corresponding log-likelihood
11. Maximum Likelihood Estimation
Likelihood for continuous distributions
2. Maximise the log-likelihood w.r.t.  and 2 and obtain the MLEs
First, we find the partial derivative w.r.t. 
and setting this this to zero gives:
so,
11. Maximum Likelihood Estimation
Likelihood for continuous distributions
2. Maximise the log-likelihood w.r.t.  and 2 and obtain the MLEs
Then, we find the partial derivative w.r.t. 
and setting this to zero gives:
so,
11. Maximum Likelihood Estimation
Likelihood for continuous distributions
To sum up, we have that:
and
• Note that the maximum likelihood estimator of the variance 2 is
NOT the sample variance s2
• In general, MLEs are biased (but the bias tends to zero as the
sample size gets larger)
• The MLEs do have the advantage of being consistent
11. Maximum Likelihood Estimation
Likelihood for continuous distributions
• Example 3: Suppose that we have some observations x1,…, xn
which we wish to model as observations of i.i.d. r.v.’s from a
Weibull distribution with unknown parameters  and , to be
estimated.
1. Form the log-likelihood
2. Maximise the log-likelihood w.r.t.  and  and obtain the MLEs
11. Maximum Likelihood Estimation
Likelihood for continuous distributions
• Example 3: Suppose that we have some observations x1,…, xn which we wish to
model as observations of i.i.d. r.v.’s from a Weibull distribution with unknown
parameters  and , to be estimated.
1. Form the log-likelihood
11. Maximum Likelihood Estimation
Likelihood for continuous distributions
• Example 3: Suppose that we have some observations x1,…, xn which we wish to
model as observations of i.i.d. r.v.’s from a Weibull distribution with unknown
parameters  and , to be estimated.
2. Maximise the log-likelihood w.r.t.  and  and obtain the MLEs
One ends up with a nonlinear equation in  that cannot be solved in
closed form.
We need to use optimisation routines, e.g. optim with program R, to
find the maximum of the log-likelihood (or equivalently the
minimum of the negative log-likelihood)
11. Maximum Likelihood Estimation
Invariance of MLEs
The invariance property of maximum likelihood estimators:
Example 1: suppose that x1,…, xn are observations  N(,2). Find
the MLE of .
We saw that the MLE for 2 is:
If we consider the one-to-one function:
Then the invariance property says that the MLE for  is:
11. Maximum Likelihood Estimation
Invariance of MLEs
The invariance property of maximum likelihood estimators:
Example 2: suppose that x1,…, xk are observations on independent
binomial r.v.’s, each with n trials and unknown probability p. The
likelihood of p is:
Find the MLE of p, and deduce the MLE of the mean of the Bin(n,p)
distribution
11. Maximum Likelihood Estimation
Invariance of MLEs
Example: suppose that x1,…, xk are observations on independent binomial r.v.’s, each with n
trials and unknown probability p. Find the MLE of p, and deduce the MLE of the mean of
the Bin(n,p) distribution
11. Maximum Likelihood Estimation
Invariance of MLEs
Example: suppose that x1,…, xk are observations on independent binomial r.v.’s, each with n
trials and unknown probability p. Find the MLE of p, and deduce the MLE of the mean of
the Bin(n,p) distribution
and setting this to zero gives:
11. Maximum Likelihood Estimation
Invariance of MLEs
Example: suppose that x1,…, xk are observations on independent binomial r.v.’s, each with n
trials and unknown probability p. Find the MLE of p, and deduce the MLE of the mean of
the Bin(n,p) distribution
Download