Uploaded by mlholt104

Chapter-5 full

advertisement
Math 362: Mathematical Statistics II
Le Chen
le.chen@emory.edu
Emory University
Atlanta, GA
Last updated on April 13, 2021
2021 Spring
0
Chapter 5. Estimation
§ 5.1 Introduction
§ 5.2 Estimating parameters: MLE and MME
§ 5.3 Interval Estimation
§ 5.4 Properties of Estimators
§ 5.5 Minimum-Variance Estimators: The Cramér-Rao Lower Bound
§ 5.6 Sufficient Estimators
§ 5.7 Consistency
§ 5.8 Bayesian Estimation
Plan
§ 5.1 Introduction
§ 5.2 Estimating parameters: MLE and MME
§ 5.3 Interval Estimation
§ 5.4 Properties of Estimators
§ 5.5 Minimum-Variance Estimators: The Cramér-Rao Lower Bound
§ 5.6 Sufficient Estimators
§ 5.7 Consistency
§ 5.8 Bayesian Estimation
Chapter 5. Estimation
§ 5.1 Introduction
§ 5.2 Estimating parameters: MLE and MME
§ 5.3 Interval Estimation
§ 5.4 Properties of Estimators
§ 5.5 Minimum-Variance Estimators: The Cramér-Rao Lower Bound
§ 5.6 Sufficient Estimators
§ 5.7 Consistency
§ 5.8 Bayesian Estimation
3
Motivating example: Given an unfair coin, or p-coin, such that
(
1 head with probability p,
X =
0 tail with probability 1 − p,
how would you determine the value p?
Solutions:
1. You need to try the coin several times, say, three times. What you
obtain is “HHT”.
2. Draw a conclusion from the experiment you just made.
Motivating example: Given an unfair coin, or p-coin, such that
(
1 head with probability p,
X =
0 tail with probability 1 − p,
how would you determine the value p?
Solutions:
1. You need to try the coin several times, say, three times. What you
obtain is “HHT”.
2. Draw a conclusion from the experiment you just made.
Rationale: The choice of the parameter p should be the value that
maximizes the probability of the sample.
P(X1 = 1, X2 = 1, X3 = 0) =P(X1 = 1)P(X2 = 1)P(X3 = 0)
=p2 (1 − p).
1
2
3
4
5
6
7
8
9
10
# Hello, R.
p <− seq(0,1,0.01)
plot(p,p^2∗(1−p),
type=”l”,
col=”red”)
title (”Likelihood”)
# add a vertical dotted (4) blue
line
abline(v=0.67, col=”blue”, lty=4)
# add some text
text(0.67,0.01, ”2/3”)
Maximize f (p) = p2 (1 − p) ....
5
Rationale: The choice of the parameter p should be the value that
maximizes the probability of the sample.
P(X1 = 1, X2 = 1, X3 = 0) =P(X1 = 1)P(X2 = 1)P(X3 = 0)
=p2 (1 − p).
1
2
3
4
5
6
7
8
9
10
# Hello, R.
p <− seq(0,1,0.01)
plot(p,p^2∗(1−p),
type=”l”,
col=”red”)
title (”Likelihood”)
# add a vertical dotted (4) blue
line
abline(v=0.67, col=”blue”, lty=4)
# add some text
text(0.67,0.01, ”2/3”)
Maximize f (p) = p2 (1 − p) ....
5
A random sample of size n from the population – Bernoulli(p):
I X1 , · · · , Xn are i.i.d.1 random variables, each following Bernoulli(p).
I Suppose the outcomes of the random sample are: X1 = k1 , · · · , Xn = kn .
I What is your choice of p based on the above random sample?
p=
n
1X
ki =: k̄ .
n
i=1
1
independent and identically distributed
6
A random sample of size n from the population – Bernoulli(p):
I X1 , · · · , Xn are i.i.d.1 random variables, each following Bernoulli(p).
I Suppose the outcomes of the random sample are: X1 = k1 , · · · , Xn = kn .
I What is your choice of p based on the above random sample?
p=
n
1X
ki =: k̄ .
n
i=1
1
independent and identically distributed
6
A random sample of size n from the population – Bernoulli(p):
I X1 , · · · , Xn are i.i.d.1 random variables, each following Bernoulli(p).
I Suppose the outcomes of the random sample are: X1 = k1 , · · · , Xn = kn .
I What is your choice of p based on the above random sample?
p=
n
1X
ki =: k̄ .
n
i=1
1
independent and identically distributed
6
A random sample of size n from the population – Bernoulli(p):
I X1 , · · · , Xn are i.i.d.1 random variables, each following Bernoulli(p).
I Suppose the outcomes of the random sample are: X1 = k1 , · · · , Xn = kn .
I What is your choice of p based on the above random sample?
p=
n
1X
ki =: k̄ .
n
i=1
1
independent and identically distributed
6
A random sample of size n from the population with given pdf:
I X1 , · · · , Xn are i.i.d. random variables, each following the same given
pdf.
I a statistic or an estimator is a function of the random sample.
Statistic/Estimator is a random variable!
e.g.,
b=
p
n
1X
Xi .
n
i=1
I The outcome of a statistic/estimator is called an estimate. e.g.,
pe =
n
1X
ki .
n
i=1
7
A random sample of size n from the population with given pdf:
I X1 , · · · , Xn are i.i.d. random variables, each following the same given
pdf.
I a statistic or an estimator is a function of the random sample.
Statistic/Estimator is a random variable!
e.g.,
b=
p
n
1X
Xi .
n
i=1
I The outcome of a statistic/estimator is called an estimate. e.g.,
pe =
n
1X
ki .
n
i=1
7
A random sample of size n from the population with given pdf:
I X1 , · · · , Xn are i.i.d. random variables, each following the same given
pdf.
I a statistic or an estimator is a function of the random sample.
Statistic/Estimator is a random variable!
e.g.,
b=
p
n
1X
Xi .
n
i=1
I The outcome of a statistic/estimator is called an estimate. e.g.,
pe =
n
1X
ki .
n
i=1
7
Plan
§ 5.1 Introduction
§ 5.2 Estimating parameters: MLE and MME
§ 5.3 Interval Estimation
§ 5.4 Properties of Estimators
§ 5.5 Minimum-Variance Estimators: The Cramér-Rao Lower Bound
§ 5.6 Sufficient Estimators
§ 5.7 Consistency
§ 5.8 Bayesian Estimation
8
Chapter 5. Estimation
§ 5.1 Introduction
§ 5.2 Estimating parameters: MLE and MME
§ 5.3 Interval Estimation
§ 5.4 Properties of Estimators
§ 5.5 Minimum-Variance Estimators: The Cramér-Rao Lower Bound
§ 5.6 Sufficient Estimators
§ 5.7 Consistency
§ 5.8 Bayesian Estimation
9
Two methods for estimating parameters
Corresponding estimator
1. Method of maximum likelihood.
MLE
2. Method of moments.
MME
10
Two methods for estimating parameters
Corresponding estimator
1. Method of maximum likelihood.
MLE
2. Method of moments.
MME
10
Maximum Likelihood Estimation
Definition 5.2.1. For a random sample of size n from the discrete (resp.
continuous) population/pdf pX (k ; θ) (resp. fY (y ; θ)), the likelihood function,
L(θ), is the product of the pdf evaluated at Xi = ki (resp. Yi = yi ), i.e.,
L(θ) =
n
Y
i=1
pX (ki ; θ)
resp. L(θ) =
n
Y
fY (yi ; θ) .
i=1
Definition 5.2.2. Let L(θ) be as defined in Definition 5.2.1. If θe is a value
of the parameter such that L(θe ) ≥ L(θ) for all possible values of θ, then we
call θe the maximum likelihood estimate for θ.
11
Maximum Likelihood Estimation
Definition 5.2.1. For a random sample of size n from the discrete (resp.
continuous) population/pdf pX (k ; θ) (resp. fY (y ; θ)), the likelihood function,
L(θ), is the product of the pdf evaluated at Xi = ki (resp. Yi = yi ), i.e.,
L(θ) =
n
Y
i=1
pX (ki ; θ)
resp. L(θ) =
n
Y
fY (yi ; θ) .
i=1
Definition 5.2.2. Let L(θ) be as defined in Definition 5.2.1. If θe is a value
of the parameter such that L(θe ) ≥ L(θ) for all possible values of θ, then we
call θe the maximum likelihood estimate for θ.
11
Examples for MLE
Often but not always MLE can be obtained by setting the first derivative
equal to zero:
E.g. 1. Poisson distribution: pX (k ) = e−λ λk ! , k = 0, 1, · · · .
k
L(λ) =
n
Y
i=1
e
−λ λ
ki
ki !
ln L(λ) = −nλ +
=e
−nλ
λ
n
Y
Pk
i=1 ki
!−1
ki !
.
i=1
n
X
!
ki
ln λ − ln
i=1
n
Y
!
ki ! .
i=1
n
d
1X
ln L(λ) = −n +
ki .
dλ
λ
i=1
d
ln L(λ) = 0
dλ
=⇒
λe =
n
1X
ki =: k̄ .
n
i=1
Comment: The critical point is indeed global maximum because
n
d2
1 X
ln
L(λ)
=
−
ki < 0.
dλ2
λ2
i=1
Examples for MLE
Often but not always MLE can be obtained by setting the first derivative
equal to zero:
E.g. 1. Poisson distribution: pX (k ) = e−λ λk ! , k = 0, 1, · · · .
k
L(λ) =
n
Y
i=1
e
−λ λ
ki
ki !
ln L(λ) = −nλ +
=e
−nλ
λ
n
Y
Pk
i=1 ki
!−1
ki !
.
i=1
n
X
!
ki
ln λ − ln
i=1
n
Y
!
ki ! .
i=1
n
d
1X
ln L(λ) = −n +
ki .
dλ
λ
i=1
d
ln L(λ) = 0
dλ
=⇒
λe =
n
1X
ki =: k̄ .
n
i=1
Comment: The critical point is indeed global maximum because
n
d2
1 X
ln
L(λ)
=
−
ki < 0.
dλ2
λ2
i=1
Examples for MLE
Often but not always MLE can be obtained by setting the first derivative
equal to zero:
E.g. 1. Poisson distribution: pX (k ) = e−λ λk ! , k = 0, 1, · · · .
k
L(λ) =
n
Y
i=1
e
−λ λ
ki
ki !
ln L(λ) = −nλ +
=e
−nλ
λ
n
Y
Pk
i=1 ki
!−1
ki !
.
i=1
n
X
!
ki
ln λ − ln
i=1
n
Y
!
ki ! .
i=1
n
d
1X
ln L(λ) = −n +
ki .
dλ
λ
i=1
d
ln L(λ) = 0
dλ
=⇒
λe =
n
1X
ki =: k̄ .
n
i=1
Comment: The critical point is indeed global maximum because
n
d2
1 X
ln
L(λ)
=
−
ki < 0.
dλ2
λ2
i=1
12
Examples for MLE
Often but not always MLE can be obtained by setting the first derivative
equal to zero:
E.g. 1. Poisson distribution: pX (k ) = e−λ λk ! , k = 0, 1, · · · .
k
L(λ) =
n
Y
i=1
e
−λ λ
ki
ki !
ln L(λ) = −nλ +
=e
−nλ
λ
n
Y
Pk
i=1 ki
!−1
ki !
.
i=1
n
X
!
ki
ln λ − ln
i=1
n
Y
!
ki ! .
i=1
n
d
1X
ln L(λ) = −n +
ki .
dλ
λ
i=1
d
ln L(λ) = 0
dλ
=⇒
λe =
n
1X
ki =: k̄ .
n
i=1
Comment: The critical point is indeed global maximum because
n
d2
1 X
ln
L(λ)
=
−
ki < 0.
dλ2
λ2
i=1
Examples for MLE
Often but not always MLE can be obtained by setting the first derivative
equal to zero:
E.g. 1. Poisson distribution: pX (k ) = e−λ λk ! , k = 0, 1, · · · .
k
L(λ) =
n
Y
i=1
e
−λ λ
ki
ki !
ln L(λ) = −nλ +
=e
−nλ
λ
n
Y
Pk
i=1 ki
!−1
ki !
.
i=1
n
X
!
ki
ln λ − ln
i=1
n
Y
!
ki ! .
i=1
n
d
1X
ln L(λ) = −n +
ki .
dλ
λ
i=1
d
ln L(λ) = 0
dλ
=⇒
λe =
n
1X
ki =: k̄ .
n
i=1
Comment: The critical point is indeed global maximum because
n
d2
1 X
ln
L(λ)
=
−
ki < 0.
dλ2
λ2
i=1
Examples for MLE
Often but not always MLE can be obtained by setting the first derivative
equal to zero:
E.g. 1. Poisson distribution: pX (k ) = e−λ λk ! , k = 0, 1, · · · .
k
L(λ) =
n
Y
i=1
e
−λ λ
ki
ki !
ln L(λ) = −nλ +
=e
−nλ
λ
n
Y
Pk
i=1 ki
!−1
ki !
.
i=1
n
X
!
ki
ln λ − ln
i=1
n
Y
!
ki ! .
i=1
n
d
1X
ln L(λ) = −n +
ki .
dλ
λ
i=1
d
ln L(λ) = 0
dλ
=⇒
λe =
n
1X
ki =: k̄ .
n
i=1
Comment: The critical point is indeed global maximum because
n
d2
1 X
ln
L(λ)
=
−
ki < 0.
dλ2
λ2
i=1
The following two cases are related to waiting time:
E.g. 2. Exponential distribution: fY (y ) = λe−λy for y ≥ 0.
L(λ) =
n
Y
λe−λyi = λn exp −λ
n
X
i=1
!
yi
i=1
ln L(λ) = n ln λ − λ
n
X
yi .
i=1
d
n X
ln L(λ) = −
yi .
dλ
λ
n
i=1
d
ln L(λ) = 0
dλ
=⇒
n
λe = Pn
i=1 yi
=:
1
.
ȳ
13
The following two cases are related to waiting time:
E.g. 2. Exponential distribution: fY (y ) = λe−λy for y ≥ 0.
L(λ) =
n
Y
λe−λyi = λn exp −λ
n
X
i=1
!
yi
i=1
ln L(λ) = n ln λ − λ
n
X
yi .
i=1
d
n X
ln L(λ) = −
yi .
dλ
λ
n
i=1
d
ln L(λ) = 0
dλ
=⇒
n
λe = Pn
i=1 yi
=:
1
.
ȳ
13
The following two cases are related to waiting time:
E.g. 2. Exponential distribution: fY (y ) = λe−λy for y ≥ 0.
L(λ) =
n
Y
λe−λyi = λn exp −λ
n
X
i=1
!
yi
i=1
ln L(λ) = n ln λ − λ
n
X
yi .
i=1
d
n X
ln L(λ) = −
yi .
dλ
λ
n
i=1
d
ln L(λ) = 0
dλ
=⇒
n
λe = Pn
i=1 yi
=:
1
.
ȳ
13
The following two cases are related to waiting time:
E.g. 2. Exponential distribution: fY (y ) = λe−λy for y ≥ 0.
L(λ) =
n
Y
λe−λyi = λn exp −λ
n
X
i=1
!
yi
i=1
ln L(λ) = n ln λ − λ
n
X
yi .
i=1
d
n X
ln L(λ) = −
yi .
dλ
λ
n
i=1
d
ln L(λ) = 0
dλ
=⇒
n
λe = Pn
i=1 yi
=:
1
.
ȳ
13
The following two cases are related to waiting time:
E.g. 2. Exponential distribution: fY (y ) = λe−λy for y ≥ 0.
L(λ) =
n
Y
λe−λyi = λn exp −λ
n
X
i=1
!
yi
i=1
ln L(λ) = n ln λ − λ
n
X
yi .
i=1
d
n X
ln L(λ) = −
yi .
dλ
λ
n
i=1
d
ln L(λ) = 0
dλ
=⇒
n
λe = Pn
i=1 yi
=:
1
.
ȳ
13
A random sample of size n from the following population:
E.g. 3. Gamma distribution: fY (y ; λ) =
known.
L(λ) =
λr
y r −1 e−λy
Γ(r )
n
Y
λr r −1 −λyi
y
e
= λr n Γ(r )−n
Γ(r ) i
i=1
ln L(λ) = r n ln λ − n ln Γ(r ) + ln
n
Y
for y ≥ 0 with r > 1
!
yir −1
exp −λ
i=1
n
Y
n
X
!
yi
i=1
!
yir −1
−λ
i=1
n
X
yi .
i=1
d
rn X
ln L(λ) =
−
yi .
dλ
λ
n
i=1
d
ln L(λ) = 0
dλ
=⇒
rn
λe = Pn
i=1 yi
=
r
.
ȳ
Comment:
– When r = 1, this reduces to the exponential distribution case.
– If r is also unknown, it will be much more complicated.
No closed-form solution. One needs numerical solver2 .
Try MME instead.
2
[DW, Example 7.2.25]
14
A random sample of size n from the following population:
E.g. 3. Gamma distribution: fY (y ; λ) =
known.
L(λ) =
λr
y r −1 e−λy
Γ(r )
n
Y
λr r −1 −λyi
y
e
= λr n Γ(r )−n
Γ(r ) i
i=1
ln L(λ) = r n ln λ − n ln Γ(r ) + ln
n
Y
for y ≥ 0 with r > 1
!
yir −1
exp −λ
i=1
n
Y
n
X
!
yi
i=1
!
yir −1
−λ
i=1
n
X
yi .
i=1
d
rn X
ln L(λ) =
−
yi .
dλ
λ
n
i=1
d
ln L(λ) = 0
dλ
=⇒
rn
λe = Pn
i=1 yi
=
r
.
ȳ
Comment:
– When r = 1, this reduces to the exponential distribution case.
– If r is also unknown, it will be much more complicated.
No closed-form solution. One needs numerical solver2 .
Try MME instead.
2
[DW, Example 7.2.25]
14
A random sample of size n from the following population:
E.g. 3. Gamma distribution: fY (y ; λ) =
known.
L(λ) =
λr
y r −1 e−λy
Γ(r )
n
Y
λr r −1 −λyi
y
e
= λr n Γ(r )−n
Γ(r ) i
i=1
ln L(λ) = r n ln λ − n ln Γ(r ) + ln
n
Y
for y ≥ 0 with r > 1
!
yir −1
exp −λ
i=1
n
Y
n
X
!
yi
i=1
!
yir −1
−λ
i=1
n
X
yi .
i=1
d
rn X
ln L(λ) =
−
yi .
dλ
λ
n
i=1
d
ln L(λ) = 0
dλ
=⇒
rn
λe = Pn
i=1 yi
=
r
.
ȳ
Comment:
– When r = 1, this reduces to the exponential distribution case.
– If r is also unknown, it will be much more complicated.
No closed-form solution. One needs numerical solver2 .
Try MME instead.
2
[DW, Example 7.2.25]
14
A random sample of size n from the following population:
E.g. 3. Gamma distribution: fY (y ; λ) =
known.
L(λ) =
λr
y r −1 e−λy
Γ(r )
n
Y
λr r −1 −λyi
y
e
= λr n Γ(r )−n
Γ(r ) i
i=1
ln L(λ) = r n ln λ − n ln Γ(r ) + ln
n
Y
for y ≥ 0 with r > 1
!
yir −1
exp −λ
i=1
n
Y
n
X
!
yi
i=1
!
yir −1
−λ
i=1
n
X
yi .
i=1
d
rn X
ln L(λ) =
−
yi .
dλ
λ
n
i=1
d
ln L(λ) = 0
dλ
=⇒
rn
λe = Pn
i=1 yi
=
r
.
ȳ
Comment:
– When r = 1, this reduces to the exponential distribution case.
– If r is also unknown, it will be much more complicated.
No closed-form solution. One needs numerical solver2 .
Try MME instead.
2
[DW, Example 7.2.25]
14
A random sample of size n from the following population:
E.g. 3. Gamma distribution: fY (y ; λ) =
known.
L(λ) =
λr
y r −1 e−λy
Γ(r )
n
Y
λr r −1 −λyi
y
e
= λr n Γ(r )−n
Γ(r ) i
i=1
ln L(λ) = r n ln λ − n ln Γ(r ) + ln
n
Y
for y ≥ 0 with r > 1
!
yir −1
exp −λ
i=1
n
Y
n
X
!
yi
i=1
!
yir −1
−λ
i=1
n
X
yi .
i=1
d
rn X
ln L(λ) =
−
yi .
dλ
λ
n
i=1
d
ln L(λ) = 0
dλ
=⇒
rn
λe = Pn
i=1 yi
=
r
.
ȳ
Comment:
– When r = 1, this reduces to the exponential distribution case.
– If r is also unknown, it will be much more complicated.
No closed-form solution. One needs numerical solver2 .
Try MME instead.
2
[DW, Example 7.2.25]
14
A random sample of size n from the following population:
E.g. 3. Gamma distribution: fY (y ; λ) =
known.
L(λ) =
λr
y r −1 e−λy
Γ(r )
n
Y
λr r −1 −λyi
y
e
= λr n Γ(r )−n
Γ(r ) i
i=1
ln L(λ) = r n ln λ − n ln Γ(r ) + ln
n
Y
for y ≥ 0 with r > 1
!
yir −1
exp −λ
i=1
n
Y
n
X
!
yi
i=1
!
yir −1
−λ
i=1
n
X
yi .
i=1
d
rn X
ln L(λ) =
−
yi .
dλ
λ
n
i=1
d
ln L(λ) = 0
dλ
=⇒
rn
λe = Pn
i=1 yi
=
r
.
ȳ
Comment:
– When r = 1, this reduces to the exponential distribution case.
– If r is also unknown, it will be much more complicated.
No closed-form solution. One needs numerical solver2 .
Try MME instead.
2
[DW, Example 7.2.25]
14
A detailed study with data:
E.g. 4. Geometric distribution: pX (k ; p) = (1 − p)k −1 p, k = 1, 2, · · · .
L(p) =
n
Pk
Y
(1 − p)ki −1 p = (1 − p)−n+ i=1 ki pn .
i=1
ln L(p) =
−n +
n
X
!
ki
ln(1 − p) + n ln p.
i=1
P
−n + ni=1 ki
d
n
ln L(p) = −
+ .
dp
1−p
p
d
ln L(p) = 0
dp
=⇒
n
p e = Pn
i=1
ki
=
1
.
k̄
Comment: Its cousin distribution, the negative binomial distribution
can be worked out similarly (See Ex 5.2.14).
15
A detailed study with data:
E.g. 4. Geometric distribution: pX (k ; p) = (1 − p)k −1 p, k = 1, 2, · · · .
L(p) =
n
Pk
Y
(1 − p)ki −1 p = (1 − p)−n+ i=1 ki pn .
i=1
ln L(p) =
−n +
n
X
!
ki
ln(1 − p) + n ln p.
i=1
P
−n + ni=1 ki
d
n
ln L(p) = −
+ .
dp
1−p
p
d
ln L(p) = 0
dp
=⇒
n
p e = Pn
i=1
ki
=
1
.
k̄
Comment: Its cousin distribution, the negative binomial distribution
can be worked out similarly (See Ex 5.2.14).
15
A detailed study with data:
E.g. 4. Geometric distribution: pX (k ; p) = (1 − p)k −1 p, k = 1, 2, · · · .
L(p) =
n
Pk
Y
(1 − p)ki −1 p = (1 − p)−n+ i=1 ki pn .
i=1
ln L(p) =
−n +
n
X
!
ki
ln(1 − p) + n ln p.
i=1
P
−n + ni=1 ki
d
n
ln L(p) = −
+ .
dp
1−p
p
d
ln L(p) = 0
dp
=⇒
n
p e = Pn
i=1
ki
=
1
.
k̄
Comment: Its cousin distribution, the negative binomial distribution
can be worked out similarly (See Ex 5.2.14).
15
A detailed study with data:
E.g. 4. Geometric distribution: pX (k ; p) = (1 − p)k −1 p, k = 1, 2, · · · .
L(p) =
n
Pk
Y
(1 − p)ki −1 p = (1 − p)−n+ i=1 ki pn .
i=1
ln L(p) =
−n +
n
X
!
ki
ln(1 − p) + n ln p.
i=1
P
−n + ni=1 ki
d
n
ln L(p) = −
+ .
dp
1−p
p
d
ln L(p) = 0
dp
=⇒
n
p e = Pn
i=1
ki
=
1
.
k̄
Comment: Its cousin distribution, the negative binomial distribution
can be worked out similarly (See Ex 5.2.14).
15
A detailed study with data:
E.g. 4. Geometric distribution: pX (k ; p) = (1 − p)k −1 p, k = 1, 2, · · · .
L(p) =
n
Pk
Y
(1 − p)ki −1 p = (1 − p)−n+ i=1 ki pn .
i=1
ln L(p) =
−n +
n
X
!
ki
ln(1 − p) + n ln p.
i=1
P
−n + ni=1 ki
d
n
ln L(p) = −
+ .
dp
1−p
p
d
ln L(p) = 0
dp
=⇒
n
p e = Pn
i=1
ki
=
1
.
k̄
Comment: Its cousin distribution, the negative binomial distribution
can be worked out similarly (See Ex 5.2.14).
15
A detailed study with data:
E.g. 4. Geometric distribution: pX (k ; p) = (1 − p)k −1 p, k = 1, 2, · · · .
L(p) =
n
Pk
Y
(1 − p)ki −1 p = (1 − p)−n+ i=1 ki pn .
i=1
ln L(p) =
−n +
n
X
!
ki
ln(1 − p) + n ln p.
i=1
P
−n + ni=1 ki
d
n
ln L(p) = −
+ .
dp
1−p
p
d
ln L(p) = 0
dp
=⇒
n
p e = Pn
i=1
ki
=
1
.
k̄
Comment: Its cousin distribution, the negative binomial distribution
can be worked out similarly (See Ex 5.2.14).
15
MLE for Geom. Distr. of sample size n = 128
k Observed frequency Predicted frequency
1
72
74.14
2
35
31.2
3
11
13.13
4
6
5.52
5
2
2.32
6
2
0.98
Real p = unknown and MLE for p = 0.5792
16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# The example from the book.
library(pracma) # Load the library ”Practical Numerical Math Functions”
k<−c(72, 35, 11, 6, 2, 2) # observed freq.
a=1:6
pe=sum(k)/dot(k,a) # MLE for p.
f=a
for (i in 1:6) {
f [ i ] = round((1−pe)^(i−1) ∗ pe ∗ sum(k),2)
}
# Initialize the table
d <−matrix(1:18, nrow = 6, ncol = 3)
# Now adding the column names
colnames(d) <− c(”k”,
”Observed freq.”,
”Predicted freq.”)
d[1:6,1]<−a
d[1:6,2]<−k
d[1:6,3]<−f
grid.table(d) # Show the table
PlotResults(”unknown”, pe, d, ”Geometric.pdf”) # Output the results using a user
defined function
17
k
MLE for Geom. Distr. of sample size n = 128
Observed frequency Predicted frequency
1
42
40.96
2
31
27.85
3
15
18.94
4
11
12.88
5
9
8.76
6
5
5.96
7
7
4.05
8
2
2.75
9
1
1.87
10
2
1.27
11
1
0.87
13
1
0.59
14
1
0.4
Real p = 0.3333 and MLE for p = 0.32
18
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Now let’s generate random samples from a Geometric distribution with p=1/3 with
the same size of the sample.
p = 1/3
n = 128
gdata<−rgeom(n, p)+1 # Generate random samples
g<− table(gdata) # Count frequency of your data.
g<− t(rbind(as.numeric(rownames(g)), g)) # Transpose and combine two columns.
pe=n/dot(g[,1],g[,2]) # MLE for p.
f <− g[,1] # Initialize f
for (i in 1:nrow(g)) {
f [ i ] = round((1−pe)^(i−1) ∗ pe ∗ n,2)
} # Compute the expected frequency
g<−cbind(g,f) # Add one columns to your matrix.
colnames(g) <− c(”k”,
”Observed freq.”,
”Predicted freq.”) # Specify the column names.
d_df <− as.data.frame(d) # One can use data frame to store data
d_df # Show data on your terminal
PlotResults(p, pe, g, ”Geometric2.pdf”) # Output the results using a user defined
function
19
k
MLE for Geom. Distr. of sample size n = 300
Observed frequency Predicted frequency
1
99
105.88
2
69
68.51
3
47
44.33
4
28
28.69
5
27
18.56
6
9
12.01
7
8
7.77
8
5
5.03
9
5
3.25
10
3
2.11
Real p = 0.3333 and MLE for p = 0.3529
20
In case we have several parameters:
E.g. 5. Normal distribution: fY (y ; µ, σ 2 ) =
2
L(µ, σ ) =
n
Y
i=1
√1 e
2πσ
−
(y −µ)2
2σ 2
, y ∈ R.
n
(y −µ)2
1
1 X
− i
√
e 2σ2 = (2πσ 2 )−n/2 exp − 2
(yi − µ)2
2σ
2πσ
i=1
ln L(µ, σ 2 ) = −
!
n
n
1 X
ln(2πσ 2 ) − 2
(yi − µ)2 .
2
2σ
i=1

n

∂
1 X

2

ln
L(µ,
σ
)
=
(yi − µ)

 ∂µ
σ2
i=1
n

∂
n
1 X

2

(yi − µ)2

 ∂σ 2 ln L(µ, σ ) = − 2σ 2 + 2σ 4
i=1

∂


ln L(µ, σ 2 ) = 0
∂µ

 ∂ ln L(µ, σ 2 ) = 0
∂σ 2
=⇒


µe = ȳ
n
1X
2

σ
=
(yi − ȳ )2
e

n
i=1
21
In case we have several parameters:
E.g. 5. Normal distribution: fY (y ; µ, σ 2 ) =
2
L(µ, σ ) =
n
Y
i=1
√1 e
2πσ
−
(y −µ)2
2σ 2
, y ∈ R.
n
(y −µ)2
1
1 X
− i
√
e 2σ2 = (2πσ 2 )−n/2 exp − 2
(yi − µ)2
2σ
2πσ
i=1
ln L(µ, σ 2 ) = −
!
n
n
1 X
ln(2πσ 2 ) − 2
(yi − µ)2 .
2
2σ
i=1

n

∂
1 X

2

ln
L(µ,
σ
)
=
(yi − µ)

 ∂µ
σ2
i=1
n

∂
n
1 X

2

(yi − µ)2

 ∂σ 2 ln L(µ, σ ) = − 2σ 2 + 2σ 4
i=1

∂


ln L(µ, σ 2 ) = 0
∂µ

 ∂ ln L(µ, σ 2 ) = 0
∂σ 2
=⇒


µe = ȳ
n
1X
2

σ
=
(yi − ȳ )2
e

n
i=1
21
In case we have several parameters:
E.g. 5. Normal distribution: fY (y ; µ, σ 2 ) =
2
L(µ, σ ) =
n
Y
i=1
√1 e
2πσ
−
(y −µ)2
2σ 2
, y ∈ R.
n
(y −µ)2
1
1 X
− i
√
e 2σ2 = (2πσ 2 )−n/2 exp − 2
(yi − µ)2
2σ
2πσ
i=1
ln L(µ, σ 2 ) = −
!
n
n
1 X
ln(2πσ 2 ) − 2
(yi − µ)2 .
2
2σ
i=1

n

∂
1 X

2

ln
L(µ,
σ
)
=
(yi − µ)

 ∂µ
σ2
i=1
n

∂
n
1 X

2

(yi − µ)2

 ∂σ 2 ln L(µ, σ ) = − 2σ 2 + 2σ 4
i=1

∂


ln L(µ, σ 2 ) = 0
∂µ

 ∂ ln L(µ, σ 2 ) = 0
∂σ 2
=⇒


µe = ȳ
n
1X
2

σ
=
(yi − ȳ )2
e

n
i=1
21
In case we have several parameters:
E.g. 5. Normal distribution: fY (y ; µ, σ 2 ) =
2
L(µ, σ ) =
n
Y
i=1
√1 e
2πσ
−
(y −µ)2
2σ 2
, y ∈ R.
n
(y −µ)2
1
1 X
− i
√
e 2σ2 = (2πσ 2 )−n/2 exp − 2
(yi − µ)2
2σ
2πσ
i=1
ln L(µ, σ 2 ) = −
!
n
n
1 X
ln(2πσ 2 ) − 2
(yi − µ)2 .
2
2σ
i=1

n

∂
1 X

2

ln
L(µ,
σ
)
=
(yi − µ)

 ∂µ
σ2
i=1
n

∂
n
1 X

2

(yi − µ)2

 ∂σ 2 ln L(µ, σ ) = − 2σ 2 + 2σ 4
i=1

∂


ln L(µ, σ 2 ) = 0
∂µ

 ∂ ln L(µ, σ 2 ) = 0
∂σ 2
=⇒


µe = ȳ
n
1X
2

σ
=
(yi − ȳ )2
e

n
i=1
21
In case we have several parameters:
E.g. 5. Normal distribution: fY (y ; µ, σ 2 ) =
2
L(µ, σ ) =
n
Y
i=1
√1 e
2πσ
−
(y −µ)2
2σ 2
, y ∈ R.
n
(y −µ)2
1
1 X
− i
√
e 2σ2 = (2πσ 2 )−n/2 exp − 2
(yi − µ)2
2σ
2πσ
i=1
ln L(µ, σ 2 ) = −
!
n
n
1 X
ln(2πσ 2 ) − 2
(yi − µ)2 .
2
2σ
i=1

n

∂
1 X

2

ln
L(µ,
σ
)
=
(yi − µ)

 ∂µ
σ2
i=1
n

∂
n
1 X

2

(yi − µ)2

 ∂σ 2 ln L(µ, σ ) = − 2σ 2 + 2σ 4
i=1

∂


ln L(µ, σ 2 ) = 0
∂µ

 ∂ ln L(µ, σ 2 ) = 0
∂σ 2
=⇒


µe = ȳ
n
1X
2

σ
=
(yi − ȳ )2
e

n
i=1
21
In case when the parameters determine the support of the density:
(Non regular case)
1
E.g. 6. Uniform distribution on [a, b] with a < b: fY (y ; a, b) = b−a
if y ∈ [a, b].
(Q
n
1
1
if a ≤ y1 , · · · , yn ≤ b,
i=1 b−a = (b−a)n
L(a, b) =
0
otherwise.
L(a, b) is monotone increasing in a and decreasing in b. Hence, in
order to maximize L(a, b), one needs to choose
ae = ymin
and be = ymax .
for y ∈ [0, θ].
(Q
2yi
n
n −2n Qn
i=1 θ 2 = 2 θ
i=1 yi
L(θ) =
0
E.g. 7. fY (y ; θ) =
2y
θ2
if 0 ≤ y1 , · · · , yn ≤ θ,
otherwise.
⇓
θe = ymax .
22
In case when the parameters determine the support of the density:
(Non regular case)
1
E.g. 6. Uniform distribution on [a, b] with a < b: fY (y ; a, b) = b−a
if y ∈ [a, b].
(Q
n
1
1
if a ≤ y1 , · · · , yn ≤ b,
i=1 b−a = (b−a)n
L(a, b) =
0
otherwise.
L(a, b) is monotone increasing in a and decreasing in b. Hence, in
order to maximize L(a, b), one needs to choose
ae = ymin
and be = ymax .
for y ∈ [0, θ].
(Q
2yi
n
n −2n Qn
i=1 θ 2 = 2 θ
i=1 yi
L(θ) =
0
E.g. 7. fY (y ; θ) =
2y
θ2
if 0 ≤ y1 , · · · , yn ≤ θ,
otherwise.
⇓
θe = ymax .
22
In case when the parameters determine the support of the density:
(Non regular case)
1
E.g. 6. Uniform distribution on [a, b] with a < b: fY (y ; a, b) = b−a
if y ∈ [a, b].
(Q
n
1
1
if a ≤ y1 , · · · , yn ≤ b,
i=1 b−a = (b−a)n
L(a, b) =
0
otherwise.
L(a, b) is monotone increasing in a and decreasing in b. Hence, in
order to maximize L(a, b), one needs to choose
ae = ymin
and be = ymax .
for y ∈ [0, θ].
(Q
2yi
n
n −2n Qn
i=1 θ 2 = 2 θ
i=1 yi
L(θ) =
0
E.g. 7. fY (y ; θ) =
2y
θ2
⇓
θe = ymax .
if 0 ≤ y1 , · · · , yn ≤ θ,
otherwise.
In case when the parameters determine the support of the density:
(Non regular case)
1
E.g. 6. Uniform distribution on [a, b] with a < b: fY (y ; a, b) = b−a
if y ∈ [a, b].
(Q
n
1
1
if a ≤ y1 , · · · , yn ≤ b,
i=1 b−a = (b−a)n
L(a, b) =
0
otherwise.
L(a, b) is monotone increasing in a and decreasing in b. Hence, in
order to maximize L(a, b), one needs to choose
ae = ymin
and be = ymax .
for y ∈ [0, θ].
(Q
2yi
n
n −2n Qn
i=1 θ 2 = 2 θ
i=1 yi
L(θ) =
0
E.g. 7. fY (y ; θ) =
2y
θ2
⇓
θe = ymax .
if 0 ≤ y1 , · · · , yn ≤ θ,
otherwise.
In case when the parameters determine the support of the density:
(Non regular case)
1
E.g. 6. Uniform distribution on [a, b] with a < b: fY (y ; a, b) = b−a
if y ∈ [a, b].
(Q
n
1
1
if a ≤ y1 , · · · , yn ≤ b,
i=1 b−a = (b−a)n
L(a, b) =
0
otherwise.
L(a, b) is monotone increasing in a and decreasing in b. Hence, in
order to maximize L(a, b), one needs to choose
ae = ymin
and be = ymax .
for y ∈ [0, θ].
(Q
2yi
n
n −2n Qn
i=1 θ 2 = 2 θ
i=1 yi
L(θ) =
0
E.g. 7. fY (y ; θ) =
2y
θ2
⇓
θe = ymax .
if 0 ≤ y1 , · · · , yn ≤ θ,
otherwise.
In case when the parameters determine the support of the density:
(Non regular case)
1
E.g. 6. Uniform distribution on [a, b] with a < b: fY (y ; a, b) = b−a
if y ∈ [a, b].
(Q
n
1
1
if a ≤ y1 , · · · , yn ≤ b,
i=1 b−a = (b−a)n
L(a, b) =
0
otherwise.
L(a, b) is monotone increasing in a and decreasing in b. Hence, in
order to maximize L(a, b), one needs to choose
ae = ymin
and be = ymax .
for y ∈ [0, θ].
(Q
2yi
n
n −2n Qn
i=1 θ 2 = 2 θ
i=1 yi
L(θ) =
0
E.g. 7. fY (y ; θ) =
2y
θ2
⇓
θe = ymax .
if 0 ≤ y1 , · · · , yn ≤ θ,
otherwise.
In case of discrete parameter:
E.g. 8. Wildlife sampling. Capture-tag-recapture.... In the history, a tags have
been put. In order to estimate the population size N, one randomly
captures n animals, and there are k tagged. Find the MLE for N.
Sol. The population
follows hypergeometric distr.:
pX (k ; N) =
a
k
N−a
n−k
N
n
.
L(N) =
a
k
N−a
n−k
N
n
How to maximize L(N)?
1
2
3
4
5
6
7
8
>
>
>
>
>
a=10
k=5
n=20
N=seq(a,a+100)
p=choose(a,k)∗choose(N−a,n−k
)/choose(N,n)
> plot(N,p,type = ”p”)
> print(paste(”The MLE is”, n∗a/
k))
[1] ”The MLE is 40”
23
In case of discrete parameter:
E.g. 8. Wildlife sampling. Capture-tag-recapture.... In the history, a tags have
been put. In order to estimate the population size N, one randomly
captures n animals, and there are k tagged. Find the MLE for N.
Sol. The population
follows hypergeometric distr.:
pX (k ; N) =
a
k
N−a
n−k
N
n
.
L(N) =
a
k
N−a
n−k
N
n
How to maximize L(N)?
1
2
3
4
5
6
7
8
>
>
>
>
>
a=10
k=5
n=20
N=seq(a,a+100)
p=choose(a,k)∗choose(N−a,n−k
)/choose(N,n)
> plot(N,p,type = ”p”)
> print(paste(”The MLE is”, n∗a/
k))
[1] ”The MLE is 40”
23
In case of discrete parameter:
E.g. 8. Wildlife sampling. Capture-tag-recapture.... In the history, a tags have
been put. In order to estimate the population size N, one randomly
captures n animals, and there are k tagged. Find the MLE for N.
Sol. The population
follows hypergeometric distr.:
pX (k ; N) =
a
k
N−a
n−k
N
n
.
L(N) =
a
k
N−a
n−k
N
n
How to maximize L(N)?
1
2
3
4
5
6
7
8
>
>
>
>
>
a=10
k=5
n=20
N=seq(a,a+100)
p=choose(a,k)∗choose(N−a,n−k
)/choose(N,n)
> plot(N,p,type = ”p”)
> print(paste(”The MLE is”, n∗a/
k))
[1] ”The MLE is 40”
23
In case of discrete parameter:
E.g. 8. Wildlife sampling. Capture-tag-recapture.... In the history, a tags have
been put. In order to estimate the population size N, one randomly
captures n animals, and there are k tagged. Find the MLE for N.
Sol. The population
follows hypergeometric distr.:
pX (k ; N) =
a
k
N−a
n−k
N
n
.
L(N) =
a
k
N−a
n−k
N
n
How to maximize L(N)?
1
2
3
4
5
6
7
8
>
>
>
>
>
a=10
k=5
n=20
N=seq(a,a+100)
p=choose(a,k)∗choose(N−a,n−k
)/choose(N,n)
> plot(N,p,type = ”p”)
> print(paste(”The MLE is”, n∗a/
k))
[1] ”The MLE is 40”
23
In case of discrete parameter:
E.g. 8. Wildlife sampling. Capture-tag-recapture.... In the history, a tags have
been put. In order to estimate the population size N, one randomly
captures n animals, and there are k tagged. Find the MLE for N.
Sol. The population
follows hypergeometric distr.:
pX (k ; N) =
a
k
N−a
n−k
N
n
.
L(N) =
a
k
N−a
n−k
N
n
How to maximize L(N)?
1
2
3
4
5
6
7
8
>
>
>
>
>
a=10
k=5
n=20
N=seq(a,a+100)
p=choose(a,k)∗choose(N−a,n−k
)/choose(N,n)
> plot(N,p,type = ”p”)
> print(paste(”The MLE is”, n∗a/
k))
[1] ”The MLE is 40”
23
The graph suggests to sudty the following quantity:
r (N) :=
L(N)
N −n
N −a
=
×
L(N − 1)
N
N −a−n+k
r (N) < 1
⇐⇒
na < Nk
i.e., N >
na
k
n
j na k l na mo
Ne = arg max L(N) : N =
,
.
k
k
24
The graph suggests to sudty the following quantity:
r (N) :=
L(N)
N −n
N −a
=
×
L(N − 1)
N
N −a−n+k
r (N) < 1
⇐⇒
na < Nk
i.e., N >
na
k
n
j na k l na mo
Ne = arg max L(N) : N =
,
.
k
k
24
The graph suggests to sudty the following quantity:
r (N) :=
L(N)
N −n
N −a
=
×
L(N − 1)
N
N −a−n+k
r (N) < 1
⇐⇒
na < Nk
i.e., N >
na
k
n
j na k l na mo
Ne = arg max L(N) : N =
,
.
k
k
24
The graph suggests to sudty the following quantity:
r (N) :=
L(N)
N −n
N −a
=
×
L(N − 1)
N
N −a−n+k
r (N) < 1
⇐⇒
na < Nk
i.e., N >
na
k
n
j na k l na mo
Ne = arg max L(N) : N =
,
.
k
k
24
Method of Moments Estimation
Rationale: The population moments should be close to the sample
moments, i.e.,
n
1X k
E(Y k ) ≈
yi , k = 1, 2, 3, · · · .
n
i=1
Definition 5.2.3. For a random sample of size n from the discrete (resp.
continuous) population/pdf pX (k ; θ1 , · · · , θs ) (resp. fY (y ; θ1 , · · · , θs )),
solutions to

P

E(Y ) = 1n ni=1 yi


..
.


P

s
E(Y ) = 1n ni=1 yis
which are denoted by θ1e , · · · , θse , are called the method of moments
estimates of θ1 , · · · , θs .
25
Examples for MME
MME is often the same as MLE:
E.g. 1. Normal distribution: fY (y ; µ, σ 2 ) =

n

1X


yi = ȳ

µ = E(Y ) = n
i=1
n

1X 2

2
2
2

σ
+
µ
=
E(Y
)
=
yi


n
i=1
√1 e
2πσ
⇒
−
(y −µ)2
2σ 2
, y ∈ R.

µe = ȳ



n

1X 2
σe2 =
yi − µ2e
n


i=1

P

= 1n ni=1 (yi − ȳ )2
More examples when MLE coincides with MME: Poisson, Exponential,
Geometric.
26
MME is often much more tractable than MLE:
E.g. 2. Gamma distribution3 : fY (y ; r , λ) =
λr
y r −1 e−λy
Γ(r )

n

 r = E(Y ) = 1 X y = ȳ


i
λ
n
i=1
n

r
r2
1X 2

2

yi

 λ2 + λ2 = E(Y ) = n
i=1
⇒
for y ≥ 0.

ȳ 2


re = 2
σ̂


λe = ȳ = re
σ̂ 2
ȳ
where ȳ P
is the sample mean and σ̂ 2 is the sample variance:
σ̂ 2 := 1n ni=1 (yi − ȳ )2 .
Comments: MME for λ is consistent with MLE when r is known.
3
Check Theorem 4.6.3 on p. 269 for mean and variance
Another tractable example for MME, while less tractable for MLE:
E.g. 3. Neg. binomial distribution: pX (k ; p, r ) =
k = 0, 1, · · · .

r (1 − p)


= E(X ) = k̄
p
r
(1
−
p)


= Var(X ) = σ̂ 2
p2
k +r −1
k
⇒
(1 − p)k pr ,

k̄


pe = σ̂ 2
2


re = k̄
σ̂ 2 − k̄
28
re = 12.74 and pe = 0.391.
29
E.g. 4. fY (y ; θ) =
2y
θ2
for y ∈ [0, θ].
Z
y = E[Y ] =
0
θ
2y 2
2 y3
dy =
2
θ
3 θ2
y =θ
=
y =0
2
θ.
3
⇓
θe =
3
y.
2
30
Plan
§ 5.1 Introduction
§ 5.2 Estimating parameters: MLE and MME
§ 5.3 Interval Estimation
§ 5.4 Properties of Estimators
§ 5.5 Minimum-Variance Estimators: The Cramér-Rao Lower Bound
§ 5.6 Sufficient Estimators
§ 5.7 Consistency
§ 5.8 Bayesian Estimation
31
Chapter 5. Estimation
§ 5.1 Introduction
§ 5.2 Estimating parameters: MLE and MME
§ 5.3 Interval Estimation
§ 5.4 Properties of Estimators
§ 5.5 Minimum-Variance Estimators: The Cramér-Rao Lower Bound
§ 5.6 Sufficient Estimators
§ 5.7 Consistency
§ 5.8 Bayesian Estimation
32
§ 5.3 Interval Estimation
Rationale. Point estimate doesn’t provide precision information.
By using the variance of the estimator, one can construct an interval such
that with a high probability that interval will contain the unknown
parameter.
I The interval is called confidence interval.
I The high probability is confidence level.
33
§ 5.3 Interval Estimation
Rationale. Point estimate doesn’t provide precision information.
By using the variance of the estimator, one can construct an interval such
that with a high probability that interval will contain the unknown
parameter.
I The interval is called confidence interval.
I The high probability is confidence level.
33
E.g. 1. A random sample of size 4, (Y1 = 6.5, Y2 = 9.2, Y3 = 9.9, Y4 = 12.4),
from a normal population:
fY (y ; µ) = √
1
−1
e 2
2π 0.8
y −µ
0.8
2
.
Both MLE and MME give µe = ȳ = 14 (6.5 + 9.2 + 9.9 + 12.4) = 9.5.
The estimator µ
b = Y follows normal distribution.
Construct 95%-confidence interval for µ ...
34
“The parameter is an unknown constant and no probability
statement concerning its value may be made.”
–Jerzy Neyman, original developer of confidence intervals.
35
“The parameter is an unknown constant and no probability
statement concerning its value may be made.”
–Jerzy Neyman, original developer of confidence intervals.
35
In general, for a normal population with σ known, the 100(1 − α)%
confidence interval for µ is
σ
σ
ȳ − zα/2 √ , ȳ + zα/2 √
n
n
Comment: There are many variations
1. One-sided interval such as
σ
ȳ − zα √ , ȳ
n
or
σ
ȳ , ȳ + zα √
n
2. σ is unknown and sample size is small:
z-score → t-score
3. σ is unknown and sample size is large:
z-score by CLT
4. Non-Gaussian population but sample size is large:
z-score by CLT
36
In general, for a normal population with σ known, the 100(1 − α)%
confidence interval for µ is
σ
σ
ȳ − zα/2 √ , ȳ + zα/2 √
n
n
Comment: There are many variations
1. One-sided interval such as
σ
ȳ − zα √ , ȳ
n
or
σ
ȳ , ȳ + zα √
n
2. σ is unknown and sample size is small:
z-score → t-score
3. σ is unknown and sample size is large:
z-score by CLT
4. Non-Gaussian population but sample size is large:
z-score by CLT
36
In general, for a normal population with σ known, the 100(1 − α)%
confidence interval for µ is
σ
σ
ȳ − zα/2 √ , ȳ + zα/2 √
n
n
Comment: There are many variations
1. One-sided interval such as
σ
ȳ − zα √ , ȳ
n
or
σ
ȳ , ȳ + zα √
n
2. σ is unknown and sample size is small:
z-score → t-score
3. σ is unknown and sample size is large:
z-score by CLT
4. Non-Gaussian population but sample size is large:
z-score by CLT
36
In general, for a normal population with σ known, the 100(1 − α)%
confidence interval for µ is
σ
σ
ȳ − zα/2 √ , ȳ + zα/2 √
n
n
Comment: There are many variations
1. One-sided interval such as
σ
ȳ − zα √ , ȳ
n
or
σ
ȳ , ȳ + zα √
n
2. σ is unknown and sample size is small:
z-score → t-score
3. σ is unknown and sample size is large:
z-score by CLT
4. Non-Gaussian population but sample size is large:
z-score by CLT
36
In general, for a normal population with σ known, the 100(1 − α)%
confidence interval for µ is
σ
σ
ȳ − zα/2 √ , ȳ + zα/2 √
n
n
Comment: There are many variations
1. One-sided interval such as
σ
ȳ − zα √ , ȳ
n
or
σ
ȳ , ȳ + zα √
n
2. σ is unknown and sample size is small:
z-score → t-score
3. σ is unknown and sample size is large:
z-score by CLT
4. Non-Gaussian population but sample size is large:
z-score by CLT
36
Theorem. Let k be the number of successes in n independent trials, where
n is large and p = P(success) is unknown. An approximate 100(1 − α)%
confidence interval for p is the set of numbers
!
r
r
(k /n)(1 − k /n) k
(k /n)(1 − k /n)
k
− zα/2
, + zα/2
.
n
n
n
n
Proof: It follows the following facts:
I X ∼binomial(n, p) iff X = Y1 + · · · + Yn , while Yi are i.i.d. Bernoulli(p):
E[Yi ] = p
and
Var(Yi ) = p(1 − p).
I Central Limit Theorem: Let W1 , W2 , · · · , Wn be an sequence of i.i.d.
random variables, whose distribution has mean µ and variance σ 2 , then
Pn
i=1 Wi − nµ
√
approximately follows N(0, 1), when n is large.
nσ 2
37
Theorem. Let k be the number of successes in n independent trials, where
n is large and p = P(success) is unknown. An approximate 100(1 − α)%
confidence interval for p is the set of numbers
!
r
r
(k /n)(1 − k /n) k
(k /n)(1 − k /n)
k
− zα/2
, + zα/2
.
n
n
n
n
Proof: It follows the following facts:
I X ∼binomial(n, p) iff X = Y1 + · · · + Yn , while Yi are i.i.d. Bernoulli(p):
E[Yi ] = p
and
Var(Yi ) = p(1 − p).
I Central Limit Theorem: Let W1 , W2 , · · · , Wn be an sequence of i.i.d.
random variables, whose distribution has mean µ and variance σ 2 , then
Pn
i=1 Wi − nµ
√
approximately follows N(0, 1), when n is large.
nσ 2
37
Theorem. Let k be the number of successes in n independent trials, where
n is large and p = P(success) is unknown. An approximate 100(1 − α)%
confidence interval for p is the set of numbers
!
r
r
(k /n)(1 − k /n) k
(k /n)(1 − k /n)
k
− zα/2
, + zα/2
.
n
n
n
n
Proof: It follows the following facts:
I X ∼binomial(n, p) iff X = Y1 + · · · + Yn , while Yi are i.i.d. Bernoulli(p):
E[Yi ] = p
and
Var(Yi ) = p(1 − p).
I Central Limit Theorem: Let W1 , W2 , · · · , Wn be an sequence of i.i.d.
random variables, whose distribution has mean µ and variance σ 2 , then
Pn
i=1 Wi − nµ
√
approximately follows N(0, 1), when n is large.
nσ 2
37
I When the sample size n is large, by the central limit theorem,
Pn
Yi − np ap.
pi=1
∼ N(0, 1)
np(1 − p)
||
X
X
−p
−p
X − np
p
= qn
≈ qn
p(1−p)
pe (1−pe )
np(1 − p)
n
I Since pe = kn , we see that

n

X
n

−p
P
−zα/2 ≤ r k
n
1− kn

≤ zα/2 
≈1−α
n
i.e., the 100(1 − α)% confidence interval for p is
!
r
r
(k /n)(1 − k /n) k
(k /n)(1 − k /n)
k
− zα/2
, + zα/2
.
n
n
n
n
38
I When the sample size n is large, by the central limit theorem,
Pn
Yi − np ap.
pi=1
∼ N(0, 1)
np(1 − p)
||
X
X
−p
−p
X − np
p
= qn
≈ qn
p(1−p)
pe (1−pe )
np(1 − p)
n
I Since pe = kn , we see that

n

X
n

−p
P
−zα/2 ≤ r k
n
1− kn

≤ zα/2 
≈1−α
n
i.e., the 100(1 − α)% confidence interval for p is
!
r
r
(k /n)(1 − k /n) k
(k /n)(1 − k /n)
k
− zα/2
, + zα/2
.
n
n
n
n
38
I When the sample size n is large, by the central limit theorem,
Pn
Yi − np ap.
pi=1
∼ N(0, 1)
np(1 − p)
||
X
X
−p
−p
X − np
p
= qn
≈ qn
p(1−p)
pe (1−pe )
np(1 − p)
n
I Since pe = kn , we see that

n

X
n

−p
P
−zα/2 ≤ r k
n
1− kn

≤ zα/2 
≈1−α
n
i.e., the 100(1 − α)% confidence interval for p is
!
r
r
(k /n)(1 − k /n) k
(k /n)(1 − k /n)
k
− zα/2
, + zα/2
.
n
n
n
n
38
I When the sample size n is large, by the central limit theorem,
Pn
Yi − np ap.
pi=1
∼ N(0, 1)
np(1 − p)
||
X
X
−p
−p
X − np
p
= qn
≈ qn
p(1−p)
pe (1−pe )
np(1 − p)
n
I Since pe = kn , we see that

n

X
n

−p
P
−zα/2 ≤ r k
n
1− kn

≤ zα/2 
≈1−α
n
i.e., the 100(1 − α)% confidence interval for p is
!
r
r
(k /n)(1 − k /n) k
(k /n)(1 − k /n)
k
− zα/2
, + zα/2
.
n
n
n
n
38
I When the sample size n is large, by the central limit theorem,
Pn
Yi − np ap.
pi=1
∼ N(0, 1)
np(1 − p)
||
X
X
−p
−p
X − np
p
= qn
≈ qn
p(1−p)
pe (1−pe )
np(1 − p)
n
I Since pe = kn , we see that

n

X
n

−p
P
−zα/2 ≤ r k
n
1− kn

≤ zα/2 
≈1−α
n
i.e., the 100(1 − α)% confidence interval for p is
!
r
r
(k /n)(1 − k /n) k
(k /n)(1 − k /n)
k
− zα/2
, + zα/2
.
n
n
n
n
38
E.g. 1. Use median test to check the randomness of a random generator.
Suppose y1 , · · · , yn denote measurements presumed to have come
from a continuous pdf fY (y ). Let k denote the number of yi ’s that
are less than the median of fY (y ). If the sample is random, we would
expect the difference between kn and 12 to be small. More specifically,
a 95% confidence interval based on k should contain the value 0.5.
Let fY (y ) = e−y . The median is m = 0.69315.
39
E.g. 1. Use median test to check the randomness of a random generator.
Suppose y1 , · · · , yn denote measurements presumed to have come
from a continuous pdf fY (y ). Let k denote the number of yi ’s that
are less than the median of fY (y ). If the sample is random, we would
expect the difference between kn and 12 to be small. More specifically,
a 95% confidence interval based on k should contain the value 0.5.
Let fY (y ) = e−y . The median is m = 0.69315.
39
E.g. 1. Use median test to check the randomness of a random generator.
Suppose y1 , · · · , yn denote measurements presumed to have come
from a continuous pdf fY (y ). Let k denote the number of yi ’s that
are less than the median of fY (y ). If the sample is random, we would
expect the difference between kn and 12 to be small. More specifically,
a 95% confidence interval based on k should contain the value 0.5.
Let fY (y ) = e−y . The median is m = 0.69315.
39
E.g. 1. Use median test to check the randomness of a random generator.
Suppose y1 , · · · , yn denote measurements presumed to have come
from a continuous pdf fY (y ). Let k denote the number of yi ’s that
are less than the median of fY (y ). If the sample is random, we would
expect the difference between kn and 12 to be small. More specifically,
a 95% confidence interval based on k should contain the value 0.5.
Let fY (y ) = e−y . The median is m = 0.69315.
39
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#! /usr/bin/Rscript
main <− function() {
args <− commandArgs(trailingOnly = TRUE)
n <− 100 # Number of random samples.
r <− as.numeric(args[1]) # Rate of the exponential
# Check if the rate argument is given.
if (is .na(r)) return(”Please provide the rate and try again.”)
# Now start computing ...
f <− function (y) pexp(y, rate = r)−0.5
m <− uniroot(f, lower = 0, upper = 100, tol = 1e−9)$root
print(paste(”For rate ”, r, ”exponential distribution,”,
”the median is equal to ”, round(m,3)))
data <− rexp(n,r) # Generate n random samples
data <− round(data,3) # Round to 3 digits after decimal
data <− matrix(data, nrow = 10,ncol = 10) # Turn the data to a matrix
prmatrix(data) # Show data on terminal
k <− sum(data > m) # Count how many entries is bigger than m
lowerbd = k/n − 1.96 ∗ sqrt((k/n)∗(1−k/n)/n);
upperbd = k/n + 1.96 ∗sqrt((k/n)∗(1−k/n)/n);
print(paste(”The 95% confidence interval is (”,
round(lowerbd,3), ”,”,
round(upperbd,3), ”)”))
}
main()
Try commandline ...
40
41
Instead of the C.I.
k
n
− zα/2
q
(k /n)(1−k /n) k
, n
n
One can simply specify the mean
the margin of error:
q
(k /n)(1−k /n)
n
.
k
and
n
r
d := zα/2
max p(1 − p) = p(1 − p)
p∈(0,1)
+ zα/2
= 1/4
p=1/2
(k /n)(1 − k /n)
.
n
=⇒
zα/2
d ≤ √ =: dm .
2 n
42
Comment:
1. When p is close to 1/2, d ≈
zα/2
√ ,
2 n
which is equivalent to σp ≈
1
√
.
2 n
E.g., n = 1000, k /n = 0.48, and α = 5%, then
r
0.48 × 0.52
1.96
d = 1.96
= 0.03097 and dm = √
= 0.03099
1000
2 1000
r
0.48 × 0.52
1
σp =
= 0.01579873 and σp ≈ √
= 0.01581139.
1000
2 1000
2. When p is away from 1/2, the discrepancy between d and dm becomes
big....
43
Comment:
1. When p is close to 1/2, d ≈
zα/2
√ ,
2 n
which is equivalent to σp ≈
1
√
.
2 n
E.g., n = 1000, k /n = 0.48, and α = 5%, then
r
0.48 × 0.52
1.96
d = 1.96
= 0.03097 and dm = √
= 0.03099
1000
2 1000
r
0.48 × 0.52
1
σp =
= 0.01579873 and σp ≈ √
= 0.01581139.
1000
2 1000
2. When p is away from 1/2, the discrepancy between d and dm becomes
big....
43
E.g. Running for presidency. Max and Sirius obtained 480 and 520 votes,
respectively. What is probability that Max will win?
What if the sample size is n = 5000, and Max obtained 2400 votes.
44
Choosing sample sizes
d ≤ zα/2
p
p(1 − p)/n
⇐⇒
n≥
2
zα/2
p(1 − p)
d2
zα/2
d≤ √
2 n
⇐⇒
n≥
2
zα/2
4d 2
(When p is known)
(When p is unknown)
E.g. Anti-smoking campaign. Need to find an 95% C.I. with a margin of
error equal to 1%. Determine the sample size?
Answer: n ≥
1.962
4×0.012
= 9640.
E.g.’ In order to reduce the sample size, a small sample is used to determine
p. One finds that p ≈ 0.22. Determine the sample size again.
Answer: n ≥
1.962 ×0.22×0.78
×0.012
= 6592.2.
45
Choosing sample sizes
d ≤ zα/2
p
p(1 − p)/n
⇐⇒
n≥
2
zα/2
p(1 − p)
d2
zα/2
d≤ √
2 n
⇐⇒
n≥
2
zα/2
4d 2
(When p is known)
(When p is unknown)
E.g. Anti-smoking campaign. Need to find an 95% C.I. with a margin of
error equal to 1%. Determine the sample size?
Answer: n ≥
1.962
4×0.012
= 9640.
E.g.’ In order to reduce the sample size, a small sample is used to determine
p. One finds that p ≈ 0.22. Determine the sample size again.
Answer: n ≥
1.962 ×0.22×0.78
×0.012
= 6592.2.
45
Plan
§ 5.1 Introduction
§ 5.2 Estimating parameters: MLE and MME
§ 5.3 Interval Estimation
§ 5.4 Properties of Estimators
§ 5.5 Minimum-Variance Estimators: The Cramér-Rao Lower Bound
§ 5.6 Sufficient Estimators
§ 5.7 Consistency
§ 5.8 Bayesian Estimation
46
Chapter 5. Estimation
§ 5.1 Introduction
§ 5.2 Estimating parameters: MLE and MME
§ 5.3 Interval Estimation
§ 5.4 Properties of Estimators
§ 5.5 Minimum-Variance Estimators: The Cramér-Rao Lower Bound
§ 5.6 Sufficient Estimators
§ 5.7 Consistency
§ 5.8 Bayesian Estimation
§ 5.4 Properties of Estimators
Question: Estimators are not in general unique (MLE or MME ...). How
to select one estimator?
Recall: For a random sample of size n from the population with given pdf,
we have X1 , · · · , Xn , which are i.i.d. r.v.’s. The estimator θ̂ is a function of
Xi0 s:
θ̂ = θ̂(X1 , · · · , Xn ).
Criterions:
1. Unbiased.
2. Efficiency, the minimum-variance estimator.
(Mean)
(Variance)
3. Sufficency.
4. Consistency.
(Asymptotic behavior)
48
§ 5.4 Properties of Estimators
Question: Estimators are not in general unique (MLE or MME ...). How
to select one estimator?
Recall: For a random sample of size n from the population with given pdf,
we have X1 , · · · , Xn , which are i.i.d. r.v.’s. The estimator θ̂ is a function of
Xi0 s:
θ̂ = θ̂(X1 , · · · , Xn ).
Criterions:
1. Unbiased.
2. Efficiency, the minimum-variance estimator.
(Mean)
(Variance)
3. Sufficency.
4. Consistency.
(Asymptotic behavior)
48
§ 5.4 Properties of Estimators
Question: Estimators are not in general unique (MLE or MME ...). How
to select one estimator?
Recall: For a random sample of size n from the population with given pdf,
we have X1 , · · · , Xn , which are i.i.d. r.v.’s. The estimator θ̂ is a function of
Xi0 s:
θ̂ = θ̂(X1 , · · · , Xn ).
Criterions:
1. Unbiased.
2. Efficiency, the minimum-variance estimator.
(Mean)
(Variance)
3. Sufficency.
4. Consistency.
(Asymptotic behavior)
48
§ 5.4 Properties of Estimators
Question: Estimators are not in general unique (MLE or MME ...). How
to select one estimator?
Recall: For a random sample of size n from the population with given pdf,
we have X1 , · · · , Xn , which are i.i.d. r.v.’s. The estimator θ̂ is a function of
Xi0 s:
θ̂ = θ̂(X1 , · · · , Xn ).
Criterions:
1. Unbiased.
2. Efficiency, the minimum-variance estimator.
(Mean)
(Variance)
3. Sufficency.
4. Consistency.
(Asymptotic behavior)
48
Unbiasedness
Definition 5.4.1. Given a random sample of size n whose population
distribution dependes on an unknown parameter θ, let θ̂ be an estimator of
θ.
Then θ̂ is called unbiased if E(θ̂) = θ;
and θ̂ is called asymptotically unbiased if limn→∞ E(θ̂) = θ.
49
E.g. 1. fY (y ; θ) =
2y
θ2
if y ∈ [0, θ].
3
– θ̂1 = Y
2
– θ̂2 = Ymax .
2n + 1
– θ̂3 =
Ymax .
2n
E.g. 2. Let X1 , · · · , Xn be a random sample of size n with the unknown
parameter θ = E(X ). Show that for any constants ai ’s,
θ̂ =
n
X
i=1
ai Xi
is unbiased
⇐⇒
n
X
ai = 1.
i=1
50
E.g. 1. fY (y ; θ) =
2y
θ2
if y ∈ [0, θ].
3
– θ̂1 = Y
2
– θ̂2 = Ymax .
2n + 1
– θ̂3 =
Ymax .
2n
E.g. 2. Let X1 , · · · , Xn be a random sample of size n with the unknown
parameter θ = E(X ). Show that for any constants ai ’s,
θ̂ =
n
X
i=1
ai Xi
is unbiased
⇐⇒
n
X
ai = 1.
i=1
50
E.g. 1. fY (y ; θ) =
2y
θ2
if y ∈ [0, θ].
3
– θ̂1 = Y
2
– θ̂2 = Ymax .
2n + 1
– θ̂3 =
Ymax .
2n
E.g. 2. Let X1 , · · · , Xn be a random sample of size n with the unknown
parameter θ = E(X ). Show that for any constants ai ’s,
θ̂ =
n
X
i=1
ai Xi
is unbiased
⇐⇒
n
X
ai = 1.
i=1
50
E.g. 1. fY (y ; θ) =
2y
θ2
if y ∈ [0, θ].
3
– θ̂1 = Y
2
– θ̂2 = Ymax .
2n + 1
– θ̂3 =
Ymax .
2n
E.g. 2. Let X1 , · · · , Xn be a random sample of size n with the unknown
parameter θ = E(X ). Show that for any constants ai ’s,
θ̂ =
n
X
i=1
ai Xi
is unbiased
⇐⇒
n
X
ai = 1.
i=1
50
E.g. 1. fY (y ; θ) =
2y
θ2
if y ∈ [0, θ].
3
– θ̂1 = Y
2
– θ̂2 = Ymax .
2n + 1
– θ̂3 =
Ymax .
2n
E.g. 2. Let X1 , · · · , Xn be a random sample of size n with the unknown
parameter θ = E(X ). Show that for any constants ai ’s,
θ̂ =
n
X
i=1
ai Xi
is unbiased
⇐⇒
n
X
ai = 1.
i=1
50
E.g. 3. Let X1 , · · · , Xn be a random sample of size n with the unknown
parameter σ 2 = Var(X ).
2
– σ
b =
n
2
1 X
Xi − X
n i=1
– S = Sample Variance =
2
n
2
1 X
Xi − X
n − 1 i=1
v
u
u
– S = Sample Standard Deviation = t
n
2
1 X
Xi − X .
n − 1 i=1
(Biased for σ!)
51
E.g. 3. Let X1 , · · · , Xn be a random sample of size n with the unknown
parameter σ 2 = Var(X ).
2
– σ
b =
n
2
1 X
Xi − X
n i=1
– S = Sample Variance =
2
n
2
1 X
Xi − X
n − 1 i=1
v
u
u
– S = Sample Standard Deviation = t
n
2
1 X
Xi − X .
n − 1 i=1
(Biased for σ!)
51
E.g. 3. Let X1 , · · · , Xn be a random sample of size n with the unknown
parameter σ 2 = Var(X ).
2
– σ
b =
n
2
1 X
Xi − X
n i=1
– S = Sample Variance =
2
n
2
1 X
Xi − X
n − 1 i=1
v
u
u
– S = Sample Standard Deviation = t
n
2
1 X
Xi − X .
n − 1 i=1
(Biased for σ!)
51
E.g. 3. Let X1 , · · · , Xn be a random sample of size n with the unknown
parameter σ 2 = Var(X ).
2
– σ
b =
n
2
1 X
Xi − X
n i=1
– S = Sample Variance =
2
n
2
1 X
Xi − X
n − 1 i=1
v
u
u
– S = Sample Standard Deviation = t
n
2
1 X
Xi − X .
n − 1 i=1
(Biased for σ!)
51
b = 1/Y is biased.
E.g. 4. Exponential distr.: fY (y ; λ) = λe−λy for y ≥ 0. λ
P
nY = ni=1 Yi ∼ Gamma distribution(n, λ). Hence,
Z ∞
1 λn n−1 −λy
b = E 1/Y = n
E λ
y
e
dy
y Γ(n)
0
Z ∞
nλ
λn−1
=
y (n−1)−1 e−λy dy
n−1 0
Γ(n − 1)
|
{z
}
pdf for Gamma distr. (n − 1, λ)
n
=
λ.
n−1
b =
Biased! But E(λ)
Note:
b∗ =
λ
n−1
nY
n
λ
n−1
→ λ as n → ∞. (Asymptotically unbiased.)
is unbiased.
E.g. 4’. Exponential distr.: fY (y ; θ) = θ1 e−y /θ for y ≥ 0. θb = Y is unbiased.
n
n
1X
1X
E θb =
E(Yi ) =
θ = θ.
n
n
i=1
i=1
52
b = 1/Y is biased.
E.g. 4. Exponential distr.: fY (y ; λ) = λe−λy for y ≥ 0. λ
P
nY = ni=1 Yi ∼ Gamma distribution(n, λ). Hence,
Z ∞
1 λn n−1 −λy
b = E 1/Y = n
E λ
y
e
dy
y Γ(n)
0
Z ∞
nλ
λn−1
=
y (n−1)−1 e−λy dy
n−1 0
Γ(n − 1)
|
{z
}
pdf for Gamma distr. (n − 1, λ)
n
=
λ.
n−1
b =
Biased! But E(λ)
Note:
b∗ =
λ
n−1
nY
n
λ
n−1
→ λ as n → ∞. (Asymptotically unbiased.)
is unbiased.
E.g. 4’. Exponential distr.: fY (y ; θ) = θ1 e−y /θ for y ≥ 0. θb = Y is unbiased.
n
n
1X
1X
E θb =
E(Yi ) =
θ = θ.
n
n
i=1
i=1
52
b = 1/Y is biased.
E.g. 4. Exponential distr.: fY (y ; λ) = λe−λy for y ≥ 0. λ
P
nY = ni=1 Yi ∼ Gamma distribution(n, λ). Hence,
Z ∞
1 λn n−1 −λy
b = E 1/Y = n
E λ
y
e
dy
y Γ(n)
0
Z ∞
nλ
λn−1
=
y (n−1)−1 e−λy dy
n−1 0
Γ(n − 1)
|
{z
}
pdf for Gamma distr. (n − 1, λ)
n
=
λ.
n−1
b =
Biased! But E(λ)
Note:
b∗ =
λ
n−1
nY
n
λ
n−1
→ λ as n → ∞. (Asymptotically unbiased.)
is unbiased.
E.g. 4’. Exponential distr.: fY (y ; θ) = θ1 e−y /θ for y ≥ 0. θb = Y is unbiased.
n
n
1X
1X
E θb =
E(Yi ) =
θ = θ.
n
n
i=1
i=1
52
b = 1/Y is biased.
E.g. 4. Exponential distr.: fY (y ; λ) = λe−λy for y ≥ 0. λ
P
nY = ni=1 Yi ∼ Gamma distribution(n, λ). Hence,
Z ∞
1 λn n−1 −λy
b = E 1/Y = n
E λ
y
e
dy
y Γ(n)
0
Z ∞
nλ
λn−1
=
y (n−1)−1 e−λy dy
n−1 0
Γ(n − 1)
|
{z
}
pdf for Gamma distr. (n − 1, λ)
n
=
λ.
n−1
b =
Biased! But E(λ)
Note:
b∗ =
λ
n−1
nY
n
λ
n−1
→ λ as n → ∞. (Asymptotically unbiased.)
is unbiased.
E.g. 4’. Exponential distr.: fY (y ; θ) = θ1 e−y /θ for y ≥ 0. θb = Y is unbiased.
n
n
1X
1X
E θb =
E(Yi ) =
θ = θ.
n
n
i=1
i=1
52
b = 1/Y is biased.
E.g. 4. Exponential distr.: fY (y ; λ) = λe−λy for y ≥ 0. λ
P
nY = ni=1 Yi ∼ Gamma distribution(n, λ). Hence,
Z ∞
1 λn n−1 −λy
b = E 1/Y = n
E λ
y
e
dy
y Γ(n)
0
Z ∞
nλ
λn−1
=
y (n−1)−1 e−λy dy
n−1 0
Γ(n − 1)
|
{z
}
pdf for Gamma distr. (n − 1, λ)
n
=
λ.
n−1
b =
Biased! But E(λ)
Note:
b∗ =
λ
n−1
nY
n
λ
n−1
→ λ as n → ∞. (Asymptotically unbiased.)
is unbiased.
E.g. 4’. Exponential distr.: fY (y ; θ) = θ1 e−y /θ for y ≥ 0. θb = Y is unbiased.
n
n
1X
1X
E θb =
E(Yi ) =
θ = θ.
n
n
i=1
i=1
52
b = 1/Y is biased.
E.g. 4. Exponential distr.: fY (y ; λ) = λe−λy for y ≥ 0. λ
P
nY = ni=1 Yi ∼ Gamma distribution(n, λ). Hence,
Z ∞
1 λn n−1 −λy
b = E 1/Y = n
E λ
y
e
dy
y Γ(n)
0
Z ∞
nλ
λn−1
=
y (n−1)−1 e−λy dy
n−1 0
Γ(n − 1)
|
{z
}
pdf for Gamma distr. (n − 1, λ)
n
=
λ.
n−1
b =
Biased! But E(λ)
Note:
b∗ =
λ
n−1
nY
n
λ
n−1
→ λ as n → ∞. (Asymptotically unbiased.)
is unbiased.
E.g. 4’. Exponential distr.: fY (y ; θ) = θ1 e−y /θ for y ≥ 0. θb = Y is unbiased.
n
n
1X
1X
E θb =
E(Yi ) =
θ = θ.
n
n
i=1
i=1
52
b = 1/Y is biased.
E.g. 4. Exponential distr.: fY (y ; λ) = λe−λy for y ≥ 0. λ
P
nY = ni=1 Yi ∼ Gamma distribution(n, λ). Hence,
Z ∞
1 λn n−1 −λy
b = E 1/Y = n
E λ
y
e
dy
y Γ(n)
0
Z ∞
nλ
λn−1
=
y (n−1)−1 e−λy dy
n−1 0
Γ(n − 1)
|
{z
}
pdf for Gamma distr. (n − 1, λ)
n
=
λ.
n−1
b =
Biased! But E(λ)
Note:
b∗ =
λ
n−1
nY
n
λ
n−1
→ λ as n → ∞. (Asymptotically unbiased.)
is unbiased.
E.g. 4’. Exponential distr.: fY (y ; θ) = θ1 e−y /θ for y ≥ 0. θb = Y is unbiased.
n
n
1X
1X
E θb =
E(Yi ) =
θ = θ.
n
n
i=1
i=1
52
b = 1/Y is biased.
E.g. 4. Exponential distr.: fY (y ; λ) = λe−λy for y ≥ 0. λ
P
nY = ni=1 Yi ∼ Gamma distribution(n, λ). Hence,
Z ∞
1 λn n−1 −λy
b = E 1/Y = n
E λ
y
e
dy
y Γ(n)
0
Z ∞
nλ
λn−1
=
y (n−1)−1 e−λy dy
n−1 0
Γ(n − 1)
|
{z
}
pdf for Gamma distr. (n − 1, λ)
n
=
λ.
n−1
b =
Biased! But E(λ)
Note:
b∗ =
λ
n−1
nY
n
λ
n−1
→ λ as n → ∞. (Asymptotically unbiased.)
is unbiased.
E.g. 4’. Exponential distr.: fY (y ; θ) = θ1 e−y /θ for y ≥ 0. θb = Y is unbiased.
n
n
1X
1X
E θb =
E(Yi ) =
θ = θ.
n
n
i=1
i=1
52
b = 1/Y is biased.
E.g. 4. Exponential distr.: fY (y ; λ) = λe−λy for y ≥ 0. λ
P
nY = ni=1 Yi ∼ Gamma distribution(n, λ). Hence,
Z ∞
1 λn n−1 −λy
b = E 1/Y = n
E λ
y
e
dy
y Γ(n)
0
Z ∞
nλ
λn−1
=
y (n−1)−1 e−λy dy
n−1 0
Γ(n − 1)
|
{z
}
pdf for Gamma distr. (n − 1, λ)
n
=
λ.
n−1
b =
Biased! But E(λ)
Note:
b∗ =
λ
n−1
nY
n
λ
n−1
→ λ as n → ∞. (Asymptotically unbiased.)
is unbiased.
E.g. 4’. Exponential distr.: fY (y ; θ) = θ1 e−y /θ for y ≥ 0. θb = Y is unbiased.
n
n
1X
1X
E θb =
E(Yi ) =
θ = θ.
n
n
i=1
i=1
52
b = 1/Y is biased.
E.g. 4. Exponential distr.: fY (y ; λ) = λe−λy for y ≥ 0. λ
P
nY = ni=1 Yi ∼ Gamma distribution(n, λ). Hence,
Z ∞
1 λn n−1 −λy
b = E 1/Y = n
E λ
y
e
dy
y Γ(n)
0
Z ∞
nλ
λn−1
=
y (n−1)−1 e−λy dy
n−1 0
Γ(n − 1)
|
{z
}
pdf for Gamma distr. (n − 1, λ)
n
=
λ.
n−1
b =
Biased! But E(λ)
Note:
b∗ =
λ
n−1
nY
n
λ
n−1
→ λ as n → ∞. (Asymptotically unbiased.)
is unbiased.
E.g. 4’. Exponential distr.: fY (y ; θ) = θ1 e−y /θ for y ≥ 0. θb = Y is unbiased.
n
n
1X
1X
E θb =
E(Yi ) =
θ = θ.
n
n
i=1
i=1
52
Efficiency
Definition 5.4.2. Let θb1 and θb2 be two unbiased estimators for a parameter
θ. If Var(θb1 ) < Var(θb2 ), then we say that θb1 is more efficient than θb2 .
The relative efficiency of θb1 w.r.t. θb2 is the ratio Var(θb1 )/Var(θb2 ).
53
E.g. 1. fY (y ; θ) = θ2y2 if y ∈ [0, θ]. Which is more efficient? Find the relative
efficiency of θb1 w.r.t. θb3 .
– θ̂1 =
3
Y
2
– θ̂3 =
2n + 1
Ymax .
2n
E.g. 2. Let X1 , · · · , Xn be a random sample of size n with the unknown
parameter θ = E(X ) (suppose σ 2 = Var(X ) < ∞).
P
Among
all possible unbiased estimators θ̂ = ni=1 ai Xi with
Pn
i=1 ai = 1. Find the most efficient one.
Sol:
b =
Var(θ)
n
X
i=1
ai2 Var(X )
=σ
2
n
X
i=1
ai2
1
≥σ
n
2
n
X
!2
ai
i=1
=
1 2
σ ,
n
with equality iff a1 = · · · = an = 1/n.
Hence, the most efficient one is the sample mean θb = X .
54
E.g. 1. fY (y ; θ) = θ2y2 if y ∈ [0, θ]. Which is more efficient? Find the relative
efficiency of θb1 w.r.t. θb3 .
– θ̂1 =
3
Y
2
– θ̂3 =
2n + 1
Ymax .
2n
E.g. 2. Let X1 , · · · , Xn be a random sample of size n with the unknown
parameter θ = E(X ) (suppose σ 2 = Var(X ) < ∞).
P
Among
all possible unbiased estimators θ̂ = ni=1 ai Xi with
Pn
i=1 ai = 1. Find the most efficient one.
Sol:
b =
Var(θ)
n
X
i=1
ai2 Var(X )
=σ
2
n
X
i=1
ai2
1
≥σ
n
2
n
X
!2
ai
i=1
=
1 2
σ ,
n
with equality iff a1 = · · · = an = 1/n.
Hence, the most efficient one is the sample mean θb = X .
54
E.g. 1. fY (y ; θ) = θ2y2 if y ∈ [0, θ]. Which is more efficient? Find the relative
efficiency of θb1 w.r.t. θb3 .
– θ̂1 =
3
Y
2
– θ̂3 =
2n + 1
Ymax .
2n
E.g. 2. Let X1 , · · · , Xn be a random sample of size n with the unknown
parameter θ = E(X ) (suppose σ 2 = Var(X ) < ∞).
P
Among
all possible unbiased estimators θ̂ = ni=1 ai Xi with
Pn
i=1 ai = 1. Find the most efficient one.
Sol:
b =
Var(θ)
n
X
i=1
ai2 Var(X )
=σ
2
n
X
i=1
ai2
1
≥σ
n
2
n
X
!2
ai
i=1
=
1 2
σ ,
n
with equality iff a1 = · · · = an = 1/n.
Hence, the most efficient one is the sample mean θb = X .
54
E.g. 1. fY (y ; θ) = θ2y2 if y ∈ [0, θ]. Which is more efficient? Find the relative
efficiency of θb1 w.r.t. θb3 .
– θ̂1 =
3
Y
2
– θ̂3 =
2n + 1
Ymax .
2n
E.g. 2. Let X1 , · · · , Xn be a random sample of size n with the unknown
parameter θ = E(X ) (suppose σ 2 = Var(X ) < ∞).
P
Among
all possible unbiased estimators θ̂ = ni=1 ai Xi with
Pn
i=1 ai = 1. Find the most efficient one.
Sol:
b =
Var(θ)
n
X
i=1
ai2 Var(X )
=σ
2
n
X
i=1
ai2
1
≥σ
n
2
n
X
!2
ai
i=1
=
1 2
σ ,
n
with equality iff a1 = · · · = an = 1/n.
Hence, the most efficient one is the sample mean θb = X .
54
E.g. 1. fY (y ; θ) = θ2y2 if y ∈ [0, θ]. Which is more efficient? Find the relative
efficiency of θb1 w.r.t. θb3 .
– θ̂1 =
3
Y
2
– θ̂3 =
2n + 1
Ymax .
2n
E.g. 2. Let X1 , · · · , Xn be a random sample of size n with the unknown
parameter θ = E(X ) (suppose σ 2 = Var(X ) < ∞).
P
Among
all possible unbiased estimators θ̂ = ni=1 ai Xi with
Pn
i=1 ai = 1. Find the most efficient one.
Sol:
b =
Var(θ)
n
X
i=1
ai2 Var(X )
=σ
2
n
X
i=1
ai2
1
≥σ
n
2
n
X
!2
ai
i=1
=
1 2
σ ,
n
with equality iff a1 = · · · = an = 1/n.
Hence, the most efficient one is the sample mean θb = X .
54
E.g. 1. fY (y ; θ) = θ2y2 if y ∈ [0, θ]. Which is more efficient? Find the relative
efficiency of θb1 w.r.t. θb3 .
– θ̂1 =
3
Y
2
– θ̂3 =
2n + 1
Ymax .
2n
E.g. 2. Let X1 , · · · , Xn be a random sample of size n with the unknown
parameter θ = E(X ) (suppose σ 2 = Var(X ) < ∞).
P
Among
all possible unbiased estimators θ̂ = ni=1 ai Xi with
Pn
i=1 ai = 1. Find the most efficient one.
Sol:
b =
Var(θ)
n
X
i=1
ai2 Var(X )
=σ
2
n
X
i=1
ai2
1
≥σ
n
2
n
X
!2
ai
i=1
=
1 2
σ ,
n
with equality iff a1 = · · · = an = 1/n.
Hence, the most efficient one is the sample mean θb = X .
54
E.g. 1. fY (y ; θ) = θ2y2 if y ∈ [0, θ]. Which is more efficient? Find the relative
efficiency of θb1 w.r.t. θb3 .
– θ̂1 =
3
Y
2
– θ̂3 =
2n + 1
Ymax .
2n
E.g. 2. Let X1 , · · · , Xn be a random sample of size n with the unknown
parameter θ = E(X ) (suppose σ 2 = Var(X ) < ∞).
P
Among
all possible unbiased estimators θ̂ = ni=1 ai Xi with
Pn
i=1 ai = 1. Find the most efficient one.
Sol:
b =
Var(θ)
n
X
i=1
ai2 Var(X )
=σ
2
n
X
i=1
ai2
1
≥σ
n
2
n
X
!2
ai
i=1
=
1 2
σ ,
n
with equality iff a1 = · · · = an = 1/n.
Hence, the most efficient one is the sample mean θb = X .
54
Plan
§ 5.1 Introduction
§ 5.2 Estimating parameters: MLE and MME
§ 5.3 Interval Estimation
§ 5.4 Properties of Estimators
§ 5.5 Minimum-Variance Estimators: The Cramér-Rao Lower Bound
§ 5.6 Sufficient Estimators
§ 5.7 Consistency
§ 5.8 Bayesian Estimation
55
Chapter 5. Estimation
§ 5.1 Introduction
§ 5.2 Estimating parameters: MLE and MME
§ 5.3 Interval Estimation
§ 5.4 Properties of Estimators
§ 5.5 Minimum-Variance Estimators: The Cramér-Rao Lower Bound
§ 5.6 Sufficient Estimators
§ 5.7 Consistency
§ 5.8 Bayesian Estimation
56
§ 5.5 MVE: The Cramér-Rao Lower Bound
Question: Can one identify the unbiased estimator having the smallest
variance?
Short answer: In many cases, yes!
We are going to develop the theory to answer this question in details!
57
Regular Estimation/Condition: The set of y (resp. k ) values, where
fY (y ; θ) 6= 0 (resp. pX (k ; θ) 6= 0), does not depend on θ.
i.e., the domain of the pdf does not depend on the parameter (so that one
can differentiate under integration).
Definition. The Fisher’s Information of a continous (resp. discrete)
random variable Y (resp. X ) with pdf fY (y ; θ) (resp. pX (k ; θ)) is defined as
"
"
2 #
2 #!
∂ ln fY (Y ; θ)
∂ ln pX (X ; θ)
I(θ) = E
resp. E
.
∂θ
∂θ
58
Regular Estimation/Condition: The set of y (resp. k ) values, where
fY (y ; θ) 6= 0 (resp. pX (k ; θ) 6= 0), does not depend on θ.
i.e., the domain of the pdf does not depend on the parameter (so that one
can differentiate under integration).
Definition. The Fisher’s Information of a continous (resp. discrete)
random variable Y (resp. X ) with pdf fY (y ; θ) (resp. pX (k ; θ)) is defined as
"
"
2 #
2 #!
∂ ln fY (Y ; θ)
∂ ln pX (X ; θ)
I(θ) = E
resp. E
.
∂θ
∂θ
58
Lemma. Under regular condition, let Y1 , · · · , Yn be a random sample of
size n from the continuous population pdf fY (y ; θ). Then the Fisher
Information in the random sample Y1 , · · · , Yn equals n times the Fisher
information in X :
"
"
2 #
2 #
∂ ln fY1 ,··· ,Yn (Y1 , · · · , Yn ; θ)
∂ ln fY (Y ; θ)
E
=nE
= n I(θ). (1)
∂θ
∂θ
(A similar statement holds for the discrete case pX (k ; θ)).
Proof. Based on two observations:
" n
!2 #
X ∂
LHS = E
ln fYi (Yi ; θ)
∂θ
i=1
E
∂
ln fYi (Yi ; θ)
∂θ
Z
∂
f (y ; θ)
∂θ Y
Z
∂
fY (y ; θ)dy
f
(y
;
θ)
∂θ
Y
R
R
Z
∂
R.C. ∂
=
fY (y ; θ)dy =
1 = 0.
∂θ R
∂θ
=
fY (y ; θ)dy =
59
Lemma. Under regular condition, let Y1 , · · · , Yn be a random sample of
size n from the continuous population pdf fY (y ; θ). Then the Fisher
Information in the random sample Y1 , · · · , Yn equals n times the Fisher
information in X :
"
"
2 #
2 #
∂ ln fY1 ,··· ,Yn (Y1 , · · · , Yn ; θ)
∂ ln fY (Y ; θ)
E
=nE
= n I(θ). (1)
∂θ
∂θ
(A similar statement holds for the discrete case pX (k ; θ)).
Proof. Based on two observations:
" n
!2 #
X ∂
LHS = E
ln fYi (Yi ; θ)
∂θ
i=1
E
∂
ln fYi (Yi ; θ)
∂θ
Z
∂
f (y ; θ)
∂θ Y
Z
∂
fY (y ; θ)dy
f
(y
;
θ)
∂θ
Y
R
R
Z
∂
R.C. ∂
=
fY (y ; θ)dy =
1 = 0.
∂θ R
∂θ
=
fY (y ; θ)dy =
59
Lemma. Under regular condition, if ln fY (y ; θ) is twice differentiable in θ,
then
2
∂
I(θ) = −E
ln
f
(Y
;
θ)
.
(2)
Y
∂θ2
(A similar statement holds for the discrete case pX (k ; θ)).
Proof. This is due to the two facts:
∂2
ln fY (Y ; θ) =
∂θ2
∂2
f (Y ; θ)
∂θ 2 Y
fY (Y ; θ)
∂
f (Y ; θ)
∂θ Y
−
|
=
E
∂2
f (Y ; θ)
∂θ 2 Y
fY (Y ; θ)
!
∂2
f (y ; θ)
∂θ 2 Y
Z
=
!2
fY (Y ; θ)
{z
}
2
∂
ln fY (Y ; θ)
∂θ
Z
fY (y ; θ)dy =
∂2
fY (y ; θ)dy .
∂θ2
fY (y ; θ)
R
Z
2
∂
R.C. ∂
=
fY (y ; θ)dy =
1 = 0.
∂θ2 R
∂θ2
R
2
60
Lemma. Under regular condition, if ln fY (y ; θ) is twice differentiable in θ,
then
2
∂
I(θ) = −E
ln
f
(Y
;
θ)
.
(2)
Y
∂θ2
(A similar statement holds for the discrete case pX (k ; θ)).
Proof. This is due to the two facts:
∂2
ln fY (Y ; θ) =
∂θ2
∂2
f (Y ; θ)
∂θ 2 Y
fY (Y ; θ)
∂
f (Y ; θ)
∂θ Y
−
|
=
E
∂2
f (Y ; θ)
∂θ 2 Y
fY (Y ; θ)
!
∂2
f (y ; θ)
∂θ 2 Y
Z
=
!2
fY (Y ; θ)
{z
}
2
∂
ln fY (Y ; θ)
∂θ
Z
fY (y ; θ)dy =
∂2
fY (y ; θ)dy .
∂θ2
fY (y ; θ)
R
Z
2
∂
R.C. ∂
=
fY (y ; θ)dy =
1 = 0.
∂θ2 R
∂θ2
R
2
60
Theorem (Cramér-Rao Inequality) Under regular condition, let Y1 , · · · , Yn
be a random sample of size n from the continuous population pdf fY (y ; θ).
b 1 , · · · , Yn ) be any unbiased estimator for θ. Then
Let θb = θ(Y
b ≥
Var(θ)
1
.
n I(θ)
(A similar statement holds for the discrete case pX (k ; θ)).
Proof. If n = 1, then by Cauchy-Schwartz inequality,
q
∂
b × I(θ)
E (θb − θ)
ln fY (Y ; θ) ≤ Var(θ)
∂θ
On the other hand,
Z
∂
fY (y ; θ)
∂
E (θb − θ)
ln fY (Y ; θ) = (θb − θ) ∂θ
fY (y ; θ)dy
∂θ
fY (y ; θ)
R
Z
∂
= (θb − θ) fY (y ; θ)dy
∂θ
R
Z
∂
=
(θb − θ)fY (y ; θ)dy +1 = 1.
∂θ R
|
{z
}
= E(θb − θ) = 0
For general n, apply for (1).
.
61
Theorem (Cramér-Rao Inequality) Under regular condition, let Y1 , · · · , Yn
be a random sample of size n from the continuous population pdf fY (y ; θ).
b 1 , · · · , Yn ) be any unbiased estimator for θ. Then
Let θb = θ(Y
b ≥
Var(θ)
1
.
n I(θ)
(A similar statement holds for the discrete case pX (k ; θ)).
Proof. If n = 1, then by Cauchy-Schwartz inequality,
q
∂
b × I(θ)
E (θb − θ)
ln fY (Y ; θ) ≤ Var(θ)
∂θ
On the other hand,
Z
∂
fY (y ; θ)
∂
E (θb − θ)
ln fY (Y ; θ) = (θb − θ) ∂θ
fY (y ; θ)dy
∂θ
fY (y ; θ)
R
Z
∂
= (θb − θ) fY (y ; θ)dy
∂θ
R
Z
∂
=
(θb − θ)fY (y ; θ)dy +1 = 1.
∂θ R
|
{z
}
= E(θb − θ) = 0
For general n, apply for (1).
.
61
Definition. Let Θ be the set of all estimators θb that are unbiased for the
parameter θ. We say that θb∗ is a best or minimum-variance esimator
(MVE) if θb∗ ∈ Θ and
b
Var(θb∗ ) ≤ Var(θ)
for all θb ∈ Θ.
b is equal to the
Definition. An unbiased estimator θb is efficient if Var(θ)
Cramér-Rao lower bound, i.e., Varθb = (n I(θ))−1 .
−1
b
The efficiency of an unbiased estimator θb is defined to be nI(θ)Var(θ)
.
Unbiased estimators Θ
Efficient est.
MVE
62
b = X is efficient?
E.g. 1. X ∼Bernoulli(p). Check whether p
Step 1. Compute Fisher’s Information:
k
1−k
pX (k ; p) = p (1 − p)
.
ln pX (k ; p) = k ln p + (1 − k ) ln(1 − p)
∂
k
1−k
ln pX (k ; p) =
−
∂p
p
1−p
∂2
k
1−k
ln pX (k ; p) = 2 +
∂2p
p
(1 − p)2
"
#
∂2
X
1−X
1
1
1
−E
ln
p
(X
;
p)
=E
+
=
+
=
.
X
∂2p
p2
(1 − p)2
p
1−p
pq
−
I(p) =
1
,
pq
q = 1 − p.
Step 2. Compute Var(b
p).
Var(b
p) =
1
Var
n2
n
X
i=1
!
Xi
=
1
pq
npq =
n2
n
Conclusion Because b
p is unbiased and Var(b
p) = (nI(p))−1 , b
p is efficient.
63
b = X is efficient?
E.g. 1. X ∼Bernoulli(p). Check whether p
Step 1. Compute Fisher’s Information:
k
1−k
pX (k ; p) = p (1 − p)
.
ln pX (k ; p) = k ln p + (1 − k ) ln(1 − p)
∂
k
1−k
ln pX (k ; p) =
−
∂p
p
1−p
∂2
k
1−k
ln pX (k ; p) = 2 +
∂2p
p
(1 − p)2
"
#
∂2
X
1−X
1
1
1
−E
ln
p
(X
;
p)
=E
+
=
+
=
.
X
∂2p
p2
(1 − p)2
p
1−p
pq
−
I(p) =
1
,
pq
q = 1 − p.
Step 2. Compute Var(b
p).
Var(b
p) =
1
Var
n2
n
X
i=1
!
Xi
=
1
pq
npq =
n2
n
Conclusion Because b
p is unbiased and Var(b
p) = (nI(p))−1 , b
p is efficient.
63
b = X is efficient?
E.g. 1. X ∼Bernoulli(p). Check whether p
Step 1. Compute Fisher’s Information:
k
1−k
pX (k ; p) = p (1 − p)
.
ln pX (k ; p) = k ln p + (1 − k ) ln(1 − p)
∂
k
1−k
ln pX (k ; p) =
−
∂p
p
1−p
∂2
k
1−k
ln pX (k ; p) = 2 +
∂2p
p
(1 − p)2
"
#
∂2
X
1−X
1
1
1
−E
ln
p
(X
;
p)
=E
+
=
+
=
.
X
∂2p
p2
(1 − p)2
p
1−p
pq
−
I(p) =
1
,
pq
q = 1 − p.
Step 2. Compute Var(b
p).
Var(b
p) =
1
Var
n2
n
X
i=1
!
Xi
=
1
pq
npq =
n2
n
Conclusion Because b
p is unbiased and Var(b
p) = (nI(p))−1 , b
p is efficient.
63
b = X is efficient?
E.g. 1. X ∼Bernoulli(p). Check whether p
Step 1. Compute Fisher’s Information:
k
1−k
pX (k ; p) = p (1 − p)
.
ln pX (k ; p) = k ln p + (1 − k ) ln(1 − p)
∂
k
1−k
ln pX (k ; p) =
−
∂p
p
1−p
∂2
k
1−k
ln pX (k ; p) = 2 +
∂2p
p
(1 − p)2
"
#
∂2
X
1−X
1
1
1
−E
ln
p
(X
;
p)
=E
+
=
+
=
.
X
∂2p
p2
(1 − p)2
p
1−p
pq
−
I(p) =
1
,
pq
q = 1 − p.
Step 2. Compute Var(b
p).
Var(b
p) =
1
Var
n2
n
X
i=1
!
Xi
=
1
pq
npq =
n2
n
Conclusion Because b
p is unbiased and Var(b
p) = (nI(p))−1 , b
p is efficient.
63
b = X is efficient?
E.g. 1. X ∼Bernoulli(p). Check whether p
Step 1. Compute Fisher’s Information:
k
1−k
pX (k ; p) = p (1 − p)
.
ln pX (k ; p) = k ln p + (1 − k ) ln(1 − p)
∂
k
1−k
ln pX (k ; p) =
−
∂p
p
1−p
∂2
k
1−k
ln pX (k ; p) = 2 +
∂2p
p
(1 − p)2
"
#
∂2
X
1−X
1
1
1
−E
ln
p
(X
;
p)
=E
+
=
+
=
.
X
∂2p
p2
(1 − p)2
p
1−p
pq
−
I(p) =
1
,
pq
q = 1 − p.
Step 2. Compute Var(b
p).
Var(b
p) =
1
Var
n2
n
X
i=1
!
Xi
=
1
pq
npq =
n2
n
Conclusion Because b
p is unbiased and Var(b
p) = (nI(p))−1 , b
p is efficient.
63
b = X is efficient?
E.g. 1. X ∼Bernoulli(p). Check whether p
Step 1. Compute Fisher’s Information:
k
1−k
pX (k ; p) = p (1 − p)
.
ln pX (k ; p) = k ln p + (1 − k ) ln(1 − p)
∂
k
1−k
ln pX (k ; p) =
−
∂p
p
1−p
∂2
k
1−k
ln pX (k ; p) = 2 +
∂2p
p
(1 − p)2
"
#
∂2
X
1−X
1
1
1
−E
ln
p
(X
;
p)
=E
+
=
+
=
.
X
∂2p
p2
(1 − p)2
p
1−p
pq
−
I(p) =
1
,
pq
q = 1 − p.
Step 2. Compute Var(b
p).
Var(b
p) =
1
Var
n2
n
X
i=1
!
Xi
=
1
pq
npq =
n2
n
Conclusion Because b
p is unbiased and Var(b
p) = (nI(p))−1 , b
p is efficient.
63
b = X is efficient?
E.g. 1. X ∼Bernoulli(p). Check whether p
Step 1. Compute Fisher’s Information:
k
1−k
pX (k ; p) = p (1 − p)
.
ln pX (k ; p) = k ln p + (1 − k ) ln(1 − p)
∂
k
1−k
ln pX (k ; p) =
−
∂p
p
1−p
∂2
k
1−k
ln pX (k ; p) = 2 +
∂2p
p
(1 − p)2
"
#
∂2
X
1−X
1
1
1
−E
ln
p
(X
;
p)
=E
+
=
+
=
.
X
∂2p
p2
(1 − p)2
p
1−p
pq
−
I(p) =
1
,
pq
q = 1 − p.
Step 2. Compute Var(b
p).
Var(b
p) =
1
Var
n2
n
X
i=1
!
Xi
=
1
pq
npq =
n2
n
Conclusion Because b
p is unbiased and Var(b
p) = (nI(p))−1 , b
p is efficient.
63
b = X is efficient?
E.g. 1. X ∼Bernoulli(p). Check whether p
Step 1. Compute Fisher’s Information:
k
1−k
pX (k ; p) = p (1 − p)
.
ln pX (k ; p) = k ln p + (1 − k ) ln(1 − p)
∂
k
1−k
ln pX (k ; p) =
−
∂p
p
1−p
∂2
k
1−k
ln pX (k ; p) = 2 +
∂2p
p
(1 − p)2
"
#
∂2
X
1−X
1
1
1
−E
ln
p
(X
;
p)
=E
+
=
+
=
.
X
∂2p
p2
(1 − p)2
p
1−p
pq
−
I(p) =
1
,
pq
q = 1 − p.
Step 2. Compute Var(b
p).
Var(b
p) =
1
Var
n2
n
X
i=1
!
Xi
=
1
pq
npq =
n2
n
Conclusion Because b
p is unbiased and Var(b
p) = (nI(p))−1 , b
p is efficient.
63
b = X is efficient?
E.g. 1. X ∼Bernoulli(p). Check whether p
Step 1. Compute Fisher’s Information:
k
1−k
pX (k ; p) = p (1 − p)
.
ln pX (k ; p) = k ln p + (1 − k ) ln(1 − p)
∂
k
1−k
ln pX (k ; p) =
−
∂p
p
1−p
∂2
k
1−k
ln pX (k ; p) = 2 +
∂2p
p
(1 − p)2
"
#
∂2
X
1−X
1
1
1
−E
ln
p
(X
;
p)
=E
+
=
+
=
.
X
∂2p
p2
(1 − p)2
p
1−p
pq
−
I(p) =
1
,
pq
q = 1 − p.
Step 2. Compute Var(b
p).
Var(b
p) =
1
Var
n2
n
X
i=1
!
Xi
=
1
pq
npq =
n2
n
Conclusion Because b
p is unbiased and Var(b
p) = (nI(p))−1 , b
p is efficient.
63
b = 1/Y efficent?
E.g. 2. Exponential distr.: fY (y ; λ) = λe−λy for y ≥ 0. Is λ
b is biased. Nevertheless, we can still compute Fisher’s
Answer No, because λ
Information as follows
Fisher’s Inf.
ln fY (y ; λ) = ln λ − λy
∂
1
ln fY (y ; λ) =
−y
∂λ
λ
∂2
1
ln fY (y ; λ) = 2
∂2λ
λ
"
#
∂2
1
1
−E
ln
f
(Y
;
λ)
=
E
= 2.
Y
∂2λ
λ2
λ
−
−2
I(λ) = λ
b∗ :=
Try: λ
n−1 1
.
n Y
It is unbiased. Is it efficient?
64
b = 1/Y efficent?
E.g. 2. Exponential distr.: fY (y ; λ) = λe−λy for y ≥ 0. Is λ
b is biased. Nevertheless, we can still compute Fisher’s
Answer No, because λ
Information as follows
Fisher’s Inf.
ln fY (y ; λ) = ln λ − λy
∂
1
ln fY (y ; λ) =
−y
∂λ
λ
∂2
1
ln fY (y ; λ) = 2
∂2λ
λ
"
#
∂2
1
1
−E
ln
f
(Y
;
λ)
=
E
= 2.
Y
∂2λ
λ2
λ
−
−2
I(λ) = λ
b∗ :=
Try: λ
n−1 1
.
n Y
It is unbiased. Is it efficient?
64
b = 1/Y efficent?
E.g. 2. Exponential distr.: fY (y ; λ) = λe−λy for y ≥ 0. Is λ
b is biased. Nevertheless, we can still compute Fisher’s
Answer No, because λ
Information as follows
Fisher’s Inf.
ln fY (y ; λ) = ln λ − λy
∂
1
ln fY (y ; λ) =
−y
∂λ
λ
∂2
1
ln fY (y ; λ) = 2
∂2λ
λ
"
#
∂2
1
1
−E
ln
f
(Y
;
λ)
=
E
= 2.
Y
∂2λ
λ2
λ
−
−2
I(λ) = λ
b∗ :=
Try: λ
n−1 1
.
n Y
It is unbiased. Is it efficient?
64
b = 1/Y efficent?
E.g. 2. Exponential distr.: fY (y ; λ) = λe−λy for y ≥ 0. Is λ
b is biased. Nevertheless, we can still compute Fisher’s
Answer No, because λ
Information as follows
Fisher’s Inf.
ln fY (y ; λ) = ln λ − λy
∂
1
ln fY (y ; λ) =
−y
∂λ
λ
∂2
1
ln fY (y ; λ) = 2
∂2λ
λ
"
#
∂2
1
1
−E
ln
f
(Y
;
λ)
=
E
= 2.
Y
∂2λ
λ2
λ
−
−2
I(λ) = λ
b∗ :=
Try: λ
n−1 1
.
n Y
It is unbiased. Is it efficient?
64
b = 1/Y efficent?
E.g. 2. Exponential distr.: fY (y ; λ) = λe−λy for y ≥ 0. Is λ
b is biased. Nevertheless, we can still compute Fisher’s
Answer No, because λ
Information as follows
Fisher’s Inf.
ln fY (y ; λ) = ln λ − λy
∂
1
ln fY (y ; λ) =
−y
∂λ
λ
∂2
1
ln fY (y ; λ) = 2
∂2λ
λ
"
#
∂2
1
1
−E
ln
f
(Y
;
λ)
=
E
= 2.
Y
∂2λ
λ2
λ
−
−2
I(λ) = λ
b∗ :=
Try: λ
n−1 1
.
n Y
It is unbiased. Is it efficient?
64
b = 1/Y efficent?
E.g. 2. Exponential distr.: fY (y ; λ) = λe−λy for y ≥ 0. Is λ
b is biased. Nevertheless, we can still compute Fisher’s
Answer No, because λ
Information as follows
Fisher’s Inf.
ln fY (y ; λ) = ln λ − λy
∂
1
ln fY (y ; λ) =
−y
∂λ
λ
∂2
1
ln fY (y ; λ) = 2
∂2λ
λ
"
#
∂2
1
1
−E
ln
f
(Y
;
λ)
=
E
= 2.
Y
∂2λ
λ2
λ
−
−2
I(λ) = λ
b∗ :=
Try: λ
n−1 1
.
n Y
It is unbiased. Is it efficient?
64
b = 1/Y efficent?
E.g. 2. Exponential distr.: fY (y ; λ) = λe−λy for y ≥ 0. Is λ
b is biased. Nevertheless, we can still compute Fisher’s
Answer No, because λ
Information as follows
Fisher’s Inf.
ln fY (y ; λ) = ln λ − λy
∂
1
ln fY (y ; λ) =
−y
∂λ
λ
∂2
1
ln fY (y ; λ) = 2
∂2λ
λ
"
#
∂2
1
1
−E
ln
f
(Y
;
λ)
=
E
= 2.
Y
∂2λ
λ2
λ
−
−2
I(λ) = λ
b∗ :=
Try: λ
n−1 1
.
n Y
It is unbiased. Is it efficient?
64
b = 1/Y efficent?
E.g. 2. Exponential distr.: fY (y ; λ) = λe−λy for y ≥ 0. Is λ
b is biased. Nevertheless, we can still compute Fisher’s
Answer No, because λ
Information as follows
Fisher’s Inf.
ln fY (y ; λ) = ln λ − λy
∂
1
ln fY (y ; λ) =
−y
∂λ
λ
∂2
1
ln fY (y ; λ) = 2
∂2λ
λ
"
#
∂2
1
1
−E
ln
f
(Y
;
λ)
=
E
= 2.
Y
∂2λ
λ2
λ
−
−2
I(λ) = λ
b∗ :=
Try: λ
n−1 1
.
n Y
It is unbiased. Is it efficient?
64
E.g. 2’. Exponential distr.: fY (y ; θ) = θ−1 e−y /θ for y ≥ 0. θb = Y efficent?
Step. 1. Compute Fisher’s Information:
ln fY (y ; θ) = − ln θ −
y
θ
∂
1
y
ln fY (y ; θ) = − + 2
∂θ
θ
θ
∂2
1
2y
ln fY (y ; θ) = − 2 + 3
∂2θ
θ
θ
"
#
∂2
1
2Y
1
2θ
−2
−E
ln fY (Y ; θ) = E − 2 + 3 = − 2 + 3 = θ .
∂2θ
θ
θ
θ
θ
−
I(θ) = θ
−2
b
Step 2. Compute Var(θ):
Var(Y ) =
n
1 X
1
θ2
2
Var(Yi ) = 2 nθ =
.
2
n i=1
n
n
Conclusion. Because θb is unbiased and Var(b
p) = (nI(p))−1 , θb is efficient.
65
E.g. 2’. Exponential distr.: fY (y ; θ) = θ−1 e−y /θ for y ≥ 0. θb = Y efficent?
Step. 1. Compute Fisher’s Information:
ln fY (y ; θ) = − ln θ −
y
θ
∂
1
y
ln fY (y ; θ) = − + 2
∂θ
θ
θ
∂2
1
2y
ln fY (y ; θ) = − 2 + 3
∂2θ
θ
θ
"
#
∂2
1
2Y
1
2θ
−2
−E
ln fY (Y ; θ) = E − 2 + 3 = − 2 + 3 = θ .
∂2θ
θ
θ
θ
θ
−
I(θ) = θ
−2
b
Step 2. Compute Var(θ):
Var(Y ) =
n
1 X
1
θ2
2
Var(Yi ) = 2 nθ =
.
2
n i=1
n
n
Conclusion. Because θb is unbiased and Var(b
p) = (nI(p))−1 , θb is efficient.
65
E.g. 2’. Exponential distr.: fY (y ; θ) = θ−1 e−y /θ for y ≥ 0. θb = Y efficent?
Step. 1. Compute Fisher’s Information:
ln fY (y ; θ) = − ln θ −
y
θ
∂
1
y
ln fY (y ; θ) = − + 2
∂θ
θ
θ
∂2
1
2y
ln fY (y ; θ) = − 2 + 3
∂2θ
θ
θ
"
#
∂2
1
2Y
1
2θ
−2
−E
ln fY (Y ; θ) = E − 2 + 3 = − 2 + 3 = θ .
∂2θ
θ
θ
θ
θ
−
I(θ) = θ
−2
b
Step 2. Compute Var(θ):
Var(Y ) =
n
1 X
1
θ2
2
Var(Yi ) = 2 nθ =
.
2
n i=1
n
n
Conclusion. Because θb is unbiased and Var(b
p) = (nI(p))−1 , θb is efficient.
65
E.g. 2’. Exponential distr.: fY (y ; θ) = θ−1 e−y /θ for y ≥ 0. θb = Y efficent?
Step. 1. Compute Fisher’s Information:
ln fY (y ; θ) = − ln θ −
y
θ
∂
1
y
ln fY (y ; θ) = − + 2
∂θ
θ
θ
∂2
1
2y
ln fY (y ; θ) = − 2 + 3
∂2θ
θ
θ
"
#
∂2
1
2Y
1
2θ
−2
−E
ln fY (Y ; θ) = E − 2 + 3 = − 2 + 3 = θ .
∂2θ
θ
θ
θ
θ
−
I(θ) = θ
−2
b
Step 2. Compute Var(θ):
Var(Y ) =
n
1 X
1
θ2
2
Var(Yi ) = 2 nθ =
.
2
n i=1
n
n
Conclusion. Because θb is unbiased and Var(b
p) = (nI(p))−1 , θb is efficient.
65
E.g. 2’. Exponential distr.: fY (y ; θ) = θ−1 e−y /θ for y ≥ 0. θb = Y efficent?
Step. 1. Compute Fisher’s Information:
ln fY (y ; θ) = − ln θ −
y
θ
∂
1
y
ln fY (y ; θ) = − + 2
∂θ
θ
θ
∂2
1
2y
ln fY (y ; θ) = − 2 + 3
∂2θ
θ
θ
"
#
∂2
1
2Y
1
2θ
−2
−E
ln fY (Y ; θ) = E − 2 + 3 = − 2 + 3 = θ .
∂2θ
θ
θ
θ
θ
−
I(θ) = θ
−2
b
Step 2. Compute Var(θ):
Var(Y ) =
n
1 X
1
θ2
2
Var(Yi ) = 2 nθ =
.
2
n i=1
n
n
Conclusion. Because θb is unbiased and Var(b
p) = (nI(p))−1 , θb is efficient.
65
E.g. 2’. Exponential distr.: fY (y ; θ) = θ−1 e−y /θ for y ≥ 0. θb = Y efficent?
Step. 1. Compute Fisher’s Information:
ln fY (y ; θ) = − ln θ −
y
θ
∂
1
y
ln fY (y ; θ) = − + 2
∂θ
θ
θ
∂2
1
2y
ln fY (y ; θ) = − 2 + 3
∂2θ
θ
θ
"
#
∂2
1
2Y
1
2θ
−2
−E
ln fY (Y ; θ) = E − 2 + 3 = − 2 + 3 = θ .
∂2θ
θ
θ
θ
θ
−
I(θ) = θ
−2
b
Step 2. Compute Var(θ):
Var(Y ) =
n
1 X
1
θ2
2
Var(Yi ) = 2 nθ =
.
2
n i=1
n
n
Conclusion. Because θb is unbiased and Var(b
p) = (nI(p))−1 , θb is efficient.
65
E.g. 2’. Exponential distr.: fY (y ; θ) = θ−1 e−y /θ for y ≥ 0. θb = Y efficent?
Step. 1. Compute Fisher’s Information:
ln fY (y ; θ) = − ln θ −
y
θ
∂
1
y
ln fY (y ; θ) = − + 2
∂θ
θ
θ
∂2
1
2y
ln fY (y ; θ) = − 2 + 3
∂2θ
θ
θ
"
#
∂2
1
2Y
1
2θ
−2
−E
ln fY (Y ; θ) = E − 2 + 3 = − 2 + 3 = θ .
∂2θ
θ
θ
θ
θ
−
I(θ) = θ
−2
b
Step 2. Compute Var(θ):
Var(Y ) =
n
1 X
1
θ2
2
Var(Yi ) = 2 nθ =
.
2
n i=1
n
n
Conclusion. Because θb is unbiased and Var(b
p) = (nI(p))−1 , θb is efficient.
65
E.g. 2’. Exponential distr.: fY (y ; θ) = θ−1 e−y /θ for y ≥ 0. θb = Y efficent?
Step. 1. Compute Fisher’s Information:
ln fY (y ; θ) = − ln θ −
y
θ
∂
1
y
ln fY (y ; θ) = − + 2
∂θ
θ
θ
∂2
1
2y
ln fY (y ; θ) = − 2 + 3
∂2θ
θ
θ
"
#
∂2
1
2Y
1
2θ
−2
−E
ln fY (Y ; θ) = E − 2 + 3 = − 2 + 3 = θ .
∂2θ
θ
θ
θ
θ
−
I(θ) = θ
−2
b
Step 2. Compute Var(θ):
Var(Y ) =
n
1 X
1
θ2
2
Var(Yi ) = 2 nθ =
.
2
n i=1
n
n
Conclusion. Because θb is unbiased and Var(b
p) = (nI(p))−1 , θb is efficient.
65
E.g. 3. fY (y ; θ) = 2y /θ2 for y ∈ [0, θ]. θb = 32 Y efficent?
Step. 1. Compute Fisher’s Information:
ln fY (y ; θ) = ln(2y ) − 2 ln θ
∂
2
ln fY (y ; θ) = −
∂θ
θ
By the definition of Fisher’s information,
2 ∂
2 2
4
I(θ) = E
ln fY (y ; θ)
=E
−
= 2.
∂θ
θ
θ
However, if we compute
−
"
−E
∂2
2
ln fY (y ; θ) = − 2
∂2θ
θ
#
∂2
2
2
4
ln
f
(Y
;
θ)
= E − 2 = − 2 6= 2 = I(θ).
Y
2
∂ θ
θ
θ
θ
(†)
b
Step 2. Compute Var(θ):
9
9 θ2
θ2
Var(Y ) =
=
.
4n
4n 18
8n
Discussion. Even though θb is unbiased, we have two discripencies: (†) and
b =
Var(θ)
b =
Var(θ)
θ2
θ2
1
≤
=
8n
4n
nI(θ)
This is because this is not a regular estimation!
66
E.g. 3. fY (y ; θ) = 2y /θ2 for y ∈ [0, θ]. θb = 32 Y efficent?
Step. 1. Compute Fisher’s Information:
ln fY (y ; θ) = ln(2y ) − 2 ln θ
∂
2
ln fY (y ; θ) = −
∂θ
θ
By the definition of Fisher’s information,
2 ∂
2 2
4
I(θ) = E
ln fY (y ; θ)
=E
−
= 2.
∂θ
θ
θ
However, if we compute
−
"
−E
∂2
2
ln fY (y ; θ) = − 2
∂2θ
θ
#
∂2
2
2
4
ln
f
(Y
;
θ)
= E − 2 = − 2 6= 2 = I(θ).
Y
2
∂ θ
θ
θ
θ
(†)
b
Step 2. Compute Var(θ):
9
9 θ2
θ2
Var(Y ) =
=
.
4n
4n 18
8n
Discussion. Even though θb is unbiased, we have two discripencies: (†) and
b =
Var(θ)
b =
Var(θ)
θ2
θ2
1
≤
=
8n
4n
nI(θ)
This is because this is not a regular estimation!
66
E.g. 3. fY (y ; θ) = 2y /θ2 for y ∈ [0, θ]. θb = 32 Y efficent?
Step. 1. Compute Fisher’s Information:
ln fY (y ; θ) = ln(2y ) − 2 ln θ
∂
2
ln fY (y ; θ) = −
∂θ
θ
By the definition of Fisher’s information,
2 ∂
2 2
4
I(θ) = E
ln fY (y ; θ)
=E
−
= 2.
∂θ
θ
θ
However, if we compute
−
"
−E
∂2
2
ln fY (y ; θ) = − 2
∂2θ
θ
#
∂2
2
2
4
ln
f
(Y
;
θ)
= E − 2 = − 2 6= 2 = I(θ).
Y
2
∂ θ
θ
θ
θ
(†)
b
Step 2. Compute Var(θ):
9
9 θ2
θ2
Var(Y ) =
=
.
4n
4n 18
8n
Discussion. Even though θb is unbiased, we have two discripencies: (†) and
b =
Var(θ)
b =
Var(θ)
θ2
θ2
1
≤
=
8n
4n
nI(θ)
This is because this is not a regular estimation!
66
E.g. 3. fY (y ; θ) = 2y /θ2 for y ∈ [0, θ]. θb = 32 Y efficent?
Step. 1. Compute Fisher’s Information:
ln fY (y ; θ) = ln(2y ) − 2 ln θ
∂
2
ln fY (y ; θ) = −
∂θ
θ
By the definition of Fisher’s information,
2 ∂
2 2
4
I(θ) = E
ln fY (y ; θ)
=E
−
= 2.
∂θ
θ
θ
However, if we compute
−
"
−E
∂2
2
ln fY (y ; θ) = − 2
∂2θ
θ
#
∂2
2
2
4
ln
f
(Y
;
θ)
= E − 2 = − 2 6= 2 = I(θ).
Y
2
∂ θ
θ
θ
θ
(†)
b
Step 2. Compute Var(θ):
9
9 θ2
θ2
Var(Y ) =
=
.
4n
4n 18
8n
Discussion. Even though θb is unbiased, we have two discripencies: (†) and
b =
Var(θ)
b =
Var(θ)
θ2
θ2
1
≤
=
8n
4n
nI(θ)
This is because this is not a regular estimation!
66
E.g. 3. fY (y ; θ) = 2y /θ2 for y ∈ [0, θ]. θb = 32 Y efficent?
Step. 1. Compute Fisher’s Information:
ln fY (y ; θ) = ln(2y ) − 2 ln θ
∂
2
ln fY (y ; θ) = −
∂θ
θ
By the definition of Fisher’s information,
2 ∂
2 2
4
I(θ) = E
ln fY (y ; θ)
=E
−
= 2.
∂θ
θ
θ
However, if we compute
−
"
−E
∂2
2
ln fY (y ; θ) = − 2
∂2θ
θ
#
∂2
2
2
4
ln
f
(Y
;
θ)
= E − 2 = − 2 6= 2 = I(θ).
Y
2
∂ θ
θ
θ
θ
(†)
b
Step 2. Compute Var(θ):
9
9 θ2
θ2
Var(Y ) =
=
.
4n
4n 18
8n
Discussion. Even though θb is unbiased, we have two discripencies: (†) and
b =
Var(θ)
b =
Var(θ)
θ2
θ2
1
≤
=
8n
4n
nI(θ)
This is because this is not a regular estimation!
66
E.g. 3. fY (y ; θ) = 2y /θ2 for y ∈ [0, θ]. θb = 32 Y efficent?
Step. 1. Compute Fisher’s Information:
ln fY (y ; θ) = ln(2y ) − 2 ln θ
∂
2
ln fY (y ; θ) = −
∂θ
θ
By the definition of Fisher’s information,
2 ∂
2 2
4
I(θ) = E
ln fY (y ; θ)
=E
−
= 2.
∂θ
θ
θ
However, if we compute
−
"
−E
∂2
2
ln fY (y ; θ) = − 2
∂2θ
θ
#
∂2
2
2
4
ln
f
(Y
;
θ)
= E − 2 = − 2 6= 2 = I(θ).
Y
2
∂ θ
θ
θ
θ
(†)
b
Step 2. Compute Var(θ):
9
9 θ2
θ2
Var(Y ) =
=
.
4n
4n 18
8n
Discussion. Even though θb is unbiased, we have two discripencies: (†) and
b =
Var(θ)
b =
Var(θ)
θ2
θ2
1
≤
=
8n
4n
nI(θ)
This is because this is not a regular estimation!
66
E.g. 3. fY (y ; θ) = 2y /θ2 for y ∈ [0, θ]. θb = 32 Y efficent?
Step. 1. Compute Fisher’s Information:
ln fY (y ; θ) = ln(2y ) − 2 ln θ
∂
2
ln fY (y ; θ) = −
∂θ
θ
By the definition of Fisher’s information,
2 ∂
2 2
4
I(θ) = E
ln fY (y ; θ)
=E
−
= 2.
∂θ
θ
θ
However, if we compute
−
"
−E
∂2
2
ln fY (y ; θ) = − 2
∂2θ
θ
#
∂2
2
2
4
ln
f
(Y
;
θ)
= E − 2 = − 2 6= 2 = I(θ).
Y
2
∂ θ
θ
θ
θ
(†)
b
Step 2. Compute Var(θ):
9
9 θ2
θ2
Var(Y ) =
=
.
4n
4n 18
8n
Discussion. Even though θb is unbiased, we have two discripencies: (†) and
b =
Var(θ)
b =
Var(θ)
θ2
θ2
1
≤
=
8n
4n
nI(θ)
This is because this is not a regular estimation!
66
E.g. 3. fY (y ; θ) = 2y /θ2 for y ∈ [0, θ]. θb = 32 Y efficent?
Step. 1. Compute Fisher’s Information:
ln fY (y ; θ) = ln(2y ) − 2 ln θ
∂
2
ln fY (y ; θ) = −
∂θ
θ
By the definition of Fisher’s information,
2 ∂
2 2
4
I(θ) = E
ln fY (y ; θ)
=E
−
= 2.
∂θ
θ
θ
However, if we compute
−
"
−E
∂2
2
ln fY (y ; θ) = − 2
∂2θ
θ
#
∂2
2
2
4
ln
f
(Y
;
θ)
= E − 2 = − 2 6= 2 = I(θ).
Y
2
∂ θ
θ
θ
θ
(†)
b
Step 2. Compute Var(θ):
9
9 θ2
θ2
Var(Y ) =
=
.
4n
4n 18
8n
Discussion. Even though θb is unbiased, we have two discripencies: (†) and
b =
Var(θ)
b =
Var(θ)
θ2
θ2
1
≤
=
8n
4n
nI(θ)
This is because this is not a regular estimation!
66
E.g. 3. fY (y ; θ) = 2y /θ2 for y ∈ [0, θ]. θb = 32 Y efficent?
Step. 1. Compute Fisher’s Information:
ln fY (y ; θ) = ln(2y ) − 2 ln θ
∂
2
ln fY (y ; θ) = −
∂θ
θ
By the definition of Fisher’s information,
2 ∂
2 2
4
I(θ) = E
ln fY (y ; θ)
=E
−
= 2.
∂θ
θ
θ
However, if we compute
−
"
−E
∂2
2
ln fY (y ; θ) = − 2
∂2θ
θ
#
∂2
2
2
4
ln
f
(Y
;
θ)
= E − 2 = − 2 6= 2 = I(θ).
Y
2
∂ θ
θ
θ
θ
(†)
b
Step 2. Compute Var(θ):
9
9 θ2
θ2
Var(Y ) =
=
.
4n
4n 18
8n
Discussion. Even though θb is unbiased, we have two discripencies: (†) and
b =
Var(θ)
b =
Var(θ)
θ2
θ2
1
≤
=
8n
4n
nI(θ)
This is because this is not a regular estimation!
66
E.g. 3. fY (y ; θ) = 2y /θ2 for y ∈ [0, θ]. θb = 32 Y efficent?
Step. 1. Compute Fisher’s Information:
ln fY (y ; θ) = ln(2y ) − 2 ln θ
∂
2
ln fY (y ; θ) = −
∂θ
θ
By the definition of Fisher’s information,
2 ∂
2 2
4
I(θ) = E
ln fY (y ; θ)
=E
−
= 2.
∂θ
θ
θ
However, if we compute
−
"
−E
∂2
2
ln fY (y ; θ) = − 2
∂2θ
θ
#
∂2
2
2
4
ln
f
(Y
;
θ)
= E − 2 = − 2 6= 2 = I(θ).
Y
2
∂ θ
θ
θ
θ
(†)
b
Step 2. Compute Var(θ):
9
9 θ2
θ2
Var(Y ) =
=
.
4n
4n 18
8n
Discussion. Even though θb is unbiased, we have two discripencies: (†) and
b =
Var(θ)
b =
Var(θ)
θ2
θ2
1
≤
=
8n
4n
nI(θ)
This is because this is not a regular estimation!
66
E.g. 3. fY (y ; θ) = 2y /θ2 for y ∈ [0, θ]. θb = 32 Y efficent?
Step. 1. Compute Fisher’s Information:
ln fY (y ; θ) = ln(2y ) − 2 ln θ
∂
2
ln fY (y ; θ) = −
∂θ
θ
By the definition of Fisher’s information,
2 ∂
2 2
4
I(θ) = E
ln fY (y ; θ)
=E
−
= 2.
∂θ
θ
θ
However, if we compute
−
"
−E
∂2
2
ln fY (y ; θ) = − 2
∂2θ
θ
#
∂2
2
2
4
ln
f
(Y
;
θ)
= E − 2 = − 2 6= 2 = I(θ).
Y
2
∂ θ
θ
θ
θ
(†)
b
Step 2. Compute Var(θ):
9
9 θ2
θ2
Var(Y ) =
=
.
4n
4n 18
8n
Discussion. Even though θb is unbiased, we have two discripencies: (†) and
b =
Var(θ)
b =
Var(θ)
θ2
θ2
1
≤
=
8n
4n
nI(θ)
This is because this is not a regular estimation!
66
Plan
§ 5.1 Introduction
§ 5.2 Estimating parameters: MLE and MME
§ 5.3 Interval Estimation
§ 5.4 Properties of Estimators
§ 5.5 Minimum-Variance Estimators: The Cramér-Rao Lower Bound
§ 5.6 Sufficient Estimators
§ 5.7 Consistency
§ 5.8 Bayesian Estimation
67
Chapter 5. Estimation
§ 5.1 Introduction
§ 5.2 Estimating parameters: MLE and MME
§ 5.3 Interval Estimation
§ 5.4 Properties of Estimators
§ 5.5 Minimum-Variance Estimators: The Cramér-Rao Lower Bound
§ 5.6 Sufficient Estimators
§ 5.7 Consistency
§ 5.8 Bayesian Estimation
68
§ 5.6 Sufficient Estimators
Rationale: Let θb be an estimator to the unknown parameter θ. Whether
does θb contain all information about θ?
Equivalently, how can one reduce the random sample of size n, denoted by
(X1 , · · · , Xn ), to a function without losing any information about θ?
P
E.g., let’s choose the function h(X1 , · · · , Xn ) := 1n ni=1 Xi , the sample mean.
In many cases, h(X1 , · · · , Xn ) contains all relevant information about the
true mean E(X ). In that case, h(X1 , · · · , Xn ), as an estimator, is sufficient.
69
§ 5.6 Sufficient Estimators
Rationale: Let θb be an estimator to the unknown parameter θ. Whether
does θb contain all information about θ?
Equivalently, how can one reduce the random sample of size n, denoted by
(X1 , · · · , Xn ), to a function without losing any information about θ?
P
E.g., let’s choose the function h(X1 , · · · , Xn ) := 1n ni=1 Xi , the sample mean.
In many cases, h(X1 , · · · , Xn ) contains all relevant information about the
true mean E(X ). In that case, h(X1 , · · · , Xn ), as an estimator, is sufficient.
69
§ 5.6 Sufficient Estimators
Rationale: Let θb be an estimator to the unknown parameter θ. Whether
does θb contain all information about θ?
Equivalently, how can one reduce the random sample of size n, denoted by
(X1 , · · · , Xn ), to a function without losing any information about θ?
P
E.g., let’s choose the function h(X1 , · · · , Xn ) := 1n ni=1 Xi , the sample mean.
In many cases, h(X1 , · · · , Xn ) contains all relevant information about the
true mean E(X ). In that case, h(X1 , · · · , Xn ), as an estimator, is sufficient.
69
Definition. Let (X1 , · · · , Xn ) be a random sample of size n from a discrete
population with a unknown parameter θ, of which θb (resp. θe ) be an
estimator (resp. estimate). We call θb and θe sufficient if
P X1 = k1 , · · · , Xn = kn θb = θe = b(k1 , · · · , kn )
(Sufficency-1)
is a function that does not depend on θ.
In case for random sample (Y1 , · · · , Yn ) from the continuous population,
(Sufficency-1) should be
fY1 ,··· ,Yn |θ=θ
y1 , · · · , yn θb = θe = b(y1 , · · · , yn )
b e
Note: θb = h(X1 , · · · , Xn ) and θe = h(k1 , · · · , kn ).
or θb = h(Y1 , · · · , Yn ) and θe = h(y1 , · · · , yn ).
70
Equivalently,
Definition. ... θb (or θe ) is sufficient if the likelihood function can be
factorized as:
 n
Y


 pX (ki ; θ) = g(θe , θ) b(k1 , · · · , kn ) Discrete


L(θ) = i=1
(Sufficency-2)
n
Y


 fY (yi ; θ) = g(θe , θ) b(y1 , · · · , yn ) Continous


i=1
where g is a function of two arguments only and b is a function that does
not depend on θ.
71
b=
E.g. 1. A random sample of size n from Bernoulli(P). p
b for p by (Sufficency-1):
sufficiency of p
Case I: If k1 , · · · , kn ∈ {0, 1} such that
Pn
i=1
Xi . Check
ki 6= c, then
b = c = 0.
p
i=1
P X1 = k1 , · · · , Xn = kn
Case II: If k1 , · · · , kn ∈ {0, 1} such that
Pn
Pn
i=1
ki = c, then
72
b=
E.g. 1. A random sample of size n from Bernoulli(P). p
b for p by (Sufficency-1):
sufficiency of p
Case I: If k1 , · · · , kn ∈ {0, 1} such that
Pn
i=1
Xi . Check
ki 6= c, then
b = c = 0.
p
i=1
P X1 = k1 , · · · , Xn = kn
Case II: If k1 , · · · , kn ∈ {0, 1} such that
Pn
Pn
i=1
ki = c, then
72
b=
E.g. 1. A random sample of size n from Bernoulli(P). p
b for p by (Sufficency-1):
sufficiency of p
Case I: If k1 , · · · , kn ∈ {0, 1} such that
Pn
i=1
Xi . Check
ki 6= c, then
b = c = 0.
p
i=1
P X1 = k1 , · · · , Xn = kn
Case II: If k1 , · · · , kn ∈ {0, 1} such that
Pn
Pn
i=1
ki = c, then
72
b=
E.g. 1. A random sample of size n from Bernoulli(P). p
b for p by (Sufficency-1):
sufficiency of p
Case I: If k1 , · · · , kn ∈ {0, 1} such that
Pn
i=1
Xi . Check
ki 6= c, then
b = c = 0.
p
i=1
P X1 = k1 , · · · , Xn = kn
Case II: If k1 , · · · , kn ∈ {0, 1} such that
Pn
Pn
i=1
ki = c, then
72
b=c
P X1 = k1 , · · · , Xn = kn p
b=c
P X1 = k1 , · · · , Xn = kn , p
=
b = c)
P(p
P
P X1 = k1 , · · · , Xn = kn , Xn + n−1
i=1 Xi = c
Pn
=
P
i=1 Xi = c
P
P X1 = k1 , · · · , Xn−1 = kn−1 , Xn = c − n−1
i=1 ki
Pn
=
P
i=1 Xi = c
Pn−1
Pn−1
Qn−1 ki
1−ki
× pc− i=1 ki (1 − p)1−c+ i=1 ki
i=1 p (1 − p)
=
n
pc (1 − p)n−c
c
1
= n .
c
73
b=c
P X1 = k1 , · · · , Xn = kn p
b=c
P X1 = k1 , · · · , Xn = kn , p
=
b = c)
P(p
P
P X1 = k1 , · · · , Xn = kn , Xn + n−1
i=1 Xi = c
Pn
=
P
i=1 Xi = c
P
P X1 = k1 , · · · , Xn−1 = kn−1 , Xn = c − n−1
i=1 ki
Pn
=
P
i=1 Xi = c
Pn−1
Pn−1
Qn−1 ki
1−ki
× pc− i=1 ki (1 − p)1−c+ i=1 ki
i=1 p (1 − p)
=
n
pc (1 − p)n−c
c
1
= n .
c
73
b=c
P X1 = k1 , · · · , Xn = kn p
b=c
P X1 = k1 , · · · , Xn = kn , p
=
b = c)
P(p
P
P X1 = k1 , · · · , Xn = kn , Xn + n−1
i=1 Xi = c
Pn
=
P
i=1 Xi = c
P
P X1 = k1 , · · · , Xn−1 = kn−1 , Xn = c − n−1
i=1 ki
Pn
=
P
i=1 Xi = c
Pn−1
Pn−1
Qn−1 ki
1−ki
× pc− i=1 ki (1 − p)1−c+ i=1 ki
i=1 p (1 − p)
=
n
pc (1 − p)n−c
c
1
= n .
c
73
b=c
P X1 = k1 , · · · , Xn = kn p
b=c
P X1 = k1 , · · · , Xn = kn , p
=
b = c)
P(p
P
P X1 = k1 , · · · , Xn = kn , Xn + n−1
i=1 Xi = c
Pn
=
P
i=1 Xi = c
P
P X1 = k1 , · · · , Xn−1 = kn−1 , Xn = c − n−1
i=1 ki
Pn
=
P
i=1 Xi = c
Pn−1
Pn−1
Qn−1 ki
1−ki
× pc− i=1 ki (1 − p)1−c+ i=1 ki
i=1 p (1 − p)
=
n
pc (1 − p)n−c
c
1
= n .
c
73
b=c
P X1 = k1 , · · · , Xn = kn p
b=c
P X1 = k1 , · · · , Xn = kn , p
=
b = c)
P(p
P
P X1 = k1 , · · · , Xn = kn , Xn + n−1
i=1 Xi = c
Pn
=
P
i=1 Xi = c
P
P X1 = k1 , · · · , Xn−1 = kn−1 , Xn = c − n−1
i=1 ki
Pn
=
P
i=1 Xi = c
Pn−1
Pn−1
Qn−1 ki
1−ki
× pc− i=1 ki (1 − p)1−c+ i=1 ki
i=1 p (1 − p)
=
n
pc (1 − p)n−c
c
1
= n .
c
73
b=c
P X1 = k1 , · · · , Xn = kn p
b=c
P X1 = k1 , · · · , Xn = kn , p
=
b = c)
P(p
P
P X1 = k1 , · · · , Xn = kn , Xn + n−1
i=1 Xi = c
Pn
=
P
i=1 Xi = c
P
P X1 = k1 , · · · , Xn−1 = kn−1 , Xn = c − n−1
i=1 ki
Pn
=
P
i=1 Xi = c
Pn−1
Pn−1
Qn−1 ki
1−ki
× pc− i=1 ki (1 − p)1−c+ i=1 ki
i=1 p (1 − p)
=
n
pc (1 − p)n−c
c
1
= n .
c
73
In summary,
b=c =
P X1 = k1 , · · · , Xn = kn p
b=
Hence, by (Sufficency-1), p
Pn
i=1

 1n
if ki ∈ {0, 1} s.t.
Pn
i=1
ki = c,
c
0
otherwise.
Xi is a sufficient estimator for p.
74
In summary,
b=c =
P X1 = k1 , · · · , Xn = kn p
b=
Hence, by (Sufficency-1), p
Pn
i=1

 1n
if ki ∈ {0, 1} s.t.
Pn
i=1
ki = c,
c
0
otherwise.
Xi is a sufficient estimator for p.
74
b for p by (Sufficency-2):
E.g. 1’. As in E.g. 1, check sufficiency of p
P
Notice that pe = ni=1 ki . Then
L(p) =
n
Y
pX (ki ; p) =
i=1
n
Y
pki (1 − p)1−ki
i=1
Pn
=p
i=1 ki
Pn
(1 − p)n−
i=1 ki
= ppe (1 − p)n−pe
b) is sufficient since (Sufficency-2) is satisfied with
Therefore, pe (or p
g(pe , p) = ppe (1 − p)n−pe
and
b(k1 , · · · , kn ) = 1.
b is sufficient but not unbiased since E(p
b) = np 6= p.
Comment 1. The estimator p
2. Any one-to-one function of a sufficient estimator is again a sufficient
b2 := 1n p
b, which is a unbiased, sufficient, and MVE.
estimator. E.g., p
b3 := X1 is not sufficient!
3. p
75
b for p by (Sufficency-2):
E.g. 1’. As in E.g. 1, check sufficiency of p
P
Notice that pe = ni=1 ki . Then
L(p) =
n
Y
pX (ki ; p) =
i=1
n
Y
pki (1 − p)1−ki
i=1
Pn
=p
i=1 ki
Pn
(1 − p)n−
i=1 ki
= ppe (1 − p)n−pe
b) is sufficient since (Sufficency-2) is satisfied with
Therefore, pe (or p
g(pe , p) = ppe (1 − p)n−pe
and
b(k1 , · · · , kn ) = 1.
b is sufficient but not unbiased since E(p
b) = np 6= p.
Comment 1. The estimator p
2. Any one-to-one function of a sufficient estimator is again a sufficient
b2 := 1n p
b, which is a unbiased, sufficient, and MVE.
estimator. E.g., p
b3 := X1 is not sufficient!
3. p
75
b for p by (Sufficency-2):
E.g. 1’. As in E.g. 1, check sufficiency of p
P
Notice that pe = ni=1 ki . Then
L(p) =
n
Y
pX (ki ; p) =
i=1
n
Y
pki (1 − p)1−ki
i=1
Pn
=p
i=1 ki
Pn
(1 − p)n−
i=1 ki
= ppe (1 − p)n−pe
b) is sufficient since (Sufficency-2) is satisfied with
Therefore, pe (or p
g(pe , p) = ppe (1 − p)n−pe
and
b(k1 , · · · , kn ) = 1.
b is sufficient but not unbiased since E(p
b) = np 6= p.
Comment 1. The estimator p
2. Any one-to-one function of a sufficient estimator is again a sufficient
b2 := 1n p
b, which is a unbiased, sufficient, and MVE.
estimator. E.g., p
b3 := X1 is not sufficient!
3. p
75
b for p by (Sufficency-2):
E.g. 1’. As in E.g. 1, check sufficiency of p
P
Notice that pe = ni=1 ki . Then
L(p) =
n
Y
pX (ki ; p) =
i=1
n
Y
pki (1 − p)1−ki
i=1
Pn
=p
i=1 ki
Pn
(1 − p)n−
i=1 ki
= ppe (1 − p)n−pe
b) is sufficient since (Sufficency-2) is satisfied with
Therefore, pe (or p
g(pe , p) = ppe (1 − p)n−pe
and
b(k1 , · · · , kn ) = 1.
b is sufficient but not unbiased since E(p
b) = np 6= p.
Comment 1. The estimator p
2. Any one-to-one function of a sufficient estimator is again a sufficient
b2 := 1n p
b, which is a unbiased, sufficient, and MVE.
estimator. E.g., p
b3 := X1 is not sufficient!
3. p
75
b for p by (Sufficency-2):
E.g. 1’. As in E.g. 1, check sufficiency of p
P
Notice that pe = ni=1 ki . Then
L(p) =
n
Y
pX (ki ; p) =
i=1
n
Y
pki (1 − p)1−ki
i=1
Pn
=p
i=1 ki
Pn
(1 − p)n−
i=1 ki
= ppe (1 − p)n−pe
b) is sufficient since (Sufficency-2) is satisfied with
Therefore, pe (or p
g(pe , p) = ppe (1 − p)n−pe
and
b(k1 , · · · , kn ) = 1.
b is sufficient but not unbiased since E(p
b) = np 6= p.
Comment 1. The estimator p
2. Any one-to-one function of a sufficient estimator is again a sufficient
b2 := 1n p
b, which is a unbiased, sufficient, and MVE.
estimator. E.g., p
b3 := X1 is not sufficient!
3. p
75
b for p by (Sufficency-2):
E.g. 1’. As in E.g. 1, check sufficiency of p
P
Notice that pe = ni=1 ki . Then
L(p) =
n
Y
pX (ki ; p) =
i=1
n
Y
pki (1 − p)1−ki
i=1
Pn
=p
i=1 ki
Pn
(1 − p)n−
i=1 ki
= ppe (1 − p)n−pe
b) is sufficient since (Sufficency-2) is satisfied with
Therefore, pe (or p
g(pe , p) = ppe (1 − p)n−pe
and
b(k1 , · · · , kn ) = 1.
b is sufficient but not unbiased since E(p
b) = np 6= p.
Comment 1. The estimator p
2. Any one-to-one function of a sufficient estimator is again a sufficient
b2 := 1n p
b, which is a unbiased, sufficient, and MVE.
estimator. E.g., p
b3 := X1 is not sufficient!
3. p
75
b for p by (Sufficency-2):
E.g. 1’. As in E.g. 1, check sufficiency of p
P
Notice that pe = ni=1 ki . Then
L(p) =
n
Y
pX (ki ; p) =
i=1
n
Y
pki (1 − p)1−ki
i=1
Pn
=p
i=1 ki
Pn
(1 − p)n−
i=1 ki
= ppe (1 − p)n−pe
b) is sufficient since (Sufficency-2) is satisfied with
Therefore, pe (or p
g(pe , p) = ppe (1 − p)n−pe
and
b(k1 , · · · , kn ) = 1.
b is sufficient but not unbiased since E(p
b) = np 6= p.
Comment 1. The estimator p
2. Any one-to-one function of a sufficient estimator is again a sufficient
b2 := 1n p
b, which is a unbiased, sufficient, and MVE.
estimator. E.g., p
b3 := X1 is not sufficient!
3. p
75
b for p by (Sufficency-2):
E.g. 1’. As in E.g. 1, check sufficiency of p
P
Notice that pe = ni=1 ki . Then
L(p) =
n
Y
pX (ki ; p) =
i=1
n
Y
pki (1 − p)1−ki
i=1
Pn
=p
i=1 ki
Pn
(1 − p)n−
i=1 ki
= ppe (1 − p)n−pe
b) is sufficient since (Sufficency-2) is satisfied with
Therefore, pe (or p
g(pe , p) = ppe (1 − p)n−pe
and
b(k1 , · · · , kn ) = 1.
b is sufficient but not unbiased since E(p
b) = np 6= p.
Comment 1. The estimator p
2. Any one-to-one function of a sufficient estimator is again a sufficient
b2 := 1n p
b, which is a unbiased, sufficient, and MVE.
estimator. E.g., p
b3 := X1 is not sufficient!
3. p
75
b for p by (Sufficency-2):
E.g. 1’. As in E.g. 1, check sufficiency of p
P
Notice that pe = ni=1 ki . Then
L(p) =
n
Y
pX (ki ; p) =
i=1
n
Y
pki (1 − p)1−ki
i=1
Pn
=p
i=1 ki
Pn
(1 − p)n−
i=1 ki
= ppe (1 − p)n−pe
b) is sufficient since (Sufficency-2) is satisfied with
Therefore, pe (or p
g(pe , p) = ppe (1 − p)n−pe
and
b(k1 , · · · , kn ) = 1.
b is sufficient but not unbiased since E(p
b) = np 6= p.
Comment 1. The estimator p
2. Any one-to-one function of a sufficient estimator is again a sufficient
b2 := 1n p
b, which is a unbiased, sufficient, and MVE.
estimator. E.g., p
b3 := X1 is not sufficient!
3. p
75
b for p by (Sufficency-2):
E.g. 1’. As in E.g. 1, check sufficiency of p
P
Notice that pe = ni=1 ki . Then
L(p) =
n
Y
pX (ki ; p) =
i=1
n
Y
pki (1 − p)1−ki
i=1
Pn
=p
i=1 ki
Pn
(1 − p)n−
i=1 ki
= ppe (1 − p)n−pe
b) is sufficient since (Sufficency-2) is satisfied with
Therefore, pe (or p
g(pe , p) = ppe (1 − p)n−pe
and
b(k1 , · · · , kn ) = 1.
b is sufficient but not unbiased since E(p
b) = np 6= p.
Comment 1. The estimator p
2. Any one-to-one function of a sufficient estimator is again a sufficient
b2 := 1n p
b, which is a unbiased, sufficient, and MVE.
estimator. E.g., p
b3 := X1 is not sufficient!
3. p
75
b for p by (Sufficency-2):
E.g. 1’. As in E.g. 1, check sufficiency of p
P
Notice that pe = ni=1 ki . Then
L(p) =
n
Y
pX (ki ; p) =
i=1
n
Y
pki (1 − p)1−ki
i=1
Pn
=p
i=1 ki
Pn
(1 − p)n−
i=1 ki
= ppe (1 − p)n−pe
b) is sufficient since (Sufficency-2) is satisfied with
Therefore, pe (or p
g(pe , p) = ppe (1 − p)n−pe
and
b(k1 , · · · , kn ) = 1.
b is sufficient but not unbiased since E(p
b) = np 6= p.
Comment 1. The estimator p
2. Any one-to-one function of a sufficient estimator is again a sufficient
b2 := 1n p
b, which is a unbiased, sufficient, and MVE.
estimator. E.g., p
b3 := X1 is not sufficient!
3. p
75
E.g. 2. Poisson(λ), pX (k ; λ) = e−λ λk /k !, k = 0, 1, · · · . Show that
b = (Pn Xi )2 is sufficient for λ for a sample of size n.
λ
i=1
Sol: The Corresponding estimate is λe = (
L(λ) =
Pn
i=1
k i )2 .
n
Y
e−λ λki
ki !
i=1
=e
−nλ
λ
n
Y
Pn
i=1 ki
!−1
ki !
i=1
=e
−nλ
λ
√
λe
n
Y
×
!−1
ki !
.
i=1
|
{z
g(λe ,λ)
}
b is sufficient estimator for λ.
Hence, λ
|
{z
b(k1 ,··· ,kn )
}
76
E.g. 2. Poisson(λ), pX (k ; λ) = e−λ λk /k !, k = 0, 1, · · · . Show that
b = (Pn Xi )2 is sufficient for λ for a sample of size n.
λ
i=1
Sol: The Corresponding estimate is λe = (
L(λ) =
Pn
i=1
k i )2 .
n
Y
e−λ λki
ki !
i=1
=e
−nλ
λ
n
Y
Pn
i=1 ki
!−1
ki !
i=1
=e
−nλ
λ
√
λe
n
Y
×
!−1
ki !
.
i=1
|
{z
g(λe ,λ)
}
b is sufficient estimator for λ.
Hence, λ
|
{z
b(k1 ,··· ,kn )
}
76
E.g. 2. Poisson(λ), pX (k ; λ) = e−λ λk /k !, k = 0, 1, · · · . Show that
b = (Pn Xi )2 is sufficient for λ for a sample of size n.
λ
i=1
Sol: The Corresponding estimate is λe = (
L(λ) =
Pn
i=1
k i )2 .
n
Y
e−λ λki
ki !
i=1
=e
−nλ
λ
n
Y
Pn
i=1 ki
!−1
ki !
i=1
=e
−nλ
λ
√
λe
n
Y
×
!−1
ki !
.
i=1
|
{z
g(λe ,λ)
}
b is sufficient estimator for λ.
Hence, λ
|
{z
b(k1 ,··· ,kn )
}
76
E.g. 2. Poisson(λ), pX (k ; λ) = e−λ λk /k !, k = 0, 1, · · · . Show that
b = (Pn Xi )2 is sufficient for λ for a sample of size n.
λ
i=1
Sol: The Corresponding estimate is λe = (
L(λ) =
Pn
i=1
k i )2 .
n
Y
e−λ λki
ki !
i=1
=e
−nλ
λ
n
Y
Pn
i=1 ki
!−1
ki !
i=1
=e
−nλ
λ
√
λe
n
Y
×
!−1
ki !
.
i=1
|
{z
g(λe ,λ)
}
b is sufficient estimator for λ.
Hence, λ
|
{z
b(k1 ,··· ,kn )
}
76
E.g. 2. Poisson(λ), pX (k ; λ) = e−λ λk /k !, k = 0, 1, · · · . Show that
b = (Pn Xi )2 is sufficient for λ for a sample of size n.
λ
i=1
Sol: The Corresponding estimate is λe = (
L(λ) =
Pn
i=1
k i )2 .
n
Y
e−λ λki
ki !
i=1
=e
−nλ
λ
n
Y
Pn
i=1 ki
!−1
ki !
i=1
=e
−nλ
λ
√
λe
n
Y
×
!−1
ki !
.
i=1
|
{z
g(λe ,λ)
}
b is sufficient estimator for λ.
Hence, λ
|
{z
b(k1 ,··· ,kn )
}
76
E.g. 2. Poisson(λ), pX (k ; λ) = e−λ λk /k !, k = 0, 1, · · · . Show that
b = (Pn Xi )2 is sufficient for λ for a sample of size n.
λ
i=1
Sol: The Corresponding estimate is λe = (
L(λ) =
Pn
i=1
k i )2 .
n
Y
e−λ λki
ki !
i=1
=e
−nλ
λ
n
Y
Pn
i=1 ki
!−1
ki !
i=1
=e
−nλ
λ
√
λe
n
Y
×
!−1
ki !
.
i=1
|
{z
g(λe ,λ)
}
b is sufficient estimator for λ.
Hence, λ
|
{z
b(k1 ,··· ,kn )
}
76
E.g. 2. Poisson(λ), pX (k ; λ) = e−λ λk /k !, k = 0, 1, · · · . Show that
b = (Pn Xi )2 is sufficient for λ for a sample of size n.
λ
i=1
Sol: The Corresponding estimate is λe = (
L(λ) =
Pn
i=1
k i )2 .
n
Y
e−λ λki
ki !
i=1
=e
−nλ
λ
n
Y
Pn
i=1 ki
!−1
ki !
i=1
=e
−nλ
λ
√
λe
n
Y
×
!−1
ki !
.
i=1
|
{z
g(λe ,λ)
}
b is sufficient estimator for λ.
Hence, λ
|
{z
b(k1 ,··· ,kn )
}
76
E.g. 3. Let Y1 , · · · , Yn be a random sample from fY (y ; θ) =
Whether is the MLE θb = Ymax sufficient for θ?
2y
θ2
for y ∈ [0, θ].
Sol: The corresponding estimate is θe = ymax .
n
Y
2y
L(θ) =
I[0,θ] (yi ) = 2n θ−2n
θ2
i=1
n −2n
=2 θ
n
Y
!
×
yi
i=1
n
Y
n
Y
I[0,θ] (yi )
i=1
!
× I[0,θ] (ymax )
yi
i=1
= 2n θ−2n I[0,θ] (θe ) ×
n
Y
yi
.
i=1
|
{z
=g(θe ,θ)
Hence, θb is a sufficient estimator for θ.
Note: MME θb = 32 Y is NOT sufficient for θ!
}
| {z }
=b(y1 ,··· ,yk )
E.g. 3. Let Y1 , · · · , Yn be a random sample from fY (y ; θ) =
Whether is the MLE θb = Ymax sufficient for θ?
2y
θ2
for y ∈ [0, θ].
Sol: The corresponding estimate is θe = ymax .
n
Y
2y
L(θ) =
I[0,θ] (yi ) = 2n θ−2n
θ2
i=1
n −2n
=2 θ
n
Y
!
×
yi
i=1
n
Y
n
Y
I[0,θ] (yi )
i=1
!
× I[0,θ] (ymax )
yi
i=1
= 2n θ−2n I[0,θ] (θe ) ×
n
Y
yi
.
i=1
|
{z
=g(θe ,θ)
Hence, θb is a sufficient estimator for θ.
Note: MME θb = 32 Y is NOT sufficient for θ!
}
| {z }
=b(y1 ,··· ,yk )
E.g. 3. Let Y1 , · · · , Yn be a random sample from fY (y ; θ) =
Whether is the MLE θb = Ymax sufficient for θ?
2y
θ2
for y ∈ [0, θ].
Sol: The corresponding estimate is θe = ymax .
n
Y
2y
L(θ) =
I[0,θ] (yi ) = 2n θ−2n
θ2
i=1
n −2n
=2 θ
n
Y
!
×
yi
i=1
n
Y
n
Y
I[0,θ] (yi )
i=1
!
× I[0,θ] (ymax )
yi
i=1
= 2n θ−2n I[0,θ] (θe ) ×
n
Y
yi
.
i=1
|
{z
=g(θe ,θ)
Hence, θb is a sufficient estimator for θ.
Note: MME θb = 32 Y is NOT sufficient for θ!
}
| {z }
=b(y1 ,··· ,yk )
E.g. 3. Let Y1 , · · · , Yn be a random sample from fY (y ; θ) =
Whether is the MLE θb = Ymax sufficient for θ?
2y
θ2
for y ∈ [0, θ].
Sol: The corresponding estimate is θe = ymax .
n
Y
2y
L(θ) =
I[0,θ] (yi ) = 2n θ−2n
θ2
i=1
n −2n
=2 θ
n
Y
!
×
yi
i=1
n
Y
n
Y
I[0,θ] (yi )
i=1
!
× I[0,θ] (ymax )
yi
i=1
= 2n θ−2n I[0,θ] (θe ) ×
n
Y
yi
.
i=1
|
{z
=g(θe ,θ)
Hence, θb is a sufficient estimator for θ.
Note: MME θb = 32 Y is NOT sufficient for θ!
}
| {z }
=b(y1 ,··· ,yk )
E.g. 3. Let Y1 , · · · , Yn be a random sample from fY (y ; θ) =
Whether is the MLE θb = Ymax sufficient for θ?
2y
θ2
for y ∈ [0, θ].
Sol: The corresponding estimate is θe = ymax .
n
Y
2y
L(θ) =
I[0,θ] (yi ) = 2n θ−2n
θ2
i=1
n −2n
=2 θ
n
Y
!
×
yi
i=1
n
Y
n
Y
I[0,θ] (yi )
i=1
!
× I[0,θ] (ymax )
yi
i=1
= 2n θ−2n I[0,θ] (θe ) ×
n
Y
yi
.
i=1
|
{z
=g(θe ,θ)
Hence, θb is a sufficient estimator for θ.
Note: MME θb = 32 Y is NOT sufficient for θ!
}
| {z }
=b(y1 ,··· ,yk )
E.g. 3. Let Y1 , · · · , Yn be a random sample from fY (y ; θ) =
Whether is the MLE θb = Ymax sufficient for θ?
2y
θ2
for y ∈ [0, θ].
Sol: The corresponding estimate is θe = ymax .
n
Y
2y
L(θ) =
I[0,θ] (yi ) = 2n θ−2n
θ2
i=1
n −2n
=2 θ
n
Y
!
×
yi
i=1
n
Y
n
Y
I[0,θ] (yi )
i=1
!
× I[0,θ] (ymax )
yi
i=1
= 2n θ−2n I[0,θ] (θe ) ×
n
Y
yi
.
i=1
|
{z
=g(θe ,θ)
Hence, θb is a sufficient estimator for θ.
Note: MME θb = 32 Y is NOT sufficient for θ!
}
| {z }
=b(y1 ,··· ,yk )
E.g. 3. Let Y1 , · · · , Yn be a random sample from fY (y ; θ) =
Whether is the MLE θb = Ymax sufficient for θ?
2y
θ2
for y ∈ [0, θ].
Sol: The corresponding estimate is θe = ymax .
n
Y
2y
L(θ) =
I[0,θ] (yi ) = 2n θ−2n
θ2
i=1
n −2n
=2 θ
n
Y
!
×
yi
i=1
n
Y
n
Y
I[0,θ] (yi )
i=1
!
× I[0,θ] (ymax )
yi
i=1
= 2n θ−2n I[0,θ] (θe ) ×
n
Y
yi
.
i=1
|
{z
=g(θe ,θ)
Hence, θb is a sufficient estimator for θ.
Note: MME θb = 32 Y is NOT sufficient for θ!
}
| {z }
=b(y1 ,··· ,yk )
E.g. 3. Let Y1 , · · · , Yn be a random sample from fY (y ; θ) =
Whether is the MLE θb = Ymax sufficient for θ?
2y
θ2
for y ∈ [0, θ].
Sol: The corresponding estimate is θe = ymax .
n
Y
2y
L(θ) =
I[0,θ] (yi ) = 2n θ−2n
θ2
i=1
n −2n
=2 θ
n
Y
!
×
yi
i=1
n
Y
n
Y
I[0,θ] (yi )
i=1
!
× I[0,θ] (ymax )
yi
i=1
= 2n θ−2n I[0,θ] (θe ) ×
n
Y
yi
.
i=1
|
{z
=g(θe ,θ)
Hence, θb is a sufficient estimator for θ.
}
| {z }
=b(y1 ,··· ,yk )
Note: MME θb = 32 Y is NOT sufficient for θ!
77
Plan
§ 5.1 Introduction
§ 5.2 Estimating parameters: MLE and MME
§ 5.3 Interval Estimation
§ 5.4 Properties of Estimators
§ 5.5 Minimum-Variance Estimators: The Cramér-Rao Lower Bound
§ 5.6 Sufficient Estimators
§ 5.7 Consistency
§ 5.8 Bayesian Estimation
78
Chapter 5. Estimation
§ 5.1 Introduction
§ 5.2 Estimating parameters: MLE and MME
§ 5.3 Interval Estimation
§ 5.4 Properties of Estimators
§ 5.5 Minimum-Variance Estimators: The Cramér-Rao Lower Bound
§ 5.6 Sufficient Estimators
§ 5.7 Consistency
§ 5.8 Bayesian Estimation
79
Definition. An estimator θbn = h(W1 , · · · , Wn ) is said to be consistent if it
converges to θ in probability, i.e., for any > 0,
lim P |θbn − θ| < = 1.
n→∞
Comment: In the -δ language, the above convergence in probability says
∀ > 0, ∀δ > 0, ∃n(, δ) > 0, s.t. ∀n ≥ n(, δ),
P |θbn − θ| < > 1 − δ.
80
Definition. An estimator θbn = h(W1 , · · · , Wn ) is said to be consistent if it
converges to θ in probability, i.e., for any > 0,
lim P |θbn − θ| < = 1.
n→∞
Comment: In the -δ language, the above convergence in probability says
∀ > 0, ∀δ > 0, ∃n(, δ) > 0, s.t. ∀n ≥ n(, δ),
P |θbn − θ| < > 1 − δ.
80
A useful tool to check convergence in probability is
Theorem. (Chebyshev’s inequality) Let W be any r.v. with finite mean µ
and variance σ 2 . Then for any > 0
P (|W − µ| < ) ≥ 1 −
or, equivalently,
P (|W − µ| ≥ ) ≤
Proof. ...
σ2
,
2
σ2
.
2
81
As a consequence of Chebyshev’s inequality, we have
P
Proposition. The sample mean µ
bn = 1n ni=1 Wi is consistent for E(W ) = µ,
provided that the population W has finite mean µ and variance σ 2 .
Proof.
E(b
µn ) = µ
∀ > 0,
and
Var(b
µn ) =
P (|b
µn − µ| ≤ ) ≥ 1 −
σ2
.
n
σ2
→ 1.
n2
82
E.g. 1. Let Y1 , · · · , Yn be a random sample of size n from the uniform pdf
fY (y ; θ) = 1/θ, y ∈ [0, θ]. Let θbn = Ymax . We know that Ymax is biased.
Is it consistent?
Sol. The c.d.f. of Y is equal to FY (y ) = y /θ for y ∈ [0, θ]. Hence,
fYmax (y ) = nFY (y )n−1 fY (y ) =
ny n−1
,
θn
y ∈ [0, θ].
Therefore,
P(|θbn − θ| < ) = P(θ − < θbn < θ + )
Z θ
Z θ+
ny n−1
=
dy
+
0dy
θn
θ−
θ
n
θ−
=1−
θ
→ 0 as n → ∞.
83
E.g. 1. Let Y1 , · · · , Yn be a random sample of size n from the uniform pdf
fY (y ; θ) = 1/θ, y ∈ [0, θ]. Let θbn = Ymax . We know that Ymax is biased.
Is it consistent?
Sol. The c.d.f. of Y is equal to FY (y ) = y /θ for y ∈ [0, θ]. Hence,
fYmax (y ) = nFY (y )n−1 fY (y ) =
ny n−1
,
θn
y ∈ [0, θ].
Therefore,
P(|θbn − θ| < ) = P(θ − < θbn < θ + )
Z θ
Z θ+
ny n−1
=
dy
+
0dy
θn
θ−
θ
n
θ−
=1−
θ
→ 0 as n → ∞.
83
E.g. 1. Let Y1 , · · · , Yn be a random sample of size n from the uniform pdf
fY (y ; θ) = 1/θ, y ∈ [0, θ]. Let θbn = Ymax . We know that Ymax is biased.
Is it consistent?
Sol. The c.d.f. of Y is equal to FY (y ) = y /θ for y ∈ [0, θ]. Hence,
fYmax (y ) = nFY (y )n−1 fY (y ) =
ny n−1
,
θn
y ∈ [0, θ].
Therefore,
P(|θbn − θ| < ) = P(θ − < θbn < θ + )
Z θ
Z θ+
ny n−1
=
dy
+
0dy
θn
θ−
θ
n
θ−
=1−
θ
→ 0 as n → ∞.
83
E.g. 1. Let Y1 , · · · , Yn be a random sample of size n from the uniform pdf
fY (y ; θ) = 1/θ, y ∈ [0, θ]. Let θbn = Ymax . We know that Ymax is biased.
Is it consistent?
Sol. The c.d.f. of Y is equal to FY (y ) = y /θ for y ∈ [0, θ]. Hence,
fYmax (y ) = nFY (y )n−1 fY (y ) =
ny n−1
,
θn
y ∈ [0, θ].
Therefore,
P(|θbn − θ| < ) = P(θ − < θbn < θ + )
Z θ
Z θ+
ny n−1
=
dy
+
0dy
θn
θ−
θ
n
θ−
=1−
θ
→ 0 as n → ∞.
83
E.g. 1. Let Y1 , · · · , Yn be a random sample of size n from the uniform pdf
fY (y ; θ) = 1/θ, y ∈ [0, θ]. Let θbn = Ymax . We know that Ymax is biased.
Is it consistent?
Sol. The c.d.f. of Y is equal to FY (y ) = y /θ for y ∈ [0, θ]. Hence,
fYmax (y ) = nFY (y )n−1 fY (y ) =
ny n−1
,
θn
y ∈ [0, θ].
Therefore,
P(|θbn − θ| < ) = P(θ − < θbn < θ + )
Z θ
Z θ+
ny n−1
=
dy
+
0dy
θn
θ−
θ
n
θ−
=1−
θ
→ 0 as n → ∞.
83
E.g. 1. Let Y1 , · · · , Yn be a random sample of size n from the uniform pdf
fY (y ; θ) = 1/θ, y ∈ [0, θ]. Let θbn = Ymax . We know that Ymax is biased.
Is it consistent?
Sol. The c.d.f. of Y is equal to FY (y ) = y /θ for y ∈ [0, θ]. Hence,
fYmax (y ) = nFY (y )n−1 fY (y ) =
ny n−1
,
θn
y ∈ [0, θ].
Therefore,
P(|θbn − θ| < ) = P(θ − < θbn < θ + )
Z θ
Z θ+
ny n−1
=
dy
+
0dy
θn
θ−
θ
n
θ−
=1−
θ
→ 0 as n → ∞.
83
E.g. 1. Let Y1 , · · · , Yn be a random sample of size n from the uniform pdf
fY (y ; θ) = 1/θ, y ∈ [0, θ]. Let θbn = Ymax . We know that Ymax is biased.
Is it consistent?
Sol. The c.d.f. of Y is equal to FY (y ) = y /θ for y ∈ [0, θ]. Hence,
fYmax (y ) = nFY (y )n−1 fY (y ) =
ny n−1
,
θn
y ∈ [0, θ].
Therefore,
P(|θbn − θ| < ) = P(θ − < θbn < θ + )
Z θ
Z θ+
ny n−1
=
dy
+
0dy
θn
θ−
θ
n
θ−
=1−
θ
→ 0 as n → ∞.
83
E.g. 1. Let Y1 , · · · , Yn be a random sample of size n from the uniform pdf
fY (y ; θ) = 1/θ, y ∈ [0, θ]. Let θbn = Ymax . We know that Ymax is biased.
Is it consistent?
Sol. The c.d.f. of Y is equal to FY (y ) = y /θ for y ∈ [0, θ]. Hence,
fYmax (y ) = nFY (y )n−1 fY (y ) =
ny n−1
,
θn
y ∈ [0, θ].
Therefore,
P(|θbn − θ| < ) = P(θ − < θbn < θ + )
Z θ
Z θ+
ny n−1
=
dy
+
0dy
θn
θ−
θ
n
θ−
=1−
θ
→ 0 as n → ∞.
83
E.g. 2. Suppose Y1 , Y2 , · · · , Yn is a random sample from the exponential pdf,
bn = Y1 is not consistent for λ.
fY (y ; λ) = λe−λy , y > 0. Show that λ
bn is not consistent for λ, we need only to find out one
Sol. To prove λ
> 0 such that the following limit does not hold:
bn − λ| < = 1.
lim P |λ
(3)
n→∞
We can choose = λ/m for any m ≥ 1. Then
1
bn − λ| ≤ λ
bn ≤ 1 + 1 λ
|λ
⇐⇒
1−
λ≤λ
m
m
m
bn ≥ 1 − 1 λ.
=⇒ λ
m
84
E.g. 2. Suppose Y1 , Y2 , · · · , Yn is a random sample from the exponential pdf,
bn = Y1 is not consistent for λ.
fY (y ; λ) = λe−λy , y > 0. Show that λ
bn is not consistent for λ, we need only to find out one
Sol. To prove λ
> 0 such that the following limit does not hold:
bn − λ| < = 1.
lim P |λ
(3)
n→∞
We can choose = λ/m for any m ≥ 1. Then
1
bn − λ| ≤ λ
bn ≤ 1 + 1 λ
|λ
⇐⇒
1−
λ≤λ
m
m
m
bn ≥ 1 − 1 λ.
=⇒ λ
m
84
E.g. 2. Suppose Y1 , Y2 , · · · , Yn is a random sample from the exponential pdf,
bn = Y1 is not consistent for λ.
fY (y ; λ) = λe−λy , y > 0. Show that λ
bn is not consistent for λ, we need only to find out one
Sol. To prove λ
> 0 such that the following limit does not hold:
bn − λ| < = 1.
lim P |λ
(3)
n→∞
We can choose = λ/m for any m ≥ 1. Then
1
bn − λ| ≤ λ
bn ≤ 1 + 1 λ
|λ
⇐⇒
1−
λ≤λ
m
m
m
bn ≥ 1 − 1 λ.
=⇒ λ
m
84
E.g. 2. Suppose Y1 , Y2 , · · · , Yn is a random sample from the exponential pdf,
bn = Y1 is not consistent for λ.
fY (y ; λ) = λe−λy , y > 0. Show that λ
bn is not consistent for λ, we need only to find out one
Sol. To prove λ
> 0 such that the following limit does not hold:
bn − λ| < = 1.
lim P |λ
(3)
n→∞
We can choose = λ/m for any m ≥ 1. Then
1
bn − λ| ≤ λ
bn ≤ 1 + 1 λ
|λ
⇐⇒
1−
λ≤λ
m
m
m
bn ≥ 1 − 1 λ.
=⇒ λ
m
84
Hence,
bn − λ| < λ ≤ P λ
bn ≥ 1 − 1 λ
P |λ
m
m
1
= P Y1 ≥ 1 −
λ
m
Z ∞
−λy
= dy
λe
=e
1 λ
1− m
1 λ2
− 1− m
Therefore, the limit in (3) cannot hold.
< 1.
85
Hence,
bn − λ| < λ ≤ P λ
bn ≥ 1 − 1 λ
P |λ
m
m
1
= P Y1 ≥ 1 −
λ
m
Z ∞
−λy
= dy
λe
=e
1 λ
1− m
1 λ2
− 1− m
Therefore, the limit in (3) cannot hold.
< 1.
85
Hence,
bn − λ| < λ ≤ P λ
bn ≥ 1 − 1 λ
P |λ
m
m
1
= P Y1 ≥ 1 −
λ
m
Z ∞
−λy
= dy
λe
=e
1 λ
1− m
1 λ2
− 1− m
Therefore, the limit in (3) cannot hold.
< 1.
85
Hence,
bn − λ| < λ ≤ P λ
bn ≥ 1 − 1 λ
P |λ
m
m
1
= P Y1 ≥ 1 −
λ
m
Z ∞
−λy
= dy
λe
=e
1 λ
1− m
1 λ2
− 1− m
Therefore, the limit in (3) cannot hold.
< 1.
85
Hence,
bn − λ| < λ ≤ P λ
bn ≥ 1 − 1 λ
P |λ
m
m
1
= P Y1 ≥ 1 −
λ
m
Z ∞
−λy
= dy
λe
=e
1 λ
1− m
1 λ2
− 1− m
Therefore, the limit in (3) cannot hold.
< 1.
85
Plan
§ 5.1 Introduction
§ 5.2 Estimating parameters: MLE and MME
§ 5.3 Interval Estimation
§ 5.4 Properties of Estimators
§ 5.5 Minimum-Variance Estimators: The Cramér-Rao Lower Bound
§ 5.6 Sufficient Estimators
§ 5.7 Consistency
§ 5.8 Bayesian Estimation
86
Chapter 5. Estimation
§ 5.1 Introduction
§ 5.2 Estimating parameters: MLE and MME
§ 5.3 Interval Estimation
§ 5.4 Properties of Estimators
§ 5.5 Minimum-Variance Estimators: The Cramér-Rao Lower Bound
§ 5.6 Sufficient Estimators
§ 5.7 Consistency
§ 5.8 Bayesian Estimation
87
Rationale: Let W be an estimator dependent on a parameter θ.
1. Frequentists view θ as a parameter whose exact value is to be
estimated.
2. Bayesians view θ is the value of a random variable Θ.
One can incorporate our knowledge on Θ — the prior distribution
pΘ (θ) if Θ is discrete and fΘ (θ) if Θ is continuous — and use Bayes’
formula to update our knowledge on Θ upon new observation W = w:

pW (w|Θ = θ)pΘ (θ)


if W is discrete


P(W = w)

gΘ (θ|W = w) =


 fW (w|Θ = θ)fΘ (θ)


if W is continuous
fW (w)
where gΘ (θ|W = w) is called posterior distribution of Θ.
88
Rationale: Let W be an estimator dependent on a parameter θ.
1. Frequentists view θ as a parameter whose exact value is to be
estimated.
2. Bayesians view θ is the value of a random variable Θ.
One can incorporate our knowledge on Θ — the prior distribution
pΘ (θ) if Θ is discrete and fΘ (θ) if Θ is continuous — and use Bayes’
formula to update our knowledge on Θ upon new observation W = w:

pW (w|Θ = θ)pΘ (θ)


if W is discrete


P(W = w)

gΘ (θ|W = w) =


 fW (w|Θ = θ)fΘ (θ)


if W is continuous
fW (w)
where gΘ (θ|W = w) is called posterior distribution of Θ.
88
Rationale: Let W be an estimator dependent on a parameter θ.
1. Frequentists view θ as a parameter whose exact value is to be
estimated.
2. Bayesians view θ is the value of a random variable Θ.
One can incorporate our knowledge on Θ — the prior distribution
pΘ (θ) if Θ is discrete and fΘ (θ) if Θ is continuous — and use Bayes’
formula to update our knowledge on Θ upon new observation W = w:

pW (w|Θ = θ)pΘ (θ)


if W is discrete


P(W = w)

gΘ (θ|W = w) =


 fW (w|Θ = θ)fΘ (θ)


if W is continuous
fW (w)
where gΘ (θ|W = w) is called posterior distribution of Θ.
88
Rationale: Let W be an estimator dependent on a parameter θ.
1. Frequentists view θ as a parameter whose exact value is to be
estimated.
2. Bayesians view θ is the value of a random variable Θ.
One can incorporate our knowledge on Θ — the prior distribution
pΘ (θ) if Θ is discrete and fΘ (θ) if Θ is continuous — and use Bayes’
formula to update our knowledge on Θ upon new observation W = w:

pW (w|Θ = θ)pΘ (θ)


if W is discrete


P(W = w)

gΘ (θ|W = w) =


 fW (w|Θ = θ)fΘ (θ)


if W is continuous
fW (w)
where gΘ (θ|W = w) is called posterior distribution of Θ.
88
Likelihood
of sample W
Prior distribution of Θ
P(Θ|W ) =
Posterior of Θ
P(W | Θ)P(Θ)
P(W )
Total
Probability
of sample W
89
Four cases for computing posterior distribution
W discrete
W continuous
p (w|Θ = θ)pΘ (θ)
P W
i pW (w|Θ = θi )pΘ (θi )
f (w|Θ = θ)pΘ (θ)
PW
i fW (w|Θ = θi )pΘ (θi )
gΘ (θ|W = w)
Θ discrete
Θ continuous
pW (w|Θ = θ)fΘ (θ)
p
(w|Θ = θ0 )fΘ (θ0 )dθ0
R W
R
R
fW (w|Θ = θ)fΘ (θ)
f
(w|Θ = θ0 )fΘ (θ0 )dθ0
W
R
90
Gamma distributions
∞
Z
Γ(r ) :=
y r −1 e−y dy ,
r > 0.
0
Two parametrizations for Gamma distributions:
1. With a shape parameter r and a scale parameter θ:
fY (y ; r , θ) =
y r −1 e−y /θ
,
θr Γ(r )
y > 0, r , θ > 0.
2. With a shape parameter r and a rate parameter λ = 1/θ,
fY (y ; r , λ) =
E[Y ] =
λr y r −1 e−λy
,
Γ(r )
r
= rθ
λ
and
y > 0, r , λ > 0.
Var(Y ) =
r
= r θ2
λ2
91
Gamma distributions
∞
Z
Γ(r ) :=
y r −1 e−y dy ,
r > 0.
0
Two parametrizations for Gamma distributions:
1. With a shape parameter r and a scale parameter θ:
fY (y ; r , θ) =
y r −1 e−y /θ
,
θr Γ(r )
y > 0, r , θ > 0.
2. With a shape parameter r and a rate parameter λ = 1/θ,
fY (y ; r , λ) =
E[Y ] =
λr y r −1 e−λy
,
Γ(r )
r
= rθ
λ
and
y > 0, r , λ > 0.
Var(Y ) =
r
= r θ2
λ2
91
Gamma distributions
∞
Z
Γ(r ) :=
y r −1 e−y dy ,
r > 0.
0
Two parametrizations for Gamma distributions:
1. With a shape parameter r and a scale parameter θ:
fY (y ; r , θ) =
y r −1 e−y /θ
,
θr Γ(r )
y > 0, r , θ > 0.
2. With a shape parameter r and a rate parameter λ = 1/θ,
fY (y ; r , λ) =
E[Y ] =
λr y r −1 e−λy
,
Γ(r )
r
= rθ
λ
and
y > 0, r , λ > 0.
Var(Y ) =
r
= r θ2
λ2
91
Gamma distributions
∞
Z
Γ(r ) :=
y r −1 e−y dy ,
r > 0.
0
Two parametrizations for Gamma distributions:
1. With a shape parameter r and a scale parameter θ:
fY (y ; r , θ) =
y r −1 e−y /θ
,
θr Γ(r )
y > 0, r , θ > 0.
2. With a shape parameter r and a rate parameter λ = 1/θ,
fY (y ; r , λ) =
E[Y ] =
λr y r −1 e−λy
,
Γ(r )
r
= rθ
λ
and
y > 0, r , λ > 0.
Var(Y ) =
r
= r θ2
λ2
91
1
2
3
4
5
6
7
# Plot gamma distributions
x = seq(0,20,0.01)
k= 3 # Shape parameter
theta = 0.5 # Scale parameter
plot(x,dgamma(x, k, scale = theta
),
type=”l”,
col=”red”)
92
Beta distributions
1
Z
B(α, β) :=
..
.
0
y α−1 (1 − y )β−1 dy ,
α, β > 0.
..
.
Γ(α)Γ(β)
.
Γ(α + β)
(see Appendix)
y α−1 (1 − y )β−1
,
B(α, β)
y ∈ [0, 1], α, β > 0.
=
Beta distribution
fY (y ; α, β) =
E[Y ] =
α
α+β
and
Var(Y ) =
αβ
(α + β)2 (α + β + 1)
93
Beta distributions
1
Z
B(α, β) :=
..
.
0
y α−1 (1 − y )β−1 dy ,
α, β > 0.
..
.
Γ(α)Γ(β)
.
Γ(α + β)
(see Appendix)
y α−1 (1 − y )β−1
,
B(α, β)
y ∈ [0, 1], α, β > 0.
=
Beta distribution
fY (y ; α, β) =
E[Y ] =
α
α+β
and
Var(Y ) =
αβ
(α + β)2 (α + β + 1)
93
Beta distributions
1
Z
B(α, β) :=
..
.
0
y α−1 (1 − y )β−1 dy ,
α, β > 0.
..
.
Γ(α)Γ(β)
.
Γ(α + β)
(see Appendix)
y α−1 (1 − y )β−1
,
B(α, β)
y ∈ [0, 1], α, β > 0.
=
Beta distribution
fY (y ; α, β) =
E[Y ] =
α
α+β
and
Var(Y ) =
αβ
(α + β)2 (α + β + 1)
93
Beta distributions
1
Z
B(α, β) :=
..
.
0
y α−1 (1 − y )β−1 dy ,
α, β > 0.
..
.
Γ(α)Γ(β)
.
Γ(α + β)
(see Appendix)
y α−1 (1 − y )β−1
,
B(α, β)
y ∈ [0, 1], α, β > 0.
=
Beta distribution
fY (y ; α, β) =
E[Y ] =
α
α+β
and
Var(Y ) =
αβ
(α + β)2 (α + β + 1)
93
Beta distributions
1
Z
B(α, β) :=
..
.
0
y α−1 (1 − y )β−1 dy ,
α, β > 0.
..
.
Γ(α)Γ(β)
.
Γ(α + β)
(see Appendix)
y α−1 (1 − y )β−1
,
B(α, β)
y ∈ [0, 1], α, β > 0.
=
Beta distribution
fY (y ; α, β) =
E[Y ] =
α
α+β
and
Var(Y ) =
αβ
(α + β)2 (α + β + 1)
93
1
2
3
4
5
6
7
# Plot Beta distributions
x = seq(0,1,0.01)
a = 13
b=2
plot(x,dbeta(x,a,b),
type=”l”,
col=”red”)
94
E.g. 1. Let X1 , · · · , Xn be a random sample from Bernoulli(θ):
pXi (k ; θ) = θk (1 − θ)1−k for k = 0, 1.
P
Let X = ni=1 Xi . Then X follows binomial(n, θ).
Prior distribution: Θ ∼beta(r , s), i.e., fΘ (θ) =
θ ∈ [0, 1].
X1 , · · · , Xn θ
Θ
∼
Bernoulli(θ)
∼
Beta(r , s)
r & s are known
X =
n
X
Γ(r +s) r −1
θ (1
Γ(r )Γ(s)
− θ)s−1 for
Xi θ
∼
Binomial(n, θ)
Θ
∼
Beta(r , s)
i=1
r & s are known
95
E.g. 1. Let X1 , · · · , Xn be a random sample from Bernoulli(θ):
pXi (k ; θ) = θk (1 − θ)1−k for k = 0, 1.
P
Let X = ni=1 Xi . Then X follows binomial(n, θ).
Prior distribution: Θ ∼beta(r , s), i.e., fΘ (θ) =
θ ∈ [0, 1].
X1 , · · · , Xn θ
Θ
∼
Bernoulli(θ)
∼
Beta(r , s)
r & s are known
X =
n
X
Γ(r +s) r −1
θ (1
Γ(r )Γ(s)
− θ)s−1 for
Xi θ
∼
Binomial(n, θ)
Θ
∼
Beta(r , s)
i=1
r & s are known
95
E.g. 1. Let X1 , · · · , Xn be a random sample from Bernoulli(θ):
pXi (k ; θ) = θk (1 − θ)1−k for k = 0, 1.
P
Let X = ni=1 Xi . Then X follows binomial(n, θ).
Prior distribution: Θ ∼beta(r , s), i.e., fΘ (θ) =
θ ∈ [0, 1].
X1 , · · · , Xn θ
Θ
∼
Bernoulli(θ)
∼
Beta(r , s)
r & s are known
X =
n
X
Γ(r +s) r −1
θ (1
Γ(r )Γ(s)
− θ)s−1 for
Xi θ
∼
Binomial(n, θ)
Θ
∼
Beta(r , s)
i=1
r & s are known
95
E.g. 1. Let X1 , · · · , Xn be a random sample from Bernoulli(θ):
pXi (k ; θ) = θk (1 − θ)1−k for k = 0, 1.
P
Let X = ni=1 Xi . Then X follows binomial(n, θ).
Prior distribution: Θ ∼beta(r , s), i.e., fΘ (θ) =
θ ∈ [0, 1].
X1 , · · · , Xn θ
Θ
∼
Bernoulli(θ)
∼
Beta(r , s)
r & s are known
X =
n
X
Γ(r +s) r −1
θ (1
Γ(r )Γ(s)
− θ)s−1 for
Xi θ
∼
Binomial(n, θ)
Θ
∼
Beta(r , s)
i=1
r & s are known
95
E.g. 1. Let X1 , · · · , Xn be a random sample from Bernoulli(θ):
pXi (k ; θ) = θk (1 − θ)1−k for k = 0, 1.
P
Let X = ni=1 Xi . Then X follows binomial(n, θ).
Prior distribution: Θ ∼beta(r , s), i.e., fΘ (θ) =
θ ∈ [0, 1].
X1 , · · · , Xn θ
Θ
∼
Bernoulli(θ)
∼
Beta(r , s)
r & s are known
X =
n
X
Γ(r +s) r −1
θ (1
Γ(r )Γ(s)
− θ)s−1 for
Xi θ
∼
Binomial(n, θ)
Θ
∼
Beta(r , s)
i=1
r & s are known
95
E.g. 1. Let X1 , · · · , Xn be a random sample from Bernoulli(θ):
pXi (k ; θ) = θk (1 − θ)1−k for k = 0, 1.
P
Let X = ni=1 Xi . Then X follows binomial(n, θ).
Prior distribution: Θ ∼beta(r , s), i.e., fΘ (θ) =
θ ∈ [0, 1].
X1 , · · · , Xn θ
Θ
∼
Bernoulli(θ)
∼
Beta(r , s)
r & s are known
X =
n
X
Γ(r +s) r −1
θ (1
Γ(r )Γ(s)
− θ)s−1 for
Xi θ
∼
Binomial(n, θ)
Θ
∼
Beta(r , s)
i=1
r & s are known
95
96
97
X is discrete and Θ is continuous.
gΘ (θ|X = k ) = R
pX (k |Θ = θ)fΘ (θ) =
=
pX (k |Θ = θ)fΘ (θ)
p
(k |Θ = θ0 )fΘ (θ0 )dθ0
R X
!
Γ(r + s) r −1
n k
θ (1 − θ)n−k ×
θ (1 − θ)s−1
k
Γ(r )Γ(s)
!
n Γ(r + s) k +r −1
θ
(1 − θ)n−k +s−1 , θ ∈ [0, 1].
k Γ(r )Γ(s)
Z
pX (k |Θ = θ0 )fΘ (θ0 )dθ0
!
Z
n Γ(r + s) 1 0k +r −1
=
θ
(1 − θ0 )n−k +s−1 dθ0
k Γ(r )Γ(s) 0
!
Γ(k + r )Γ(n − k + s)
n Γ(r + s)
=
×
k Γ(r )Γ(s)
Γ((k + r ) + (n − k + s))
pX (k ) =
R
98
X is discrete and Θ is continuous.
gΘ (θ|X = k ) = R
pX (k |Θ = θ)fΘ (θ) =
=
pX (k |Θ = θ)fΘ (θ)
p
(k |Θ = θ0 )fΘ (θ0 )dθ0
R X
!
Γ(r + s) r −1
n k
θ (1 − θ)n−k ×
θ (1 − θ)s−1
k
Γ(r )Γ(s)
!
n Γ(r + s) k +r −1
θ
(1 − θ)n−k +s−1 , θ ∈ [0, 1].
k Γ(r )Γ(s)
Z
pX (k |Θ = θ0 )fΘ (θ0 )dθ0
!
Z
n Γ(r + s) 1 0k +r −1
=
θ
(1 − θ0 )n−k +s−1 dθ0
k Γ(r )Γ(s) 0
!
Γ(k + r )Γ(n − k + s)
n Γ(r + s)
=
×
k Γ(r )Γ(s)
Γ((k + r ) + (n − k + s))
pX (k ) =
R
98
X is discrete and Θ is continuous.
gΘ (θ|X = k ) = R
pX (k |Θ = θ)fΘ (θ) =
=
pX (k |Θ = θ)fΘ (θ)
p
(k |Θ = θ0 )fΘ (θ0 )dθ0
R X
!
Γ(r + s) r −1
n k
θ (1 − θ)n−k ×
θ (1 − θ)s−1
k
Γ(r )Γ(s)
!
n Γ(r + s) k +r −1
θ
(1 − θ)n−k +s−1 , θ ∈ [0, 1].
k Γ(r )Γ(s)
Z
pX (k |Θ = θ0 )fΘ (θ0 )dθ0
!
Z
n Γ(r + s) 1 0k +r −1
=
θ
(1 − θ0 )n−k +s−1 dθ0
k Γ(r )Γ(s) 0
!
Γ(k + r )Γ(n − k + s)
n Γ(r + s)
=
×
k Γ(r )Γ(s)
Γ((k + r ) + (n − k + s))
pX (k ) =
R
98
X is discrete and Θ is continuous.
gΘ (θ|X = k ) = R
pX (k |Θ = θ)fΘ (θ) =
=
pX (k |Θ = θ)fΘ (θ)
p
(k |Θ = θ0 )fΘ (θ0 )dθ0
R X
!
Γ(r + s) r −1
n k
θ (1 − θ)n−k ×
θ (1 − θ)s−1
k
Γ(r )Γ(s)
!
n Γ(r + s) k +r −1
θ
(1 − θ)n−k +s−1 , θ ∈ [0, 1].
k Γ(r )Γ(s)
Z
pX (k |Θ = θ0 )fΘ (θ0 )dθ0
!
Z
n Γ(r + s) 1 0k +r −1
=
θ
(1 − θ0 )n−k +s−1 dθ0
k Γ(r )Γ(s) 0
!
Γ(k + r )Γ(n − k + s)
n Γ(r + s)
=
×
k Γ(r )Γ(s)
Γ((k + r ) + (n − k + s))
pX (k ) =
R
98
X is discrete and Θ is continuous.
gΘ (θ|X = k ) = R
pX (k |Θ = θ)fΘ (θ) =
=
pX (k |Θ = θ)fΘ (θ)
p
(k |Θ = θ0 )fΘ (θ0 )dθ0
R X
!
Γ(r + s) r −1
n k
θ (1 − θ)n−k ×
θ (1 − θ)s−1
k
Γ(r )Γ(s)
!
n Γ(r + s) k +r −1
θ
(1 − θ)n−k +s−1 , θ ∈ [0, 1].
k Γ(r )Γ(s)
Z
pX (k |Θ = θ0 )fΘ (θ0 )dθ0
!
Z
n Γ(r + s) 1 0k +r −1
=
θ
(1 − θ0 )n−k +s−1 dθ0
k Γ(r )Γ(s) 0
!
Γ(k + r )Γ(n − k + s)
n Γ(r + s)
=
×
k Γ(r )Γ(s)
Γ((k + r ) + (n − k + s))
pX (k ) =
R
98
X is discrete and Θ is continuous.
gΘ (θ|X = k ) = R
pX (k |Θ = θ)fΘ (θ) =
=
pX (k |Θ = θ)fΘ (θ)
p
(k |Θ = θ0 )fΘ (θ0 )dθ0
R X
!
Γ(r + s) r −1
n k
θ (1 − θ)n−k ×
θ (1 − θ)s−1
k
Γ(r )Γ(s)
!
n Γ(r + s) k +r −1
θ
(1 − θ)n−k +s−1 , θ ∈ [0, 1].
k Γ(r )Γ(s)
Z
pX (k |Θ = θ0 )fΘ (θ0 )dθ0
!
Z
n Γ(r + s) 1 0k +r −1
=
θ
(1 − θ0 )n−k +s−1 dθ0
k Γ(r )Γ(s) 0
!
Γ(k + r )Γ(n − k + s)
n Γ(r + s)
=
×
k Γ(r )Γ(s)
Γ((k + r ) + (n − k + s))
pX (k ) =
R
98
!
n Γ(r + s)
× θk +r −1 (1 − θ)n−k +s−1
k Γ(r )Γ(s)
!
gΘ (θ|X = k ) =
Γ(k + r )Γ(n − k + s)
n Γ(r + s)
×
k Γ(r )Γ(s)
Γ((k + r ) + (n − k + s))
=
Γ(n + r + s)
θk +r −1 (1 − θ)n−k +s−1 ,
Γ(k + r )Γ(n − k + s)
θ ∈ [0, 1]
Conclusion: the posterior ∼ beta distribution(k + r , n − k + s).
Recall that the prior ∼ beta distribution(r , s).
99
!
n Γ(r + s)
× θk +r −1 (1 − θ)n−k +s−1
k Γ(r )Γ(s)
!
gΘ (θ|X = k ) =
Γ(k + r )Γ(n − k + s)
n Γ(r + s)
×
k Γ(r )Γ(s)
Γ((k + r ) + (n − k + s))
=
Γ(n + r + s)
θk +r −1 (1 − θ)n−k +s−1 ,
Γ(k + r )Γ(n − k + s)
θ ∈ [0, 1]
Conclusion: the posterior ∼ beta distribution(k + r , n − k + s).
Recall that the prior ∼ beta distribution(r , s).
99
!
n Γ(r + s)
× θk +r −1 (1 − θ)n−k +s−1
k Γ(r )Γ(s)
!
gΘ (θ|X = k ) =
Γ(k + r )Γ(n − k + s)
n Γ(r + s)
×
k Γ(r )Γ(s)
Γ((k + r ) + (n − k + s))
=
Γ(n + r + s)
θk +r −1 (1 − θ)n−k +s−1 ,
Γ(k + r )Γ(n − k + s)
θ ∈ [0, 1]
Conclusion: the posterior ∼ beta distribution(k + r , n − k + s).
Recall that the prior ∼ beta distribution(r , s).
99
It remains to determine the values of r and s to incorporate the prior
knowledge:
PK 1. Mean is about 0.035.
E(Θ) = 0.035
=⇒
r
= 0.035
r +s
⇐⇒
r
7
=
s
193
PK 2. The pdf mostly concentrated over [0, 0.07]. ... trial ...
100
It remains to determine the values of r and s to incorporate the prior
knowledge:
PK 1. Mean is about 0.035.
E(Θ) = 0.035
=⇒
r
= 0.035
r +s
⇐⇒
r
7
=
s
193
PK 2. The pdf mostly concentrated over [0, 0.07]. ... trial ...
100
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
x <− seq(0, 1, length = 1025)
plot(x,dbeta(x,4,102),
type=”l”)
plot(x,dbeta(x,7,193),
type=”l”)
dev.off()
pdf=cbind(dbeta(x,4,102),dbeta(x
,7,193))
matplot(x,pdf,
type=”l”,
lty = 1:2,
xlab = ”theta”, ylab = ”PDF”,
lwd = 2 # Line width
)
legend(0.2, 25, # Position of legend
c(”Beta(4,102)”, ”Beta(7,193)”),
col = 1:2, lty = 1:2,
ncol = 1, # Number of columns
cex = 1.5, # Fontsize
lwd=2 # Line width
)
abline(v=0.07, col=”blue”, lty=1,lwd
=1.5)
text(0.07, −0.5, ”0.07”)
abline(v=0.035, col=”gray60”, lty=3,
lwd=2)
text(0.035, 1, ”0.035”)
101
If we choose r = 7 and s = 193:
gΘ (θ|X = k ) =
Γ(n + 200)
θk +6 (1 − θ)n−k +192 ,
Γ(k + 7)Γ(n − k + 193)
θ ∈ [0, 1]
Moreover, if n = 10 and k = 2,
gΘ (θ|X = k ) =
Γ(210)
θ8 (1 − θ)200 ,
Γ(9)Γ(201)
θ ∈ [0, 1]
102
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
x <− seq(0, 0.1, length = 1025)
pdf=cbind(dbeta(x,7,193),dbeta(x
,9,201))
matplot(x,pdf,
type=”l”,
lty = 1:2,
xlab = ”theta”, ylab = ”PDF”,
lwd = 2 # Line width
)
legend(0.05, 25, # Position of legend
c(”Posterior Beta(9,201)”, ”Prior
Beta(7,193)”),
col = 1:2, lty = 1:2,
ncol = 1, # Number of columns
cex = 1.5, # Fontsize
lwd=2 # Line width
)
abline(v=0.07,col=”blue”, lty=1,lwd
=1.5)
text(0.07, −0.5, ”0.07”)
abline(v=0.035,col=”black”, lty=3,lwd
=2)
text(0.035, 1, ”0.035”)
103
Definition. If the posterior distributions p(Θ|X ) are in the same
probability distribution family as the prior probability distribution p(Θ),
the prior and posterior are then called conjugate distributions, and the
prior is called a conjugate prior for the likelihood function.
1. Beta distributions are conjugate priors for Bernoulli, binomial, nega.
binomial, geometric likelihood.
2. Gamma distributions are conjugate priors for Poisson and exponential
likelihood.
104
Definition. If the posterior distributions p(Θ|X ) are in the same
probability distribution family as the prior probability distribution p(Θ),
the prior and posterior are then called conjugate distributions, and the
prior is called a conjugate prior for the likelihood function.
1. Beta distributions are conjugate priors for Bernoulli, binomial, nega.
binomial, geometric likelihood.
2. Gamma distributions are conjugate priors for Poisson and exponential
likelihood.
104
E.g. 2. Let X1 , · · · , Xn be a random sample from Poisson(θ): pX (k ; θ) =
for k = 0, 1, · · · .
P
Let W = ni=1 Xi . Then W follows Poisson(nθ).
Prior distribution: Θ ∼ Gamma(s, µ), i.e., fΘ (θ) =
θ > 0.
X1 , · · · , Xn θ
Θ
∼
Poisson(θ)
∼
Gamma(s, µ)
s & µ are known
W =
n
X
µs
θs−1 e−µθ
Γ(s)
e−θ θ k
k!
for
Xi θ
∼
Poisson(nθ)
Θ
∼
Gamma(s, µ)
i=1
s & µ are known
105
E.g. 2. Let X1 , · · · , Xn be a random sample from Poisson(θ): pX (k ; θ) =
for k = 0, 1, · · · .
P
Let W = ni=1 Xi . Then W follows Poisson(nθ).
Prior distribution: Θ ∼ Gamma(s, µ), i.e., fΘ (θ) =
θ > 0.
X1 , · · · , Xn θ
Θ
∼
Poisson(θ)
∼
Gamma(s, µ)
s & µ are known
W =
n
X
µs
θs−1 e−µθ
Γ(s)
e−θ θ k
k!
for
Xi θ
∼
Poisson(nθ)
Θ
∼
Gamma(s, µ)
i=1
s & µ are known
105
E.g. 2. Let X1 , · · · , Xn be a random sample from Poisson(θ): pX (k ; θ) =
for k = 0, 1, · · · .
P
Let W = ni=1 Xi . Then W follows Poisson(nθ).
Prior distribution: Θ ∼ Gamma(s, µ), i.e., fΘ (θ) =
θ > 0.
X1 , · · · , Xn θ
Θ
∼
Poisson(θ)
∼
Gamma(s, µ)
s & µ are known
W =
n
X
µs
θs−1 e−µθ
Γ(s)
e−θ θ k
k!
for
Xi θ
∼
Poisson(nθ)
Θ
∼
Gamma(s, µ)
i=1
s & µ are known
105
E.g. 2. Let X1 , · · · , Xn be a random sample from Poisson(θ): pX (k ; θ) =
for k = 0, 1, · · · .
P
Let W = ni=1 Xi . Then W follows Poisson(nθ).
Prior distribution: Θ ∼ Gamma(s, µ), i.e., fΘ (θ) =
θ > 0.
X1 , · · · , Xn θ
Θ
∼
Poisson(θ)
∼
Gamma(s, µ)
s & µ are known
W =
n
X
µs
θs−1 e−µθ
Γ(s)
e−θ θ k
k!
for
Xi θ
∼
Poisson(nθ)
Θ
∼
Gamma(s, µ)
i=1
s & µ are known
105
E.g. 2. Let X1 , · · · , Xn be a random sample from Poisson(θ): pX (k ; θ) =
for k = 0, 1, · · · .
P
Let W = ni=1 Xi . Then W follows Poisson(nθ).
Prior distribution: Θ ∼ Gamma(s, µ), i.e., fΘ (θ) =
θ > 0.
X1 , · · · , Xn θ
Θ
∼
Poisson(θ)
∼
Gamma(s, µ)
s & µ are known
W =
n
X
µs
θs−1 e−µθ
Γ(s)
e−θ θ k
k!
for
Xi θ
∼
Poisson(nθ)
Θ
∼
Gamma(s, µ)
i=1
s & µ are known
105
E.g. 2. Let X1 , · · · , Xn be a random sample from Poisson(θ): pX (k ; θ) =
for k = 0, 1, · · · .
P
Let W = ni=1 Xi . Then W follows Poisson(nθ).
Prior distribution: Θ ∼ Gamma(s, µ), i.e., fΘ (θ) =
θ > 0.
X1 , · · · , Xn θ
Θ
∼
Poisson(θ)
∼
Gamma(s, µ)
s & µ are known
W =
n
X
µs
θs−1 e−µθ
Γ(s)
e−θ θ k
k!
for
Xi θ
∼
Poisson(nθ)
Θ
∼
Gamma(s, µ)
i=1
s & µ are known
105
gΘ (θ|W = w) = R
pW (w|Θ = θ)fΘ (θ)
p
(w|Θ = θ0 )fΘ (θ0 )dθ0
R W
e−nθ (nθ)w
µs s−1 −µθ
×
θ e
w!
Γ(s)
n w µs
=
× θw+s−1 e−(µ+n)θ , θ > 0.
w! Γ(s)
pW (w|Θ = θ)fΘ (θ) =
Z
pW (w|Θ = θ0 )fΘ (θ0 )dθ0
Z ∞
0
n µs
=
θ0w+s−1 e−(µ+n)θ dθ0
w! Γ(s) 0
Γ(w + s)
n w µs
=
×
w! Γ(s)
(µ + n)w+s
pW (w) =
R
w
106
gΘ (θ|W = w) = R
pW (w|Θ = θ)fΘ (θ)
p
(w|Θ = θ0 )fΘ (θ0 )dθ0
R W
e−nθ (nθ)w
µs s−1 −µθ
×
θ e
w!
Γ(s)
n w µs
=
× θw+s−1 e−(µ+n)θ , θ > 0.
w! Γ(s)
pW (w|Θ = θ)fΘ (θ) =
Z
pW (w|Θ = θ0 )fΘ (θ0 )dθ0
Z ∞
0
n µs
=
θ0w+s−1 e−(µ+n)θ dθ0
w! Γ(s) 0
Γ(w + s)
n w µs
=
×
w! Γ(s)
(µ + n)w+s
pW (w) =
R
w
106
gΘ (θ|W = w) = R
pW (w|Θ = θ)fΘ (θ)
p
(w|Θ = θ0 )fΘ (θ0 )dθ0
R W
e−nθ (nθ)w
µs s−1 −µθ
×
θ e
w!
Γ(s)
n w µs
=
× θw+s−1 e−(µ+n)θ , θ > 0.
w! Γ(s)
pW (w|Θ = θ)fΘ (θ) =
Z
pW (w|Θ = θ0 )fΘ (θ0 )dθ0
Z ∞
0
n µs
=
θ0w+s−1 e−(µ+n)θ dθ0
w! Γ(s) 0
Γ(w + s)
n w µs
=
×
w! Γ(s)
(µ + n)w+s
pW (w) =
R
w
106
gΘ (θ|W = w) = R
pW (w|Θ = θ)fΘ (θ)
p
(w|Θ = θ0 )fΘ (θ0 )dθ0
R W
e−nθ (nθ)w
µs s−1 −µθ
×
θ e
w!
Γ(s)
n w µs
=
× θw+s−1 e−(µ+n)θ , θ > 0.
w! Γ(s)
pW (w|Θ = θ)fΘ (θ) =
Z
pW (w|Θ = θ0 )fΘ (θ0 )dθ0
Z ∞
0
n µs
=
θ0w+s−1 e−(µ+n)θ dθ0
w! Γ(s) 0
Γ(w + s)
n w µs
=
×
w! Γ(s)
(µ + n)w+s
pW (w) =
R
w
106
gΘ (θ|W = w) = R
pW (w|Θ = θ)fΘ (θ)
p
(w|Θ = θ0 )fΘ (θ0 )dθ0
R W
e−nθ (nθ)w
µs s−1 −µθ
×
θ e
w!
Γ(s)
n w µs
=
× θw+s−1 e−(µ+n)θ , θ > 0.
w! Γ(s)
pW (w|Θ = θ)fΘ (θ) =
Z
pW (w|Θ = θ0 )fΘ (θ0 )dθ0
Z ∞
0
n µs
=
θ0w+s−1 e−(µ+n)θ dθ0
w! Γ(s) 0
Γ(w + s)
n w µs
=
×
w! Γ(s)
(µ + n)w+s
pW (w) =
R
w
106
gΘ (θ|W = w) = R
pW (w|Θ = θ)fΘ (θ)
p
(w|Θ = θ0 )fΘ (θ0 )dθ0
R W
e−nθ (nθ)w
µs s−1 −µθ
×
θ e
w!
Γ(s)
n w µs
=
× θw+s−1 e−(µ+n)θ , θ > 0.
w! Γ(s)
pW (w|Θ = θ)fΘ (θ) =
Z
pW (w|Θ = θ0 )fΘ (θ0 )dθ0
Z ∞
0
n µs
=
θ0w+s−1 e−(µ+n)θ dθ0
w! Γ(s) 0
Γ(w + s)
n w µs
=
×
w! Γ(s)
(µ + n)w+s
pW (w) =
R
w
106
n w µs
× θw+s−1 e−(µ+n)θ
w! Γ(s)
gΘ (θ|X = k ) =
Γ(w + s)
n w µs
×
w! Γ(s)
(µ + n)w+s
=
(µ + n)w+s w+s−1 −(µ+n)θ
θ
e
,
Γ(w + s)
θ > 0.
Conclusion: the posterior of Θ ∼ gamma distribution(w + s, n + µ).
Recall that the prior of Θ ∼ gamma distribution(s, µ).
107
Case Study 5.8.1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
x <− seq(0, 4, length = 1025)
pdf=cbind(dgamma(x, shape=88, rate
=50),
dgamma(x, shape=88+92,
100),
dgamma(x, 88+92+72, 150))
matplot(x,pdf,
type=”l”,
lty = 1:3,
xlab = ”theta”, ylab = ”PDF”,
lwd = 2 # Line width
)
legend(2, 3.5, # Position of legend
c(”Prior Gamma(88,50)”,
”Posterior1 Beta(180,100)”,
”Posterior2 Beta(252,150)”),
col = 1:3, lty = 1:3,
ncol = 1, # Number of columns
cex = 1.5, # Fontsize
lwd=2 # Line width
)
108
Bayesian Point Estimation
Question. Can one calculate an appropriate point estimate θe given the
posterior gΘ (θ|W = w)?
Definitions. Let θe be an estimate for θ based on a statistic W . The loss
function associated with θe is denoted L(θe , θ), where L(θe , θ) ≥ 0 and
L(θ, θ) = 0.
Let gΘ (θ|W = w) be the posterior distribution of the random variable Θ.
Then the risk associated with θb is the expected value of the loss function
with respect to the posterior distribution of Θ:
Z

b

 L(θ, θ)gΘ (θ|W = w)dθ if Θ is continuous
R
risk = X

b θi )gΘ (θi |W = w) if Θ is discrete

L(θ,

i
109
Bayesian Point Estimation
Question. Can one calculate an appropriate point estimate θe given the
posterior gΘ (θ|W = w)?
Definitions. Let θe be an estimate for θ based on a statistic W . The loss
function associated with θe is denoted L(θe , θ), where L(θe , θ) ≥ 0 and
L(θ, θ) = 0.
Let gΘ (θ|W = w) be the posterior distribution of the random variable Θ.
Then the risk associated with θb is the expected value of the loss function
with respect to the posterior distribution of Θ:
Z

b

 L(θ, θ)gΘ (θ|W = w)dθ if Θ is continuous
R
risk = X

b θi )gΘ (θi |W = w) if Θ is discrete

L(θ,

i
109
Bayesian Point Estimation
Question. Can one calculate an appropriate point estimate θe given the
posterior gΘ (θ|W = w)?
Definitions. Let θe be an estimate for θ based on a statistic W . The loss
function associated with θe is denoted L(θe , θ), where L(θe , θ) ≥ 0 and
L(θ, θ) = 0.
Let gΘ (θ|W = w) be the posterior distribution of the random variable Θ.
Then the risk associated with θb is the expected value of the loss function
with respect to the posterior distribution of Θ:
Z

b

 L(θ, θ)gΘ (θ|W = w)dθ if Θ is continuous
R
risk = X

b θi )gΘ (θi |W = w) if Θ is discrete

L(θ,

i
109
Theorem. Let gΘ (θ|W = w) be the posterior distribution of the random
variable Θ.
1. If L(θe , θ) = |θe − θ|, then the Bayes point estimate for θ is the median
of gΘ (θ|W = w).
2. If L(θe , θ) = (θe − θ)2 , then the Bayes point estimate for θ is the mean
of gΘ (θ|W = w).
Remarks
1. Median usually does not have a closed form formula.
2. Mean usually has a closed formula.
110
Theorem. Let gΘ (θ|W = w) be the posterior distribution of the random
variable Θ.
1. If L(θe , θ) = |θe − θ|, then the Bayes point estimate for θ is the median
of gΘ (θ|W = w).
2. If L(θe , θ) = (θe − θ)2 , then the Bayes point estimate for θ is the mean
of gΘ (θ|W = w).
Remarks
1. Median usually does not have a closed form formula.
2. Mean usually has a closed formula.
110
Theorem. Let gΘ (θ|W = w) be the posterior distribution of the random
variable Θ.
1. If L(θe , θ) = |θe − θ|, then the Bayes point estimate for θ is the median
of gΘ (θ|W = w).
2. If L(θe , θ) = (θe − θ)2 , then the Bayes point estimate for θ is the mean
of gΘ (θ|W = w).
Remarks
1. Median usually does not have a closed form formula.
2. Mean usually has a closed formula.
110
Theorem. Let gΘ (θ|W = w) be the posterior distribution of the random
variable Θ.
1. If L(θe , θ) = |θe − θ|, then the Bayes point estimate for θ is the median
of gΘ (θ|W = w).
2. If L(θe , θ) = (θe − θ)2 , then the Bayes point estimate for θ is the mean
of gΘ (θ|W = w).
Remarks
1. Median usually does not have a closed form formula.
2. Mean usually has a closed formula.
110
Theorem. Let gΘ (θ|W = w) be the posterior distribution of the random
variable Θ.
1. If L(θe , θ) = |θe − θ|, then the Bayes point estimate for θ is the median
of gΘ (θ|W = w).
2. If L(θe , θ) = (θe − θ)2 , then the Bayes point estimate for θ is the mean
of gΘ (θ|W = w).
Remarks
1. Median usually does not have a closed form formula.
2. Mean usually has a closed formula.
110
Theorem. Let gΘ (θ|W = w) be the posterior distribution of the random
variable Θ.
1. If L(θe , θ) = |θe − θ|, then the Bayes point estimate for θ is the median
of gΘ (θ|W = w).
2. If L(θe , θ) = (θe − θ)2 , then the Bayes point estimate for θ is the mean
of gΘ (θ|W = w).
Remarks
1. Median usually does not have a closed form formula.
2. Mean usually has a closed formula.
110
https://en.wikipedia.org
111
Proof. ( of Part 1. )
Let m be the median of the random variable W . We first claim that
E(|W − m|) ≤ E(|W |).
(?)
For any constant b ∈ R, because
1
= P(W ≤ m) = P(W − b ≤ m − b)
2
we see that m − b is the median of W − b. Hence, by (?),
E (|W − m|) = E (|(W − b) − (m − b)|) ≤ E (|W − b|) ,
for all b ∈ R,
which proves the statement.
112
Proof. ( of Part 1. continued )
It remains to prove (?). Without loss of generality, we may assume m > 0.
Then
Z
E(|W − m|) =
|w − m|fW (w)dw
R
Z m
Z ∞
=
(m − w)fW (w)dw +
(w − m)fW (w)dw
−∞
Z m
=−
m
∞
Z
wfW (w)dw +
−∞
Z 0
=−
w fW (w)dw +
m
Z m
wfW (w)dw −
−∞
Z
0
≤−
wfW (w)dw +
−∞
Z
=
|w|fW (w)dw
|0
Z
wfW (w)dw +
{z
}
1
(m − m)
2
∞
w fW (w)dw
m
≥0
∞
Z
w fW (w)dw
0
R
= E(|W |).
113
Proof. ( of Part 2. )
Let µ be the mean of W . Then for any b ∈ R, we see that
E (W − b)2 = E ([W − µ] + [µ − b])2
= E (W − µ)2 + 2(µ − b) E(W − µ) +[µ − b]2
| {z }
=0
= E (W − µ)2 + [µ − b]2
≥ E (W − µ)2 ,
that is,
E (W − µ)2 ≤ E (W − b)2 ,
for all b ∈ R.
E.g. 1’.
X1 , · · · , Xn θ
Θ
∼
Bernoulli(θ)
∼
Beta(r , s)
X =
n
X
Xi θ
∼
Binomial(n, θ)
Θ
∼
Beta(r , s)
i=1
r & s are known
r & s are known
Prior Beta(r , s) → posterior Beta(k + r , n − k + s)
upon observing X = k for a random sample of size n.
Consider the L2 loss function.
θe = mean of Beta(k + r , n − k + s)
k +r
n+r +s
n
k
r +s
r
=
×
+
×
n+r +s
n
n+r +s
r +s
| {z }
| {z }
=
MLE
Mean of Prior
115
E.g. 1’.
X1 , · · · , Xn θ
Θ
∼
Bernoulli(θ)
∼
Beta(r , s)
X =
n
X
Xi θ
∼
Binomial(n, θ)
Θ
∼
Beta(r , s)
i=1
r & s are known
r & s are known
Prior Beta(r , s) → posterior Beta(k + r , n − k + s)
upon observing X = k for a random sample of size n.
Consider the L2 loss function.
θe = mean of Beta(k + r , n − k + s)
k +r
n+r +s
n
k
r +s
r
=
×
+
×
n+r +s
n
n+r +s
r +s
| {z }
| {z }
=
MLE
Mean of Prior
115
E.g. 1’.
X1 , · · · , Xn θ
Θ
∼
Bernoulli(θ)
∼
Beta(r , s)
X =
n
X
Xi θ
∼
Binomial(n, θ)
Θ
∼
Beta(r , s)
i=1
r & s are known
r & s are known
Prior Beta(r , s) → posterior Beta(k + r , n − k + s)
upon observing X = k for a random sample of size n.
Consider the L2 loss function.
θe = mean of Beta(k + r , n − k + s)
k +r
n+r +s
n
k
r +s
r
=
×
+
×
n+r +s
n
n+r +s
r +s
| {z }
| {z }
=
MLE
Mean of Prior
115
E.g. 1’.
X1 , · · · , Xn θ
Θ
∼
Bernoulli(θ)
∼
Beta(r , s)
X =
n
X
Xi θ
∼
Binomial(n, θ)
Θ
∼
Beta(r , s)
i=1
r & s are known
r & s are known
Prior Beta(r , s) → posterior Beta(k + r , n − k + s)
upon observing X = k for a random sample of size n.
Consider the L2 loss function.
θe = mean of Beta(k + r , n − k + s)
k +r
n+r +s
n
k
r +s
r
=
×
+
×
n+r +s
n
n+r +s
r +s
| {z }
| {z }
=
MLE
Mean of Prior
115
E.g. 1’.
X1 , · · · , Xn θ
Θ
∼
Bernoulli(θ)
∼
Beta(r , s)
X =
n
X
Xi θ
∼
Binomial(n, θ)
Θ
∼
Beta(r , s)
i=1
r & s are known
r & s are known
Prior Beta(r , s) → posterior Beta(k + r , n − k + s)
upon observing X = k for a random sample of size n.
Consider the L2 loss function.
θe = mean of Beta(k + r , n − k + s)
k +r
n+r +s
n
k
r +s
r
=
×
+
×
n+r +s
n
n+r +s
r +s
| {z }
| {z }
=
MLE
Mean of Prior
115
E.g. 1’.
X1 , · · · , Xn θ
Θ
∼
Bernoulli(θ)
∼
Beta(r , s)
X =
n
X
Xi θ
∼
Binomial(n, θ)
Θ
∼
Beta(r , s)
i=1
r & s are known
r & s are known
Prior Beta(r , s) → posterior Beta(k + r , n − k + s)
upon observing X = k for a random sample of size n.
Consider the L2 loss function.
θe = mean of Beta(k + r , n − k + s)
k +r
n+r +s
n
k
r +s
r
=
×
+
×
n+r +s
n
n+r +s
r +s
| {z }
| {z }
=
MLE
Mean of Prior
115
E.g. 1’.
X1 , · · · , Xn θ
Θ
∼
Bernoulli(θ)
∼
Beta(r , s)
X =
n
X
Xi θ
∼
Binomial(n, θ)
Θ
∼
Beta(r , s)
i=1
r & s are known
r & s are known
Prior Beta(r , s) → posterior Beta(k + r , n − k + s)
upon observing X = k for a random sample of size n.
Consider the L2 loss function.
θe = mean of Beta(k + r , n − k + s)
k +r
n+r +s
n
k
r +s
r
=
×
+
×
n+r +s
n
n+r +s
r +s
| {z }
| {z }
=
MLE
Mean of Prior
115
E.g. 1’.
X1 , · · · , Xn θ
Θ
∼
Bernoulli(θ)
∼
Beta(r , s)
X =
n
X
Xi θ
∼
Binomial(n, θ)
Θ
∼
Beta(r , s)
i=1
r & s are known
r & s are known
Prior Beta(r , s) → posterior Beta(k + r , n − k + s)
upon observing X = k for a random sample of size n.
Consider the L2 loss function.
θe = mean of Beta(k + r , n − k + s)
k +r
n+r +s
n
k
r +s
r
=
×
+
×
n+r +s
n
n+r +s
r +s
| {z }
| {z }
=
MLE
Mean of Prior
115
MLE vs. Prior
θe
||
n
×
n+r +s
k
r +s
r
+
×
n
n+r +s
r +s
| {z }
| {z }
MLE
Mean of Prior
116
E.g. 2’.
X1 , · · · , Xn θ
Θ
∼
Poisson(θ)
∼
Gamma(s, µ)
s & µ are known
W =
n
X
Xi θ
∼
Poisson(nθ)
Θ
∼
Gamma(s, µ)
i=1
s & µ are known
Prior Gamma(s, µ) → Posterior Gamma(w + s, µ + n)
upon observing W = w for a random sample of size n.
Consider the L2 loss function.
θe = mean of Gamma(w + s, µ + n)
w +s
=
µ+n
w n
µ
s
=
×
+
×
µ+n
n
µ+n
µ
| {z }
| {z }
MLE
Mean of Prior
117
E.g. 2’.
X1 , · · · , Xn θ
Θ
∼
Poisson(θ)
∼
Gamma(s, µ)
s & µ are known
W =
n
X
Xi θ
∼
Poisson(nθ)
Θ
∼
Gamma(s, µ)
i=1
s & µ are known
Prior Gamma(s, µ) → Posterior Gamma(w + s, µ + n)
upon observing W = w for a random sample of size n.
Consider the L2 loss function.
θe = mean of Gamma(w + s, µ + n)
w +s
=
µ+n
w n
µ
s
=
×
+
×
µ+n
n
µ+n
µ
| {z }
| {z }
MLE
Mean of Prior
117
E.g. 2’.
X1 , · · · , Xn θ
Θ
∼
Poisson(θ)
∼
Gamma(s, µ)
s & µ are known
W =
n
X
Xi θ
∼
Poisson(nθ)
Θ
∼
Gamma(s, µ)
i=1
s & µ are known
Prior Gamma(s, µ) → Posterior Gamma(w + s, µ + n)
upon observing W = w for a random sample of size n.
Consider the L2 loss function.
θe = mean of Gamma(w + s, µ + n)
w +s
=
µ+n
w n
µ
s
=
×
+
×
µ+n
n
µ+n
µ
| {z }
| {z }
MLE
Mean of Prior
117
E.g. 2’.
X1 , · · · , Xn θ
Θ
∼
Poisson(θ)
∼
Gamma(s, µ)
s & µ are known
W =
n
X
Xi θ
∼
Poisson(nθ)
Θ
∼
Gamma(s, µ)
i=1
s & µ are known
Prior Gamma(s, µ) → Posterior Gamma(w + s, µ + n)
upon observing W = w for a random sample of size n.
Consider the L2 loss function.
θe = mean of Gamma(w + s, µ + n)
w +s
=
µ+n
w n
µ
s
=
×
+
×
µ+n
n
µ+n
µ
| {z }
| {z }
MLE
Mean of Prior
E.g. 2’.
X1 , · · · , Xn θ
Θ
∼
Poisson(θ)
∼
Gamma(s, µ)
s & µ are known
W =
n
X
Xi θ
∼
Poisson(nθ)
Θ
∼
Gamma(s, µ)
i=1
s & µ are known
Prior Gamma(s, µ) → Posterior Gamma(w + s, µ + n)
upon observing W = w for a random sample of size n.
Consider the L2 loss function.
θe = mean of Gamma(w + s, µ + n)
w +s
=
µ+n
w n
µ
s
=
×
+
×
µ+n
n
µ+n
µ
| {z }
| {z }
MLE
Mean of Prior
E.g. 2’.
X1 , · · · , Xn θ
Θ
∼
Poisson(θ)
∼
Gamma(s, µ)
s & µ are known
W =
n
X
Xi θ
∼
Poisson(nθ)
Θ
∼
Gamma(s, µ)
i=1
s & µ are known
Prior Gamma(s, µ) → Posterior Gamma(w + s, µ + n)
upon observing W = w for a random sample of size n.
Consider the L2 loss function.
θe = mean of Gamma(w + s, µ + n)
w +s
=
µ+n
w n
µ
s
=
×
+
×
µ+n
n
µ+n
µ
| {z }
| {z }
MLE
Mean of Prior
117
E.g. 2’.
X1 , · · · , Xn θ
Θ
∼
Poisson(θ)
∼
Gamma(s, µ)
s & µ are known
W =
n
X
Xi θ
∼
Poisson(nθ)
Θ
∼
Gamma(s, µ)
i=1
s & µ are known
Prior Gamma(s, µ) → Posterior Gamma(w + s, µ + n)
upon observing W = w for a random sample of size n.
Consider the L2 loss function.
θe = mean of Gamma(w + s, µ + n)
w +s
=
µ+n
w n
µ
s
=
×
+
×
µ+n
n
µ+n
µ
| {z }
| {z }
MLE
Mean of Prior
117
MLE vs. Prior
θe
||
w n
µ
×
+
×
µ+n
n
µ+n
| {z }
MLE
s
µ
| {z }
Mean of Prior
118
Appendix: Beta integral
1
Z
Lemma.
B(α, β) :=
x α−1 (1 − x)β−1 dx =
0
Proof. Notice that
Z
Γ(α) =
∞
and
x α−1 e−x dx
0
Hence,
Γ(α)Γ(β)
Γ(α + β)
∞
Z
Γ(β) =
y β−1 e−y dy .
0
∞
Z
Z
Γ(α)Γ(β) =
0
∞
x α−1 y β−1 e−(x+y ) dxdy .
0
119
The key in the proof is the following change of variables:
(
=⇒
∂(x, y )
=
∂(r , θ)
=⇒
x = r 2 cos2 (θ)
y = r 2 sin2 (θ)
2r cos2 (θ)
2
−2r cos(θ) sin(θ)
det
∂(x, y )
∂(r , θ)
2r sin2 (θ)
2
2r cos(θ) sin(θ)
= 4r 3 sin(θ) cos(θ).
120
Therefore,
π
2
Z
Γ(α)Γ(β) =
∞
Z
dθ
0
0
Z
=4
π
2
2
dr r 2(α+β)−4 e−r cos2α−2 (θ) sin2β−2 (θ) × 4r 3 sin(θ) cos(θ)
|
{z
}
Jacobian
! Z
∞
cos2α−1 (θ) sin2β−1 (θ)dθ
0
2
r 2(α+β)−1 e−r dr
.
0
Now let us compute the following two integrals separately:
Z
I1 :=
π
2
cos2α−1 (θ) sin2β−1 (θ)dθ
0
Z
I2 :=
∞
2
r 2(α+β)−1 e−r dr
0
121
For I2 , by change of variable r 2 = u (so that 2rdr = du),
∞
Z
I2 =
2
r 2(α+β)−1 e−r dr
0
1
=
2
∞
Z
0
2
r 2(α+β)−2 e−r 2rdr
|{z}
=du
1 ∞ α+β−1 −u
=
u
e du
2 0
1
= Γ(α + β).
2
Z
122
For I1 , by the change of variables
1
− sin(θ)dθ = 2√
dx),
x
Z
I1 =
π
2
√
x = cos(θ) (so that
cos2α−1 (θ) sin2β−1 (θ)dθ
0
Z
=
π
2
0
cos2α−1 (θ) sin2β−2 (θ) × sin(θ)dθ
| {z }
1 dx
=− √
2
Z
0
x
=
1
α− 1
2
x
−1
(1 − x)β−1 √ dx
2 x
Z
1 1 α−1
=
x
(1 − x)β−1 dx
2 0
1
= B(α, β)
2
123
Therefore,
Γ(α)Γ(β) = 4I1 × I2
1
1
= 4 × Γ(α + β) × B(α, β)
2
2
i.e.,
B(α, β) =
Γ(α)Γ(β)
.
Γ(α + β)
124
Download