Uploaded by Alexa Bryant

8.7

advertisement
298
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
as simulated pairs from the posterior. A further complication is that these pairs are
not independent of one another. But, nonetheless, a histogram of the collection of θk
could be used as an estimate of the marginal posterior distribution of ". The posterior
mean of " can be estimated as
E("|X ) ≈
N
1 !
θk
N k=1
8.7 Efficiency and the Cramér-Rao
Lower Bound
In most statistical estimation problems, there are a variety of possible parameter
estimates. For example, in Chapter 7 we considered both the sample mean and a
ratio estimate, and in this chapter we considered the method of moments and the
method of maximum likelihood. Given a variety of possible estimates, how would we
choose which to use? Qualitatively, it would be sensible to choose that estimate whose
sampling distribution was most highly concentrated about the true parameter value.
To define this aim operationally, we would need to specify a quantitative measure
of such concentration. Mean squared error is the most commonly used measure of
concentration, largely because of its analytic simplicity. The mean squared error of θ̂
as an estimate of θ0 is
M S E(θ̂) = E(θ̂ − θ0 )2
= Var(θ̂ ) + (E(θ̂ ) − θ0 )2
(See Theorem A of Section 4.2.1.) If the estimate θ̂ is unbiased [E(θ̂ )= θ0 ], MSE(θ̂ )=
Var(θ̂). When the estimates under consideration are unbiased, comparison of their
mean squared errors reduces to comparison of their variances, or equivalently, standard errors.
Given two estimates, θ̂ and θ̃ , of a parameter θ, the efficiency of θ̂ relative to θ̃
is defined to be
Var(θ̃ )
eff(θ̂ , θ̃ ) =
Var(θ̂ )
Thus, if the efficiency is smaller than 1, θ̂ has a larger variance than θ̃ has. This
comparison is most meaningful when both θ̂ and θ̃ are unbiased or when both have
the same bias. Frequently, the variances of θ̂ and θ̃ are of the form
c1
Var(θ̂ ) =
n
c2
Var(θ̃ ) =
n
where n is the sample size. If this is the case, the efficiency can be interpreted as
the ratio of sample sizes necessary to obtain the same variance for both θ̂ and θ̃ . (In
Chapter 7, we compared the efficiencies of estimates of a population mean from a
simple random sample, a stratified random sample with proportional allocation, and
a stratified random sample with optimal allocation.)
8.7 Efficiency and the Cramér-Rao Lower Bound
EXAMPLE A
299
Muon Decay
Two estimates have been derived for α in the problem of muon decay. The method of
moments estimate is
α̃ = 3X
The maximum likelihood estimate is the solution of the nonlinear equation
n
!
i=1
Xi
=0
1 + α̂ X i
We need to find the variances of these two estimates.
Since the variance of a sample mean is σ 2 /n, we compute σ 2 :
σ 2 = E(X 2 ) − [E(X )]2
" 1
α2
1 + αx
dx −
=
x2
2
9
−1
1 α2
−
3
9
Thus, the variance of the method of moments estimate is
=
Var(α̃) = 9 Var(X ) =
3 − α2
n
The exact variance of the mle, θ̂ , cannot be computed in closed form, so we approximate it by the asymptotic variance,
Var(α̂) ≈
1
n I (α)
and then compare this asymptotic variance to the variance of α̃. The ratio of the former
to the latter is called the asymptotic relative efficiency. By definition,
$2
#
∂
log f (x|α)
I (α) = E
∂α
%
&
" 1
x2
1 + αx
=
dx
2
2
−1 (1 + αx)
%
&
1+α
log
− 2α
1−α
,
−1 < α < 1, α =
# 0
=
2α 3
1
α=0
= ,
3
The asymptotic relative efficiency is thus (for α =
# 0)


Var(α̂)
2α 3 

=
Var(α̃)
3 − α2 
log
%

1

&

1+α
− 2α
1−α
300
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
The following table gives this efficiency for various values of α between 0 and 1;
symmetry would yield the values between −1 and 0.
α
Efficiency
0.0
.1
.2
.3
.4
.5
.6
.7
.8
.9
.95
1.0
.997
.989
.975
.953
.931
.878
.817
.727
.582
.464
As α tends to 1, the efficiency tends to 0. Thus, the mle is not much better than the
method of moments estimate for α close to 0 but does increasingly better as α tends
to 1.
It must be kept in mind that we used the asymptotic variance of the mle, so we
calculated an asymptotic relative efficiency, viewing this as an approximation to the
actual relative efficiency. To gain more precise information for a given sample size,
a simulation of the sampling distribution of the mle could be conducted. This might
be especially interesting for α = 1, a case for which the formula for the asymptotic
variance given above does not appear to make much sense. With a simulation study,
the behavior of the bias as n and α vary could be analyzed (we showed that the mle
is asymptotically unbiased, but there may be bias for a finite sample size), and the
■
actual distribution could be compared to the approximating normal.
In searching for an optimal estimate, we might ask whether there is a lower bound
for the MSE of any estimate. If such a lower bound existed, it would function as a
benchmark against which estimates could be compared. If an estimate achieved this
lower bound, we would know that it could not be improved upon. In the case in which
the estimate is unbiased, the Cramér-Rao inequality provides such a lower bound. We
now state and prove the Cramér-Rao inequality.
THEOREM A Cramér-Rao Inequality
Let X 1 , . . . , X n be i.i.d. with density function f (x|θ ). Let T = t (X 1 , . . . , X n )
be an unbiased estimate of θ. Then, under smoothness assumptions on f (x|θ ),
1
Var(T ) ≥
n I (θ )
8.7 Efficiency and the Cramér-Rao Lower Bound
301
Proof
Let
Z =
=
n
!
∂
log f (X i |θ )
∂θ
i=1
n
!
i=1
∂
f (X i |θ )
∂θ
f (X i |θ )
In Section 8.5.2, we showed that E(Z ) = 0. Because the correlation coefficient
of Z and T is less than or equal to 1 in absolute value
Cov2 (Z , T ) ≤ Var(Z )Var(T )
It was also shown in Section 8.5.2 that
$
#
∂
log f (X |θ ) = I (θ )
Var
∂θ
Therefore,
Var(Z ) = n I (θ )
The proof will be completed by showing that Cov(Z , T ) = 1. Since Z has
mean 0,
Cov(Z , T ) = E(Z T )
=
"
···
"
Noting that
n
!
i=1


∂
n
n
f
(x
|θ
)
i
! ∂θ
t (x1 , . . . , xn ) 
f (x j |θ ) d x j

f (xi |θ )
i=1
j=1
∂
n
n
f (xi |θ) ∂ ∂θ
f (x j |θ ) =
f (xi |θ )
f (xi |θ) j=1
∂θ i=1
we rewrite the expression for the covariance of Z and T as
"
"
n
∂ Cov(Z , T ) = · · · t (x1 , . . . , xn )
f (xi |θ ) d xi
∂θ i=1
"
"
n
∂
=
f (xi |θ ) d xi
· · · t (x1 , . . . , xn )
∂θ
i=1
∂
∂
E(T ) =
(θ ) = 1
∂θ
∂θ
which proves the inequality. [Note the interchange of differentiation and integration that must be justified by the smoothness assumptions on f (x|θ ).]
■
=
302
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
Theorem A gives a lower bound on the variance of any unbiased estimate. An
unbiased estimate whose variance achieves this lower bound is said to be efficient.
Since the asymptotic variance of a maximum likelihood estimate is equal to the
lower bound, maximum likelihood estimates are said to be asymptotically efficient.
For a finite sample size, however, a maximum likelihood estimate may not be efficient, and maximum likelihood estimates are not the only asymptotically efficient
estimates.
EXAMPLE B
Poisson Distribution
In Example B in Section 8.5.3, we found that for the Poisson distribution
1
λ
Therefore, by Theorem A, for any unbiased estimate T of λ, based on a sample of
independent Poisson random variables, X 1 , . . . , X n ,
λ
Var(T ) ≥
n
I (λ) =
The mle of λ was found to be X = S/n, where S = X 1 + · · · + X n . Since S follows a
Poisson distribution with parameter nλ, Var(S) = nλ and Var(X ) = λ/n. Therefore,
X attains the Cramér-Rao lower bound, and we know that no unbiased estimator of λ
can have a smaller variance. In this sense, X is optimal for the Poisson distribution.
But note that the theorem does not preclude the possibility that there is a biased
■
estimator of λ that has a smaller mean squared error than X does.
8.7.1 An Example: The Negative Binomial Distribution
The Poisson distribution is often the first model considered for random counts; it
has the property that the mean of the distribution is equal to the variance. When it is
found that the variance of the counts is substantially larger than the mean, the negative
binomial distribution is sometimes instead considered as a model. We consider a
reparametrization and generalization of the negative binomial distribution introduced
in Section 2.1.3, which is a discrete distribution on the nonnegative integers with a
frequency function depending on the parameters m and k:
%
&x
.
m /−k '(k + x)
m
f (x|m, k) = 1 +
k
x!'(k)
m+k
The mean and variance of the negative binomial distribution can be shown
to be
µ=m
m2
k
It is apparent that this distribution is overdispersed (σ 2 > µ) relative to the Poisson.
We will not derive the mean and variance. (They are most easily obtained by using
moment-generating functions.)
σ2 = m +
8.7 Efficiency and the Cramér-Rao Lower Bound
303
The negative binomial distribution can be used as a model in several cases:
•
•
•
•
If k is an integer, the distribution of the number of successes up to the kth failure in a
sequence of independent Bernoulli trials with probability of success p = m/(m + k)
is negative binomial.
Suppose that ( is a random variable following a gamma distribution and that for λ,
a given value of (, X follows a Poisson distribution with mean λ. It can be shown
that the unconditional distribution of X is negative binomial. Thus, for situations in
which the rate varies randomly over time or space, the negative binomial distribution
might tentatively be considered as a model.
The negative binomial distribution also arises with a particular type of clustering.
Suppose that counts of colonies, or clusters, follow a Poisson distribution and that
each colony has a random number of individuals. If the probability distribution
of the number of individuals per colony is of a particular form (the logarithmic
series distribution), it can be shown that the distribution of counts of individuals is
negative binomial. The negative binomial distribution might be a plausible model
for the distribution of insect counts if the insects hatch from depositions, or clumps,
of larvae.
The negative binomial distribution can be applied to model population size in a
certain birth/death process, the assumption being that the birth rate and death rate
per individual are constant and that there is a constant rate of immigration.
Anscombe (1950) discusses estimation of the parameters m and k and compares
the efficiencies of several methods of estimation. The simplest method is the method
of moments; from the relations of m and k to µ and σ 2 given previously, the method
of moments estimates of m and k are
m̂ = X
X2
σ̂ 2 − X
Another relatively simple method of estimation of m and k is based on the number
of zeros. The probability of the count being zero is
.
m /−k
p0 = 1 +
k
If m is estimated by the sample mean and there are n 0 zeros out of a sample size of
n, then k is estimated by k̂, where k̂ satisfies
%
&−k̂
X
n0
= 1+
n
k̂
k̂ =
Although the solution cannot be obtained in closed form, it is not difficult to find by
iteration.
Figure 8.11, from Anscombe (1950), shows the asymptotic efficiencies of the two
methods of estimation of the negative binomial parameters relative to the maximum
likelihood estimate. In the figure, the method of moments is method 1 and the method
based on the number of zeros is method 2. Method 2 is quite efficient when the mean
is small—that is, when there are a large number of zeros. Method 1 becomes more
efficient as k increases.
304
Chapter 8
Estimation of Parameters and Fitting of Probability Distributions
100
40
98%
20
Exponent k
10
90%
4
75%
2
50%
1
75%
90%
0-4
0-2
0-1
0-04 0-1 0-2 0-4
50%
98%
1
2 4
10 20 40
Mean m
100 200 400
Method 1
Method 2
F I G U R E 8.11 Asymptotic efficiencies of estimates of negative binomial parameters.
The maximum likelihood estimate is asymptotically efficient but is somewhat
more difficult to compute. The equations will not be written out here. Bliss and Fisher
(1953) discuss computational methods and give several examples. The maximum
likelihood estimate of m is the sample mean, but that of k is the solution of a nonlinear
equation.
EXAMPLE A
Insect Counts
Let us consider an example from Bliss and Fisher (1953). From each of 6 apple trees
in an orchard that was sprayed, 25 leaves were selected. On each of the leaves, the
number of adult female red mites was counted. Intuitively, we might conclude that
this situation was too heterogeneous for a Poisson model to fit; the rates of infestation
might be different on different trees and at different locations on the same tree.
The following table shows the observed counts and the expected counts from fitting
Poisson and negative binomial distributions. The mle’s for k and m were k̂ = 1.025
and m̂ = 1.146.
Number
per Leaf
Observed
Count
Poisson
Distribution
Negative Binomial
Distribution
0
1
2
3
4
5
6
7
8+
70
38
17
10
9
3
2
1
0
47.7
54.6
31.3
12.0
3.4
.75
.15
.03
.00
69.5
37.6
20.1
10.7
5.7
3.0
1.6
.85
.95
8.8 Sufficiency
305
Casual inspection of this table makes it clear that the Poisson does not fit; there
are many more small and large counts observed than are expected for a Poisson
■
distribution.
A recursive relation is useful in fitting the negative binomial distribution:
.
m /−k
p0 = 1 +
k
%
&
m
k+n−1
pn =
pn−1
n
k+m
8.8 Sufficiency
This section introduces the concept of sufficiency and some of its theoretical implications. Suppose that X 1 , . . . , X n is a sample from a probability distribution with the
density or frequency function f (x|θ ). The concept of sufficiency arises as an attempt
to answer the following question: Is there a statistic, a function T (X 1 , . . . , X n ), that
contains all the information in the sample about θ ? If so, a reduction of the original
data to this statistic without loss of information is possible. For example, consider
a sequence of independent Bernoulli trials with unknown probability of success, θ .
We may have the intuitive feeling that the total number of successes contains all
the information about θ that there is in the sample, that the order in which the successes occurred, for example, does not give any additional information. The following
definition formalizes this idea.
DEFINITION
A statistic T (X 1 , . . . , X n ) is said to be sufficient for θ if the conditional distribution of X 1 , . . . , X n , given T = t, does not depend on θ for any value
■
of t.
In other words, given the value of T , which is called a sufficient statistic, we can
gain no more knowledge about θ from knowing more about the probability distribution
of X 1 , . . . , X n . (Formally, we could envision keeping only T and throwing away all
the X i without any loss of information. Informally, and more realistically, this would
make no sense at all. The values of the X i might indicate that the model did not fit or
that something was fishy about the data. What would you think, for example, if you
saw 50 ones followed by 50 zeros in a sequence of supposedly independent Bernoulli
trials?)
Download