Stat 643 Outline Spring 2010 Steve Vardeman Iowa State University

advertisement
Stat 643 Outline
Spring 2010
Steve Vardeman
Iowa State University
April 22, 2010
Abstract
This outline summarizes the main points of the class lectures. Details
(including proofs) must be seen in class notes and reference books.
Contents
1 Characteristic Functions
1.1 Some Simple Generalities and Basic Facts About Chf’s . . . . . .
1.2 Chf’s and Probabilities . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Chf’s and Convergence in Distribution . . . . . . . . . . . . . . .
3
4
5
7
2 Central Limit Theorems
2.1 Basic CLTs for Independent Double Arrays . . . . . . . . . . . .
2.2 Related Limit Results . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 CLTs Under Dependence . . . . . . . . . . . . . . . . . .
2.2.2 CLTs for Sums of Random Numbers of Random Summands
2.2.3 Rates for CLTs and Generalizations . . . . . . . . . . . .
2.2.4 Large Deviation Results . . . . . . . . . . . . . . . . . . .
2.2.5 Results About "Stable Laws" . . . . . . . . . . . . . . . .
2.2.6 Functional Central Limit Theorems and Empirical Processes
8
8
9
9
10
10
11
11
11
3 Conditioning
3.1 Basic Results About Conditional Expectation . . . . . . . . .
3.2 Conditional Variance . . . . . . . . . . . . . . . . . . . . . . .
3.3 Conditional Analogues of Properties of Ordinary Expectation
3.4 Conditional Probability . . . . . . . . . . . . . . . . . . . . .
13
14
15
16
16
4 Transition to Statistics
.
.
.
.
.
.
.
.
18
1
5 Su¢
5.1
5.2
5.3
ciency and Related Concepts
19
Su¢ ciency and the Factorization Theorem . . . . . . . . . . . . . 19
Minimal Su¢ ciency . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Ancillarity and Completeness . . . . . . . . . . . . . . . . . . . . 22
6 Facts About Common Statistical Models
6.1 Bayes Models . . . . . . . . . . . . . . . .
6.2 Exponential Families . . . . . . . . . . . .
6.3 Measures of Statistical Information . . . .
6.3.1 Fisher Information . . . . . . . . .
6.3.2 Kullback-Leibler Information . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
24
24
25
26
27
29
7 Statistical Decision Theory
31
7.1 Basic Framework and Concepts . . . . . . . . . . . . . . . . . . . 31
7.2 (Finite Dimensional) Geometry of Decision Theory . . . . . . . . 33
7.3 Complete Classes of Decision Rules . . . . . . . . . . . . . . . . . 34
7.4 Su¢ ciency and Decision Theory . . . . . . . . . . . . . . . . . . . 35
7.5 Bayes Decision Rules . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.6 Minimax Decision Rules . . . . . . . . . . . . . . . . . . . . . . . 38
7.7 Invariance/Equivariance Theory . . . . . . . . . . . . . . . . . . . 38
7.7.1 Location Parameter Estimation and Invariance/Equivariance 39
7.7.2 Generalities of Invariance/Equivariance Theory . . . . . . 40
8 Asymptotics of Likelihood Inference
43
8.1 Asymptotics of Likelihood Estimation . . . . . . . . . . . . . . . 43
8.1.1 Consistency of Likelihood-Based Estimators . . . . . . . . 43
8.1.2 Asymptotic Normality of Likelihood-Based Estimators . . 46
8.2 Asymptotics of LRT-like Tests and Competitors and Related Con…dence Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
8.3 Asymptotic Shape of the Loglikelihood and Related Bayesian Asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
9 Optimality in Finite Sample Point Estimation
52
9.1 Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
9.2 "Information" Inequalities . . . . . . . . . . . . . . . . . . . . . . 53
10 Optimality in Finite Sample Testing
54
10.1 Simple versus Simple Testing . . . . . . . . . . . . . . . . . . . . 55
2
1
Characteristic Functions
p
Recall from simple complex analysis that i
1 (we’ll have to tell from
context whether i is standing for the root of 1 or for something else like an
index of summation). For some purposes, complex numbers a + bi are just
a compact representation of points in R2 . That is, a + bi with "real part" a
and "imaginary part" b is equivalent to the ordered pair (a; b) 2 R2 . But the
complex number rules of operation give the convenience of operating with many
of the same rules of computation that we use for real numbers and others that
are derivable by simply operating algebraically on i. For example, a reciprocal
of the complex number a + bi (another complex number that when multiplied
times it produces 1) is
(a + bi)
1
=
(a + bi)
a bi
=
2
2
a +b
(a + bi) (a + bi)
If g is a complex-valued function, we may think of g in terms of two realvalued functions (Re g (t) and Im g (t)) such that
g (t) = Re g (t) + i Im g (t)
So, for example, we may use the shorthand
Z
Z
Z
gd
Re gd + i Im gd
Further, in the obvious way, derivatives of complex functions come from derivatives of two real-valued functions
d
d
d
g (t) =
Re g (t) + i Im g (t)
dt
dt
dt
A standard complex analysis convention is (for real ) to de…ne
ei
cos + i sin
(You should check that this de…nition is plausible based on the forms of the
Maclaurin series for exp t; cos t; and sin t.) Now it’s easy to see that ei = 1
so that ei is on the unit circle in C (or R2 ). It’s also straightforward to verify
that for real 1 and 2
ei( 1 + 2 ) = ei 1 ei 2
Further, the complex-valued function of a single real variable g (t) = eitb has
nth derivative
n
g (n) (t) = (ib) eitb
3
1.1
Some Simple Generalities and Basic Facts About Chf’s
De…nition 1 If is a probability distribution on Rk (and is the distribution of
the random vector X) then the characteristic function of (or of X) is the
mapping from Rk to C de…ned by
Z
Z
Z
0
0
(t) = Eeit X = eit x d (x) = cos (t0 x) d (x) + i sin (t0 x) d (x)
Characteristic functions are guaranteed to exist for all probability distributions. Both A&L Ch 10 and Chung Ch 6 are full of interesting and useful stu¤
about chf’s. Some simple facts are:
1. If
is a chf, then
(0) = 1.
2. If
is a chf, then j (t)j
1 8t.
3. If for 1 ; : : : ; m are chf’s for k-dimensional distributions
and p1 ;P
: : : ; pm
P
specify a distribution over f1; 2; : : : ; mg, then
pl l is the chf of
pl l .
4. If X 1 ; : : : ; X m are independent P
k-dimensional random vectors
Qm with chf’s
m
;
:
:
:
;
,
then
the
sum
S
=
X
has
chf
=
1
m
m
l
Sm
l=1
l=1 l .
5. If is the chf of real-valued X, then for real numbers a and b, Y = a + bX
has chf Y (t) = eiat (bt).
6. If
is the chf of
is the chf of X, then
X.
7. If X and Y are iid with chf’s , then (the symmetrically distributed)
2
random vector X Y has (real-valued) chf X Y = j j .
8. If
is the chf of X
(symmetric) distribution
1
2
and
+ 12
is the distribution of X, then the
has (real-valued) chf Re .
9. If 1 is the chf of the k1 -dimensional X 1 and 2 is the chf of the k2 dimensional X 2 and the random vectors X 1 and X 2 are independent,
X1
then for t1 2 Rk1 and t2 2 Rk2 the random vector
has characteristic
X2
t1
function
= 1 (t1 ) 2 (t2 )
t2
On pages 155-156 Chung provides a nice list of standard distributions and
their chfs. Table 1 is derived from Chung’s list.
Per Prof. Nordman’s chf notes and 10.1.3 of A&L, there is the promise that
chf’s are continuous.
Theorem 2 If
is a chf, then
is uniformly continuous on Rk .
Per Nordman’s Theorem 7.1, 10.1.4 of A&L, or page 67 of Lamperti, there
is the fact that when they exist, moments of random variables can be deduced
from derivatives of chf’s.
4
Table 1: Chung’s List of Chf’s
Distribution
Unit Point Mass at a
Mass 12 at each of 1
Ber(p)
Bi(n; p)
Geo(p)
Poisson( )
Exp( ) (mean
1
)
U( a; a)
Triangular on [ a; a]
N
;
2
pdf or pmf
px (1
p)
1 x
I [x is 0 or 1]
n x
n x
p (1 p)
I [x 2 f0; 1; : : : ; ng]
x
x
p (1 p) I [x 2 f0; 1; : : :g]
x
e
I [x 2 f0; 1; : : :g]
x!
e x I [x > 0]
1
I [ a < x < a]
2a
a jxj
I [ a < x < a]
a2
!
2
1
(x
)
p
exp
2 2
2 2
N(0; 1)
Cauchy with scale
(1
Chf
eiat
cos t
p) + peit
(1
p) + peit
p 1
it
e (e
1
(
2
+ x2 )
it
sin at
at
sin at
2
t
2
t2 =2
e
(0) = ir EX r
Theorem 4 If f on R is decreasing and convex on R+ = [0; 1) with f (0) =
1; f (t) 0, and f (t) = f ( t), then f is a chf.
Chf’s and Probabilities
Theorem 3 shows that when they exist, chf’s encode the moments of a distribution. In fact, a chf encodes all there is to know about a distribution. One
5
2
2 2
exp i t
As a bit of an aside, an interesting and important additional property of
characteristic functions (provided by Bochner’s Theorem) is that they are nonnegative de…nite complex-valued functions. That means that any real-valued
characteristic function can serve as a lag-correlation function for a second-order
stationary stochastic process. That fact gives large classes of models for time
series and spatial statistics (that are de…ned in terms of their lag-correlation
functions). Related to this use of characteristic functions are theorems that
provide su¢ cient conditions for a real-valued function to be a characteristic
function. One such theorem is on page 191 of Chung and is 10.1.8 of A&L.
1.2
1
at
2
Theorem 3 If is the chf of the random variable X, and for some positive
r
integer r, EjXj < 1, then has a continuous rth derivative (r) on R and
(r)
1)
1
e
>0
p) eit
(1
jtj
n
1
can get all probabilities from a distribution’s chf. 10.2.6 of A&L and Theorem
11.1.1 of Rosenthal is next.
Theorem 5 (Fourier Inversion Theorem) If is the chf of a probability distribution on R, then for any real a < b
Z T
e ita e itb
1
1
1
lim
(t) dt = ((a; b)) +
(fag) +
(fbg)
T !1 2
it
2
2
T
Two simple consequences of Theorem 5 are next.
Corollary 6 Under the hypotheses of Theorem 5, if a and b are continuity
points of the cdf F corresponding to , then
Z T
e ita e itb
1
(t) dt = ((a; b))
lim
T !1 2
it
T
Corollary 7 (Fourier Uniqueness Theorem) Chf ’s uniquely determine probability distributions.
Corollary 7 is 10.2.5 of A&L.
We will say that X is symmetrically distributed i¤ X and
same distribution.
X have the
Corollary 8 X is symmetrically distributed i¤ its characteristic function
real-valued.
is
There are all sorts of properties of distributions that are re‡ected in particular properties of characteristic functions. Per Nordman’s Theorem 7.2 and
Corollary 7.4 (that is 10.2.4. of A&L) there are the following two results, the
second of which follows from Corollary 6.
Theorem 9 (Riemann-Lebesgue Lemma) If the distribution of a random variable X has a density f wrt Lebesgue measure on R, then the chf of X, , satis…es
(t) ! 0 as jtj ! 1
on R has a chf, , for which
RCorollary 10 If the probability distribution
j (t)j dt < 1, then has density wrt Lebesgue measure (pdf ) on R
Z
1
f (x) =
e itx (t) dt
2
A&L’s 10.2.6b is also interesting. It is next.
Theorem 11 For a distribution on R with chf and real number a,
Z T
1
e ita (t) dt
(fag) = lim
T !1 2
T
6
1.3
Chf’s and Convergence in Distribution
The major remaining major concerning characteristic functions is that they are
useful for establishing convergence in distribution. That is, for distributions n
d
with chf’s n and a distribution with chf , n ! i¤ n (t) ! (t) 8t 2 Rk .
Notice that such a result is a kind of continuity theorem for the mappings that
1) produce chf’s from distributions and 2) produce distributions from chf’s.
(Continuity of a function means that proximity of arguments implies proximity
of function values.)
To get to this kind of continuity result, a technical lemma is required. The
following is 10.3.3 of A&L and 11.1.13 of Rosenthal.
Lemma 12 Let f n g be a sequence of distributions on R with corresponding
sequence of characteristic functions f n g. If there exists g : R ! C that is
continuous at 0 and t0 > 0 such that
n
(t) ! g (t) 8t 2 ( t0 ; t0 )
then f l g is tight.
This then enables proof of the important continuity theorem (that is 10.3.4
of A&L).
Theorem 13 (The R1 version of the Continuity Theorem)
f l g on R have characteristic functions f l g and distribution
d
n
!
i¤
n
If distributions
has chf ,
(t) 8t 2 R1
(t) !
(See A&L 10.4.4 for an Rk version of this.)
A technical lemma provides a simple Central Limit Theorem as a direct
consequence of Theorem 13. This concerns the complex number version of a
limit from freshman calculus.
Lemma 14 If complex numbers cn converge to c, then
lim
n!1
1+
n
cn
n
= ec
Theorem 15 Suppose the random variables fXn g are iid with EX12 < 1: Let
EX1 = and VarX1 = 2 < 1: Then
p
d
n Xn
! N 0;
2
A second very useful consequence of Theorem 13 is the so-called "CramérWold device." This appears as 10.4.5 of A&L.
Theorem 16 (Cramér-Wold Device) Suppose that fX n g is a sequence of kdimensional random vectors. Then
d
d
X n ! X i¤ a0 X n ! a0 X 8a 2 Rk
7
2
Central Limit Theorems
The basic idea of Central Limit Theory is that sums of small, not too highly dependent, random variables should be approximately normal. The simplest
Pn such
theorems are for sequences of summands X1 ; X2 ; : : :, where for Sn
i=1 Xi
there are sequences of constants an and bn such that
Sn
an
bn
d
! N (0; 1)
A more general and useful framework for CLTs is that of double arrays.
i!1
De…nition 17 fXij g for i = 1; 2; : : : and j = 1; 2; : : : ; ni where ni ! 1 will
be called a double array of random variables, and in the case where ni = i we
will call the collection a triangular array. Standard notation will be to let
Si =
ni
X
Xij
j=1
It is easy to see that even if one assumes that fXij gj=1;2;:::;ni is a set of
independent random variables, there is no guarantee that there are constants ai
and bi such that
Si ai d
! N (0; 1)
bi
2.1
Basic CLTs for Independent Double Arrays
The most famous condition under which Central Limit Theorems are proved is
the famous "Lindeberg condition."
De…nition 18 A double array of independent random variables fXij g with
2
EXij = 0 8i; j for which each EXij
< 1 is said to satisfy the Lindeberg
Pni
2
2
condition provided for Bi VarSi = j=1
EXij
,
8 >0
ni
1 X
2
EXij
I [jXij j > Bi ] ! 0
i!1
Bi2 j=1
This condition says that the tails of the Xij contribute less and less to the
variance of Si with increasing i. Next is 11.1.1 of A&L
Theorem 19 (Lindeberg Central Limit Theorem) If fXij g is a double array
of independent mean 0 random variables with …nite variances that satisfy the
Lindeberg condition, then
Si d
! N (0; 1)
Bi
A simpler-looking condition under which the Lindeberg condition holds is
Liapounov’s condition.
8
De…nition 20 A double array of independent random variables fXij g with
2+
EXij = 0 8i; j for which each EjXij j
< 1 for some …xed P> 0 is said
ni
2
EXij
to satisfy the Liapounov condition provided for Bi2 VarSi = j=1
1
Bi2+
ni
X
j=1
2+
E jXij j
! 0
i!1
The Liapounov condition says that the ratio of the total absolute 2 +
moment to the 1 + 2 power of the total absolute 2nd moment goes to 0.
Next is a result on page 348 of A&L.
Theorem 21 If a double array of independent random variables fXij g with
2+
EXij = 0 8i; j for which each EjXij j
< 1 for some …xed > 0 satis…es the
Liapounov condition, it also satis…es the Lindeberg condition.
2.2
Related Limit Results
The basic central limit theorems can be generalized in a number of directions.
A few such directions are mentioned here.
2.2.1
CLTs Under Dependence
One direction is to allow some (not too severe) dependencies in the variables
being summed.
De…nition 22 A sequence of random variables X1 ; X2 ; : : : is said to be mdependent if
(X1 ; : : : ; Xj ) is independent of (Xj+m ; Xj+m+1 ; : : :) 8j
On his page 224, Chung gives an m-dependent central limit theorem.
Theorem 23 If an m-dependent sequence of uniformly bounded random variables X1 ; X2 ; : : : is such that
p
VarSn
n
!1
1=3
n
n1=3
then
Sn
ESn
n
d
! N (0; 1)
The condition on the variance of the partial sum in Theorem 23 says that
the variance
p of Sn can not grow too slowly. Note that an iid model of course
has n = n .
9
2.2.2
CLTs for Sums of Random Numbers of Random Summands
There are limit theorems for random sums of random variables like that of A&L
problem 11.11 and Chung page 226.
Theorem 24 Suppose that X1 ; X2 ; : : : is a sequence of iid mean 0 variance 1
random variables and 1 ; 2 ; : : : is a sequence of random variables taking only
positive integer values such that for some c > 0
n P
!c
n
then
S
p
n
n
d
! N (0; 1)
This kind of theorem is fairly easy to prove in the case that fXj g and f j g are
independent (using characteristic function and conditioning). But this result
is harder to prove and much more general.
2.2.3
Rates for CLTs and Generalizations
There are theorems about the rate of convergence of distributions to normal.
That is, Theorem 15 says that in the iid …nite variance case
Sn n
p
n
P
t !
(t) 8t
This together with Polya’s Theorem (9.14 of A&L) says that
n
sup P
t
Sn n
p
n
and there are rate theorems about
n
t
(t) ! 0
like 11.4.1 of A&L.
Theorem 25 (The Berry-Esseen Theorem) Suppose the random variables fXj g
3
are iid with EjX1 j < 1: Let EX1 = and VarX1 = 2 . Then there is a
C 2 (0; 1) such that
3
E jX1
j
p
C
n
3 n
In fact, the C in Theorem 25 is no more than 5:05.
There are also rate theorems for "higher order" approximations called "Edgeworth Expansions." See for example, Theorem 11.4 of A&L, that says that
under conditions like those of Theorem 25, with
"
#
3
E jX1
j
Sn n
0
2
p
p
sup P
t
(t)
t
1 (t)
n
n
6 3 n
t
0
n
=o n
1
2
or even
0
n
=o n
1
.
10
2.2.4
Large Deviation Results
There are theorems about approximating other probabilities related to Sn . For
X1 ; X2 ; : : : a sequence of iid variables with EX1 = and t > so called "large
deviation" results concern approximation of P Snn > t . Note that typical
Central Limit Theorems say that
P
Sn n
p
n
>t =P
Sn
>
n
+ tp
n
!1
(t)
and so CLTs provide approximate cumulative probabilities for values of Snn
di¤ering from by amounts that shrink to 0 at rate n 1=2 . Large deviation
results instead concern probabilities of exceeding a …xed value t > . Theorem
11.4.6 of A&L says
1
Sn
n
>t
! C (t)
P
n
for C (t) some function depending upon the distribution of X1 .
2.2.5
Results About "Stable Laws"
CLTs say that there are sequences of constants an and bn such that
Sn
an
(1)
bn
has a standard normal limit. A sensible questions is whether quantities (1) can
have any other types of limit distributions.
De…nition 26 A non-degenerate distribution G on R is stable i¤ for X1 ; X2 ; : : :
iid G, 9 sequences of constants an and bn such that
Sn
an
bn
d
= X1
Theorem 27 A non-degenerate distribution G on R is stable i¤ 9 some distribution F and sequences of constants an and bn such that with X1 ; X2 ; : : : iid
F
Sn an d
!G
bn
Section 11.2 of A&L is full interesting facts about stable laws (including
representations of their characteristic functions).
2.2.6
Functional Central Limit Theorems and Empirical Processes
For C [0; 1] the space of continuous functions on the unit interval with metric
d (f; g) = sup jf (t)
t
11
g (t)j
one may think of putting distributions on the Borel sets in the complete separable metric space C [0; 1]. Thinking of the argument of an f 2 C [0; 1] as "time,"
such a distribution speci…es a stochastic process model.
Suppose that fXj g are iid with EX1 = 0 and VarX1 = 1 and de…ne Wn (t)
on [0; 1] by
1
j
= p Sj for j = 0; 1; : : : ; n
Wn
n
n
and by linear interpolation for t not of the form j=n for any j = 0; 1; : : : ; n.
Wn (t) is a random element of C [0; 1] and there is a probability measure n
such that Wn
n . The multivariate CLT can be used to show that 8 0
t1 t2
tk 1
0
1
Wn (t1 )
B Wn (t2 ) C
B
C d
B
C ! MVNk (0; )
..
@
A
.
Wn (tk )
where
=(
ij )
with
ij
= ti ^ tj
(2)
It further turns out that there is a distribution on C [0; 1] called the Standard
Brownian Motion such that 8 0 t1 t2
tk 1, W
implies that
0
1
W (t1 )
B W (t2 ) C
B
C
B .. C MVNk (0; )
@ . A
W (tk )
where
has form (2). ( is a Gaussian process, meaning all …nite dimensional marginals are Gaussian.) In addition, under these assumptions there is
a "functional CLT" (11.4.9 of A&L).
Theorem 28 For fXj g iid with EX1 = 0 and VarX1 = 1, and Wn (t) ;
as de…ned above
d
n !
n;
and
The distribution of Wn converges to a standard Brownian motion. (Convergence in distribution in this context is most naturally convergence of probabilities of sets of functions with boundaries that are null sets under the limiting
distribution. See A&L Section 9.3 for convergence in distribution in abstract
spaces.)
A related line of argument concerns "empirical processes" and the "Brownian Bridge" distribution on C [0; 1] and is discussed in Section 11.4.5 of A&L.
That is, suppose that W
for on C [0; 1] the standard Brownian motion
distribution. Let
B (t) = W (t) tW (1)
12
Then B (t) has a Gaussian process distribution called the "Brownian Bridge"
with
B (0) = B (1) = 0 and EB (t) = 0 8t
and
Cov (B (t1 ) ; B (t2 )) = t1 ^ t2
t1 t2
This distribution on C [0; 1] can be thought of as arising as a limit distribution in the following way. For fUj g iid U(0; 1) and (the empirical cumulative
distribution function for the U ’s)
n
Fn (t)
1X
I [Uj
n j=1
t] for t 2 [0; 1]
let
Yn (t) = the linear interpolation of
p
n (Fn (t)
t) made from values
at the jump points of Fn (i.e. where t is some Uj )
Then there is 11.4.1. of A&L.
Theorem 29 With B (t) and Yn (t) as above
d
Yn (t) ! B (t)
Yn (t) is the "empirical process" for the Uj and Theorem 29 says that approximations for probabilities for the process (and thence for Fn (t)) can be had
from use of the Brownian bridge distribution.
3
Conditioning
One needs properly general notions of "conditioning" in order to deal coherently with general statistical models. Some thought about even fairly simple
not-completely-discrete problems leaves one convinced that ideas of conditional
expectation must focus on conditioning -algebras, not conditioning variables
(a conditioning variable generates a corresponding -algebra, but it is the latter
that is intrinsic, not the former, as many slightly di¤erent variables can generate
the same -algebra). Here is the standard de…nition.
De…nition 30 Let ( ; F; P ) be a probability space and G F be a sub algebra. Let XR be a real-valued measurable function (a random variable) on
with EjXj = jX (!)j dP (!) < 1: Then the conditional expectation of
X given G, written E(XjG), is a random variable ! R with the properties
that
1. E(XjG) is G-measurable, and
13
2. for any A 2 G
Z
E (XjG) dP =
A
Z
XdP
A
In other notation, the basic requirement 2. of De…nition 30 is that EIA E(XjG) =
EIA X for all A 2 G. The existence of E(XjG) satisfying De…nition 30 follows
from the Radon-Nikodym Theorem.
The case X = IB for B 2 F gets its own notation.
De…nition 31 For B 2 F and G F a sub -algebra, the conditional probability of B given G is
P (BjG) E (IB jG)
Several comments about the notion of conditional expectation are in order.
First, E(XjG) is clearly unique only up to sets of P measure 0: Second, the
case where G = (W ) for some random variable W is
E (XjW )
E (Xj (W ))
And third, A&L have what is a very clever and possibly more natural (at least for
statisticians) de…nition of conditioning that applies to X with second moments.
Namely, E(XjG) is the projection of X onto the linear subspace of L2 ( ; F; P )
consisting of G-measurable functions. (More on this later.)
If G is the trivial sub -algebra, i.e. G = f?; g, then E(XjG) must be
constant and thus E(XjG) = EX. On the other hand, if G = F, then it’s easy
to argue that E(XjG) = EX. Intuitively, the smaller is G, the "less random"
is E(XjG), i.e. the "more averaging has gone on in its creation," i.e. the fewer
values it takes on.
3.1
Basic Results About Conditional Expectation
Some basic facts about conditional expectation are collected in the next theorem.
Theorem 32 Let ( ; F; P ) be a probability space and G
F be a sub -algebra.
1. If c 2 R, E(cjG) = c a.s.
2. If X has a …rst moment and is G-measurable, then E(XjG) = X a.s.
3. If X has a …rst moment and X
0 a.s., then E(XjG)
0 a.s.
4. If X1 ; : : : ; Xk have …rst moments and c1 ; : : : ; ck are real, then
!
k
k
X
X
E
ci Xi jG =
ci E (Xi jG) a.s
i=1
i=1
5. If X has a …rst moment then jE (XjG)j
14
E(jXj jG) a.s.
Next is 12.1.5ii of A&L and 13.2.6 of Rosenthal.
Theorem 33 Suppose that X and Y are random variables and G F is a sub
-algebra. If Y and XY have …rst moments and X is G-measurable, then
E (XY jG) = XE (Y jG) a.s.
(A G-measurable X can be "factored out" of a conditional expectation.)
Then there is a standard result from 12.1.5i of A&L or 13.2.7 of Rosenthal.
Theorem 34 Let X be a random variable with …rst moment and G1
be two sub -algebras. Then
G2
F
E (E (XjG2 ) jG1 ) = E (XjG1 ) a.s.
And there is a natural result about independence.
Theorem 35 Suppose that X has a …rst moment and G F is a sub -algebra.
If G and (X) are independent (in the sense that A 2 G and B 2 (X) implies
that P (A \ B) = P (A) P (B)) then
E (XjG) = EX a.s.
3.2
Conditional Variance
De…nition 31 is one important special case of conditional expectation. Another
is conditional variance.
De…nition 36 If EX 2 < 1 and G
tional variance
F is a sub
Var (XjG) = E X 2 jG
-algebra, de…ne the condi-
(E (XjG))
2
Theorem 12.2.6 of A&L is next.
Theorem 37 If EX 2 < 1 and G
F is a sub -algebra, then
VarX = Var (E (XjG)) + E (Var (XjG))
The next result shows that the A&L de…nition of conditional expectation is
appropriate for X 2 L2 ( ; F; P ).
Theorem 38 If EX 2 < 1 and G F is a sub
mizes
2
E (X Y )
-algebra, then E(XjG) mini-
over choices of G-measurable random variables Y:
(So under the hypotheses of Theorem 38, E(XjG) is indeed the projection of X
onto the subspace of L2 ( ; F; P ) consisting of G-measurable random variables.)
15
3.3
Conditional Analogues of Properties of Ordinary Expectation
Many properties of ordinary expectation have analogues for conditional expectations. For one, there is a conditional version of Jensen’s inequality (that is
12.2.4 of A&L).
Theorem 39 For 1
1 and that the function
sub -algebra G F,
a < b 1 suppose that X on ( ; F; P ) has P [a < X < b] =
is convex on (a; b) with Ej (X)j < 1. Then for any
(E (XjG))
E ( (X) jG) a.s.
And there are convergence theorems for conditional expectations like a conditional monotone convergence theorem (12.2.1 of A&L), a conditional version
of the Fatou lemma (12.2.2 of A&L), and a conditional dominated convergence
theorem (12.2.3 of A&L).
Theorem 40 (Monotone Convergence Theorem for Conditional Expectation)
If fXn g is a sequence of non-negative random variables on ( ; F; P ) with
Xn
Xn+1 a.s. and Xn ! X a.s.
then
E (Xn jG) ! E (XjG) a.s.
Theorem 41 (Fatou’s Lemma for Conditional Expectation) If fXn g is a sequence of non-negative random variables on ( ; F; P ), then
lim E (Xn jG)
E (lim Xn jG) a.s.
Theorem 42 (Dominated Convergence Theorem for Conditional Expectation)
Suppose fXn g is a sequence of random variables on ( ; F; P )and that Xn ! X
a.s. If there 9 a random variable Z such that jXn j
Z a.s. for all n and
EjZj < 1, then
E (Xn jG) ! E (XjG) a.s.
For statistical applications, it is Theorem 39 that is the most important of
these analogues of standard results for ordinary expectations.
3.4
Conditional Probability
Consider the implications of De…nition 31. It’s possible to argue that for every
countable set of disjoint Ai 2 F,
X
P (Ai jG) = P ([Ai jG) a.s.
i.e. that 9 NfAi g such that P NfAi g = 0 and
X
P (Ai jG) (!) = P ([Ai jG) (!) 8! 2 nNfAi g
16
(3)
It’s tempting to expect that this means that except for a set of P measure 0, I
may think of
P ( jG) (!)
as a probability measure. But (3) does not in general guarantee this. The
problem is that there are typically uncountably many sets fAi g and [NfAi g need
not be F-measurable, let alone have P measure 0: So the fact that (3) holds
does not guarantee the existence of an object satisfying the next de…nition.
De…nition 43 For ( ; F; P ) a probability space and G
-algebras, a function
:D
! [0; 1]
F and D
F two sub
is called a regular conditional probability on D given G provided
1. 8 A 2 D;
(A; ) = P (AjG) ( ) a.s.,
2. 8 ! 2
( ; !) is a probability measure on ( ; D).
;
A regular conditional probability need not exist. (See the example on page
81 of Breiman’s Probability.) When a regular conditional probability does
exist, conditional expectations can be computed using it. That is, there is the
following.
Lemma 44 If
is a regular conditional probability on D given G and Y is
D-measurable, then
Z
E (Y jG) (!) = Y d ( ; !) for P almost all !
In the case where D = (X), a regular conditional probability on D given G
is called a regular conditional distribution of X given G. Provided that X takes
values in a nice enough space, (like Rk ; B k or even (R1 ; B 1 )) it’s possible to
conclude that a regular conditional distribution of X given G exists. This is
12.3.1 of A&L.
Theorem 45 Suppose that ( ; F; P ) is a probability space and (X ; B) is a Polish space with Borel -algebra. Let X : ! X be measurable. Then for any
sub -algebra G F 9 a regular conditional distribution of X given G (a regular
conditional probability on (X) given G).
Under the conditions of Theorem 45, for any measurable h : X ! R; with
Ejh (X)j < 1; for the conditional distribution of X given G guaranteed to
exist by the theorem
Z
E (h (X) jG) (!) = h (X) d ( ; !) for P almost all !
17
and this obviously then works for indicator functions h. So for A 2 B
P X
1
(A) jG (!) = E IX
1 (A)
jG (!)
= E (IA (X) jG) (!)
=
=
(f! 0 jX (! 0 ) 2 Ag ; !)
X
1
(A) ; ! for P almost all !
making use of Lemma 44 for the switch between conditional expectation and
values of .
A …nal topic to be considered in the discussion of conditional probability is
conditional independence.
De…nition 46 Suppose G is a sub -algebra of F and for each l belonging to
some index set , Fl is a subset of F. We will say that fFl gl2 is conditionally independent given G if for any positive integer k and l1 ; l2 ; : : : ; lk 2
P (A1 \ A2 \
\ Ak jG) =
k
Y
i=1
P (Ai jG) a.s.
for all choices of Ai 2 Fli . A collection of random variables fXl gl2 is called
conditionally independent given G if f (Xl )gl2 is conditionally independent
given G.
Next is the result of Problem 12.18 of A&L.
Proposition 47 Suppose G1 ; G2 ; and G3 are three sub -algebras of F. G1 and
G2 are conditionally independent given G3 if and only if
P (Aj (G2 [ G3 )) = P (AjG3 ) 8A 2 G1
And there is the result of Problem 12.19 of A&L.
Proposition 48 Suppose G1 ; G2 ; and G3 are three sub -algebras of F. If
(G1 [ G3 ) is independent of G2 , then G1 and G2 are conditionally independent
given G3
4
Transition to Statistics
In the usual probability theory set-up, a (measurable) random observable X
maps ( ; F; P ) to some measure space (X ; B) and thereby induces a probability
measure P X on (X ; B) (the distribution of the random observable). In statistical theory, the space ( ; F; P ) largely disappears from consideration, being
replaced by consideration of X ; B; P X , and there is not just a single P X to
be considered but rather a whole family of such (giving di¤erent models for the
observable) indexed by a parameter belonging to some index set . It is
completely standard to abuse notation somewhat and replace the P X notation
18
with just P notation for the distribution of X, and thus consider a set of models
for X
P fP g 2
We’ll use a subscript to indicate computations made under the P model. So,
for example,
Z
E g (X) =
gdP
X
The object is then to use X to learn about , some function of , or perhaps
some as yet unobserved Y that has a distribution depending upon : The reason
to have learned measure-theoretic probability is that it now allows a logically
coherent approach to statistical analysis in models more general than the completely discrete or completely continuous models of Stat 542/543.
Henceforth, we will suppose that 9 a -…nite measure such that
P
8 2
(we will say that P is dominated by ). The Radon-Nikodym theorem then
promises that for each 2 ; 9 f : X ! R such that
Z
P (B) =
f d 8B 2 B
B
dP
is the R-N derivative of P with respect to ). f basically tells one
d
how to go from probabilities for one model for X to probabilities for another.
Then, for observable X, f (X) is a random function of that becomes the
likelihood function in a properly general statistical set-up.
We note for future reference that a Bayesian approach to inference will add to
these measure-theoretic model assumptions an assumption that has some distribution G (on after it is given some -algebra). Notice then that Xj
P
together with
G then presumably produces some (single) joint distribution
on X
that in turn typically can be thought of as producing some regular
conditional distribution of given (X) in the style of Section 3.4 (that serves
as a posterior distribution of jX).
(f =
5
5.1
Su¢ ciency and Related Concepts
Su¢ ciency and the Factorization Theorem
Roughly, T (X) is su¢ cient for P fP g 2 (or for ) provided "the conditional
distribution of X given T (X) is the same for each ," and this ought to be
equivalent to the likelihood function being essentially the same for any two
possible values of X, say x and x0 , with T (x) = T (x0 ) : This equivalence is the
"factorization theorem."
19
A formal development that will produce the factorization theorem begins
with a statistic T , a measurable map
T : (X ; B) ! (T ; F)
and the sub -algebra of B,
B0 = B (T )
T
1
(F )
=
F 2F
(T )
De…nition 49 T (or B0 ) is su¢ cient for P (or for ) if for every B 2 B 9
a B0 -measurable function Y such that 8 2
Y = P (BjB0 ) a.s. P
This general de…nition of su¢ ciency is phrased in terms of conditional probabilities (conditional expectations of indicators) and says that for each B, there
is a single Y that works as conditional probability for B given B0 as one looks
across all .. In the event that there is a single regular conditional distribution
of X given B0 = B (T ) = (T ) that works for all 2 , De…nition 49 says
that T is su¢ cient. (But the de…nition doesn’t require the existence of such
a regular conditional distribution, only the existence of the " -free" conditional
expectations one B at a time.)
Theorem 50 (Halmos-Savage Factorization Theorem) Suppose that P
where is -…nite. Then T is su¢ cient for P i¤ 9 a nonnegative B-measurable
function h and nonnegative F-measurable functions g such that 8
dP
(x) = f (x) = g (T (x)) h (x) a.s.
d
Theorem 50 says that T is su¢ cient i¤ x and x0 with T (x) = T (x0 ) have
likelihood functions that are proportional.
Theorem 50 is not so easy to prove. A series of lemmas is needed. The
…rst is Lemma 1.2 of Shao and the hint of problem 12.1 of A&L. (Below B k is
the Borel -algebra on Rk .)
Lemma 51 (Lehmann’s Theorem) Let T : X ! T be measurable and
:
(X ; B) ! Rk ; B k be B (T )-measurable. Then 9 an F-measurable function
: (T ; F) ! Rk ; B k such that (x) = (T (x)).
Next is Lemma 2.1 of Shao.
Lemma 52 P is dominated by a
probability measure of the form
-…nite measure
=
1
X
ci P
i¤ P is dominated by a
i
i=1
for some countable set fP i g
P and set of ci
20
0 with
P1
i=1 ci
= 1.
Lemma 53 Suppose that P is dominated by a -…nite measure
and T :
(X ; B) ! (T ; F). Then T is su¢ cient i¤ 9 nonnegative F-measurable functions g such that 8 (and with as in Lemma 52 …xed)
dP
(x) = g (T (x)) a.s.
d
We’ll take Lemmas 51 and 52 as given. An outline of the proof of Lemma
53 using Lemmas 51 and 52 is
dP 0
=g T
the restrictions of P and to B0 ,
d 0
dP
by Lehmann’s Theorem. Then show g T =
using facts about P (BjB0 )
d
(=))
For P 0 and
0
((=)
Show E (IB jB0 ) =E (IB jB0 ) using the representation of
dP
.
d
Then, an outline of the proof of Theorem 50 using Lemmas 51-53 is
dP
dP
\="
d
d
(=))
d
d
plus a.e.’s for
dP
=g
d
T (Lemma 53) and
d
= h.
d
dP
dP
\="
d
d
((=)
h
1
X
ci (g
i
d
d
plus a.e.’s for
T ). This implies that
i=1
5.2
dP
= (g
d
T ) h and
d
d
=
dP
is of the “right” form.
d
Minimal Su¢ ciency
A su¢ cient statistic T (X) is a reduction of data X that in some sense still
carries all the information available about P. A sensible question is what form
the "most compact" su¢ cient reduction of X takes. This is the notion of
minimal su¢ ciency.
De…nition 54 A su¢ cient statistic T : (X ; B) ! (T ; F) is minimal suf…cient for P (or for ) provided for every su¢ cient statistic S : (X ; B) !
(S; G) 9 a function U : S ! T such that
T =U
S a.s. P
(This de…nition is equivalent to or close to equivalent to B (T ) B (S).)
The fundamental qualitative insight here is that "the ‘shape’ of the loglikelihood" is minimal su¢ cient. This is phrased in technical terms in the next
result.
21
Theorem 55 Suppose that P is dominated by a -…nite measure , and that
T : (X ; B) ! (T ; F) is su¢ cient for P. Suppose further that 9 versions of
the R-N derivatives
dP
(x) = f (x)
d
such that the existence of k (x; y) > 0 such that f (y) = f (x) k (x; y) 8 implies
that T (x) = T (y). Then T is minimal su¢ cient.
A more technical and perhaps less illuminating line of argument for establishing minimal su¢ ciency is based on the next two results. (Shao’s Theorem
2.3ii is the k = 1 version of Theorem 56.)
k
Theorem 56 Suppose that P = fPi gi=0 is …nite and let fi be the R-N derivative of Pi with respect to some dominating -…nite measure . Suppose that
all k + 1 densities fi are positive everywhere on X . Then
T (X) =
fk (X)
f1 (X) f2 (X)
;
;:::;
f0 (X) f0 (X)
f0 (X)
is minimal su¢ cient for P.
(The positivity assumption in Theorem 56 is isn’t really necessary, but proof
without that assumption is fairly unpleasant.) The next result is a version of
Shao’s Theorem 2.3i.
Theorem 57 Suppose that P is a family of distributions and P0 P dominates
P (P (B) = 0 8P 2 P0 implies that P (B) = 0 8P 2 P). If T is su¢ cient for
P and minimal su¢ cient for P0 , then it is minimal su¢ cient for P.
5.3
Ancillarity and Completeness
Su¢ ciency of T (X) means roughly that "what is left over in X beyond the
information in T (X) is of no additional use in inferences about ." This notion
is related to the concepts of ancillarity and completeness.
De…nition 58 A statistic T : (X ; B) ! (T ; F) is said to be ancillary for
P = fP g 2 if the distribution of T (X) (i.e. the measure P T on (T ; F) de…ned
by P T (F ) = P T 1 (F ) does not depend upon on (i.e. is the same for all
).
De…nition 59 A statistic T : (X ; B) ! R1 ; B 1 is said to be 1st order
ancillary for P = fP g 2 if E T (X) does not depend upon on (i.e. is the
same for all ).
Two slightly di¤erent versions of the notion of completeness are next. They
are both stated in the framework of De…nition 58. That is, suppose T :
(X ; B) ! (T ; F), de…ne P T on (T ; F) by P T (F ) = P T 1 (F ) , and let
PT = P T .
22
De…nition 60 (A) P T (or T ) is complete (or boundedly complete) for P
or if
i) U : (X ; B) ! R1 ; B 1 is B (T )-measurable (and bounded) and
ii) E U (X) = 0 8
imply that U = 0 a.s. P 8 .
(B) P T (or T ) is complete (or boundedly complete) for P or if
i) h : (T ; F) ! R1 ; B 1 is F-measurable (and bounded) and
ii) E h (T (X)) = 0 8
imply that h T = 0 a.s. P 8 .
By virtue of Lehmann’s Theorem, the two versions of completeness in De…nition 60 are equivalent. They both say that there is no nontrivial (bounded)
function of T that is an unbiased estimator of 0. This is equivalent to saying
that there is no nontrivial (bounded) function of T that is …rst order ancillary.
Bounded completeness is obviously a slightly weaker condition than completeness. Two simple results regarding completeness and sub-families of models
are next.
Proposition 61 If T is complete for P = fP g
complete for P 0 = fP g 2 0 .
2
and
0
, T need not be
00
Theorem 62 If
and
dominates 00 in the sense that P (B) = 0
8 2
implies that P (B) = 0 8 2 00 , then T (boundedly) complete for
P = fP g 2 implies that T is (boundedly) complete for P 00 = fP g 2 00 .
Completeness of a statistic guarantees that there is no part of it (no nontrivial function of it) that is somehow irrelevant in and of itself to inference about .
That is reminiscent of minimal su¢ ciency. An important connection between
the two concepts is next. The proof of part i) of the following is on pages 95-96
of Schervish. (Silvey page 30 has a heuristic argument for part ii).)
Theorem 63 (Bahadur’s Theorem) Suppose that T : (X ; B) ! (T ; F) is
su¢ cient and boundedly complete. Then
i) if (T ; F) = Rk ; B k then T is minimal su¢ cient, and
ii) if there is any minimal su¢ cient statistic, then T is minimal su¢ cient.
Another standard result involving completeness, su¢ ciency, and ancillarity
is the following (from page 99 of Schervish).
Theorem 64 (Basu’s Theorem) If T is a boundedly complete su¢ cient statistic
and U is ancillary, then for all the random variables U and T are independent
(according to P ).
23
6
6.1
Facts About Common Statistical Models
Bayes Models
The "Bayes" approach to statistical problems is to add model assumptions to
the basic structure introduced thus far. To the standard assumptions concerning
P = fP g 2 , "Bayesians" add a distribution on the parameter space . That
is, suppose that
G is a distribution on ( ; C)
where it may be convenient to suppose that G is dominated by some -…nite
measure and that
dG
( ) = g( )
d
If considered as a function of both x and , f (x) is B C-measurable, then 9
a probability (a joint distribution for (X; ) on (X
; B C)) de…ned by
Z
Z
X;
(A) =
f (x) d (
G) (x; ) =
f (x) g ( ) d (
) (x; )
A
A
This distribution has marginals
Z
X
(B) =
f (x) d (
G) (x; ) =
B
Z Z
f (x) dG ( ) d (x)
B
R
R
(x) =
f (x) dG ( ) =
f (x) g ( ) d ( )) and
Z
Z Z
(C) =
f (x) d (
G) (x; ) =
f (x) d (x) dG ( ) = G (C)
(so that
d X
d
X
C
C
X
Further, the conditionals (regular conditional probabilities) are computable as
Z
Xj
(Bj ) =
f (x) d (x) = P (B)
B
and
jX
(Cjx) =
Z
C
Note then that
jX
d
dG
and thus that
jX
d
d
R
f (x)
dG ( )
f (x) dG ( )
( jx) = R
( jx) = R
f (x)
f (x) dG ( )
f (x) g ( )
f (x) g ( ) d ( )
is the usual "posterior density" for given X = x.
Bayesians use the posterior distribution or density (THAT DEPENDS ON
ONE’S CHOICE OF G) as their basis of inference for ). Pages 84-88 of
Schervish concern a Bayesian formulation of su¢ ciency and argue (basically)
that the Bayesian form is equivalent to the "classical" form.
24
6.2
Exponential Families
Many standard families of distributions can be treated with a single set of analyses by recognizing them to be of a common “exponential family” form. The
following is a list of facts about such families. (See, for example, Schervish Sections 2.2.1 and 2.2.2, or scattered results in Shao or the old books by Lehmann
(TSH and TPE ) for details.)
De…nition 65 Suppose that P = fP g is dominated by a -…nite measure
P is called an exponential family if for some h(x) 0,
!
k
X
: dP
(x) = exp a( ) +
f (x) =
.
i ( )Ti (x) h(x) 8
d
i=1
.
De…nition 66 A family of probability measures P = fP g is said to be identi…able if 1 6= 2 implies that P 1 6= P 2 .
R
P
Claim 67 Let = ( 1 ; 2 ; :::; k ); = f 2 Rk j h(x) exp (
i Ti (x)) d (x) <
1g, and consider also the family of distributions (on X ) with R-N derivatives
wrt of the form
!
k
X
f (x) = K( ) exp
,
2
.
i Ti (x) h(x)
i=1
Call this family P . This set of distributions on X is at least as large as P (and
this second parameterization is mathematically nicer than the …rst).
is called
the natural parameter space for P . It is a convex subset of Rk . (TSH, page
57.) If lies in a subspace of dimension less than k, then f (x) (and therefore
f (x)) can be written in a form involving fewer than k statistics Ti . We will
henceforth assume to be fully k dimensional. Note that depending upon the
nature of the functions i ( ) and the parameter space , P may be a proper
subset of P . That is, de…ning
= f( 1 ( ); 2 ( ); :::; k ( )) 2 Rk j 2 g;
can be a proper subset of .
Claim 68 The "support" of P , de…ned as fxjf (x) > 0g, is clearly fxjh(x) >
0g, which is independent of . The distributions in P are thus mutually absolutely continuous.
Claim 69 From the Factorization Theorem, the statistic T = (T1 ; T2 ; :::; Tk ) is
su¢ cient for P.
Claim 70 T has induced measures fP T j 2
family.
25
g which also form an exponential
Claim 71 If
contains an open rectangle in Rk , then T is complete for P.
(See pages 142-143 of TSH.)
Claim 72 If
contains an open rectangle in Rk (and actually under the much
weaker assumptions given on page 44 of TPE) T is minimal su¢ cient for P.
Claim 73 If g is any measurable real valued function such that E jg(X)j < 1,
then
Z
E g(X) = g(x)f (x)d (x)
is continuous on and has continuous partial derivatives of all orders on the
interior of . These can be calculated as
Z
@ 1+ 2+ + k
@ 1+ 2+ + k
E
g(X)
=
g(x)
f (x)d (x) .
1
2
k
@ 1 @ 2
@ k
@ 1 1@ 2 2
@ kk
(See page 59 of TSH.)
Claim 74 If for u = (u1 ; u2 ; :::; uk ), both
E
n
exp
u1 T1 (X) + u2 T2 (X) +
0
Further, if
0
0
and
0
+ u belong to
o
+ uk Tk (X) =
K( 0 )
:
K( 0 + u)
is in the interior of , then
E 0 (T1 1 (X)T2 2 (X)
Tk k (X)) = K(
In particular, E 0 Tj (X) =
@
@ j
and Cov 0 (Tj (X),Tl (X)) =
( ln K( ))
@
@
0)
2
j@ l
1+
@
@
1
1
=
( ln K( ))
@
2+
+
@
2
2
k
k
k
1
K( )
, Var 0 Tj (X) =
0
=
@2
@ j2
:
=
0
( ln K( ))
.
0
Claim 75 If X = (X1 ; X2 ; :::; Xn ) has iid components, with each Xi
P ,
n
n
n
;
B
)
wrt
.
The
then X generates a k dimensional
exponential
family
on
(X
Pn
k dimensional statistic
Under the
i=1 T (Xi ) is su¢ cient for this family.
condition that
contains an open rectangle, this statistic is also complete and
minimal su¢ cient.
6.3
Measures of Statistical Information
Details for the development here can be found in Section 2.3 of Schervish.
26
=
,
0
6.3.1
Fisher Information
This is an attempt to quantify how much on average the likelihood f (X) tells
one about 2 Rk . To even begin, one must have enough technical conditions
in place to make it possible to de…ne statistical information.
De…nition 76 We will say that the model P with dominating -…nite measure
is FI Regular at 0 2
Rk provided there is an open neighborhood of 0 ,
say O, such that
1. f (x) > 0 8x 2 X and
2 O,
2. 8x; f (x) has …rst order partial derivatives at 0 , and
R
3. 1 = f (x) d (x) can be di¤ erentiated with respect to each
R
d (x).
integral sign at 0 ; that is 0 = @@ i f (x)
=
then the k
@
ln f (X)
@ i
0
under the
0
De…nition 77 If the model P with dominating -…nite measure
at 0 2
Rk , and
E
i
is FI Regular
2
=
< 1 8i
0
k matrix
I(
0)
=
E
0
@
ln f (X)
@ i
=
@
ln f (X)
0 @ j
is called the Fisher Information about
=
0
contained in X at
0.
Claim 78 Fisher information doesn’t depend upon the dominating measure.
The Fisher Information certainly does depend upon the parameterization
one uses. For example, suppose that the model P is FI regular at 0 2 R1 and
that for some "nice" function h (say 1-1 with derivative h 1 )
= h( )
Then for P with R-N derivative f with respect to , the distributions Q (for
in the range of h) are de…ned by
Q = Ph
1(
)
and have R-N derivatives (again with respect to )
g =
dQ
= fh
d
1(
)
Then continuing to let "prime" denote di¤erentiation with respect to
g 0 = fh0
1
1(
)
h0 (h
27
1
( ))
or ,
and
I( )=
2
1
h0 (h
1
J h
( ))
1
( )
where I is the information in X about and J is the information in X about .
Under appropriate conditions, there is a second form for the Fisher Information matrix.
Theorem 79 Suppose that the model P is FI regular at R0 2 . If 8x; f (x)
has continuous partials in the neighborhood O, and 1 = f (x) d (x) can be
di¤ erentiated twice with respect to each i under the integral sign at 0 ; that is
R
R
2
d (x) and 0 = @ @i @ j f (x)
d (x) 8i and j, then
0 = @@ i f (x)
=
=
0
I(
0)
=
E
0
@2
@ i@
0
f (X)
j
=
0
!
One reason Fisher Information is an attractive measure is that for independent observations, it is additive.
Proposition 80 If X1 ; X2 ; : : : ; Xn are independent with
Pn Xi Pi; , then X =
(X1 ; X2 ; : : : ; Xn ) carries Fisher Information I ( ) = i=1 I i ( ) (where in the
obvious notation I i ( ) is the Fisher information carried by Xi ).
Fisher information also behaves "properly" in terms of what happens when
one reduces X to some statistic T (X). That is, there are the next two results,
the second of which is like 2.86 of Schervish.
Proposition 81 Suppose the model P is FI Regular at 0 2
Rk . If the
function T is 1-1, then the Fisher Information in T (X) is the same as the
Fisher Information in X.
Proposition 82 Suppose the model P is FI Regular at 0 2
Rk . (?And
T
also suppose that P
is FI regular at 0 ?) Then the matrix
2
IX (
0)
I T (X) (
0)
is non-negative de…nite. Further, if T (X) is su¢ cient, then I X ( 0 ) I T (X) (
0. Also, I X ( 0 ) I T (X) ( 0 ) = 0 8 0 implies that T (X) is su¢ cient.
In exponential families, the Fisher Information has a very nice form.
Proposition 83 For a family of distributions as in Claim 67,
I(
0 ) = Var (T (X)) =
@2
@ i@
28
( ln K( ))
j
=
0
0)
=
6.3.2
Kullback-Leibler Information
This is another measure of "information" that aims to measure how far apart
two distributions are in the sense of likelihood, i.e. in the sense of how hard it
will be to discriminate between them on the basis of X. It is a quantity that
arises naturally in the asymptotics of maximum likelihood.
De…nition 84 If P and Q are probability measures on X with R-N derivatives
with respect to a common dominating measure , p and q respectively, then the
Kullback-Leibler Information is the P expected log likelihood ratio
Z
p (x)
p (X)
= ln
p (x) d (x)
I (P; Q) = EP ln
q (X)
q (x)
Claim 85 I (P; Q) is not in general the same as I (Q; P ).
Lemma 86 I (P; Q)
0 and there is equality only when P = Q.
Lemma 86 follows from the next (slightly more general) lemma (from page
75 of Silvey), upon noting that C ( ) = ln ( ) is strictly convex.
Lemma 87 Suppose that P and Q are two di¤ erent probability measures on X
with R-N derivatives with respect to a common dominating measure , p and q
respectively. Suppose further that a function C : (0; 1) ! R is convex. Then
EP C
q (x)
p (x)
C (1)
with strict inequality if C is strictly convex.
K-L Information has a kind of additive property similar to that possessed
by the Fisher Information.
Theorem 88 If X = (X1 ; X2 ) and under both P and Q the variables X1 and
X2 are independent, then
IX (P; Q) = IX1 P X1 ; QX1 + IX2 P X2 ; QX2
K-L Information also has the kind of non-increasing property associated with
reduction of X to a statistic T (X) possessed by the Fisher Information.
Theorem 89 IX (P; Q)
IT (X) P T (X) ; QT (X) and there is equality if and
only if T (X) is su¢ cient for fP; Qg.
Finally, there is also an important connection between the notions of Fisher
and K-L Information under the formulation of Section 6.3.1.
29
Theorem 90 Under the hypotheses of Theorem 79, if one can reverse the order
of limits in all
@2
(E 0 ln f (X))
@ i@ j
= 0
to get
@2
E0
ln f (X)
@ i@ j
= 0
then
@2
= I ( 0)
I (P 0 ; P )
@ i@ j
= 0
(Clearly, the …rst I is the K-L Information and the second is the Fisher Information).
As a bit of an aside, it is worth considering what would establish the legitimacy of exchange of order of di¤erentiation and integration (e.g. required in
the hypotheses of Theorem 90). For any function h ( ) : R1 ! R1 , de…ne
D1 (h) ( ) = h ( +
)
h( )
and
D2 (h) ( ) = h ( +
Then
lim
and
!0
)
1
D1 (h) ( ) = h0 ( )
1
D2 (h) ( ) = h00 ( )
!0
lim
) + h(
2
2h ( )
Consider g ( ; x) : R1 X ! R1 . Then, for example,
Z
1 2
D (g ( ; x)) ( 0 ) f 0 (x) d (x)
2
Z
1
= 2
[g ( 0 + ; x) + g ( 0
; x) 2g ( 0 ; x)] f 0 (x) d (x)
Z
1
= 2 D2
g ( ; x) f 0 (x) d (x) ( 0 )
Now the limit of the last of these (as ! 0) is
Z
d2
g ( ; x) f 0 (x) d (x)
d 2
=
0
So the question of interchange of expectation and second derivative is that of
whether the ( ) integral of the limit of
1
2
D2 (g ( ; x)) ( 0 ) f 0 (x)
is the limit of the ( ) integral. This might be established using the dominated
convergence theorem.
30
7
Statistical Decision Theory
This is a formal framework for balancing of various kinds of risks under uncertainty, and can sometimes be used look for "good" statistical procedures. It
…nds modern uses in the search for good data mining and machine learning
algorithms.
7.1
Basic Framework and Concepts
To the usual statistical modeling framework of Section 4 we now add the following elements:
1. some "action space" A with -algebra E,
2. a suitably measurable "loss function" L :
3. (non-randomized) decision rules
A ! [0; 1), and
: (X ; B) ! (A; E) :
(For data X; (X) is the action taken on the basis of X.)
R
De…nition 91 R ( ; ) E L ( ; (X)) = L ( ; (x)) dP (x) mapping
[0; 1) is called the risk function for .
De…nition 92
is at least as good as
0
if R ( ; )
0
De…nition 93 is better than (dominates)
strict inequality for at least one 2 .
De…nition 94
and
0
R ( ; 0) 8 .
if R ( ; )
R ( ; 0 ) 8 with
are risk equivalent if R ( ; ) = R ( ; 0 ) 8 .
De…nition 95 is best in a class of decision rules
least as good as any other 0 2 .
De…nition 96
inates) .
!
is inadmissible in
De…nition 97 is admissible in
no 0 which is better).
if 9
0
2
if
2
and
is at
which is better than (dom-
if it is not inadmissible in
(if there is
We will use the notation
D = f g = the class of (non-randomized) decision rules
It is technically useful to extend the notion of decision procedures to include
the possibility of randomizing in various ways.
De…nition 98 If for each x 2 X ,
called a behavioral decision rule.
x
is a distribution on (A; E), then
31
x
is
The notion of a behavioral decision rule is that one observes X = x and then
makes a random choice of an element of A using distribution x . We’ll let
D =f
xg
= the class of behavioral decision rules
It’s possible to think of D as a subset of D by associating with
2 D a
behavioral decision rule x that is a point mass distribution on A concentrated
at (x). The natural de…nition of the risk function of a behavioral decision rule
is (abusing notation and using "R" here too)
Z Z
R( ; ) =
L ( ; a) d x (a) dP (x)
X
A
A second (less intuitively appealing) notion of randomizing decisions is one
that might somehow pick an element of D at random (and then plug in X). Let
F be a -algebra on D that contains all singleton sets.
De…nition 99 A randomized decision function (or rule)
ity measure on (D; F).
with distribution
is a probabil-
becomes a random object. Let
D = f g = the class of randomized decision rules
It’s possible to think of D as a subset of D by associating with
2 D a
randomized decision rule
placing mass 1 on . The natural de…nition of the
risk function of a behavioral decision rule is (yet again abusing notation and
using "R" here too)
Z
R( ; ) =
R( ; )d ( )
D
Z Z
=
L ( ; (x)) dP (x) d ( )
D
X
(assuming that R ( ; ) is properly measurable).
The behavioral decision rules are most natural, while the randomized decision rules are easiest to deal with in some proofs. So it is a reasonably important
question when D and D are equivalent in the sense of generating the same set
of risk functions. Some properly quali…ed version of the following is true.
Proposition 100 If A is a complete separable metric space with E the Borel
-algebra and ??? regarding the distributions P and ???, then D and D are
equivalent in terms of generating the same set of risk functions.
D and D are clearly more complicated than D. A sensible question is
when they really provide anything D doesn’t provide. One kind of negative
answer can be given for the case of convex loss. The following is like 2.5a of
Shao, page 151 of Schervish, page 40 of Berger, or page 78 of Ferguson.
32
Lemma 101 Suppose that A is a convex subset of Rd and
decision rule. De…ne a non-randomized decision rule by
Z
(x) =
ad x (a)
x
is a behavioral
A
(In the case that d > 1, interpret (x) as vector-valued, the integral as a vector
of integrals over d coordinates of a 2 A.) Then
1. if L ( ; ) : A ! [0; 1) is convex, then
R( ; )
R( ; )
and
2. if L ( ; ) : A ! [0; 1) is strictly convex, R ( ; ) < 1 and
P (fxj
x
is non-degenerateg) > 0, then
R( ; ) < R( ; )
Corollary 102 Suppose that A is a convex subset of Rd ,
decision rule and
Z
(x) =
ad x (a)
x
is a behavioral
A
Then
1. if L ( ; a) is convex in a 8 ,
is at least as good as ,
2. if L ( ; a) is convex in a 8 and for some
convex in a, R ( 0 ; ) < 1 and P 0 (fxj
is better than .
0
x
the function L ( 0 ; a) is strictly
is non-degenerateg) > 0, then
The corollary shows, e.g., that for squared error loss estimation, averaging
out over non-trivial randomization in a behavioral decision rule will in fact
improve that estimator.
7.2
(Finite Dimensional) Geometry of Decision Theory
A very helpful device for understanding some of the basics of decision theory is
the geometry associated with cases where is …nite. Let
= f 1;
2; : : : ; k g
Assume that R ( ; ) < 1 8 2
and
2 D and note that in this case
R ( ; ) : ! [0; 1) can be thought of as a k-vector. Let
S = y = (y1 ; y2 ; : : : ; yk ) 2 Rk j yi = R ( i ; ) for all i and some
= the set of all randomized risk vectors
33
2D
Theorem 103 S is convex.
In fact, it turns out that if
S 0 = y = (y1 ; y2 ; : : : ; yk ) 2 Rk j yi = R ( i ; ) for all i and some
2D
S is the convex hull of S 0 . This is the smallest convex set containing S 0 , the
set of all convex combinations of points of S 0 , the intersection of all convex sets
containing S 0 . (By the de…nition of randomized decision rules, each produces
a y that is a convex combination of points in S 0 , so it is at least obvious that
S is a subset of the convex hull of S 0 .)
De…nition 104 For x 2 Rk , the lower quadrant of x is
Qx = fzjzi
xi 8ig
Theorem 105 y 2 S (or the decision rule giving rise to y) is admissible if and
only if Qy \ S = fyg.
De…nition 106 For S the closure of S, the lower boundary of S is
(S) = yjQy \ S = fyg
De…nition 107 S is closed from below if
(S)
S.
Let A (S) = fyjQy \ S = fygg be the set of admissible risk points.
Theorem 108 If S is closed A (S) =
(S).
Theorem 108 is very easy to prove. A better theorem (one with weaker
hypotheses) that is perhaps surprisingly harder to prove is next.
Theorem 109 If S is closed from below A (S) =
(S).
Now drop the …niteness assumption.
7.3
Complete Classes of Decision Rules
De…nition 110 A class of decision rules C D is a complete class if for
any 2
= C, 9 0 2 C such that 0 is better than .
(One can do better in the complete class than anywhere outside of the complete
class in terms of risk functions.)
De…nition 111 A class of decision rules C D is an essentially complete
class if for any 2
= C, 9 0 2 C such that 0 is at least as good as .
De…nition 112 A class of decision rules C D is a minimal complete
class if it is complete and is a subset of any other complete class.
34
Let A (D ) be the set of admissible rules in D :
Theorem 113 If a minimal complete class C exists, then C = A (D ).
Theorem 114 If A (D ) is complete, then it is minimal complete.
Theorems 113 and 114 are properly quali…ed versions of the rough (not quite
true) statement
"A (D ) is the minimal complete class."
7.4
Su¢ ciency and Decision Theory
Presumably one can typically do as well in a decision problem based on a su¢ cient statistic T (X) as one can do based on X.
Theorem 115 If T is su¢ cient for P and is a behavioral decision rule, then
9 another behavioral decision rule 0 that is a function of T and has the same
risk function as .
(That 0 is a function of T means that T (x) = T (y) implies that 0x and 0y are
the same distributions on A.)
Theorem 115 shows that the class of rules that are functions of a su¢ cient
statistic is essentially complete. Theorem 118 shows that under a convex loss,
conditioning on a su¢ cient statistic can sometimes actually improve a decision
rule. We state two small results that lead up to this.
Lemma 116 Suppose A
decision rules. Then
Rd is convex and
=
1
(
2
1
+
1
and
2
are two non-randomized
2)
is also a decision rule. Further,
1. if L ( ; a) is convex in a and R ( ;
and
1)
2. if L ( ; a) is strictly convex in a, R ( ;
0, then R ( ; ) < R ( ; 1 ).
= R( ;
1)
2 ),
= R( ;
then R ( ; )
2)
R( ;
< 1, and P (
1
1 ),
(X) 6=
Corollary 117 Suppose A Rd is convex and 1 and 2 are two nonrandomized decision rules with identical risk functions. If L ( ; a) is convex in a 8 ,
the existence of a 2
for which L ( ; a) is strictly convex in a, R ( ; 1 ) =
R ( ; 2 ) < 1 and P ( 1 (X) 6= 2 (X)) > 0, implies that 1 and 2 are inadmissible.
Finally, we have the general form of the Rao-Blackwell Theorem (that usually
appears in its SELE form in Stat 543 and is 2.5.6 of Shao, page 152 of Schervish,
Berger page 41, and Ferguson page 121).
35
2
(X)) >
Theorem 118 (Rao-Blackwell Theorem) Suppose A
Rd is convex and is
a nonrandomized decision function with E k (X)k < 1 8 . Suppose further
that T is su¢ cient for and with B0 = B (T ) let
0
Then
0
(x)
E ( jB0 ) (x)
is a non-randomized decision rule. Further,
1. if L ( ; ) is convex in a, then R ( ;
0)
R ( ; ), and
2. for any for which L ( ; a) is strictly convex in a, R ( ; ) < 1, and
P ( 0 (X) 6= (X)) > 0, it is the case that R ( ; 0 ) < R ( ; ).
7.5
Bayes Decision Rules
The Bayes approach to decision theory is one means of reducing risk functions
to numbers so that they can be compared in a straightforward fashion. Let G
be a distribution on ( ; C).
De…nition 119 The Bayes risk of
2 D with respect to the prior G is
(abusing notation again and once more using the symbols R ( ; ))
Z
R (G; ) = R ( ; ) dG ( )
Further (again abusing notation) the "minimum" Bayes risk is
R (G)
De…nition 120
inf R (G;
0 2D
0
)
2 D is said to be Bayes with respect to G if
R (G; ) = R (G)
De…nition 121
2 D is said to be -Bayes with respect to G provided
R (G; )
R (G) +
Bayes rules are often admissible, at least if the prior involved "spreads its
mass around su¢ ciently."
Theorem 122 If
is countable and G is a prior with G ( ) < 1 8 and
Bayes with respect to G, then is admissible.
is
Theorem 123 Suppose
Rk is such that every neighborhood of any point
2
has a non-empty intersection with the interior of . Suppose further
that R ( ; ) is continuous in for all 2 D . Let G be a prior distribution
that has support
in the sense that every open ball that is a subset of
has
positive G probability. Then if R (G) < 1 and is a Bayes rule with respect
to G, is admissible.
36
Theorem 124 If every Bayes rule with respect to G has the same risk function,
they are all admissible.
Corollary 125 In a problem where Bayes rules exist and are unique, they are
admissible.
A converse of Theorem 122 (for …nite ) is given in Theorem 127. Its
proof requires the use of a piece of k-dimensional analytical geometry (called
the separating hyperplane theorem) that is stated next.
Theorem 126 (Separating Hyperplane Theorem) Let S1 and SP
2 be two disjoint
P
convex subsets of Rk . Then 9 a p 2 Rk such that p 6= 0 and
p i xi
pi yi
8y 2 S1 and 8x 2 S2 .
Theorem 127 If
to some prior.
is …nite and
is admissible, then
is Bayes with respect
For non-…nite , admissible rules are "usually" Bayes or "limits" of Bayes
rules. See Section 2.10 of Ferguson or page 546 of Berger.
As the next result shows, Bayesians don’t need randomized decision rules,
at least for purposes of achieving minimum Bayes risk, R (G). See also page
147 of Schervish.
Proposition 128 Suppose that
2 D is Bayes versus G and R (G) < 1.
Then there exists a nonrandomized rule 2 D that is also Bayes versus G.
There remain the questions of 1) when Bayes rules exist and 2) what they
look like when they do exist. These are addressed next.
Theorem 129 If
is …nite, S is closed from below, and G assigns positive
probability each 2 , there is a rule Bayes versus G.
De…nition 130 A formal nonrandomized Bayes rule versus a prior G
is a rule (x) such that 8x 2 X ,
Z
f (x)
dG ( )
(x) is an a 2 A minimizing
L ( ; a) R
f (x) dG ( )
De…nition 131 If G is a -…nite measure, a formal nonrandomized generalized Bayes rule versus G is a rule (x) such that that 8x 2 X ,
Z
(x) is an a 2 A minimizing
L ( ; a) f (x) dG ( )
37
7.6
Minimax Decision Rules
An
R alternative to the Bayes reduction of R ( ; ) to the number R (G; ) =
R ( ; ) dG ( ) is to reduce R ( ; ) to the number supR ( ; ).
De…nition 132 A decision rule
2 D is said to be minimax if
supR ( ; ) = inf
supR ( ;
0
0
)
It is intuitively plausible that if one is trying to produce a minimax rule by
pushing down "the highest peak" in R ( ; ) (considered as a function of ),
that will tend to direct one toward rules with fairly "‡at" risk functions. So
the notion of "equalizer rules" has a central place in minimax theory.
De…nition 133 If a decision rule
called an equalizer rule.
Theorem 134 If
minimax.
2 D has a constant risk function, it is
2 D is an equalizer rule and is admissible, then it is
Theorem 135 Suppose that f i g is a sequence of decision rules, each i Bayes
versus a prior Gi . If R (Gi ; i ) ! C < 1 and
is a decision rule with
R ( ; ) C 8 , then is minimax.
Corollary 136 If 2 D is an equalizer rule and is Bayes with respect to a
prior G, then it is minimax.
Corollary 137 If
is minimax.
2 D is Bayes versus G and R ( ; )
R (G) 8 , then it
In light of Corollaries 136 and 137, a reasonable question is "How might I
identify a prior that might produce a Bayes rule that is minimax?" An answer
to this question is "Look for a ‘least favorable’prior distribution."
De…nition 138 A prior distribution G is said to be least favorable if
R (G) = supR (G0 )
G0
Theorem 139 If
favorable.
7.7
is Bayes versus G and R ( ; )
R (G) 8 , then G is least
Invariance/Equivariance Theory
In decision problems with su¢ cient symmetries, it may make sense to restrict
attention to decision rules that have corresponding symmetries. It then can
happen that risk functions of such rules take relatively few values ... are potentially even constant. Then when comparing (constant) risk functions, one is in
a position to identify a best (symmetric) rule. Invariance/equivariance theory
is the formalization of this line of argument.
38
7.7.1
Location Parameter Estimation and Invariance/Equivariance
De…nition 140 When X = Rk and
parameter means that under P , X
Rk , to say that
is a location
has the same distribution for all .
De…nition 141 When X = Rk and
R, to say that is a 1-dimensional
location parameter means that under P , X
1 has the same distribution
for all .
Until further notice, assume that
= A = R.
De…nition 142 A loss function is called location invariant if L ( ; a) =
(
a) for some non-negative function . If (z) is increasing in z for z > 0
and decreasing in z for z < 0, the corresponding decision problem is called a
location estimation problem.
De…nition 143 A non-randomized decision rule is called location equivariant if
(x + c1) = (x) + c
Theorem 144 If is a 1-dimensional location parameter, L ( ; a) is location
invariant and is location equivariant, then R ( ; ) is constant in .
De…nition 145 A function g : Rk ! R is called location invariant if g (x + c1) =
g (x).
A location invariant function is "constant on ‘45 lines.’"
Lemma 146 A function u is location invariant if and only if u depends upon
x only through
y (x) = (x1
xk ; x2
xk ; : : : ; x k
1
xk )
Lemma 147 Suppose that a non-randomized decision rule 0 is location equivariant. Then 1 is location equivariant if and only if 9 a location invariant function
u such that
1 = 0+u
Theorem 148 Suppose that L ( ; a) = (
a) is location invariant and 9 a
location equivariant estimator 0 for with …nite risk. Let 0 (z) = ( z).
Suppose that for each y 9 a number v (y) that minimizes
E
=0
[ 0(
0
v) jY = y]
over choices of v. Then
(x) =
0
(x)
v (y (x))
is a best location equivariant estimator of .
39
In the case of squared error loss, the estimator of Theorem 148 is "Pittman’s
estimator" and can be written out more explicitly in some situations.
2
Theorem 149 Suppose that L ( ; a) = (
a) and that the distribution of
X = (X1 ; : : : ; Xk ) (P ) has R-N derivative with respect to Lebesgue measure on
Rk
f (x) = f (x1
; : : : ; xk
) = f (x
1)
and that f has second moments. Then the best location equivariant estimator
of is
R1
f (x
1) d
1
(x) = R 1
f (x
1) d
1
Pittman’s estimator is the formal generalized Bayes estimator versus Lebesgue
measure on R:
Lemma 150 For the case of L ( ; a) = (
estimator of must be unbiased for .
7.7.2
2
a) a best location equivariant
Generalities of Invariance/Equivariance Theory
A somewhat more general story about invariance leads to a generalization of
Theorem 144. So, suppose that P = fP g is identi…able and that L ( ; a) =
L ( ; a0 ) 8 implies that a = a0 . Invariance/equivariance theory is phrased in
terms of transformations X ! X , ! , and A ! A and how decision rules
relate to them.
De…nition 151 For 1-1 transformations of X onto X , g1 and g2 , the composition of g1 and g2 is another 1-1 transformations of X onto X de…ned by
g2 g1 (x) = g2 (g1 (x)) 8x
De…nition 152 For g a 1-1 transformation of X onto X , if h is another such
transformation such that h g (x) = x 8x, h is called the inverse of g and
denoted as h = g 1 .
It’s an easy result that both g
1
g (x) = x 8x and g g
1
(x) = x 8x.
De…nition 153 The identity transformation on X is de…ned by e (x) = x 8x.
Another easy fact is then that e g = g e = g for g any 1-1 transformation
of X onto X .
Proposition 154 If G is a class of 1-1 transformations of X onto X that is
closed under composition and inversion, then with the operation " " G is a group
in the usual mathematical sense.
To go anywhere with this theory, we need transformations on that …t with
the ones on X . At a minimum, we don’t want the transformations on X to move
us outside of P.
40
De…nition 155 If g is a 1-1 transformation of X onto X such that
n
o
g(X)
P
=P
2
the transformation is said to leave the model invariant. If each element of
a group of transformations G leaves P invariant, we’ll say that the group leaves
the model invariant.
One circumstance in which G leaves P invariant is that where the model P
can be thought of as "generated by G" in the sense that for some U P (for a
…xed P ) and P = P g(U ) jg 2 G .
If a group of transformations on X (say G) leaves P invariant, there is a
natural corresponding group of transformations on . That is, for g 2 G
X
P ) g (X)
P
0
for some
0
2
(and in fact, because of the indenti…ability assumption, there is only one such
0
). So one might adopt the next de…nition.
De…nition 156 If G leaves P invariant, for each g 2 G, de…ne g :
g ( ) = the element
0
2
such that X
P ) g (X)
Pg(
!
by
)
Proposition 157 For G a group of 1-1 transformation of X onto X leaving P
invariant, the set
G = fgjg 2 Gg
with the operation "function composition" is a group of 1-1 transformations of
onto .
(G is the homomorphic image of G under the map g ! g.)
Now we need transformations on A that …t together nicely with those on X
and .
De…nition 158 A loss function L ( ; a) is said to be invariant under the
group of transformations G, provided for every g 2 G and a 2 A, 9 a unique
a0 2 A such that
L ( ; a) = L (g ( ) ; a0 ) 8
De…nition 159 If G leaves P invariant and the loss function L ( ; a) is invariant, de…ne g~ : A ! A by
g~ (a) = the element a0 of A such that L ( ; a) = L (g ( ) ; a0 ) 8
According to the notations in De…nitions 156 and 159, for G leaving P invariant and L ( ; a) invariant,
L ( ; a) = L (g ( ) ; g~ (a))
41
Proposition 160 For G a group of 1-1 transformations X onto X leaving P
invariant and invariant loss function L ( ; a),
G~ = f~
g jg 2 Gg
with the operation "function composition" is a group of 1-1 transformations of
A onto A.
(G~ is the homomorphic image of G under the map g ! g~.)
Finally, we come to the business of equivariance of decision functions. Here
restrict attention to non-randomized decision rules.
De…nition 161 In an invariant decision problem (one where G leaves P invariant and the loss function L ( ; a) is invariant) a non-randomized rule is
said to be equivariant provided
(g (x)) = g~ ( (x)) 8x 2 X and g 2 G
We now have enough structure to state the promised generalization of Theorem 144.
Theorem 162 If G leaves P invariant, the loss function L ( ; a) is invariant,
and the non-randomized rule is equivariant,
R ( ; ) = R (g ( ) ; ) 8 2
De…nition 163 For 2 , O = f
the orbit in containing .
0
2
j
0
and g 2 G
= g ( ) for some g 2 Gg is called
The distinct orbits O partition , and Theorem 162 says that an equivariant
non-randomized rule in an invariant problem has a risk function that is constant
on orbits. Comparing risk functions for equivariant non-randomized rules in
an invariant problem reduces to comparing risks on orbits. This, of course, is
most helpful when there is only one orbit.
De…nition 164 A group of 1-1 transformations of a space onto itself is called
transitive if for any two points in the space there is a transformation in the
group taking the …rst point to the second.
Corollary 165 If G leaves P invariant, the loss function L ( ; a) is invariant,
and G is transitive over , then every equivariant non-randomized decision rule
has constant risk.
De…nition 166 In an invariant decision problem, a decision rule , an equivariant decision rule is said to be a best (non-randomized) equivariant (or
MRE: minimum risk equivariant) decision rule provided that for any other
(non-randomized) equivariant rule 0
R( ; )
(and it su¢ ces to check that R ( ; )
R ( ; 0) 8
R ( ; 0 ) for any one ).
42
8
Asymptotics of Likelihood Inference
As usual, write P = fP g. Suppose that P is identi…able and dominated by a
-…nite measure, , and write
dP
f =
d
We’ll here consider asymptotics for inference based on the likelihood function
f (X) (a random function of ).
We’ll concentrate on the iid (one sample) case. Here is some notation we’ll
use in the discussion. The basic one-observation/one-dimensional model is
(X ; B; P ) with parameter space
Rk . Based on this we’ll take
X n = (X1 ; X2 ; : : : ; Xn )
Xn
Bn
Pn
n
fn =
dP n
d n
the
the
the
the
the
observable through stage n
observation space for X n
n-fold product -algebra
distribution of X n
dominating (product) measure on X n
Notice that if one wants to prove almost sure convergence results, one needs
a …xed probability space in which to operate for all n. In that case, one
could use a big space involving = X 1 ; C = B 1 ; P 1 ; 1 and think of Xi in
X n = (X1 ; X2 ; : : : ; Xn ) as
Xi (!) = the ith coordinate of !
8.1
8.1.1
Asymptotics of Likelihood Estimation
Consistency of Likelihood-Based Estimators
De…nition 167 A statistic T : X n !
estimator of if
is called a maximum likelihood
fTn(xn ) (xn ) = supf n (xn ) 8xn 2 X n
2
For a …xed xn 2 X n a 2
maximizing f n (xn ) might be called a maximum likelihood estimate even when some xn ’s might not have corresponding
maximizers.
De…nition 168 Suppose that fTn g is a sequence of statistics, Tn : X n !
Rk
1. fTn g is (weakly) consistent at
P n0 [kTn
0
0k
2
if for every
> 0;
> ] ! 0 as n ! 1
43
2. fTn g is strongly consistent at
0
2
if Tn !
0
a.s. P 1
0
Let
z 0 ( ) = E 0 ln f (X)
Z
= I ( 0 ; ) + ln (f 0 (x)) f 0 (x) d (x)
It is a corollary of Lemma 86 that for
6=
0
z 0 ( ) < z 0 ( 0)
Consider the loglikelihood through n observations
ln (xn ; ) = ln f n (xn ) =
n
X
ln f (xi )
i=1
Note that
1
ln (X n ; )
n
and in fact that the LLN promises that for every
z 0( )=E
0
1
ln (X n ; ) ! z 0 ( ) in
n
0
probability
So assuming enough regularity to make the convergence uniform in there is
1
perhaps hope that a single maximizer of ln (X n ; ) will exist and be close to
n
the maximizer of z 0 ( ) (namely 0 ), i.e. that an MLE might exist and be
consistent. To make this kind of argument precise is quite hard. See Theorems
7.49 and 7.54 of Schervish. We’ll take a di¤erent, easier, and much more
common line of development.
Where the f are di¤erentiable in , an MLE must satisfy the likelihood
equations.
Rk and the f are di¤ erentiable in ,
De…nition 169 In the case where
the likelihood equations are
@
ln f n (xn ) = 0 for i = 1; 2; : : : ; k
@ i
Much of the relatively simple theory of (maximum) likelihood revolves more
around solutions of the likelihood equations than it does maximizers of the
likelihood. For a one-dimensional parameter there is the following result.
Theorem 170 Suppose that k = 1 and 9 an open neighborhood of
such that
1. f (x) > 0 8x and 8 2 O,
44
0,
say O,
2. 8x; f (x) is di¤ erentiable at every point
3. E 0 ln f (X) exists for all
Then if
equation
> 0 and
2 O, and
2 O and E 0 ln f 0 (X) is …nite.
> 0 9 n such that 8m > n the
0
probability that the
d
lm (X m ; ) = 0
d
(the likelihood equation) has a root within of 0 is at least 1
.
Corollary 171 Under the hypotheses of Theorem 170, suppose in addition that
is open and f (x) is di¤ erentiable at every point 2 , so that the likelihood
equation
n
d X
d
ln (X n ; ) =
ln f (Xi ) = 0
d
d i=1
makes sense at all 2 . De…ne n to be the root of the likelihood equation
when there is exactly one (and otherwise adopt any de…nition for n ). If with
0 probability approaching 1 the likelihood equation has a single root, then
n
!
0
in
0
probability
Corollary 172 Under the hypotheses of Theorem 170, if fTn g is a sequence of
estimators consistent at 0 and
8
if the likelihood
>
>
Tn
<
equation
has no roots
^n =
the
root
of
the
likelihood
>
>
otherwise
:
equation closest to Tn
then
^n !
probability
Actually implementing the prescription for ^n in Corollary 172 is not without
its practical and philosophical di¢ culties. Another line of attack for possibly
improving a consistent estimator is to adopt a "1-step Newton improvement"
on it. Writing
n
X
Ln ( ) = ln (X n ; ) =
ln f (Xi )
0
in
0
i=1
this is
L0n (Tn )
L00n (Tn )
(of course provided L00n (Tn ) 6= 0). In the event that k > 1, this becomes
~n = Tn
~n = Tn
L00n (Tn )
1
L0n (Tn )
for L0n (Tn ) the k 1 vector of …rst partials of Ln ( ) and L00n (Tn ) the k k
matrix of second partials of Ln ( ). It is plausible that estimators of this type
end up having asymptotic behavior similar to that of real roots of the likelihood
equations. See, e.g., Schervish’s development around his Theorem 7.75 for a
more general ("M-estimation") version of this.
45
8.1.2
Asymptotic Normality of Likelihood-Based Estimators
An honest version of the Stat 543 "MLE’s are asymptotically Normal" is next.
Theorem 173 Suppose that k = 1 and 9 an open neighborhood of
such that
0,
say O,
1. f (x) > 0 8x and 8 2 O,
2. 8x, f (x) is three times di¤ erentiable at every point of O,
3. 9 M (x)
0 with E 0 M (X) < 1 and
d3
ln f (x)
d 3
M (x) 8x and 8 2 O;
R
4. 1 = f (x) d (x) can be di¤ erentiated twice with respect to
integral at 0 , and
under the
5. I1 ( ) 2 (0; 1) 8 2 O.
If with 0 probability approaching 1, ^n is a root of the likelihood equation and
^n ! 0 in 0 probability, then under 0
p
n ^n
d
0
! N 0;
1
I1 ( 0 )
Corollary 174 Under the hypotheses of Theorem 173, if I1 ( ) is continuous
at 0 , then under 0
r
d
nI1 ^n ^n
0 ! N (0; 1)
Corollary 175 Under the hypotheses of Theorem 173, under
r
d
L00n ^n ^n
0 ! N (0; 1)
0
What is often a more practically useful result
p (parallel to Theorem 173) concerns "one-step Newton improvements" on " n-consistent" estimators. (The
following is a special case of Schervish’s Theorem 7.75.)
Theorem 176 Under
of Theorem 173, suppose that under 0
p the hypotheses 1-5 p
estimators Tn are n-consistent, that is n (Tn
0 ) converges in distribution
(or more generally, is O (1) in 0 probability). Then with
L0n (Tn )
L00n (Tn )
~n = Tn
under
0
p
n ~n
d
0
! N 0;
1
I1 ( 0 )
It is then obvious that versions of Corollaries 174 and 175 hold where ^n is
replaced by ~n .
46
8.2
Asymptotics of LRT-like Tests and Competitors and
Related Con…dence Regions
This primarily concerns the most common "general purpose" means of cooking
up tests (and related con…dence sets) in non-standard statistical models. For
testing H0 : 2 0 versus Ha : 2 1
0 we might consider a statistic
sup f (x)
LR (x) =
2
1
2
0
sup f (x)
or perhaps
supf (x)
(x) = max (LR (x) ; 1) =
2
sup f (x)
2
0
where rejection is appropriate for large (x). Notice that if there is an MLE
of , say b,
supf (x) = f b (x)
2
and there is thus the promise of using "MLE-type" asymptotics to establish
limiting distributions for likelihood ratio-type test statistics. Corresponding to
the case of a point null hypothesis (H0 : = 0 ) in the iid (one-sample) context
there is the following. (This is the k = 1 version. Similar results hold for k > 1
with 2k limits.)
Theorem 177 Under the hypotheses of Theorem 173, if
n
then under
f ^n (X n )
2 ln
= 2 Ln ^n
f 0 (X n )
Ln ( 0 )
0
d
n
!
2
1
Theorem 177 and its k > 1 generalizations are relevant to testing of point
null hypotheses and corresponding con…dence set making for entire parameter vectors. There are also versions of 2 limit theorems for composite null
hypotheses relevant to con…dence set making for parts of a k-dimensional parameter vector. (See, for example, Schervish page 459 for the complete statement
of a theorem of this type.) In vague terms, one can often prove things like what
is indicated in the next claim.
Claim 178 Suppose that
Rp and appropriate regularity conditions hold in
the iid problem. If k < p and
n
sup f n (X n )
2
=
sup
f n (X n )
0
s.t. i = i
i = 1; 2; : : : ; k
47
then under any
2
=
0
i
for i = 1; 2; : : : ; k
2 ln
n
!
for which
i
d
2
k
Claim 178 is relevant to testing H0 : i = i0 for i = 1; 2; : : : ; k and for making
con…dence sets for ( 1 ; 2 ; : : : ; k ). A way of thinking about con…dence sets for
part of a parameter vector that come from the inversion of likelihood ratio tests
is in terms of "pro…le loglikelihood" functions.
De…nition 179 In the (iid) context where
Rp and k < p, the function of
( 1; 2; : : : ; k )
Ln ( 1 ; 2 ; : : : ; k ) =
sup Ln ( )
(
k+1 ;:::; p )
is called the pro…le loglikelihood function for ( 1 ;
2 ; : : : ; k ).
With the notation/jargon of De…nition 179 and results like Claim 178, large
n con…dence sets for ( 1 ; 2 ; : : : ; k ) become
)
(
1 2
( 1 ; 2 ; : : : ; k ) jLn ( 1 ; 2 ; : : : ; k ) >
sup Ln ( 1 ; 2 ; : : : ; k )
2 k
( 1 ; 2 ;:::; k )
(for 2k a small upper percentage point of the 2k distribution).
A generalization of the hypothesis H0 : i = i0 for i = 1; 2; : : : ; k is the hypothesis H0 :hi ( ) = ci for i = 1; 2; : : : ; k for some set of functions fhi ( )g (that
could be hi ( ) = i ) and constants fci g. The obvious LRTs based on
c1 ;c2 ;:::;ck
n
sup f n (X n )
2
sup
f n (X n )
s.t. hi ( ) = ci
i = 1; 2; : : : ; k
=
could be de…ned, 2k limit distributions identi…ed under appropriate conditions
on the hi ( ), and large sample con…dence sets for (h1 ( ) ; h2 ( ) ; : : : ; hk ( )) of
the form
(c1 ; c2 ; : : : ; ck ) j2 ln cn1 ;c2 ;:::;ck < 2k
made (for 2k a small upper percentage point of the 2k distribution).
Tests (and associated con…dence sets) sometimes discussed as competitors for
likelihood ratio tests (and regions) are so-called "Wald" and "score" procedures.
So consider next Wald tests for H0 :hi ( ) = ci for i = 1; 2; : : : ; k in the (iid)
context where
Rp and k < p.
De…ne
0
h ( ) = (h1 ( ) ; h2 ( ) ; : : : ; hk ( ))
and for ^n such that under
p
n ^n
0
d
0
! MVNp 0; I1 1 ( 0 )
48
consider the estimator of h ( )
h ^n
For di¤erentiable h ( ) let
@
hi ( )
@ j
H( ) =
k p
The
-method shows that under
p
0
d
n h ^n
0
h ( 0 ) ! MVNk 0; H ( 0 ) I1 1 ( 0 ) H ( 0 )
0
Let B ( 0 ) = H ( 0 ) I1 1 ( 0 ) H ( 0 ) . Provided H ( 0 ) is full rank so B ( 0 ) is
0
invertible and h ( 0 ) = (c1 ; c2 ; : : : ; ck ) , under 0
0
0
110
c1
B .. CC
@ . AA B ( 0 )
ck
B
n @h ^n
1
0
0
11
c1
B .. CC d
@ . AA !
ck
B ^
@h n
2
k
So, supposing that B ( ) is continuous at 0 , an "expected information" version
of a Wald test of H0 :hi ( ) = ci for i = 1; 2; : : : ; k can be based on the statistic
0
0
110
c1
B .. CC
@ . AA B ^n
ck
B
Wnc1 ;c2 ;:::;ck = n @h ^n
1
0
0
11
c1
B .. CC
@ . AA
ck
B ^
@h n
and a limiting 2k null distribution. Or, an "observed information" version of
Wn can be had by writing
for
1
D n ^n
n
b n = H ^n
B
Dn ( )
and then replacing B ^n
1
@2
@ i@
1
( 0 ) H ^n
0
Ln ( )
j
b 1.
in Wnc1 ;c2 ;:::;ck with B
n
Note …nally that a
(k-dimensional) large sample ellipsoidal con…dence set for h ( ) is then of the
form
0
(c1 ; c2 ; : : : ; ck ) jWnc1 ;c2 ;:::;ck < 2k
(for 2k a small upper percentage point of the 2k distribution).
The "score test" competitor for LRT and Wald tests of H0 :hi ( ) = ci for
i = 1; 2; : : : ; k is motivated by the notion that if H0 is true and ~n maximizes
the loglikelihood Ln ( ) subject to the constraints imposed by the null hypothesis,
then ~n should typically nearly be an absolute maximizer of Ln ( ), and therefore
49
the gradient of Ln ( ) at ~n should be small. That is, de…ne the p
vector
!
@
~
rLn n =
Ln ( )
@ i
= ~n
1 gradient
(this is the score function evaluated at the restricted MLE, ~n ) and under H0
we expect this to be close to 0. The score test statistic is
Vnc1 ;c2 ;:::;ck =
1
rLn ~n
n
0
I1
~n rLn ~n
1
As it turns out, it is typically the case that under H0 , Vnc1 ;c2 ;:::;ck di¤ers from
2 ln cn1 ;c2 ;:::;ck (the version of the corresponding LRT statistic with the standard
limiting null distribution) by a quantity tending to 0 in probability. Hence a 2k
limiting null distribution is appropriate for a score test using Vnc1 ;c2 ;:::;ck . Note
too that another large sample con…dence set for h ( ) is then of the form
0
(c1 ; c2 ; : : : ; ck ) jVnc1 ;c2 ;:::;ck <
2
k
(for 2k a small upper percentage point of the 2k distribution).
It is common to point out that when dealing with a composite null hypothesis, in order to use an LRT one must …nd both unrestricted and restricted
MLEs, while to use a Wald test only an unrestricted MLE is needed, and to use
a score test only a restricted MLE is needed.
8.3
Asymptotic Shape of the Loglikelihood and Related
Bayesian Asymptotics
We consider some simple/crude/easy implications of the foregoing to questions
of the large sample shape of random functions that are the loglikelihood and
Bayes posterior densities. (Much …ner/harder results are available and some
are discussed in Schervish. See his Section 7.4.) Standard statistical folklore
is that for large n, under 0
1. loglikelihoods are steep with a peak near
in continuous contexts, and
0
and a locally quadratic shape
2. "posteriors are consistent" if a prior spreads some of its mass around
at/near 0 , and in continuous
contexts, the corresponding posterior
density will be approximately normal with mean an MLE of .
In the following discussion, continue the iid (one sample) context.
Lemma 180 For
6=
0
Ln ( 0 )
Ln ( ) ! 1 in
50
0
probability
Claim 181 If G is a prior on
terior density is therefore
then for
6=
with density g with respect to
g ( jX n ) = R
0,
and the pos-
g ( ) f n (X n )
g ( ) f n (X n ) d ( )
g ( jX n )
! 0 in
g ( 0 jX n )
0
probability
Corollary 182 If
is …nite,
is counting measure, and g ( ) > 0 8 , the
n
posterior jX is consistent in the sense that for any A
jX n
(A) ! I [
0
2 A] in
0
probability
A (crude) result giving a hint about shape for a loglikelihood is next.
Lemma 183 Under the hypotheses of Theorem 173, suppose that ^n is an MLE
of consistent at 0 . Then
Qn ( )
Ln ^n
Ln
^n + p
n
!
1
2
2
I1 ( 0 ) in
0
probability
Lemma 183 and the phenomenon it represents have implications for the
nature of a posterior distribution. Schervish has a very careful and very hard
development. What we can get easily is the following.
Corollary 184 Under the hypotheses of Theorem 173, suppose that a prior
G has density g with respect to Lebesgue measure on , g ( 0 ) > 0 and g is
continuous at 0 . If ^n is an MLE of consistent at 0 , then the posterior
density g ( jX n ) has the property that
1
0
n
^
g
jX
n
C
B
C ! 1 2 I1 ( 0 ) in 0 probability
R ( ) ln B
A
@
2
g ^n + p jX n
n
Corollary 184 suggests that under 0 one might roughly speaking expect
posterior densities to look approximately normal with mean ^n and variance
1=nI1 ( 0 ). So a practical approximate form for g ( jX n ) might be
0
1
0
1
1
1 A
A or N @ ^n ;
N @ ^n ;
00
^
nI1 n
L ^n
n
Schervish proves that under appropriate conditions the posterior density of
r
^n
L00 ^n
n
(a random function of ) converges in an appropriate sense to the standard
normal density. It is another (harder) matter to prove that approximate posterior probabilities for this are in some appropriate sense close to real posterior
probabilities.
51
9
Optimality in Finite Sample Point Estimation
9.1
Unbiasedness
De…nition 185 The bias of an estimator for
Z
E ( (X)
( )) =
( (x)
( ) 2 R is
( )) dP (x)
X
is unbiased for
De…nition 186 An estimator
E ( (X)
( ) 2 R provided
( )) = 0 8 2
The following suggests that often, no unbiased estimator can be much good.
2
Theorem 187 If L ( ; a) = w ( ) ( ( ) a) for w ( ) > 0 8 and is both
Bayes with respect to G with …nite risk function and unbiased for ( ), then
(according to the joint distribution of (X; ))
(X) =
( ) a.s.
Corollary 188 Suppose
R is nontrivial (has at least 2 elements) and
fP g has elements that are mutually absolutely continuous. Then if L ( ; a) =
2
(
a) , no Bayes estimator with …nite variance can be unbiased.
De…nition 189 is best unbiased for ( ) 2 R if it is unbiased and is at
least as good as any other unbiased estimator.
Theorem 190 Suppose that A
R is convex, L ( ; a) is convex in a 8 and
( ). Then for any for which
1 and 2 are two best unbiased estimators of
L ( ; a) is strictly convex in a and R ( ; 1 ) < 1
P [
E
1
(X) =
2
(X)] = 1
The special case of SELE of ( ) is of most importance.
2
(X) < 1 and unbiased for ( ) implies that
R ( ; ) = Var
For this case
(X)
De…nition 191 is called a UMVUE (uniformly minimum variance unbiased
estimator) of ( ) 2 R if it is unbiased and
Var
for
0
(X)
any other unbiased estimator of
Var
0
(X) 8
( ).
The primary tool of UMVUE theory is the famous Lehmann-Sche¤é Theorem.
52
Theorem 192 (Lehmann-Sche¤ é) Suppose that A R is convex, T is su¢ cient for P = fP g, and is an unbiased estimator of ( ). Let
E [ jT ] =
0
T
(the latter representation guaranteed by Lehmann’s Theorem). Then
unbiased estimator of ( ). If in addition, T is complete for P
1. if
0
T is unbiased for
( ), then
0
T =
0
is an
T a.s. P 8 , and
2. L ( ; a) convex in a 8 implies that
(a)
0
(b) if
is best unbiased, and
0
is any other best unbiased estimator of
0
for any
9.2
for which R ( ;
0)
=
0
( )
a.s. P
< 1 and L ( ; a) is strictly convex in a.
"Information" Inequalities
The topic of discussion here is inequalities about variances of estimators, some
of which involve "information"measures. The basic tool is the Cauchy-Schwarz
inequality.
Lemma 193 (Cauchy-Schwarz) If U and V are random variables with EU 2 <
1 and EV 2 < 1, then
2
EU 2 EV 2 (EU V )
Corollary 194 If X and Y have …nite second moments
(Cov (X; Y ))
(so that jCorr (X; Y )j
2
VarX VarY
1).
Corollary 195 Suppose that (Y ) takes values in R and E 2 (Y ) < 1. If g
is a measurable function such that Eg (Y ) = 0 and 0 < Eg 2 (Y ) < 1, de…ning
C E (Y ) g (Y )
C2
Var (Y )
Eg 2 (Y )
A more statistical-looking version of Corollary 195 is next.
Corollary 196 Suppose that (X) takes values in R and E 2 (X) < 1. If
g is a measurable function such that E g (X) = 0 and 0 < E g 2 (X) < 1,
de…ning C ( ) E (X) g (X)
Var
C2 ( )
E g 2 (X)
(X)
53
Corollary 196 is the source of several interesting inequalities as one chooses
di¤erent interesting g ’s. The most famous derives from the choice
g (x) =
d
ln f (x)
d
and is stated next.
Theorem 197 (Cramér-Rao) Suppose that the model P =R fP g (for
R)
is FI regular at 0 and that 0 < I ( 0 ) < 1: If E (X) =
(x) f (x) dP (x)
can be di¤ erentiated under the integral sign at 0 , i.e. if
Z
d
d
=
(x)
d (x)
E (X)
f (x)
d
d
= 0
= 0
then
Var
(X)
0
2
d
d
E
(X)
=
0
I ( 0)
Then, consider the choice
g (x) =
Provided P
0
f (x) f 0 (x)
f (x)
P , Corollary 196 implies that
Var
(E
(X)
E
(X)
E
0
2
(X))
2
f (X) f 0 (X)
f (X)
This produces the so-called Chapman-Robbins inequality.
Theorem 198 (Chapman-Robbins)
Var
(X)
0
10
(E
sup
s.t. P
0
P
E
(X)
E
0
(X))
f (X) f 0 (X)
f (X)
2
2
Optimality in Finite Sample Testing
We consider the classical hypothesis testing problem. Write = 0 [ 1 where
2 1 based on
0 \ 1 = ; and we must decide between H0 : 2
0 and Ha :
X
P . This can be thought of in decision theoretic terms as a two-action
decision problem, i.e. where A = f0; 1g. One possible loss for such an approach
is (0-1 loss)
L ( ; a) = I [ 2
0] I
[a = 1] + I [ 2
1] I
[a = 0]
While a completely decision-theoretic approach is possible, the subject is
older than formal decision theory and has its own set of emphases and peculiar
terminology.
54
De…nition 199 A non-randomized decision function
called an hypothesis test.
: X ! A = f0; 1g is
De…nition 200 The rejection region for a non-randomized test is fxj (x) = 1g.
De…nition 201 A Type 1 error in a testing problem is taking action 1 when
2 0 . A Type 2 error is taking action 0 when 2 1 .
In a two-action decision problem a behavioral decision rule is for each x a
distribution over A = f0; 1g, which is clearly characterized by x (f1g) 2 [0; 1].
In hypothesis testing terminology one has
De…nition 202 A randomized test is a function : X ! [0; 1] with the interpretation that if (x) = , one rejects H0 with probability .
For 0-1 loss, for
2
0
R ( ; ) = E I [a = 1] = E
while for
2
(X)
1
R ( ; ) = E I [a = 0] = 1
E
(X)
(the a the action randomly chosen by ) and it is thus clear that the function
of , E (X) will be of central importance to the theory of hypothesis testing.
De…nition 203 The power function of a (possibly randomized) test
Z
( ) = E (X) =
(x) dP (x)
De…nition 204 The size or level of the test
10.1
( ).
is sup
2
is
0
Simple versus Simple Testing
Consider the case of testing where 0 = f 0 g and
test de…ne the (0-1 loss) risk vector
(R ( 0 ; ) ; R ( 1 ; )) = (
( 0) ; 1
1
= f 1 g. We may for each
( 1 ))
and consider the standard decision theoretic two-dimensional risk set S. Fairly
obviously, a completely equivalent device is to instead consider the set V of
points
( ( 0 ) ; ( 1 ))
Lemma 205 For a simple versus simple testing problem with
S = f( ( 0 ) ; 1
( 1 )) j is a testg and V = f( ( 0 ) ;
( 1 )) j is a testg
1. both S and V are convex,
2. both S and V are symmetric with respect to the point
55
1 1
2; 2
,
3. (0; 1) 2 S, (1; 0) 2 S, and (0; 0) 2 V, (1; 1) 2 V, and
4. both S and V are closed.
De…nition 206 For testing H0 : = 0 versus Ha : = 1 a test is called most
powerful of size if it is of size and for any other test
with
( 0)
,
( 1)
( 1)
The fundamental theorem of all testing theory is the Neyman-Pearson Lemma.
It has implications that reach far beyond the present case of simple versus simple
testing, but is stated in the present context.
Theorem 207 (Neyman-Pearson) Suppose that P = fP 0 ; P 1 g
dP i
.
…nite measure and let fi =
d
1. (Su¢ ciency) Any test of the form
8
< 1
(x)
(x) =
:
0
a sigma-
if f1 (x) > kf0 (x)
if f1 (x) = kf0 (x)
if f1 (x) < kf0 (x)
(*)
for some k 2 [0; 1) and (x) : X ! [0; 1] is most powerful of its size.
Further (corresponding to the k = 1 case), the test
(x) =
is most powerful of its size
1
0
if f0 (x) = 0
otherwise
(**)
= 0.
2. (Existence) For every 2 [0; 1] 9 a test of the form (*) with
constant in [0; 1]) or a test of form (**) with
( 0) = .
(x) =
(a
3. (Uniqueness) If is a most powerful test of size , then it is of form (*)
or (**) except possibly for x 2 N with P 0 (N ) = P 1 (N ) = 0.
56
Download