Exponential families

advertisement
Exponential families
STAT 582
The family of distributions with range not depending on the parameter and with sufficient statistics
that have dimension independent of sample size turns out to be quite rich. It is called the exponential
family of distributions. These have density
f(x;qq) = exp(c(qq)T T(x) - B(qq))h(x), xŒEÃRn
where the set E does not depend on q . The sufficient statistic T(x), which is determined up to a multiplicative constant, is called the natural sufficient statistic. We say that an exponential family is minimal if
the functions c(qq) and the statistics T(x) each are linearly independent. We can always achieve this by
reparametrization.
Example: (1) G(q,1): For the gamma distribution with shape parameter q and scale parameter 1 we have
T (x)=Slog xi , E={xi >0, i =1, . . . ,n}, c(q)=q-1, B(q)=-n log G(q) and h(x)=exp(-Sxi ).
(2) The inverse normal: This distribution is given by the density
1
⁄2
l ___
x -3/2 exp(-l(x-m)2 /2m2 x), x>0.
f (m,l) = 2p
Here m is the mean, and l is a precision parameter. It sometimes helps to reparametrize using a=l/m2 ,
yielding
f(a,l) = exp((al) ⁄ -1⁄2log l-ax/2-l/2x)
1
2
so that (Sxi , Sx -1
i ) is sufficient for the natural parameter a/2,l/2.
For a minimal family, the sufficient statistic T is also minimal sufficient. For a proof, see Lehmann:
Theory of Point Estimation, Example 5.9, pp. 43-44.
If we parametrize the family using h =c(qq), this is called the natural parametrization (or the
canonical parametrization). We then write
f (x,h
h) = exp(h
hT T(x) - A(h
h))h(x)
where
A(h
h) = log Úexp(h
hT T(x))h(x)dx
E
The natural parameter space is H={h
h: A(h
h) < •}.
Theorem 1: H is a convex set.
Proof: Let 0<a<1 and take h and h 1 in H. Write
hT T(x))h((x)))a (exp(h
h T1 T(x))h((x)))1-a dx.
A(ah
h + (1-a)h
h1 ) = log Ú(exp(h
E
a 1-a
But u v £au+(1-a)v (take logarithms of both sides and use the fact that the logarithm is a concave
function), whence
A(ah
h + (1-a)h
h1 ) £ aA(h
h) + (1-a)A(h
h1 ) < •.
Example: In the case of G(q,1) we define h(x)=exp(-Sxi - Slog xi) , and find that q is itself the natural
parameter (obviously, any linear function of q is also natural). The natural parameter space is then R+ .
-2-
Theorem 2: If d =dim(H)=1 we have whenever hŒint(H) that
Eh T(X) = A´(h)
and
Varh T(X) = A´´(h).
Proof: First compute
Eh T(X) = ÚT(x)f(x;h)dx
E
= ÚT(x)exp(hT(x)-A(h))h(x)dx
E
and
1
d
____________
___
log Úexp(hT)hdx =
ÚTexp(hT)hdx
dh
E
Úexp(hT)hdx E
A´(h) =
E
= ÚTexp(hT-A)hdx = Eh T
E
where the differentiation under the integral sign needs to be be justified. To do that, define
y(h)=exp(A(h)). Note that
1
y(h+k)-y(h) __
____________
= (Úexp((h+k)T)hdx - Úexp(hT)hdx)
k E
k
E
exp(kT)-1
_________
exp(hT)hdx
k
E
=Ú
Now, if | k | <d
|
• T i k i -1
exp(dT)+exp(-dT)
| T | d)-1
T | i | k | i -1
________________
exp(kT)-1
_exp(
___________
______ | £ • _| _________
_________
<
<
|= |S
S
d
d
i
!
i
!
k
i =1
i =1
and thus
|
1
y(h+k)-y(h)
__
____________
| < (y(h+d)+y(h-d)) < •
d
k
since hŒint(H). Hence the dominated convergence theorem applies, and
exp(kT)-1
y(h+k)-y(h)
_________
____________
exp(hT)hdx = ÚTexp(hT)hdx.
= Ú lim
k
k
kÆ0
E
EkÆ0
lim
This justifies the interchange of integration and differentiation (and the argument can be repeated for
higher derivatives).
To compute the variance, first note that
d2
_y´´(s)
_____ ____
log y(s) =
2
y(s)
ds
2
y´(s) _____
.
y(s) By differentiating under the integral sign y´´(h)=ÚT 2 exp(hT)hdx, so y´´/y=Eh T 2 , whence
E
-3-
A´´ =
d2
____
log y(h) = ET 2 - E2 T = VarT.
dh2
Example: (1) G(q,1): For n =1 we have A (q)=log G(q), so
Eq T = Eq log X =
G´(q)
_____
= Y(q)
G(q)
where Y(x) is the digamma function.
(2) The Rayleigh density: let
f(x;q) =
x
___
exp(-x 2 /2q2 )
q2
where x and q are positive real numbers. This is the density of the length of a bivariate normal vector,
where the two components are independent standard normal. Writing this in exponential family form for
a sample of size n we get
f(x;q) = exp(-
1
____
Sx 2i - nlog q2 + Slog xi ).
2q2
Thus h=c(q)=-1/2q2 , or q2 =-1/2h. Also B(q)=n log q2 , so A(h)=-n log(-2h). By Theorem 1 we get
ESX 2i = -
n
__
= 2nq2
h
and
VarSX 2i =
n
___
= 4nq4 .
h2
To compute these moments directly from the density one must compute, e.g,
•
3
x
exp(-x 2 /2q2 )dx,
Ú ___
2
0
q
an exercise in partial integration.
Corollary 1: For d>1 we have
Eh Ti (X) =
∂
____
A(h
h)
∂hi
and
Covh (Ti (X),T j (X)) =
∂2
_______
A(h
h)
∂hi ∂h j
Corollary 2: The moment generating function of T is
where y(h
h) = exp(A(h
h)).
Eh exp(Sti Ti ) = y(h
h+t)/y(h
h),
Corollary 3: The cumulant generating function of T is
k(t) = log Eh expSti Ti = A (h
h+t) - A(h
h).
Thus A (k) (h
h) is the k’th cumulant of T.
-4Example: (1) G(a,b): Here
f(x,h
h) = exp(aSlogxi - bSxi - nalogb - nlogG(a)-Slog xi )
The natural parameter is h =(a,b)T , with A(h)=alogb+logG(a), and by Corollary 2 the joint mgf for
(log X,X) is
exp(log(G(a+t 1 )+(a+t 1 )log(b+t 2 ))-log G(a)-alog b) =
1)
t
_G(a+t
_______
(1 + t 2 /b)a (b+t 2 ) 1
G(a)
Setting t 1 =0 we get the mgf for X, and setting t 2 =0 that for log X.
(2) Inverse normal: By Corollary 3, the cgf for the inverse normal distribution is k(a,l)=1⁄2logl+(al) ⁄ .
1
2
The exponential family is closed under sampling, in the sense that if we take a random sample
X1 , . . . ,Xn from the distribution (1), the joint density is
n
P f(xi ;qq) = exp(Sc(qq)T T(xi ) - nB(qq))Ph(xi ) = exp(c(qq)T (ST(xi )) - nB(qq))hn (x1 , . . . ,xn )
i =1
so the natural sufficient statistic is ST(xi ). Notice that this always has the same dimension, regardless of
sample size. Furthermore, the natural sufficient statistic itself has an exponential family distribution.
More precisely, we have the following result:
Theorem 3: Let X be distributed according to the exponential family
r
f(x;(a
a,b
b)T ) = exp( S ai Ui (x) +
i =1
s
S bi Ti (x) - A(a,b))h(x).
i =1
Then the distribution of T 1 , . . . ,Ts is an exponential family of the form
s
f T (t;(a
a,b
b)T ) = exp( S bi ti - A(a,b))g(t;a
a)
i =1
and the conditional distribution of U 1 , . . . ,Ur given T=t is an exponential family of the form
r
f U | T (u | t;(a
a,b
b)T ) = exp( S ai ui - A t (a))k(t,u).
i =1
Proof: We prove the result in the discrete case. A general proof is in Lehmann: Testing Statistical
Hypotheses, Wiley, 1959, p.52. Compute
S
P(T(X) = t) =
=
r
S
x:T(x)=t
s
exp( S ai Ui (x) +
x:T(x)=t
s
i =1
= exp( S bi ti - A(a,b))
i =1
f(x;(a
a,b
b)T ))
S bi Ti (x) - A(a,b))h(x)
i =1
S
r
exp( S bi Ui (x))h(x)
x:T(x)=t
i =1
This proves the first part. For the second part, write
P(U=u | T=t) = P(U=u,T=t)/P(T=t)
r
s
exp( S ai ui + S bi ti -A(a,b))k(u,t)
i =1
i =1
_____________________________
= exp(Sai ui - log g (t;a))k(u,t)
=
s
exp( S bi ti -A(a,b))g(t;a)
i =1
-5where k(u,t)=
S
h(x).
x:U(x)=u,T(x)=t
The likelihood equation for an exponential family is simple.
Theorem 4: Let C=int(c(qq):qqŒQ
Q), and let X be distributed according to a minimal exponential family. If
the equation
Eq T(X) = T(x)
has a solution q̂q̂(x) with c(q̂q̂(x))ŒC, then q̂ is the unique mle of q .
Proof: Look first at the canonical parametrization, so that the log likelihood is
Shi Ti (x) - A(hh) - log h (x)
and the likelihood equations are
Ti (x) =
∂
____
A(h
h) = Eh Ti (X)
∂hi
using Corollary 1 to Theorem 1. In addition, minus the second derivative of the log likelihood is just
[∂A(h
h)/∂hi ∂h j ], or the covariance matrix of T. Since the parametrization is minimal, the covariance
matrix is positive definite.
Q are a subset of the values of ln (h
h), h ŒH. Hence if c(q̂)ŒC
Now note that the values of ln (qq), q ŒQ
then this corresponds to the unique maximizing ĥ
ĥ.
Example: G(a,b): The likelihood equations are
n
a
__
= xi
b S
n(logb - Y(a)) = Slog xi
Download