Exponential families STAT 582 The family of distributions with range not depending on the parameter and with sufficient statistics that have dimension independent of sample size turns out to be quite rich. It is called the exponential family of distributions. These have density f(x;qq) = exp(c(qq)T T(x) - B(qq))h(x), xŒEÃRn where the set E does not depend on q . The sufficient statistic T(x), which is determined up to a multiplicative constant, is called the natural sufficient statistic. We say that an exponential family is minimal if the functions c(qq) and the statistics T(x) each are linearly independent. We can always achieve this by reparametrization. Example: (1) G(q,1): For the gamma distribution with shape parameter q and scale parameter 1 we have T (x)=Slog xi , E={xi >0, i =1, . . . ,n}, c(q)=q-1, B(q)=-n log G(q) and h(x)=exp(-Sxi ). (2) The inverse normal: This distribution is given by the density 1 ⁄2 l ___ x -3/2 exp(-l(x-m)2 /2m2 x), x>0. f (m,l) = 2p Here m is the mean, and l is a precision parameter. It sometimes helps to reparametrize using a=l/m2 , yielding f(a,l) = exp((al) ⁄ -1⁄2log l-ax/2-l/2x) 1 2 so that (Sxi , Sx -1 i ) is sufficient for the natural parameter a/2,l/2. For a minimal family, the sufficient statistic T is also minimal sufficient. For a proof, see Lehmann: Theory of Point Estimation, Example 5.9, pp. 43-44. If we parametrize the family using h =c(qq), this is called the natural parametrization (or the canonical parametrization). We then write f (x,h h) = exp(h hT T(x) - A(h h))h(x) where A(h h) = log Úexp(h hT T(x))h(x)dx E The natural parameter space is H={h h: A(h h) < •}. Theorem 1: H is a convex set. Proof: Let 0<a<1 and take h and h 1 in H. Write hT T(x))h((x)))a (exp(h h T1 T(x))h((x)))1-a dx. A(ah h + (1-a)h h1 ) = log Ú(exp(h E a 1-a But u v £au+(1-a)v (take logarithms of both sides and use the fact that the logarithm is a concave function), whence A(ah h + (1-a)h h1 ) £ aA(h h) + (1-a)A(h h1 ) < •. Example: In the case of G(q,1) we define h(x)=exp(-Sxi - Slog xi) , and find that q is itself the natural parameter (obviously, any linear function of q is also natural). The natural parameter space is then R+ . -2- Theorem 2: If d =dim(H)=1 we have whenever hŒint(H) that Eh T(X) = A´(h) and Varh T(X) = A´´(h). Proof: First compute Eh T(X) = ÚT(x)f(x;h)dx E = ÚT(x)exp(hT(x)-A(h))h(x)dx E and 1 d ____________ ___ log Úexp(hT)hdx = ÚTexp(hT)hdx dh E Úexp(hT)hdx E A´(h) = E = ÚTexp(hT-A)hdx = Eh T E where the differentiation under the integral sign needs to be be justified. To do that, define y(h)=exp(A(h)). Note that 1 y(h+k)-y(h) __ ____________ = (Úexp((h+k)T)hdx - Úexp(hT)hdx) k E k E exp(kT)-1 _________ exp(hT)hdx k E =Ú Now, if | k | <d | • T i k i -1 exp(dT)+exp(-dT) | T | d)-1 T | i | k | i -1 ________________ exp(kT)-1 _exp( ___________ ______ | £ • _| _________ _________ < < |= |S S d d i ! i ! k i =1 i =1 and thus | 1 y(h+k)-y(h) __ ____________ | < (y(h+d)+y(h-d)) < • d k since hŒint(H). Hence the dominated convergence theorem applies, and exp(kT)-1 y(h+k)-y(h) _________ ____________ exp(hT)hdx = ÚTexp(hT)hdx. = Ú lim k k kÆ0 E EkÆ0 lim This justifies the interchange of integration and differentiation (and the argument can be repeated for higher derivatives). To compute the variance, first note that d2 _y´´(s) _____ ____ log y(s) = 2 y(s) ds 2 y´(s) _____ . y(s) By differentiating under the integral sign y´´(h)=ÚT 2 exp(hT)hdx, so y´´/y=Eh T 2 , whence E -3- A´´ = d2 ____ log y(h) = ET 2 - E2 T = VarT. dh2 Example: (1) G(q,1): For n =1 we have A (q)=log G(q), so Eq T = Eq log X = G´(q) _____ = Y(q) G(q) where Y(x) is the digamma function. (2) The Rayleigh density: let f(x;q) = x ___ exp(-x 2 /2q2 ) q2 where x and q are positive real numbers. This is the density of the length of a bivariate normal vector, where the two components are independent standard normal. Writing this in exponential family form for a sample of size n we get f(x;q) = exp(- 1 ____ Sx 2i - nlog q2 + Slog xi ). 2q2 Thus h=c(q)=-1/2q2 , or q2 =-1/2h. Also B(q)=n log q2 , so A(h)=-n log(-2h). By Theorem 1 we get ESX 2i = - n __ = 2nq2 h and VarSX 2i = n ___ = 4nq4 . h2 To compute these moments directly from the density one must compute, e.g, • 3 x exp(-x 2 /2q2 )dx, Ú ___ 2 0 q an exercise in partial integration. Corollary 1: For d>1 we have Eh Ti (X) = ∂ ____ A(h h) ∂hi and Covh (Ti (X),T j (X)) = ∂2 _______ A(h h) ∂hi ∂h j Corollary 2: The moment generating function of T is where y(h h) = exp(A(h h)). Eh exp(Sti Ti ) = y(h h+t)/y(h h), Corollary 3: The cumulant generating function of T is k(t) = log Eh expSti Ti = A (h h+t) - A(h h). Thus A (k) (h h) is the k’th cumulant of T. -4Example: (1) G(a,b): Here f(x,h h) = exp(aSlogxi - bSxi - nalogb - nlogG(a)-Slog xi ) The natural parameter is h =(a,b)T , with A(h)=alogb+logG(a), and by Corollary 2 the joint mgf for (log X,X) is exp(log(G(a+t 1 )+(a+t 1 )log(b+t 2 ))-log G(a)-alog b) = 1) t _G(a+t _______ (1 + t 2 /b)a (b+t 2 ) 1 G(a) Setting t 1 =0 we get the mgf for X, and setting t 2 =0 that for log X. (2) Inverse normal: By Corollary 3, the cgf for the inverse normal distribution is k(a,l)=1⁄2logl+(al) ⁄ . 1 2 The exponential family is closed under sampling, in the sense that if we take a random sample X1 , . . . ,Xn from the distribution (1), the joint density is n P f(xi ;qq) = exp(Sc(qq)T T(xi ) - nB(qq))Ph(xi ) = exp(c(qq)T (ST(xi )) - nB(qq))hn (x1 , . . . ,xn ) i =1 so the natural sufficient statistic is ST(xi ). Notice that this always has the same dimension, regardless of sample size. Furthermore, the natural sufficient statistic itself has an exponential family distribution. More precisely, we have the following result: Theorem 3: Let X be distributed according to the exponential family r f(x;(a a,b b)T ) = exp( S ai Ui (x) + i =1 s S bi Ti (x) - A(a,b))h(x). i =1 Then the distribution of T 1 , . . . ,Ts is an exponential family of the form s f T (t;(a a,b b)T ) = exp( S bi ti - A(a,b))g(t;a a) i =1 and the conditional distribution of U 1 , . . . ,Ur given T=t is an exponential family of the form r f U | T (u | t;(a a,b b)T ) = exp( S ai ui - A t (a))k(t,u). i =1 Proof: We prove the result in the discrete case. A general proof is in Lehmann: Testing Statistical Hypotheses, Wiley, 1959, p.52. Compute S P(T(X) = t) = = r S x:T(x)=t s exp( S ai Ui (x) + x:T(x)=t s i =1 = exp( S bi ti - A(a,b)) i =1 f(x;(a a,b b)T )) S bi Ti (x) - A(a,b))h(x) i =1 S r exp( S bi Ui (x))h(x) x:T(x)=t i =1 This proves the first part. For the second part, write P(U=u | T=t) = P(U=u,T=t)/P(T=t) r s exp( S ai ui + S bi ti -A(a,b))k(u,t) i =1 i =1 _____________________________ = exp(Sai ui - log g (t;a))k(u,t) = s exp( S bi ti -A(a,b))g(t;a) i =1 -5where k(u,t)= S h(x). x:U(x)=u,T(x)=t The likelihood equation for an exponential family is simple. Theorem 4: Let C=int(c(qq):qqŒQ Q), and let X be distributed according to a minimal exponential family. If the equation Eq T(X) = T(x) has a solution q̂q̂(x) with c(q̂q̂(x))ŒC, then q̂ is the unique mle of q . Proof: Look first at the canonical parametrization, so that the log likelihood is Shi Ti (x) - A(hh) - log h (x) and the likelihood equations are Ti (x) = ∂ ____ A(h h) = Eh Ti (X) ∂hi using Corollary 1 to Theorem 1. In addition, minus the second derivative of the log likelihood is just [∂A(h h)/∂hi ∂h j ], or the covariance matrix of T. Since the parametrization is minimal, the covariance matrix is positive definite. Q are a subset of the values of ln (h h), h ŒH. Hence if c(q̂)ŒC Now note that the values of ln (qq), q ŒQ then this corresponds to the unique maximizing ĥ ĥ. Example: G(a,b): The likelihood equations are n a __ = xi b S n(logb - Y(a)) = Slog xi