Central Limit Theorem for Linear Models

advertisement

Central Limit Theorem for Linear Models

Fall 2014

The CLT you learn in Math Stat applies to sums or averages of independent R.V.s. We need a slightly different version to apply to coefficient estimates, which are much like means.

From Math Stat:

Lindeberg CLT (Arnold, 1981) says that if Y sequence of constants such that max c

2 j

P c

2 j

1

, Y

2

, . . .

∼ (0 , σ

2

) with σ

P n c j

Y j

−→ 0 as n → ∞ , then i =1 q

P n j =1 c

2 j

2

< ∞ , and c

1

, c

2

, . . .

is a converges in distribution to

N (0 , σ

2

). The condition keeps the c j

’s from increasing so fast that the last term dominates the sum.

Simple case:

Suppose we have a balanced ANOVA model (1, 2, or multiway) with n observations per cell, Y ij iid(0 , σ

2

). Take the mean in group i to be satisfied because max( c

2 j

) = n

− 2 and P c

2 j

1 n

P ( Y

= n × n ij

− 2

− µ

= i

), so let n

− 1 c j

= n

− 1 so the fraction is n

− 1

− µ i

. The condition is easily which goes to 0 nicely as n → ∞ . We conclude n

− 1

P ( Y ij n − 1 / 2

− µ i

)

= n

1 / 2

( Y i

− µ i

) −→ N (0 , σ

2

) or n

1 / 2

Y i

−→ N ( µ i

, σ

2

) .

For any n , the cell means are independent of each other. Any linear combination of cell means will converge in distribution to a normal distribution.

Full linear model (Sen and Srivastava, 1997)

Change the condition for the CLT slightly, letting a i

= c i

/ q

P c 2 i and require max | a n i

| −→ 0 and

P i a

2 n i

−→ 1. This allows the use of a triangular array of constants instead of reusing the same values (Gnedenko and Kolmogorov, 1954). The conclusion is that P a n i

Y i

D

−→ N (0 , σ

2

) as n → ∞ .

Linear model:

Take the simple case: y ∼ ( Xβ , σ

2

I ) with X of full column rank, and consider the coefficient vector

( b

− β ), or actually the distribution of

σ

− 2

( b

− β ) T ( X T X )( b

− β ) = σ

− 2

[( X T X )

− 1 X T y − β ] T ( X T X )[( X T X )

− 1 X T y − β ]

= σ

− 2

[( X

T

X )

− 1

( X

T y − X

T

Xβ )]

T

X

T

X [( X

T

X )

− 1

( X

T y − X

T

Xβ )]

= σ

− 2

( X

T y − X

T

Xβ )

T

( X

T

X )

− 1

( X

T

X )( X

T

X )

− 1

( X

T y − X

T

Xβ )

= σ

− 2

( y − Xβ )

T

X ( X

T

X )

− 1

X

T

( y − Xβ )

= σ

− 2 T

X ( X

T

X )

− 1

X

T

= σ

− 2 T H = σ

− 2 T P ppo

Use the full rank SVD or eigenvalue decomposition on H n

Take a vector b s.t.

b

T b = 1 = P b

2 i and let a

( i ) n

= b

T

L

( i ) n to obtain H = L n

T

(a scalar), then look at

L b n

T with L

L n

= n

L n

T

P n i =1

= a

I r

( i ) n i

.

.

| a

( i ) n

| = | b

T

L

( i ) n

| ≤ ( b

T b )

1 / 2

( L

( i ) T n

L

( i ) n

)

1 / 2

= 1 × h ii

By H¨ h ii max L

( i ) T n

L

( i ) n

−→ is the i th diagonal of the hat matrix.

We need: max h ii

=

0, so the condition needed is that diagonals of the hat matrix, the leverages, must go to zero as n −→ 0.

Then P i

( a

( i ) n

) 2 = b T L n

L n

T b = b T b = 1 so we conclude that

1

b

T

L n

−→ N (0 , σ

2

) for every norm one vector b . By properties of MVN, L n has an asymptotic

N ( 0 , σ

2

I ) distribution. Hence

σ

− 2

( b

− β )

T

( X

T

X )( b

− β )

D

−→ χ

2 r

.

In practice, we use t and F tests, so we need to ensure that these distributions also hold. We know that s

2 converges to σ

2 in probability so s

− 2

( b

− β )( X T X )( β − β )

D

−→ χ

2 r

.

In general if C is a p by q matrix of rank q and C

T

β is estimable, then s

− 2

( C

T

β − C

T

β )

T

[ C

T

( X

T

X ) g

C ]

− 1

( C

T

β − C

T

β )

D

−→ χ

2 q

.

However, the above assumes that s

2 is a “perfect” estimate of the variance. We generally prefer to not make that assumption and use the F distribution instead.

1 qs 2

( C

T

β − C

T

β )

T

[ C

T

( X

T

X ) g

C ]

− 1

( C

T

β − C

T

β )

D

−→ F q,n − r

.

The critical assumption is that no single point is entirely “influential”, but the leverage of each is going to zero. We shall see that P h ii

= trace( H ) = rank( X ) which does not increase as n does.

More generally, the above also works with Var( ) = σ

2

V for known V , we just need to do a little transformation. Take L to be the Cholesky decomposition of V

− 1 . That means V

− 1 = LL T , with rank( V

− 1

) = rank( V ) = rank( L ), since we cannot increase rank through multiplication.

L and L are therefore non–singular, and L

T − 1

L

− 1

= ( LL

T

)

− 1

= V

T

Premultiply our linear model by L T to get:

L

T y = L

T

Xβ + L

T

= X

β +

The mean of

∗ is still 0 , and it has variance-covariance matrix σ

2

L

T

V L = σ

2

L

T

L

T − 1

L

− 1

L = σ

2

I , so the second formulation of our model fits the conditions for the CLT and as n gets big, the sampling distribution of b converges to a normal distribution.

In practice, we never know exactly how big n needs to be to make the sampling distribution close enough to normality so that our inferences based on t and F distributions are valid. Problems in convergence occur when we have skewed residuals and/or outliers in the data. If residuals have short tails, the CLT will work well even if sample sizes are only moderate ( n − r = 30?), otherwise we might need several hundred points to appy the CLT.

References

Arnold, S.

The Theory of Linear Models and Multivariate Analysis . Wiley (1981).

Sen, A. K. and Srivastava, M. S.

Regression Analysis: Theory, Methods, and Applications . Springer-

Verlag Inc (1997).

2

Download