Fall 2014
The CLT you learn in Math Stat applies to sums or averages of independent R.V.s. We need a slightly different version to apply to coefficient estimates, which are much like means.
From Math Stat:
Lindeberg CLT (Arnold, 1981) says that if Y sequence of constants such that max c
2 j
P c
2 j
1
, Y
2
, . . .
∼ (0 , σ
2
) with σ
P n c j
Y j
−→ 0 as n → ∞ , then i =1 q
P n j =1 c
2 j
2
< ∞ , and c
1
, c
2
, . . .
is a converges in distribution to
N (0 , σ
2
). The condition keeps the c j
’s from increasing so fast that the last term dominates the sum.
Simple case:
Suppose we have a balanced ANOVA model (1, 2, or multiway) with n observations per cell, Y ij iid(0 , σ
2
). Take the mean in group i to be satisfied because max( c
2 j
) = n
− 2 and P c
2 j
1 n
P ( Y
= n × n ij
− 2
− µ
= i
), so let n
− 1 c j
= n
− 1 so the fraction is n
− 1
− µ i
∼
. The condition is easily which goes to 0 nicely as n → ∞ . We conclude n
− 1
P ( Y ij n − 1 / 2
− µ i
)
= n
1 / 2
( Y i
− µ i
) −→ N (0 , σ
2
) or n
1 / 2
Y i
−→ N ( µ i
, σ
2
) .
For any n , the cell means are independent of each other. Any linear combination of cell means will converge in distribution to a normal distribution.
Full linear model (Sen and Srivastava, 1997)
Change the condition for the CLT slightly, letting a i
= c i
/ q
P c 2 i and require max | a n i
| −→ 0 and
P i a
2 n i
−→ 1. This allows the use of a triangular array of constants instead of reusing the same values (Gnedenko and Kolmogorov, 1954). The conclusion is that P a n i
Y i
D
−→ N (0 , σ
2
) as n → ∞ .
Linear model:
Take the simple case: y ∼ ( Xβ , σ
2
I ) with X of full column rank, and consider the coefficient vector
( b
− β ), or actually the distribution of
σ
− 2
( b
− β ) T ( X T X )( b
− β ) = σ
− 2
[( X T X )
− 1 X T y − β ] T ( X T X )[( X T X )
− 1 X T y − β ]
= σ
− 2
[( X
T
X )
− 1
( X
T y − X
T
Xβ )]
T
X
T
X [( X
T
X )
− 1
( X
T y − X
T
Xβ )]
= σ
− 2
( X
T y − X
T
Xβ )
T
( X
T
X )
− 1
( X
T
X )( X
T
X )
− 1
( X
T y − X
T
Xβ )
= σ
− 2
( y − Xβ )
T
X ( X
T
X )
− 1
X
T
( y − Xβ )
= σ
− 2 T
X ( X
T
X )
− 1
X
T
= σ
− 2 T H = σ
− 2 T P ppo
Use the full rank SVD or eigenvalue decomposition on H n
Take a vector b s.t.
b
T b = 1 = P b
2 i and let a
( i ) n
= b
T
L
( i ) n to obtain H = L n
T
(a scalar), then look at
L b n
T with L
L n
= n
L n
T
P n i =1
= a
I r
( i ) n i
.
.
| a
( i ) n
| = | b
T
L
( i ) n
| ≤ ( b
T b )
1 / 2
( L
( i ) T n
L
( i ) n
)
1 / 2
= 1 × h ii
By H¨ h ii max L
( i ) T n
L
( i ) n
−→ is the i th diagonal of the hat matrix.
We need: max h ii
=
0, so the condition needed is that diagonals of the hat matrix, the leverages, must go to zero as n −→ 0.
Then P i
( a
( i ) n
) 2 = b T L n
L n
T b = b T b = 1 so we conclude that
1
b
T
L n
−→ N (0 , σ
2
) for every norm one vector b . By properties of MVN, L n has an asymptotic
N ( 0 , σ
2
I ) distribution. Hence
σ
− 2
( b
− β )
T
( X
T
X )( b
− β )
D
−→ χ
2 r
.
In practice, we use t and F tests, so we need to ensure that these distributions also hold. We know that s
2 converges to σ
2 in probability so s
− 2
( b
− β )( X T X )( β − β )
D
−→ χ
2 r
.
In general if C is a p by q matrix of rank q and C
T
β is estimable, then s
− 2
( C
T
β − C
T
β )
T
[ C
T
( X
T
X ) g
C ]
− 1
( C
T
β − C
T
β )
D
−→ χ
2 q
.
However, the above assumes that s
2 is a “perfect” estimate of the variance. We generally prefer to not make that assumption and use the F distribution instead.
1 qs 2
( C
T
β − C
T
β )
T
[ C
T
( X
T
X ) g
C ]
− 1
( C
T
β − C
T
β )
D
−→ F q,n − r
.
The critical assumption is that no single point is entirely “influential”, but the leverage of each is going to zero. We shall see that P h ii
= trace( H ) = rank( X ) which does not increase as n does.
More generally, the above also works with Var( ) = σ
2
V for known V , we just need to do a little transformation. Take L to be the Cholesky decomposition of V
− 1 . That means V
− 1 = LL T , with rank( V
− 1
) = rank( V ) = rank( L ), since we cannot increase rank through multiplication.
L and L are therefore non–singular, and L
T − 1
L
− 1
= ( LL
T
)
− 1
= V
T
Premultiply our linear model by L T to get:
L
T y = L
T
Xβ + L
T
= X
∗
β +
∗
The mean of
∗ is still 0 , and it has variance-covariance matrix σ
2
L
T
V L = σ
2
L
T
L
T − 1
L
− 1
L = σ
2
I , so the second formulation of our model fits the conditions for the CLT and as n gets big, the sampling distribution of b converges to a normal distribution.
In practice, we never know exactly how big n needs to be to make the sampling distribution close enough to normality so that our inferences based on t and F distributions are valid. Problems in convergence occur when we have skewed residuals and/or outliers in the data. If residuals have short tails, the CLT will work well even if sample sizes are only moderate ( n − r = 30?), otherwise we might need several hundred points to appy the CLT.
Arnold, S.
The Theory of Linear Models and Multivariate Analysis . Wiley (1981).
Sen, A. K. and Srivastava, M. S.
Regression Analysis: Theory, Methods, and Applications . Springer-
Verlag Inc (1997).
2