Least Squares Fitting and Inference for Linear Models

advertisement
Author(s): Kerby Shedden, Ph.D., 2010
License: Unless otherwise noted, this material is made available under the
terms of the Creative Commons Attribution Share Alike 3.0 License:
http://creativecommons.org/licenses/by-sa/3.0/
We have reviewed this material in accordance with U.S. Copyright Law and have tried to maximize your
ability to use, share, and adapt it. The citation key on the following slide provides information about how you
may share and adapt this material.
Copyright holders of content included in this material should contact open.michigan@umich.edu with any
questions, corrections, or clarification regarding the use of content.
For more information about how to cite these materials visit http://open.umich.edu/privacy-and-terms-use.
Any medical information in this material is intended to inform and educate and is not a tool for self-diagnosis
or a replacement for medical evaluation, advice, diagnosis or treatment by a healthcare professional. Please
speak to your physician if you have questions about your medical condition.
Viewer discretion is advised: Some medical content is graphic and may not be suitable for all viewers.
1 / 157
Least Squares Fitting and Inference for Linear
Models
Kerby Shedden
Department of Statistics, University of Michigan
September 30, 2015
2 / 157
Goals and terminology
The goal is to learn about an unknown function f that relates variables
Y ∈ R and X ∈ Rp through a relationship Y ≈ f (X ). The variables Y
and X have distinct roles:
I
Independent variables (predictors, regressors, covariates, exogeneous
variables):
X = (X1 , . . . , Xp )0
I
Dependent variable (response, outcome, endogeneous variables): Y
Note that the terms “independent” and “dependent” do not imply
statistical independence or linear algebraic independence. They refer to
the setting of an experiment where the value of X can be manipulated,
and we observe the consequent changes in Y .
3 / 157
Goals and terminology
The analysis is empirical (based on a sample of data):
Yi ∈ R
Xi = (Xi1 , . . . , Xip )0 ∈ Rp
i = 1, . . . , n.
where n is the sample size.
The data (Yi , Xi ) tell us what is known about the i th “analysis unit”,
“case”, or “observation”.
4 / 157
Why do we want to do this?
I
Prediction: Estimate f (X ) to predict the typical value of Y at a
given X point (not necessarily one of the Xi points in the data).
I
Model inference: Inductive learning about the relationship between
X and Y , such as understanding which predictors or combinations
of predictors are associated with particular changes in Y .
“Model inference” often has the goal of better understanding the
physical (biological, social, etc.) mechanism underlying the relationship
between X and Y .
We will see that inferences about mechanisms can be misleading,
particularly when based on observational data.
5 / 157
Examples:
I
An empirical model for the weather conditions 48 hours from now
could be based on current and historical weather conditions. Such a
model could have a lot of practical value, but it would not
necessarily provide a lot of insight into the atmospheric processes
that underly changes in the weather.
I
A study of the relationship between childhood lead exposure and
subsequent behavioral problems would primarily be of interest for
inference, rather than prediction. Such a model could be used to
assess whether there is any risk due to lead exposure, and to
estimate the overall effects of lead exposure in a large population.
The effect of lead exposure on an individual child is probably too
small in relation to numerous other risk factors for such a model to
be of predictive value at the individual level.
6 / 157
Statistical interpretations of the regression function
The most common way of putting “curve fitting” into a statistical
framework is to define f as the conditional expectation
f (x) ≡ E [Y |X = x].
Less commonly, the regression function is defined as a conditional
quantile, such as the median
f (x) ≡ median(Y |X = x)
or even some other quantile f (x) ≡ Qp (Y |X = x). This is called quantile
regression.
7 / 157
The conditional expectation function
The conditional expectation E [Y |X ] can be viewed in two ways:
1. As a deterministic function of X , essentially what we would get if
we sampled a large number of X , Y pairs from their joint
distribution, and took the average of the Y values that are coupled
with a specific X value. If there are densities we can write:
Z
E [Y |X = x] =
y · f (y |X = x)dy .
2. As a scalar random variable. A realization is obtained by sampling
X from its marginal distribution, then plugging this value into the
deterministic function described in 1 above.
In regression analysis, 1 is much more commonly used than 2.
8 / 157
Least Squares Fitting
In a linear model, the independent variables X are postulated to be
related to the dependent variable Y via a linear relationship
Yi ≈
p
X
βj Xij = β 0 Xi .
j=1
This is called a “linear model” because it is a linear function of β, not
because it is a linear function of X .
To estimate f , we need to estimate the βj . One approach to doing this is
to minimize the following function of β:
L(β) =
X
i
(Yi −
X
βj Xij )2 =
j
X
(Yi − β 0 Xi )2
i
This is called least squares estimation.
9 / 157
Simple linear regression
A special case of the linear model is simple linear regression, when there
is p = 1 covariate and an intercept (a covariate whose value is always 1).
L(α, β) =
X
(Yi − α − βXi )2
i
We can differentiate with respect to α and β:
∂L/∂α
∂L/∂β
P
= −2 Pi (Yi − α − βXi )
= −2 i (Yi − α − βXi )Xi
P
= −2 P Ri
= −2 i Ri Xi
Ri = Yi − α − βXi
is the “working residual” (requires working values for α and β).
10 / 157
Simple linear regression
Setting
∂L/∂α = ∂L/∂β = 0
and solving for α and β yields
α̂ = Ȳ − β̂ X̄
P
Yi Xi /n − Ȳ X̄
β̂ = P 2
Xi /n − X̄ 2
P
P
where Ȳ =
Yi /n and X̄ =
Xi /n are the sample mean values
(averages).
The “hat” notation (ˆ) distinguishes the least-squares optimal values of α
and β from an arbitrary pair of parameter values.
11 / 157
Two important identities
We will call α̂ and β̂ the “least squares estimates” of the model
parameters α and β.
At the least squares solution,
∂L/∂α
∂L/∂β
P
= 2 P
Ri
= −2 i Ri Xi
=
=
0
0
we have the following basic properties for the least squares estimates:
I
The residuals Ri = Yi − α − βXi sum to zero
I
The residuals are orthogonal to the independent variable X .
Recall that two vectors v , w ∈ Rd are orthogonal if
P
j
vj wj = 0.
12 / 157
Two important identities
Centered and uncentered sums of squares can be related as follows:
X
Xi2 /n − X̄ 2 =
X
(Xi − X̄ )2 /n.
i
Centered and uncentered cross-products can be related as follows:
X
Yi Xi /n − Ȳ X̄ =
X
Yi (Xi − X̄ )/n =
X
(Yi − Ȳ )(Xi − X̄ )/n.
i
13 / 157
Simple linear regression
Note that
X
(Xi − X̄ )2 /n
and
i
X
(Yi − Ȳ )(Xi − X̄ )/n.
i
are essentially the sample variance of X1 , . . . , Xn , and the sample
covariance of the (Xi , Yi ) pairs. Since β̂ is their ratio, we can replace n in
the denominator with n − 1 so that
β̂ =
cd
ov(Y , X )
var(X
c )
where cd
ov and var
c are the usual unbiased estimates of variance and
covariance.
14 / 157
Convex functions
A map Q : Rd → R is convex if for any v , w ∈ Rd ,
Q(λv + (1 − λ)w ) ≤ λQ(v ) + (1 − λ)Q(w ),
for 0 ≤ λ ≤ 1. If the inequality is strict for 0 < λ < 1 and all v 6= w ,
then Q is strictly convex.
λ=1
Q(v)
λQ(v)+(1−λ)Q(w)
λ=0
Q(w)
Q(λv+(1−λ)w)
v
w
15 / 157
Convex functions
A key property of strictly convex functions is that they have at most one
global minimizer. That is, there exists at most one v ∈ Rd such that
Q(v ) ≤ Q(w ) for all w ∈ Rd .
The proof is simple. Suppose there exists v 6= w such that
Q(v ) = Q(w ) = inf u∈Rd Q(u).
If Q is strictly convex and λ = 1/2, then
Q(v /2 + w /2) < (Q(v ) + Q(w ))/2 = inf u∈Rd Q(u),
Thus z = (v + w )/2 has the property that Q(z) < inf u∈Rd Q(u), a
contradiction.
16 / 157
Convexity of quadratic functions
A general quadratic function can be written
Q(v ) = v 0 Av + b 0 v + c
where A is a d × d matrix, b and v are vectors in Rd , and c is a scalar.
Note that
v 0 Av =
X
i,j
vi vj Aij
b0 v =
X
bj vj .
j
17 / 157
Convexity of quadratic functions
If b ∈ col(A), we can complete the square to eliminate the linear term,
giving us
Q(v ) = (v − f )0 A(v − f ) + s.
where f is any vector satisfying Af = −b/2, and s = c − f 0 Af .
If A is invertible, we can take f = −A−1 b/2.
Since the property of being convex is invariant to translations in both the
domain and range, without loss of generality we can assume f = 0 and
s = 0 for purposes of analyzing the convexity of Q.
18 / 157
Convexity of quadratic functions
Two key definitions:
I
A square matrix A is positive definite if v 0 Av > 0 for all vectors
v 6= 0.
I
A square matrix is positive semidefinite if v 0 Av ≥ 0 for all v .
We will now show that the quadratic function Q(v ) = v 0 Av is strictly
convex if and only if A is positive definite.
Note that without loss of generality, A is symmetric, since otherwise
à ≡ (A + A0 )/2 gives the same quadratic form as A.
19 / 157
Convexity of quadratic functions
Q(λv + (1 − λ)w )
=
(λv + (1 − λ)w )0 A(λv + (1 − λ)w )
= λ2 v 0 Av + (1 − λ)2 w 0 Aw + 2λ(1 − λ)v 0 Aw .
λQ(v ) + (1 − λ)Q(w ) − Q(λv + (1 − λ)w ) =
λ(1 − λ)(v 0 Av + w 0 Aw − 2w 0 Av ) =
λ(1 − λ)(v − w )0 A(v − w ) ≥ 0.
If 0 < λ < 1, this is a strict inequality for all v 6= w if and only if A is
positive definite.
20 / 157
Convexity of quadratic functions
The gradient, or Jacobian of a scalar-valued function y = f (x) (y ∈ R,
x ∈ Rd ) is
(∂f /∂x1 , . . . , ∂f /∂xd ),
which is viewed as a row vector by convention.
21 / 157
Convexity of quadratic functions
If A is symmetric, the Jacobian of Q(v ) = v 0 Av is 2v 0 A0 . To see this,
write
v 0 Av =
X
vi2 Aii +
i
X
vi vj Aij
i6=j
and differentiate with respect to v` to get the `th component of the
Jacobian:
2v` A`` +
X
j6=`
vj A`j +
X
vi Ai` = 2(Av )` .
i6=`
22 / 157
Convexity of quadratic functions
The Hessian of a scalar-valued function y = f (x) (y ∈ R, x ∈ Rd ) is the
matrix
Hij = ∂ 2 f /∂xj ∂xi .
If A is symmetric, the Hessian of Q is 2A. To see this, note that the `th
component of 2v 0 A0 is the inner product of v with the `th row (or
column) of A. The derivative of this inner product with respect to a
second index `0 is 2A``0 .
It follows that a quadratic function is strictly convex iff its Hessian is
positive definite (more generally, any continuous, twice differentiable
function is convex on Rd if and only if its Hessian matrix is everywhere
positive definite).
23 / 157
Uniqueness of the simple linear regression least squares fit
The least squares solution for simple linear regression, α̂, β̂, is unique as
long as var[X
c ] (the sample variance of the covariate) is positive.
To see this, note that the Hessian (second derivative matrix) of L(α, β) is
H=
where X· =
P
i
∂ 2 L/∂α2
∂ 2 L/∂α∂β
∂ 2 L/∂α∂β
∂ 2 L/∂β 2
=
2n
2X·
2X·
P
2 Xi2
Xi .
24 / 157
Uniqueness of the simple linear regression least squares fit
If var(X
c ) > 0 then this is a positive definite matrix since all the principal
submatrices have positive determinants:
|2n| > 0
2n
2X·
2X· P
=
2 Xi2 =
4n
X
Xi2 − 4(X· )2
4n(n − 1)var(X
c )
> 0.
25 / 157
The fitted line and the data center
The fitted line
Y = α̂ + β̂X
passes through the center of mass of the data (X̄ , Ȳ ).
This can be seen by substituting X = X̄ into the equation of the fitted
line, which yields Ȳ .
26 / 157
Regression slopes and the Pearson correlation
The sample Pearson correlation coefficient between X and Y is
ρ̂ =
cd
ov(Y , X )
c
c )
SD(Y )SD(X
The relationship between β̂ and ρ̂ is
β̂ = ρ̂ ·
c )
SD(Y
.
c )
SD(X
Thus the fitted slope has the same sign as the Pearson correlation
coefficient between Y and X .
27 / 157
Reversing X and Y
If X and Y are reversed in a simple linear regression, the slope is
β̂∗ =
c )
SD(X
cd
ov(Y , X )
= cor(Y
c ,X)
.
c )
var(Y
c )
SD(Y
If the data fall exactly on a line, then cor(Y , X ) = 1, so β̂∗ = 1/β̂, which
is consistent with algebraically rearranging
Y = α + βX
to
X = −α/β + Y /β.
But if the data do not fit a line exactly, this property does not hold.
28 / 157
Norms
The Euclidean norm on vectors (the most commonly used norm) is:
kV k =
sX
Vi2 =
√
V 0V .
i
Here are some useful identities, for vectors V , W ∈ Rp :
kV + W k2 = kV k2 + kW k2 + 2V 0 W
kV − W k2 = kV k2 + kW k2 − 2V 0 W
29 / 157
Fitting multiple linear regression models
For multiple regression (p > 1), the covariate data define the design
matrix:

1
 1




X=





1
x11
x21
···
···
x12
x22
···
···
···
···
···
xn1
xn2
···

x1p
x2p 










xnp
Note that in some situations the first column of 10 s (the intercept) will
not be included.
30 / 157
Fitting multiple linear regression models
The linear model coefficients are written as a vector
β = (β0 , β1 , . . . , βp )0
where β0 is the intercept and βk is the slope corresponding to the k th
covariate. For a given working covariate vector β, the vector of fitted
values is given by the matrix-vector product
Ŷ = Xβ,
which is an n-dimensional vector.
The vector of residuals Y − Xβ is also an n-dimensional vector.
31 / 157
Fitting multiple linear regression models
The goal of least-squares estimation is to minimize the sum of squared
differences between the fitted and observed values.
L(β) =
X
(Yi − Ŷi )2 = kY − Xβk2 .
i
Estimating β by minimizing L(β) is called ordinary least squares (OLS).
32 / 157
The multivariate chain rule
Suppose g (·) is a map from Rm to Rn and f (·) is a scalar-valued function
on Rn . If h = f ◦ g , i.e. h(z) = f (g (z)). Let fj (x) = ∂f (x)/∂xj , let
∇f (x) = (f1 (x), · · · , fn (x))0
denote the gradient of f , and let J denote the Jacobian of g
Jij (z) = ∂gi (z)/∂zj .
33 / 157
The multivariate chain rule
Then
∂h(z)/∂zj
=
X
fi (g (z))∂gi (z)/∂zj
i
=
[J(z)0 ∇f (g (z))]j
Thus we can write the gradient of h as a matrix-vector product between
the (transposed) Jacobian of g and the gradient of f :
∇h = J 0 ∇f
where J is evaluated at z and ∇f is evaluated at g (z).
34 / 157
The least squares gradient function
For the least squares problem, the gradient of L(β) with respect to β is
∂L/∂β = −2X0 (Y − Xβ).
This can be seen by differentiating
L(β) =
X
(Yi − β0 −
i
p
X
Xij βj )2
j=1
element-wise, or by differentiating
kY − Xβk2
using the
Pmultivariate chain rule, letting g (β) = Y − Xβ and
f (x) = j xj2 .
35 / 157
Normal equations
Setting ∂L/∂β = 0 yields the “normal equations:”
X0 Xβ = X0 Y
Thus calculating the least squares estimate of β reduces to solving a
system of p + 1 linear equations. Algebraically we can write
β = (X0 X)−1 X0 Y ,
which is often useful for deriving analytical results. However this
expression should not be used to numerically calculate the coefficients.
36 / 157
Solving the normal equations
The most standard numerical approach is to calculate the QR
decomposition of
X = QR
where Q is a n × p + 1 orthogonal matrix (i.e. Q0 Q = I) and R is a
p + 1 × p + 1 upper triangular matrix.
The QR decomposition can be calculated rapidly, and highly precisely.
Once it is obtained, the normal equations become
Rβ = Q0 Y ,
which is an easily solved p + 1 × p + 1 triangular system.
37 / 157
Matrix products
In multiple regression we encounter the matrix product X0 X. Let’s review
some ways to think about matrix products.
If A ∈ Rn×m and B ∈ Rn×p are matrices, we can form the product
A0 B ∈ Rm×p .
Suppose we partition A and B by rows:

− a1
 − a2

···
A=


···
− an

−
− 




−

− b1
 − b2

···
B=


···
− bn

−
− 




−
38 / 157
Matrix products
Then
A0 B = a10 b1 + a20 b2 + · · · + an0 bn ,
where each aj0 bj term is an outer product

aj1 bj1
 aj2 bj1
0
aj bj = 

···
aj1 bj2
aj2 bj2
···
···
···


 ∈ Rm×p .

39 / 157
Matrix products
Now if we partition A and B by columns

|
A =  a1
|
|
a2
|

|
a3 
|

|
B =  b1
|
|
b2
|

|
b3  ,
|
then A0 B is a matrix of inner products
a10 b1

a20 b1
A0 B = 
 ···

a10 b2
a20 b2
···

···
··· 
.
··· 
40 / 157
Matrix products
Thus we can view the product X0 X involving the design matrix in two
different ways. If we partition X by rows (the cases)

− x1
 − x2

···
X=


···
− xn

−
− 




−
then
X0 X =
n
X
xi0 xi
i=1
is the sum of the outer product matrices of the cases.
41 / 157
Matrix products
If we partition X by the columns (the variables)

|
X =  x1
|
|
x2
|

|
x3  ,
|
then X0 X is a matrix whose entries are the pairwise inner products of the
variables ([X0 X]ij = xi0 xj ).
42 / 157
Mathematical properties of the multiple regression fit
The multiple least square solution is unique as long as the columns of X
are linearly independent. Here is the proof:
1. The Hessian of L(β) is 2X0 X.
2. For v 6= 0, v 0 (X0 X)v = (Xv )0 Xv = kXv k2 > 0, since the columns
of X are linearly independent. Therefore the Hessian of L is positive
definite.
3. Since L(β) is quadratic with a positive definite Hessian matrix, it is
convex and hence has a unique global minimizer.
43 / 157
Projections
Suppose S is a subspace of Rd , and V is a vector in Rd . The projection
operator PS maps V to the vector in S that is closest to V :
PS (V ) = argminη∈S kV − ηk2 .
44 / 157
Projections
Property 1: (V − PS (V ))0 s = 0 for all s ∈ S. To see this, let s ∈ S.
Without loss of generality ksk = 1 and (V − PS (V ))0 s ≤ 0. Let λ ≥ 0,
and write
kV − PS V + λsk2
=
kV − PS V k2 + λ2 + 2λ(V − PS (V ))0 s.
If (V − PS (V ))0 s 6= 0, then for sufficiently small λ > 0,
λ2 + 2λ(V − PS (V ))0 s < 0. This means that PS (V ) − λs is closer to V
than PS (V ), contradicting the definition of PS (V ).
45 / 157
Projections
Property 2: Given a subspace S of Rd , any vector V ∈ Rd can be
written uniquely in the form V = VS + VS ⊥ , where VS ∈ S and
s 0 VS ⊥ = 0 for all s ∈ S. To prove uniqueness, suppose
V = VS + VS ⊥ = ṼS + ṼS ⊥ .
Then
0
= kVS − ṼS + VS ⊥ − ṼS ⊥ k2
= kVS − ṼS k2 + kVS ⊥ − ṼS ⊥ k2 + 2(VS − ṼS )0 (VS ⊥ − ṼS ⊥ )
= kVS − ṼS k2 + kVS ⊥ − ṼS ⊥ k2 .
which is only possible if VS = ṼS and VS ⊥ − ṼS ⊥ . Existence follows from
Property 1, with VS = PV (S) and VS ⊥ = V − PV (S).
46 / 157
Projections
Property 3: The projection PS (V ) is unique.
This follows from property 2 – if there exists a V for which V1 6= V2 ∈ S
both minimize the distance from V to S, then V = V1 + U1 and
V = V2 + U2 (Uj = V − Vj ) would be distinct decompositions of V as a
sum of a vector in S and a vector in S ⊥ , contradicting property 2.
47 / 157
Projections
Property 4: PS (PS (V )) = PS (V ). The proof of this is simple, since
PS (V ) ∈ S, and any element of S has zero distance to itself.
A matrix or linear map with this property is called idempotent.
48 / 157
Projections
Property 5: PS is a linear operator.
Let A, B be vectors with θA = PS (A) and θB = PS (B). Then we can
write
A + B = θA + θB + (A + B − θA − θB ),
where θA + θB ∈ S, and
s 0 (A + B − θA − θB ) = s 0 (A − θA ) + s 0 (B − θB ) = 0
for all s ∈ S. By Property 2 above, this representation is unique, so
θA + θB = PS (A + B).
49 / 157
Projections
Property 6: Suppose PS is the projection operator onto a subspace S.
Then I − PS , where I is the identity matrix, is the projection operator
onto the subspace
S ⊥ ≡ {u ∈ Rd |u 0 s = 0 for all s ∈ S}.
To prove this, write
V = (I − PS )V + PS V ,
and note that ((I − PS )V )0 s = 0 for all s ∈ S, so (I − PS )V ∈ S ⊥ , and
u 0 PS V = 0 for all u ∈ S ⊥ . By property 2 this decomposition is unique,
and therefore I − PS is the projection operator onto S ⊥ .
50 / 157
Projections
Property 7: Since PS (V ) is linear, it can be represented in the form
PS (V ) = PS · V for a suitable square matrix PS . Suppose S is spanned
by the columns of a non-singular matrix X. Then
PS = X(X0 X)−1 X0 .
To prove this, let Q = X(X0 X)−1 X0 , so for an arbitrary vector V ,
V
=
QV + (V − QV )
=
X(X0 X)−1 X0 V + (I − X(X0 X)−1 X0 )V
and note that the first summand is in S.
51 / 157
Projections
Property 7 (continued):
To show that the second summand is in S ⊥ , take s ∈ S and write
s = Xb. Then
s 0 (I − X(X0 X)−1 X0 )V
= b 0 X0 (I − X(X0 X)−1 X0 )V
=
0.
Therefore this is the unique decomposition from Property 2, above, so PS
must be the projection.
52 / 157
Projections
Property 7 (continued)
An alternate approach to property 7 is constructive. Let θ = Xγ, and
suppose we wish to minimize the distance between θ and V . Using
calculus, we differentiate with respect to γ and solve for the stationary
point:
∂kV − Xγk2 /∂γ
=
∂(V 0 V − 2V 0 Xγ + γ 0 X0 Xγ)/∂γ
=
−2X 0 V + 2X0 Xγ
=
0.
The solution is
γ = (X0 X)−1 X0 V
so
θ = X(X0 X)−1 X0 V .
53 / 157
Projections
Property 8: The composition of projections onto S and S ⊥ (in either
order) is identically zero.
PS ◦ PS ⊥ = PS ⊥ ◦ PS ≡ 0. This can be shown by direct calculation using
the representations of PS and PS ⊥ given above.
54 / 157
Least squares and projections
The least squares problem of minimizing
kY − Xβk2
is equivalent to minimizing kY − ηk2 over η ∈ col(X).
Therefore the minimizing value η̂ is the projection of Y onto col(X).
If the columns of X are linearly independent, there is a unique vector β̂
such that Xβ̂ = η̂. These are the least squares coefficient estimates.
55 / 157
Properties of the least squares fit
• The fitted regression surface passes through the mean point (X̄, Ȳ ).
To see this, note that the fitted surface at X̄ (the vector of column-wise
means) is
X̄0 β̂
=
(10 X/n)β̂
=
10 X(X0 X)−1 X0 Y /n,
where 1 is a column vector of 1’s. Since 1 ∈ col(X), it follows that
X(X0 X)−1 X0 1 = 1,
which gives the result, since Ȳ = 10 Y /n.
56 / 157
Properties of the least squares fit
• The multiple regression residuals sum to zero. The residuals are
R
≡
Y − Ŷ
=
Y − PS Y
=
(I − PS )Y
=
PS ⊥ Y ,
where S = col(X ). The sum of residuals can be written
10 (I − X(X0 X)−1 X0 )Y ,
where 1 is a vector of 1’s. If X includes an intercept, PS 1 = 1, so
10 I = 10 X(X0 X)−1 X0 = 10 ,
so
10 (I − X(X0 X)−1 X0 ) = 0.
57 / 157
Orthogonal matrices
An orthogonal matrix X satisfies X0 X = I . This is equivalent to stating
that the columns of X are mutually orthonormal.
If X is square and orthogonal, then X0 = X−1 and also XX0 = I .
If X is orthogonal then the projection onto col(X) simplifies to
X(X0 X)−1 X0 = XX0
If X is orthogonal and the first column of X is constant, it follows that
the remaining columns of X are centered and have sample variance
1/(n − 1).
58 / 157
Orthogonal matrices
• If X is orthogonal, the slopes obtained by using multiple regression of
Y on X = (X1 , . . . , Xp ) are the same as the slopes obtained by carrying
out p simple linear regressions of Y on each covariate separately.
To see this, note that the multiple regression slope estimate for the i th
covariate is
β̂m,i = X0:i Y
where X:i is column i of X. Since X0 X = I it follows that each covariate
has zero sample mean, and sample variance equal to 1/(n − 1). Thus the
simple linear regression slope for covariate i is
c :i , Y )/var(X
c :i ) = X0:i Y = β̂m,i .
β̂i = cov(X
59 / 157
Comparing multiple regression and simple regression slopes
• The signs of the multiple regression slopes need not agree with the
signs of the corresponding simple regression slopes.
For example, suppose there are two covariates, both with mean zero and
variance 1, and for simplicity assume that Y has mean zero and variance
1. Let r12 be the correlation between the two covariates, and let r1y and
r2y be the correlations between each covariate and the response. It
follows that

n/(n − 1) 0
0
1
X0 X/(n − 1) = 
0
r12

0
r12 
1


0
X0 Y /(n − 1) =  r1y  .
r2y
60 / 157
Comparing multiple regression and simple regression slopes
So we can write

β̂ = (X0 X/(n − 1))−1 (X0 Y /(n − 1)) =
1 
r1y
2
1 − r12
r2y

0
− r12 r2y  .
− r12 r1y
Thus, for example, if r1y , r2y , r12 ≥ 0, then if r12 > r1y /r2y , then β̂1 has
opposite signs in single and multiple regression. Note that if r1y > r2y it
is impossible for r12 > r1y /r2y . Thus, the effect direction for the covariate
that is more strongly marginally correlated with Y cannot be reversed.
This is an example of “Simpson’s paradox”.
61 / 157
Comparing multiple regression and simple regression slopes
Numerical example: if r1y = 0.1, r2y = 0.6, r12 = 0.6, then X1 is
(marginally) positively associated with Y , but for fixed values of X2 , the
association between Y and X1 is negative.
62 / 157
Formulation of regression models in terms of probability
distributions
Up to this point, we have primarily expressed regression models in terms
of the mean structure, e.g.
E [Y |X = x] = β 0 x.
Regression models are commonly discussed in terms of moment
structures (E [Y |X = x], var[Y |X = x]) or quantile structures
(Qp [Y |X = x]), rather than as fully-specified probability distributions.
63 / 157
Formulation of regression models in terms of probability
distributions
If we want to specify the model more completely, we can think in terms
of a random “error term” that describes how the observed value y
deviates from the ideal value f (x), where (x, y ) is generated according to
the regression model.
A very general regression model is
Y = f (X , ).
where is a random variable with expected value zero.
If we specify the distribution of , then we have fully specified the
distribution of Y |X .
64 / 157
Formulation of regression models in terms of probability
distributions
A more restrictive “additive error” model is:
Y = f (X ) + .
Under this model,
E [Y |X ]
= E [f (X ) + |X ]
=
E [f (X )|X ] + E [|X ]
=
f (X ) + E [|X ].
Without loss of generality, E [|X ] = 0, so E [Y |X ] = f (X ).
65 / 157
Regression model formulations and parameterizations
A parametric regression model is:
Y = f (X ; θ) + ,
where θ is a finite dimensional parameter vector.
Examples:
1. The linear response surface model f (X ; θ) = θ0 X
2. The quadratic response surface model f (X ; θ) = θ1 + θ2 X + θ3 X 2
3. The Gompertz curve f (X ; θ) = θ1 exp(θ2 exp(θ3 X )) θ2 , θ3 ≤ 0.
Models 1 and 2 are both “linear models” because they are linear in θ.
The Gompertz curve is a non-linear model because it is not linear in θ.
66 / 157
Basic inference for simple linear regression
This section deals with statistical properties of least square fits that can
be derived under minimal conditions.
Specifically, we will assume derive properties of β̂ that hold for the
generating model y = x 0 β + , where:
i E [|x] = 0
ii var[|x] = σ 2 exists and is constant across cases
iii the random variables are uncorrelated across cases
(given x).
We will not assume here that follows a particular distribution, e.g. a
Gaussian distribution.
67 / 157
The relationship between E [|X ] and cov(, X )
If we treat X as a random variable, the condition that
E [|X ] = 0
for all X implies that cor(X , ) = 0. This follows from the double
expectation theorem:
cov(X , )
= E [X ] − E [X ] · E []
= E [X ]
= EX [E [X |X ]]
= EX [XE [|X ]]
= EX [X · 0]
=
0.
68 / 157
The relationship between E [|X ] and cov(, X )
The converse is not true. If cor(X , ) = 0 and E [] = 0 it may not be the
case that E [|X ] = 0.
For example, if X ∈ {−1, 0, 1} and ∈ {−1, 1}, with joint distribution
-1
0
1
-1
1/12
4/12
1/12
1
3/12
0
3/12
then E = 0 and cor(X , ) = 0, but E [|X ] is not identically zero. When
and X are jointly Gaussian, cor(, X ) = 0 implies that and X are
independent, which in turn implies that E [|X ] = E = 0.
69 / 157
Data sampling
To be able to interpret the results of a regression analysis, we need to
know how the data were sampled. In particular, it important to consider
whether X and/or Y and/or the pair X , Y should be considered as
random draws from a population. Here are two important situations:
I
Designed experiment: We are studying the effect of temperature on
reaction yield in a chemical synthesis. The temperature X is
controlled and set by the experimenter. In this case, Y is randomly
sampled conditionally on X , but X is not randomly sampled, so it
doesn’t make sense to consider X to be a random variable.
I
Observational study: We are interested in the relationship between
cholesterol level X and blood pressure Y . We sample people at
random from a well defined population (e.g. all registered nurses in
the state of Michigan) and measure their blood pressure and
cholesterol levels. In this case, X and Y are sampled from their
joint distribution, and both can be viewed a random variables.
70 / 157
Data sampling
There is another design that we may encounter:
I
Case/control study: Again suppose we are interested in the
relationship between cholesterol level X , and blood pressure Y . But
now suppose we have an exhaustive list of blood pressure
measurements for all nurses in the state of Michigan. We wish to
select a subset of 500 nurses to contact for acquiring cholesterol
measures, and it is decided that studying the 250 nurses with
greatest blood pressure together with the 250 nurses with lowest
blood pressure will be most informative. In this case X is randomly
sampled conditionally on Y , but Y is not randomly sampled.
Summary: Regression models are formulated in terms of the conditional
distribution of Y given X . The statistical properties of β̂ are easiest to
calculate and interpret as being conditional on X . The way that the data
are sampled also affects our interpretation of the results.
71 / 157
Basic inference for simple linear regression via OLS
Above we showed that the slope and intercept estimates are
P
Yi (Xi − X̄ )
cd
ov(Y , X )
= Pi
β̂ =
,
2
var[X
c ]
i (Xi − X̄ )
α̂ = Ȳ − β̂ X̄ .
Note that we are using the useful fact that
X
i
(Yi − Ȳ )(Xi − X̄ ) =
X
i
Yi (Xi − X̄ ) =
X
(Yi − Ȳ )Xi .
i
72 / 157
Sampling means of parameter estimates
First we will calculate the sampling means of α̂ and β̂. A useful identity
is that
β̂
=
X
(α + βXi + i )(Xi − X̄ )/
i
= β+
X
(Xi − X̄ )2
i
X
i (Xi − X̄ )/
i
X
(Xi − X̄ )2 .
i
From this identity it is clear that E [β̂|X ] = β. Thus β̂ is unbiased (a
parameter estimate is unbiased if its sampling mean is the same as the
population value of the parameter).
The intercept is also unbiased:
E [α̂|X ]
= E (Ȳ − β̂ X̄ |X )
= α + β X̄ + E [¯
|X ] − E [β̂ X̄ |X ]
= α
73 / 157
Sampling variances of parameter estimates
Next we will calculate the sampling variances of α̂ and β̂. These values
capture the variability of the parameter estimates over replicated studies
or experiments from the same population.
First we will need the following result:
cov(β̂, ¯|X )
=
X
=
0,
cov(i , ¯|X )(Xi − X̄ )/
i
X
(Xi − X̄ )2
i
since cov(i , ¯|X ) = σ 2 /n does not depend on i.
74 / 157
Sampling variances of parameter estimates
To derive the sampling variances, start with the identity:
β̂ = β +
X
i (Xi − X̄ )/
X
(Xi − X̄ )2 .
i
i
The sampling variances are
var[β̂|X ]
= σ2 /
X
(Xi − X̄ )2
i
= σ 2 / ((n − 1)var[X
c ]) .
var[α̂|X ]
= var[Ȳ − β̂ X̄ |X ]
= var[α + β X̄ + ¯ − β̂ X̄ |X ]
= var[¯
|X ] + X̄ 2 var[β̂|X ] − 2X̄ cov[¯
, β̂|X ]
= σ 2 /n + X̄ 2 σ 2 /((n − 1)var[X
c ]).
75 / 157
Sampling covariance of parameter estimates
The sampling covariance between the slope and intercept is
cov(α̂, β̂|X )
= cov(Ȳ , β̂|X ) − X̄ var[β̂|X ]
= −σ 2 X̄ /((n − 1)var[X
c ]).
When X̄ > 0, it’s easy to see what the expression for cov(α̂, β̂|X ) is
telling us:
Underestimate slope, overestimate intercept
Y
−
Overestimate slope, underestimate intercept
0.5
0.0
0.5
1.0
X
1.5
2.0
2.5
76 / 157
Basic inference for simple linear regression via OLS
Some observations:
I
All variances scale with sample size like 1/n.
I
β̂ does not depend on X̄ .
I
var[α̂] is minimized if X̄ = 0.
I
α̂ and β̂ are uncorrelated if X̄ = 0.
77 / 157
Some properties of residuals
Start with the following useful expression:
Ri
≡ Yi − α̂ − β̂Xi
= Yi − Ȳ − β̂(Xi − X̄ ).
Since Yi = α + βXi + i and therefore Ȳ = α + β X̄ + ¯, we can subtract
to get
Yi − Ȳ = β(Xi − X̄ ) + i − ¯
it follows that
Ri = (β − β̂)(Xi − X̄ ) + i − ¯.
Since E β̂ = β and E [i − ¯] = 0, itP
follows that ERi = 0. Note that this
is a distinct fact from the identity i Ri = 0.
78 / 157
Some properties of residuals
It is important to distinguish the residual Ri from the “observation errors”
i . The identity Ri = (β − β̂)(Xi − X̄ ) + i − ¯ shows us that the centered
observation error i − ¯ is one part of the residual. The other term
(β − β̂)(Xi − X̄ )
reflects the fact that the residuals are also influenced by how well we
recover the true slope β through our estimate β̂.
To illustrate, consider two possibilities:
I
We overestimate the slope (β − β̂ < 0). The residuals to the right of
the right of the mean (i.e. when X > X̄ ) are shifted down (toward
−∞), and the residuals to the left of the mean are shifted up.
I
We underestimate the slope (β − β̂ > 0). The residuals to the right
of the right of the mean are shifted up, and the residuals to the left
of the mean are shifted down.
79 / 157
Some properties of residuals
We can use the identity Ri = (β − β̂)(Xi − X̄ ) + i − ¯ to derive the
variance of each residual. We have:
var[(β − β̂)(Xi − X̄ )]
=
σ 2 (Xi − X̄ )2 /
X
(Xj − X̄ )2
j
var[i − ¯|X ]
= var[i |X ] + var[¯
|X ) − 2cov(i , ¯|X )
= σ 2 + σ 2 /n − 2σ 2 /n
= σ 2 (n − 1)/n.
cov (β − β̂)(Xi − X̄ ), i − ¯|X
= −(Xi − X̄ )cov(β̂, i |X )
X
= −σ 2 (Xi − X̄ )2 /
(Xj − X̄ )2 ,
j
80 / 157
Some properties of residuals
Thus the variance of the i th residual is
var[Ri |X ]
= var (β − β̂)(Xi − X̄ ) + i − ¯|X
= var (β − β̂)(Xi − X̄ )|X + var[i − ¯|X ] +
2cov (β − β̂)(Xi − X̄ ), i − ¯|X
X
= (n − 1)σ 2 /n − σ 2 (Xi − X̄ )2 /
(Xj − X̄ )2 .
j
Note the following identity, which ensures that this expression is
non-negative:
(Xi − X̄ )2 /
X
(Xj − X̄ )2 ≤ (n − 1)/n.
j
81 / 157
Some properties of residuals
I
The residuals are not iid – they are correlated with each other, and
they have different variances.
I
var[Ri ] < var[i ] – the residuals are less variable than the errors.
82 / 157
Some properties of residuals
2
Since E [R
Pi ] = 20 it follows that var[Ri ] = E [Ri ]. Therefore the expected
value of i Ri is
(n − 1)σ 2 − σ 2 = (n − 2)σ 2
since the (Xi − X̄ )2 /
P
j (Xj
− X̄ )2 sum to one.
83 / 157
Estimating the error variance σ 2
Since
E
X
Ri2 = (n − 2)σ 2
i
it follows that
X
Ri2 /(n − 2)
i
is an unbiased estimate of σ 2 . That is
!
E
X
Ri2 /(n
− 2)
= σ2 .
i
84 / 157
Basic inference for multiple linear regression via OLS
We have the following useful identity for the multiple linear regression
least squares fit:
β̂
=
(X0 X)−1 X0 Y
=
(X0 X)−1 X0 (Xβ + )
=
β + (X0 X)−1 X0 .
Letting
η = (X0 X)−1 X0 we see that E [η|X] = 0 and
var[η|X] = var[β̂|X]
=
(X0 X)−1 X0 var[|X]X(X0 X)−1
= σ 2 (X0 X)−1
• var[|X] = σ 2 I since (i) the i are uncorrelated and (ii) the i have
constant variance given X .
85 / 157
Variance of β̂ in multiple regression OLS
Let ui be the i th row of the design matrix X. Then
X0 X =
X
ui0 ui
i
where ui0 ui is an outer-product (a p + 1 × p + 1 matrix).
If we have a limiting behavior
n−1
X
ui0 ui → Q,
i
for a fixed p + 1 × p + 1 matrix Q, then X0 X ≈ nQ, so
cov(β̂|X) ≈ σ 2 Q −1 /n.
Thus we see the usual influence of sample size on the standard errors of
the regression coefficients.
86 / 157
Level sets of quadratic forms
Suppose
1
0.5
2
3
C=
0.5
1
so that
C −1 =
2
−1
−1
2
Left side: level curves of g (x) = x 0 Cx
Right side: level curves of h(x) = x 0 C −1 x
3
3
2
1
1.0
0
0
0.5
1
1
4.0
2
33
0
4.0 3. 1.0
0.5
3
2.0.0
1
5.0
5.0
2.0
2
2
2
1
0
1
2
3
33
2
1
0
1
2
3
87 / 157
Eigen-decompositions and quadratic forms
The dominant eigenvector of C maximizes the “Rayleigh quotient”
g (x) = x 0 Cx/x 0 x,
thus it points in the direction of greatest change of g .
If
C=
1
0.5
0.5
1
,
the dominant eigenvector points in the (1, 1) direction. The dominant
eigenvector of C −1 points in the (−1, 1) direction.
P
If the eigendecomposition of C is j λj vj vj0 then the eigendecomposition
P
0
of C −1 is j λ−1
j vj vj – thus directions in which the level curves of C are
most spread out are the directions in which the level curves of C −1 are
most compressed.
88 / 157
Variance of β̂ in multiple regression OLS
If p = 2 and

1
X 0 X /n =  0
0
0
1
r

0
r ,
1
then

1
cov(β̂) = σ 2 n−1  0
0

0
0
1/(1 − r 2 ) −r /(1 − r 2 )  .
−r /(1 − r 2 ) 1/(1 − r 2 )
So if p = 2 and X1 and X2 are positively colinear (meaning that
(X1 − X̄1 )0 (X2 − X̄2 ) > 0), the corresponding slope estimates β̂1 and β̂2
are negatively correlated.
89 / 157
The residual sum of squares
The residual sum of squares (RSS) is the squared norm of the residual
vector:
RSS = kY − Ŷ k2
= kY − PY k2
= k(I − P)Y k2
= Y 0 (I − P)(I − P)Y
= Y 0 (I − P)Y ,
where P is the projection matrix onto col(X). The last equivalence
follows from the fact that I − P is a projection and hence is idempotent.
90 / 157
The expected value of the RSS
The expression RSS = Y 0 (I − P)Y is a quadratic form in Y , and we can
write
Y 0 (I − P)Y = tr (Y 0 (I − P)Y ) = tr ((I − P)YY 0 ) ,
where the second equality uses the circulant property of the trace.
For three factors, the circulent property states that
tr(ABC ) = tr(CAB) = tr(BCA).
91 / 157
The expected value of the RSS
By linearity we have
E tr[(I − P)YY 0 ] = tr[(I − P) · EYY 0 ],
and
EYY 0
= E Xββ 0 X0 + EX β0 + E β 0 X0 + E 0
= Xββ 0 X0 + E 0
= Xββ 0 X0 + σ 2 I .
Since PX = X and hence (I − P)X = 0,
(I − P)E [YY 0 ] = σ 2 (I − P).
Therefore the expected value of the RSS is
E [RSS] = σ 2 tr(I − P).
92 / 157
Four more properties of projection matrices
Property 9: A projection matrix P is symmetric. One way to show this
is to let V1 , . . . , Vq be an orthonormal basis for S, where P is the
projection onto S. Then complete the Vj with Vq+1 , . . . , Vd to get a
basis. By direct calculation,
(P −
q
X
Vj Vj0 )Vk = 0
j=1
for all k, hence P =
Pq
j=1
Vj Vj0 which is symmetric.
93 / 157
Four more properties of projection matrices
Property 10: A projection matrix is positive semidefinite. Let V be an
arbitrary vector and write V = V1 + V2 , where V1 ∈ S and V2 ∈ S ⊥ .
Then
(V1 + V2 )0 P(V1 + V2 ) = V10 V1 ≥ 0.
94 / 157
Four more properties of projection matrices
Property 11: The eigenvalues of a projection matrix P must be zero or
one.
Suppose λ, v is an eigenvalue/eigenvector pair:
Pv = λv .
If P is the projection onto a subspace S, this implies that λv is the
closest element of S to v . But if λv ∈ S then v ∈ S, and is strictly closer
to v than λv , unless λ = 1 or v = 0. Therefore only 0 and 1 can be
eigenvalues of P.
95 / 157
Four more properties of projection matrices
Property 12: The trace of a projection matrix is its rank.
The rank of a matrix is the number of nonzero eigenvalues. The trace of
a matrix is the sum of all eigenvalues. Since the nonzero eigenvalues of a
projection matrix are all 1, the rank and the trace must be identical.
96 / 157
The expected value of the RSS
We know that E [RSS] = σ 2 tr(I − P). Since I − P is the projection onto
col(X)⊥ , I − P has rank n − rank(X) = n − p − 1. Thus
E [RSS] = σ 2 (n − p − 1),
so
RSS/(n − p − 1)
is an unbiased estimate of σ 2 .
97 / 157
Covariance matrix of residuals
Since E β̂ = β, it follows that E Ŷ = Xβ = EY . Therefore we can derive
the following simple expression for the covariance matrix of the residuals.
cov(Y − Ŷ )
=
E (Y − Ŷ )(Y − Ŷ )0
=
(I − P)EYY 0 (I − P)
=
(I − P)(Xββ 0 X0 + σ 2 I )(I − P)
=
σ 2 (I − P)
98 / 157
Distribution of the RSS
The RSS can be written
RSS =
=
tr[(I − P)YY 0 ]
tr[(I − P)0 ]
Therefore, the distribution of the RSS does not depend on β. It also
depends on X only through col(X).
99 / 157
Distribution of the RSS
If the distribution of is invariant under orthogonal transforms, i.e.
d
= Q
when Q is a square orthogonal matrix, then we can make the stronger
statement that the distribution of the RSS only depends on X through
its rank.
To see this, construct a square orthogonal matrix Q so that Q 0 (I − P)Q
is the projection onto a fixed subspace S of dimension n − p − 1 (so Q 0
maps col(I − P) to S). Then
tr[(I − P)0 ]
=
d
tr[(I − P)Q(Q)0 ]
=
tr[Q 0 (I − P)Q0 ]
Note that since Q is square we have QQ 0 = Q 0 Q = I .
100 / 157
Optimality
For a given design matrix X, there are many linear estimators that are
unbiased for β. That is, there are many matrices M ∈ Rp+1×n such that
E [MY |X] = β
for all β. The Gauss-Markov theorem states that among these, the
least squares estimate is “best,” in the sense that its covariance matrix is
“smallest.”
Here we are using the definition that a matrix A is “smaller” than a
matrix B if
B −A
is positive definite.
101 / 157
Optimality (BLUE)
Letting β ∗ = MY be any linear unbiased estimator of β, when
cov(β̂|X) ≤ cov(β ∗ |X),
this implies that for any fixed vector θ,
var[θ0 β̂|X] ≤ var[θ0 β ∗ |X].
The Gauss-Markov theorem implies that the least squares estimate β̂ is
the “BLUE” (best linear unbiased estimator) for the least squares model.
102 / 157
Optimality (proof of GMT)
The idea of the proof is to show that for any linear unbiased estimate β ∗
of β, β ∗ − β̂ and β̂ are uncorrelated. It follows that
cov(β ∗ |X)
= cov(β ∗ − β̂ + β̂|X)
=
cov(β ∗ − β̂|X) + cov(β̂|X)
≥
cov(β̂|X).
To prove the theorem note that
E [β ∗ |X] = M · E [Y |X] = MXβ = β
for all β, and let B = (X0 X)−1 X0 , so that
E [β̂|X] = BXβ = β
for all β, so (M − B)X ≡ 0.
103 / 157
Optimality (proof of GMT)
Therefore
cov(β ∗ − β̂, β̂|X)
= E [(M − B)Y (BY − β)0 |X]
= E [(M − B)YY 0 B 0 |X] − E [(M − B)Y β 0 |X]
=
(M − B)(Xββ 0 X0 + σ 2 I )B 0 − (M − B)Xββ 0
= σ 2 (M − B)B 0
=
0.
Note that we have an explicit expression for the gap between cov(β̂) and
cov(β ∗ ):
cov(β ∗ − β̂|X) = σ 2 (M − B)(M − B)0 .
104 / 157
Multivariate density families
If C is a covariance matrix, many families of multivariate densities have
the form c · φ((x − µ)0 C −1 (x − µ)), where c > 0 and φ : R+ → R+ is a
function, typically decreasing with a mode at the origin (e.g. for the
multivariate normal density, φ(u) = exp(−u/2) and for the multivariate
t-distribution with d degrees of freedom, φ(u) = (1 + u/d)−(d+1)/2 ) .
Left side: level sets of g (x) = x 0 Cx
Right side: level sets of a density c · φ((x − µ)0 C −1 (x − µ))
3
3
2
0
0.5
1
2
2
1
0
1
2
3
3
1
1
0
2
3
2
3
2
1
0.0
5
0.5
0.0
01
1
2
1
0
0.0
5.0
2.0
4.0 3.0 1.0
1
33
2
0.10.15
0
33
3
2
1
0.1
0.05
1
4.0
2
33
01
1
1.0
0
0.1
5
1
0.0
1
5
3 .0
2.0.0
0.0
2
2
2
1
0
1
2
3
33
2
1
0
1
105 / 157
Regression inference with Gaussian errors
The random vector X = (X1 , . . . , Xp )0 has a p-dimensional standard
multivariate normal distribution if its components are independent and
follow standard normal marginal distributions.
The density of X is the product of p standard normal densities:
p(X ) = (2π)−p/2 exp(−
1
1X 2
Xj ) = (2π)−p/2 exp(− X 0 X ).
2
2
j
106 / 157
Regression inference with Gaussian errors
If we transform
Z = µ + AX ,
we get a random variable satisfying
EZ
cov(Z )
= µ
= Acov(X )A0 = AA0 ≡ Σ.
107 / 157
Regression inference with Gaussian errors
The density of Z can be obtained using the change of variables formula:
1
p(Z ) = (2π)−p/2 |A−1 | exp − (Z − µ)0 A−T A−1 (Z − µ)
2
1
= (2π)−p/2 |Σ|−1/2 exp − (Z − µ)0 Σ−1 (Z − µ)
2
This distribution is denoted N(µ, Z ). The log-density is
1
1
− log |Σ| − (Z − µ)0 Σ−1 (Z − µ),
2
2
with the constant term dropped.
108 / 157
Regression inference with Gaussian errors
The joint log-density for an iid sample of size n is
1X
n
n
n
(Xi − µ)0 Σ−1 (Xi − µ) = − log |Σ| − tr Sxx Σ−1
− log |Σ| −
2
2
2
2
i
where
Sxx =
X
(Xi − µ)(Xi − µ)0 /n.
109 / 157
The Cholesky decomposition
If Σ is a non-singular covariance matrix, there is a lower triangular matrix
A with positive diagonal elements such that
AA0 = Σ
This matrix can be denoted Σ1/2 , and is called the Cholesky square root.
110 / 157
Properties of the multivariate normal distribution:
• A linear function of a multivariate normal random vector is also
multivariate normal. Specifically, if
X ∼ N(µ, Σ)
is p-variate normal, and θ is a q × p matrix with q ≤ p, then Y ∼ θX
has a
N(θµ, θΣθ0 )
distribution.
To prove this fact, write
X = µ + AZ
where AA0 = Σ is the Cholesky decomposition.
111 / 157
Properties of the multivariate normal distribution:
Next, extend θ to a square invertible matrix
θ
θ∗
θ̃ =
.
where θ∗ ∈ Rp−q×p .
The matrix θ∗ can be chosen such that
θΣθ∗0 = 0,
by the Gram-Schmidt procedure. Let
Ỹ = θ̃X =
Y
Y∗
=
θµ + θAZ
θ∗ µ + θ∗ AZ
.
112 / 157
Properties of the multivariate normal distribution:
Therefore
cov(Ỹ ) =
θΣθ0
0
0
θ∗ Σθ∗0
,
and
cov(Ỹ )−1 =
(θΣθ0 )−1
0
0
(θ∗ Σθ∗0 )−1
,
Using the change of variables formula, and the structure of the
multivariate normal density, it follows that
p(Ỹ ) = p(Y )p(Y ∗ ).
This implies that Y and Y ∗ are independent, and by inspecting the form
of their densities, both are seen to be multivariate normal.
113 / 157
Properties of the multivariate normal distribution:
• A consequence of the above argument is that in general, uncorrelated
components of a multivariate normal vector are independent.
• If X is a standard multivariate normal vector, and Q is a square
orthogonal matrix, then QX is also standard multivariate normal. This
follows directly from the fact that QQ 0 = I .
114 / 157
The χ2 distribution
If z is a standard normal random variable, the density of z 2 can be
calculated directly as
√
p(z) = z −1/2 exp(−z/2)/ 2π.
This is the χ21 distribution. The χ2p distribution is defined to be the
distribution of
p
X
zj2
j=1
where z1 , . . . , zp are iid standard normal random variables.
115 / 157
The χ2 distribution
By direct calculation, if F ∼ χ21 ,
EF = 1
var[F ] = 2.
Therefore the mean of the χ2p distribution is p and the variance is 2p.
116 / 157
The χ2 distribution
The χ2p density is
p(x) =
1
x p/2−1 exp(−x/2).
Γ(p/2)2p/2
To prove this by induction, assume that the χ2p−1 density is
1
x (p−1)/2−1 exp(−x/2).
Γ((p − 1)/2)2(p−1)/2
The χ2p density is the density of the sum of a χ2p−1 random variable and a
χ21 random variable, which can be written
117 / 157
The χ2 distribution
1
Γ((p − 1)/2)Γ(1/2)2p/2
x
Z
s (p−1)/2−1 exp(−s/2)(x − s)−1/2 exp(−(x − s)/2)ds
Z x
1
exp(−x/2)
s (p−1)/2−1 (x − s)−1/2 ds
Γ((p − 1)/2)Γ(1/2)2p/2
0
Z 1
1
p/2−1
exp(−x/2)x
u p/2−3/2 (1 − u)−1/2 du.
Γ((p − 1)/2)Γ(1/2)2p/2
0
=
0
=
where u = s/x.
The density for χ2p can now be obtained by applying the identities
Z
Γ(1/2)
=
u α−1 (1 − u)β−1 du
=
√
π
1
Γ(α)Γ(β)/Γ(α + β).
0
118 / 157
The χ2 distribution and the RSS
Let P be a projection matrix and Z be a iid vector of standard normal
values. For any square orthogonal matrix Q,
Z 0 PZ = (QZ )0 QPQ 0 (QZ ).
Since QZ is equal in distribution to Z , Z 0 PZ is equal in distribution to
Z 0 QPQ 0 Z .
If the rank of P is k, we can choose Q so that QPQ 0 is the projection
onto the first k canonical basis vectors, i.e.
0
QPQ =
Ik
0
0
0
,
where Ik is the k × k identity matrix.
119 / 157
The χ2 distribution and the RSS
This gives us
d
Z 0 PZ = Z 0 QPQ 0 Z =
k
X
Zj2
j=1
which follows a χ2k distribution.
It follows that
n−p−1 2
σ̂
σ2
=
Y 0 (I − P)Y /σ 2
=
(/σ)0 (I − P)(/σ)
∼
χ2n−p−1 .
120 / 157
The χ2 distribution and the RSS
Thus when the errors are Gaussian, we have
var[σ̂ 2 ] =
2σ 4
.
n−p−1
121 / 157
The t distribution
Suppose Z is standard normal, V ∼ χ2p , and V is independent of Z . Then
T =
√
√
pZ / V
has a “t distribution with p degrees of freedom.”
Note that by the law of large numbers, V /p converges almost surely to 1.
Therefore T converges in distribution to a standard normal distribution.
To derive the t density apply the change of variables formula. Let
U
W
≡
√ Z/ V
V
122 / 157
The t distribution
The Jacobian is
√
1/ V
J = 0
−Z /2V 3/2 −1/2
= W −1/2 .
=V
1
Since the joint density of Z and V is
p(Z , V ) ∝ exp(−Z 2 /2)V p/2−1 exp(−V /2),
it follows that the joint density of U and W is
p(U, W ) ∝ exp(−U 2 W /2)W p/2−1/2 exp(−W /2)
=
exp(−W (U 2 + 1)/2)W p/2−1/2 .
123 / 157
The t distribution
Therefore
Z
p(U)
=
p(U, W )dW
Z
exp(−W (U 2 + 1)/2)W p/2−1/2 dW
Z
2
−p/2−1/2
= (U + 1)
exp(−Y /2)Y p/2−1/2 dY
∝
∝ (U 2 + 1)−p/2−1/2 .
where
Y = W (U 2 + 1).
Finally, write T =
√
pU to get that
p(T ) ∝ (T 2 /p + 1)−(p+1)/2 .
124 / 157
Confidence intervals
The linear model residuals are
R ≡ Y − Ŷ = (I − P)Y
and the estimated coefficients are
β̂ = (X0 X)−1 X0 Y .
125 / 157
Confidence intervals
Recalling that (I − P)X = 0, and E [Y − Ŷ |X] = 0, it follows that
cov(Y − Ŷ , β̂)
= E [(Y − Ŷ )β̂ 0 ]
= E [(I − P)YY 0 X(X0 X)−1 ]
=
(I − P)(Xββ 0 X0 + σ 2 I )X(X0 X)−1
=
0.
Therefore every estimated coefficient is uncorrelated with every residual.
If is Gaussian, they are also independent.
Since σ̂ 2 is a function of the residuals, it follows that if is Gaussian,
then β̂ and σ̂ 2 are independent.
126 / 157
Confidence interval for a regression coefficient
Let
Vk = [(X0 X)−1 ]kk
so that
var[β̂k |X] = σ 2 Vk .
If the are multivariate Gaussian N(0, σ 2 I ), then
(β̂k − βk )/σ
p
Vk ∼ N(0, 1).
127 / 157
Confidence intervals for a regression coefficient
Therefore
(n − p − 1)σ̂ 2 /σ 2 ∼ χ2n−p−1
and using the fact that β̂ and σ̂ 2 are independent, it follows that the
pivotal quantity
β̂k − βk
√
σ̂ 2 Vk
has a t-distribution with n − p − 1 degrees of freedom.
128 / 157
Confidence intervals for a regression coefficient
Therefore if QT (q, k) is the q th quantile of the t-distribution with k
degrees of freedom, then for 0 ≤ α ≤ 1, and
qα = QT (1 − (1 − α)/2, n − p − 1),
p
P −qα ≤ (β̂k − βk )/ σ̂ 2 Vk ≤ qα = α.
Rearranging terms we get the confidence interval
p
p P β̂k − σ̂qα Vk ≤ βk ≤ β̂k + σ̂qα Vk = α
which has coverage probability α.
129 / 157
Confidence intervals for the expected response
Let x ∗ be any point in Rp+1 . The expected response at X = x ∗ is
E [Y |X = x ∗ ] = β 0 x ∗ .
A point estimate for this value is
β̂ 0 x ∗ ,
which is unbiased since β̂ is unbiased, and has variance
var[β̂ 0 x ∗ ] = σ 2 x ∗0 (X0 X)−1 x ∗ ≡ σ 2 Vx ∗ .
130 / 157
Confidence intervals for the expected response
As above we have that
β̂ 0 x ∗ − β 0 x ∗
√
σ̂ 2 Vx ∗
has a t-distribution with n − p − 1 degrees of freedom.
Therefore
P(β̂ 0 x ∗ − σ̂qα
p
Vx ∗ ≤ β 0 x ∗ ≤ β̂ 0 x ∗ + σ̂qα
p
Vx ∗ ) = α
defines a CI for E [Y |X = x ∗ ] with coverage probability α.
131 / 157
Prediction intervals
Suppose Y ∗ is a new observation at X = x ∗ , independent of the data
used to estimate β̂ and σ̂ 2 . If the errors are Gaussian, then Y ∗ − β̂ 0 x ∗ is
Gaussian, with the following mean and variance:
E [Y ∗ − β̂ 0 x ∗ |X] = β 0 x ∗ − β 0 x ∗ = 0
and
var[Y ∗ − β̂ 0 x ∗ |X] = σ 2 (1 + Vx ∗ ),
132 / 157
Prediction intervals
It follows that
Y ∗ − β̂ 0 x ∗
p
σ̂ 2 (1 + Vx ∗ )
has a t-distribution with n − p − 1 degrees of freedom. Therefore a
prediction interval at x ∗ with coverage probability α is defined by
p
p
P β̂ 0 x ∗ − σ̂qα (1 + Vx ∗ ) ≤ Y ∗ ≤ β̂ 0 x ∗ + σ̂qα (1 + Vx ∗ ) = α.
133 / 157
Wald tests
Suppose we want to carry out a formal hypothesis test for the null
hypothesis βk = 0, for some specified index k.
Since var(β̂k |X) = σ 2 Vk , the “Z-score”
β̂k /
p
σ 2 Vk
follows a standard normal distribution under the null hypothesis. The
“Z-test”
√ or “asymptotic Wald test” rejects the null hypothesis if
|β̂k |/ σ 2 Vk > F −1 (1 − α/2).
More generally, we can also test hypotheses of the form βk = c using the
Z-score
(β̂k − c)/
p
σ 2 Vk
134 / 157
Wald tests
If the errors are normally distributed, then
β̂k /
p
σ̂ 2 Vk
follows a t-distribution with n − p − 1 √
degrees of freedom, and the Wald
test rejects the null hypothesis if |β̂k |/ σ 2 Vk > Ft−1 (1 − α/2, n − p − 1),
where Ft (·, d) is the student t CDF with d degrees of freedom, and α is
the type-I error probability (e.g. α = 0.05).
135 / 157
Wald tests for contrasts
More generally, we can consider the contrast θ0 β defined by a fixed vector
θ ∈ Rp+1 . The population value of the contrast can be estimated by the
plug-in estimate θ0 β̂.
We can test the null hypothesis θ0 β = 0 with the Z-score
√
θ0 β̂/ σ̂ 2 θ0 V θ.
This approximately follows a standard normal distribution, and more
“exactly” (under specified assumptions) it follows a student t-distribution
with n − p − 1 degrees of freedom (all under the null hypothesis).
136 / 157
F-tests
Suppose we have two nested design matrices X1 ∈ Rn×p1 and
X2 ∈ Rn×p2 , such that
col(X1 ) ⊂ col(X2 ).
We may wish to compare the model defined by X1 to the model defined
by X2 . To do this, we need a test statistic that discriminates between the
two models.
Let P1 and P2 be the corresponding projections, and let
Ŷ (1)
= P1 Y
(2)
= P2 Y
Ŷ
be the fitted values.
137 / 157
F-tests
Due to the nesting, P2 P1 = P1 and P1 P2 = P1 . Therefore
(P2 − P1 )2 = P2 − P1 ,
so P2 − P1 is a projection that projects onto col(X2 ) − col(X1 ), the
complement of col(X1 ) in col(X2 ).
Since
(I − P2 )(P2 − P1 ) = 0,
it follows that if E [Y ] ∈ col(X2 ) then
cov(Y − Ŷ (2) , Ŷ (2) − Ŷ (1) ) = E [(I − P2 )YY 0 (P1 − P2 )] = 0.
If the linear model errors are Gaussian, Y − Ŷ (2) and Ŷ (2) − Ŷ (1) are
independent.
138 / 157
F-tests
Since P2 X1 = P1 X1 = X1 , we have
(I − P2 )X1 = (P2 − P1 )X1 = 0.
Now suppose we take as the null hypothesis that EY ∈ col(X1 ), so we
can write Y = θ + , where θ ∈ col(X1 ). Therefore under the null
hypothesis
kY − Ŷ (2) k2 = tr[(I − P2 )YY 0 ] = tr[(I − P2 )0 ]
and
kŶ (2) − Ŷ (1) k2 = tr[(P2 − P1 )YY 0 ] = tr[(P2 − P1 )0 ].
139 / 157
F-tests
Since I − P2 and P2 − P1 are projections onto subspaces of dimension
n − p2 and p2 − p1 , respectively, it follows that
kY − Ŷ (2) k2 /σ 2 = k(I − P2 )Y k2 /σ 2 ∼ χ2n−p2
and under the null hypothesis
kŶ (2) − Ŷ (1) k2 /σ 2 = k(P2 − P1 )Y k2 /σ 2 ∼ χ2p2 −p1 .
Therefore
kŶ (2) − Ŷ (1) k2 /(p2 − p1 )
kY − Ŷ (2) k2 /(n − p2 )
.
Since kŶ (2) − Ŷ (1) k2 will tend to be large when EY 6∈ col(X1 ), i.e. when
the null hypothesis is false, this quantity can be used as a test-statistic.
It is called the F-test statistic.
140 / 157
The F distribution
The F-test statistic is the ratio between two rescaled independent χ2
draws.
If U ∼ χ2p and V ∼ χ2q , then
U/p
V /q
has an “F-distribution with p, q degrees of freedom,” denoted Fp,q .
To derive the kernel of the density, let
X
Y
≡
U/V
V
.
141 / 157
The F distribution
The Jacobian of the map is
1/V
0
−U/V 2 = 1/V .
1
The joint density of U and V is
p(U, V ) ∝ U p/2−1 exp(−U/2)V q/2−1 exp(−V /2).
The joint density of X and Y is
p(X , Y ) ∝ X p/2−1 Y (p+q)/2−1 exp(−Y (X + 1)/2).
142 / 157
The F distribution
Now let
Z = Y (X + 1),
so
p(X , Z ) ∝ X p/2−1 Z (p+q)/2−1 (X + 1)−(p+q)/2 exp(−Z /2)
and hence
p(X ) ∝ X p/2−1 /(X + 1)(p+q)/2 .
143 / 157
The F distribution
Now if we let
F =
U/p
q
= X
V /q
p
then
p(F ) ∝ F p/2−1 /(pF /q + 1)(p+q)/2 .
144 / 157
The F distribution
Therefore, the F-test statistic follows an Fp2 −p1 ,n−p2 distribution
kŶ (2) − Ŷ (1) k2 /(p2 − p1 )
kY − Ŷ (2) k2 /(n − p2 )
∼ Fp2 −p1 ,n−p2 .
145 / 157
Simultaneous confidence intervals
If θ is a fixed vector, we can cover the value θ0 β with probability α by
pivoting on the tn−p−1 -distributed pivotal quantity
θ0 β̂ − θ0 β
√
,
σ̂ Vθ
where
Vθ = θ0 (X0 X)−1 θ.
Now suppose we have a set T ⊂ Rp+1 of vectors θ, and we want to
construct a set of confidence intervals such that
P(all θ0 β covered, θ ∈ T ) = α.
We call this a set of simultaneous confidence intervals for {θ0 β; θ ∈ T }.
146 / 157
The Bonferroni approach to simultaneous confidence
intervals
The Bonferroni approach can be applied when T is a finite set, |T | = k.
Let
Ij = I(CI j covers θj0 β)
and
Ij0 = I(CI j does not cover θj0 β).
Then the union bound implies that
P(I1 and I2 · · · and Ik )
1 − P(I10 or I20 · · · or Ik0 )
X
≥ 1−
P(Ij0 ).
=
i
147 / 157
The Bonferroni approach to simultaneous confidence
intervals
As long as
1−α≥
X
P(Ij0 ),
i
the intervals cover simultaneously. One way to achieve this is if each
interval individually has probability
α0 ≡ 1 − (1 − α)/k
of covering its corresponding true value. To do this, use the same
approach as used to construct single confidence intervals, but with a
much larger value of qα .
148 / 157
The Scheffé approach to simultaneous confidence intervals
The Scheffé approach can be applied if T is a linear subspace of Rp+1 .
Begin with the pivotal quantity
θ0 β̂ − θ0 β
p
,
σ̂ 2 θ0 (X0 X)−1 θ
and postulate that a symmetric interval can be found so that
P
!
θ0 β̂ − θ0 β
−Qα ≤ p
≤ Qα for all θ ∈ T
σ̂ 2 θ0 (X0 X)−1 θ
= α.
Equivalently, we can write
P
supθ∈T
(θ0 β̂ − θ0 β)2
≤ Qα2
σ̂ 2 θ0 (X0 X)−1 θ
!
= α.
149 / 157
The Scheffé approach approach to simultaneous confidence
intervals
Since
β̂ − β = (X0 X)−1 X0 ,
we have
(θ0 β̂ − θ0 β)2
σ̂ 2 θ0 (X0 X)−1 θ
=
=
θ0 (X0 X)−1 X0 0 X(X0 X)−1 θ
σ̂ 2 θ0 (X0 X)−1 θ
Mθ0 0 Mθ
,
σ̂ 2 Mθ0 Mθ
where
Mθ = X(X0 X)−1 θ.
150 / 157
The Scheffé approach to simultaneous confidence intervals
Note that
Mθ0 0 Mθ
= h, Mθ /kMθ ki2 /σ̂ 2 ,
σ̂ 2 Mθ0 Mθ
i.e. it is the squared length of the projection of onto the line spanned by
Mθ (divided by σ̂ 2 ).
The quantity h, Mθ /kMθ ki2 is maximized at kPk2 , where P is the
projection onto the linear space
{X(X0 X)−1 θ | θ ∈ T } = {Mθ }.
Therefore
supθ∈T h, Mθ /kMθ ki2 /σ̂ 2 = kPk2 /σ̂ 2 ,
and since {Mθ } ⊂ col(X), it follows that P and σ̂ 2 are independent.
151 / 157
The Scheffé approach to simultaneous confidence intervals
Moreover,
kPk2 /σ 2 ∼ χ2q
where q = dim(T ), and as we know,
n−p−1 2
σ̂ ∼ χ2n−p−1 .
σ2
Thus
kPk2 /q
∼ Fq,n−p−1 .
σ̂ 2
152 / 157
The Scheffé approach to simultaneous confidence intervals
Let QF be the α quantile of the Fq,n−p−1 distribution. Then
P
p
|θ0 β̂ − θ0 β|
p
≤ σ̂ qQF for all θ
θ0 (X0 X)−1 θ
!
= α,
so
p
p
P θ0 β̂ − σ̂ qQF Vθ ≤ θ0 β ≤ θ0 β̂ + σ̂ qQF Vθ for all θ = α
defines a level α simultaneous confidence set for {θ0 β | θ ∈ T }, where
Vθ = θ0 (X0 X)−1 θ.
153 / 157
Polynomial regression
The conventional linear model has the mean specification
E [Y |X1 , . . . , Xq ] = β0 + β1 X1 + · · · βp Xp .
It is possible to accomodate nonlinear relationships while still working
with linear models.
Polynomial regression is a traditional approach to doing this. If there is
only one predictor variable, polynomial reqression uses the mean structure
E [Y |X ] = β0 + β1 X + β2 X 2 + · · · + βp X p .
Note that this is still a linear model, as it is linear in the coefficients β.
Multiple regression techniques (e.g. OLS) can be used for estimation and
inference.
154 / 157
Functional linear regression
A drawback of polynomial regression is that the polynomials can be
highly colinear (e.g. cor(U, U 2 ) ≈ 0.97 if U is uniform on 0, 1). Also,
polynomials are unbounded and it is often desirable to model E [Y |X ] as
a bounded function.
If there is a single covariate X , we can use a model of the form
E [Y |X ] = β1 φ1 (X ) + · · · + βp φp (X )
where the φj (·) are basis functions that are chosen based on what we
think E [Y |X ] might look like. Splines, sinusoids, wavelets, and radial
basis functions are possible choices.
Since the mean function is linear in the unknown parameters βj , this is a
linear model and can be estimated using multiple linear regression (OLS)
techniques.
155 / 157
Confidence bands
Suppose we have a functional linear model of the form
E [Y |X , t] = β 0 X + f (t),
and we model f (t) as
f (t) =
q
X
γj φj (t)
j=1
where the φj (·) are basis functions.
A confidence band for f with coverage probability α is an expression of
the form
fˆ(t) ± M(t)
such that
P fˆ(t) − M(t) ≤ f (t) ≤ fˆ(t) + M(t) ∀t = α.
156 / 157
Confidence bands
We can use as a point estimate
fˆ(t) ≡
q
X
γ̂j φj (t)
j=1
Pq
For each fixed t, j=1 γj φj (t) is a linear combniation of the γ̂j , so if we
simultaneously cover all linear combinations of the γj , we will have our
confidence band. Thus the Scheffé procedure can be applied with
T = Rq .
157 / 157
Download