Uploaded by Antonio Dapporto

MATH341 Handout

advertisement
What is a Regression Model?
Statistical model for:
Y
x (non-random variable)
Aim: understand the effect of x on the random quantity Y
Linear Models
Victor Panaretos
(random variable)
depending on
General formulation1 :
Y Distributionfg (x )g
Institut de Mathématiques – EPFL
victor.panaretos@epfl.ch
g () from data f(xi ; yi )gni=1 . Use for:
Statistical Problem: Estimate (learn)
Description
Inference
Prediction
Data compression (parsimonious representations)
:::
j f
g
1 Often books/people write Y x
Distribution g (x ) but this implies that (X ; Y ) have a
joint distribution; this assumption is unnecessary (e.g., in a designed experiment we choose values
for x ). Despite this, we write Y x to remind ourselves that the distribution of Y depends on x .
j
Victor Panaretos (EPFL)
Linear Models
1 / 299
Linear Models
Example: Gas mileage
2 / 299
Histogram of Auto$mpg
40
0
20
Frequency
60
80
Example: Honolulu tide
Victor Panaretos (EPFL)
10
20
30
40
50
Auto$mpg
Figure: Miles per gallon for 392 car models
Victor Panaretos (EPFL)
Figure: RecordedLinear
tideModels
heights in Honolulu
3 / 299
Victor Panaretos (EPFL)
Linear Models
4 / 299
Example: Honolulu tide with time covariate
Example: Gass mileage with horsepower covariate
●
30
20
●
●
●
● ● ●
●●● ●
●
● ●●
●
●● ●● ● ●● ● ● ●
●● ●
●
●
●
●
●●●●
●
●
● ●
● ●
●
●●
● ●●● ●
● ●●
●
● ● ●● ●●●
●
●●
●●●●
●●● ●●● ● ●
● ●●●
●●
● ● ● ●●
●
●●
● ●● ●●●
●
● ●●●●● ●●
●
●
●
● ●
●
●
●●●● ● ●●●
●●● ●
●
●
●
● ● ●● ●●
●●
● ●●
●
● ● ● ●●
●
●● ●●● ●
●
●
●
● ● ●●● ●●
● ●
●● ●●●●● ● ●
●
●
●
● ●
●●● ● ●
●● ●
●
● ●
● ●
●●● ●
● ●●
●
● ●● ●●●● ● ●●●
● ● ●● ●
●● ● ● ●
●●
●
●
● ● ●
●
●
●
10
Gas mileage (MPG)
40
● ●
●
●
●
●
●
●
● ●
●
● ●
●●
● ● ●
● ●
● ● ● ● ●●
●● ● ●●
●
●
● ●●
●●
●● ● ●
●●
● ● ● ●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
50
100
●
●
●
150
●●
● ● ●
●
●
●
200
Horsepower
Victor Panaretos (EPFL)
Linear Models
5 / 299
Great Variety of Models
x can be:
continuous, discrete, categorical, vector
Y j x N ( 0 + 1 x ; 2 )
m
Y = 0 + 1 x + ; N (0; 2 )
:::
The second verson is useful for mathematical work, but is puzzling statistically,
since we don’t observe .
arrive randomly, or be chosen by experimenter, or both
x arises, we treat it as constant in the analysis
Also,
Distribution can be:
Gaussian (Normal), Laplace, binomial, Poisson, gamma, General exponential
family, . . .
g () can be:
g (x ) = 0 + 1 x , g (x ) = PKk=
Victor Panaretos (EPFL)
6 / 299
Y ; x 2 R, g (x ) = 0 + 1 x , Distribution = Gaussian
Y Distributionfg (x )g
Function
Linear Models
Fundamental Case: Normal Linear Regression
Remember general model:
however
Victor Panaretos (EPFL)
x could be vector (Y ; 0 2 R, x 2 Rp , 2 Rp ):
Y j x N ( 0 + > x ; 2 )
m
Y = 0 + > x + ; N (0; 2 )
ikx , Cubic spline, . . .
K ke
Linear Models
7 / 299
Victor Panaretos (EPFL)
Linear Models
8 / 299
Example: Professor’s Van
Victor Panaretos (EPFL)
Linear Models
9 / 299
Example: Professor’s Van
Victor Panaretos (EPFL)
Victor Panaretos (EPFL)
Linear Models
10 / 299
Linear Models
12 / 299
Example: Professor’s Van
Linear Models
11 / 299
Victor Panaretos (EPFL)
Tools of the trade
:::
Start from Normal linear model ! gradually generalise
Important features of Normal linear model:
:::
Gaussian distribution
Linearity
Projections, Spectra, Gaussian Law
These two combine well and give geometric insights to solve the estimation
problem. Thus we need to revise some linear algebra and probability . . .
Will base course on the Gaussian assumption, but relax linearity later:
linear Gaussian regression
nonlinear Gaussian regression
nonparametric Gaussian regression
Many further generalisations are possible . . .
Victor Panaretos (EPFL)
Linear Models
13 / 299
Reminder: Subspaces, Spectra, Projections
M
Recall that M
(Q ) = fy
:9
2 Rp ;
Q to be
Two further important subspaces associated with a real
( )
( )
Allows interpretation of system of linear equations
Q = y:
$
[existence of solution
is y an element of M(Q )?]
[uniqueness of solution
is there a unique coordinate vector
Linear Models
n p matrix Q :
the null space (or kernel),
The columns of Q provide a coordinate system for the subspace M Q
If Q is of full column rank (p ), then the coordinates corresponding to a
y 2 M Q are unique.
Victor Panaretos (EPFL)
14 / 299
ker(Q ), of Q is the subspace defined as
ker(Q ) = fx 2 Rp : Qx = 0g;
the orthogonal complement of M(Q ), M? (Q ), is the subspace defined as
M? (Q ) = fy 2 Rn : y > Qx = 0; 8x 2 Rp g
= fy 2 Rn : y >v = 0; 8v 2 M(Q )g:
y = Q g:
(Q ) is a subspace of Rn .
$
Linear Models
Reminder: Subspaces, Spectra, Projections
If Q is an n p real matrix, we define the column space (or range) of
the set spanned by its columns:
2 Rn
Victor Panaretos (EPFL)
The orthogonal complement may be defined for arbitrary subspaces by using the
second equality.
?]
15 / 299
Victor Panaretos (EPFL)
Linear Models
16 / 299
Reminder: Subspaces, Spectra, Projections
Reminder: Subspaces, Spectra, Projections.
Theorem (Singular Value Decomposition)
Theorem (Spectral Theorem)
A p p matrix Q is symmetric if and only if there exists a
matrix U and a diagonal matrix such that
p p orthogonal
Any
n p
Q = U U > :
In particular:
1
2
3
the columns of
such that
where U and V > are orthogonal with columns called left singular vectors and
right singular vectors, respectively, and is diagonal with real entries called
singular values.
U = (u1 up ) are eigenvectors of Q , i.e. there exist j
Quj = j uj ;
=
n p real matrix can be factorised as
Q = nUn np pVp> ;
j = 1; : : : ; p ;
1
(1; : : : ; p ) are the corresponding eigenvalues of Q ,
QQ > .
The right singular vectors are eigenvectors of Q > Q .
The left singular vectors are eigenvectors of
the entries of
diag
which are real; and
2
3
The squares of the singular values are eigenvalues of both
the rank of
4
The left singular vectors corresponding to non-zero singular values form an
orthonormal basis for M Q .
5
The left singular vectors corresponding to zero singular values form an
orthonormal basis for M? Q .
Q is the number of non-zero eigenvalues.
Note: if the eigenvalues are distinct, the eigenvectors are unique (up to
re-ordering).
Victor Panaretos (EPFL)
Linear Models
Reminder: Subspaces, Spectra, Projections.
Recall that a matrix
( )
( )
Victor Panaretos (EPFL)
17 / 299
Linear Models
18 / 299
Reminder: Subspaces, Spectra, Projections.
Q is called idempotent if Q 2 = Q .
An orthogonal projection (henceforth projection) onto a subspace V is a
symmetric idempotent matrix H such that M H
V.
( )=
Proposition
Let V be a subspace and
Proposition
The only possible eigenvalues of a projection matrix are 0 and 1.
Proposition
Proposition
If
Let V be a subspace and
matrix onto V? .
H
be a projection onto V. Then
I H
is the projection
Proof.
(I H )> = I H > = I H since H is symmetric and,
(I H )2 = I 2 2H + H 2 = I H . Thus I H is a projection matrix.
It remains to identify the column space of I H . Let H = U U > be the
spectral decomposition of H . Then I H = UU > U U > = U (I )U > .
Hence the column space of I H is spanned by the eigenvectors of H
corresponding to zero eigenvalues of H , which coincides with M? (H ) = V? .
Victor Panaretos (EPFL)
QQ > and Q > Q .
Linear Models
19 / 299
H
be a projection onto V. Then
Hy = y for all y 2 V.
P and Q are projection matrices onto a subspace V, then P = Q .
Proposition
If x1 ; : : : ; xp are linearly independent and are such that span
the projection onto V can be represented as
(x1 ; : : : ; xp ) = V, then
H = X (X > X ) 1 X >
where X is a matrix with columns x1 ; : : : ; xp .
Victor Panaretos (EPFL)
Linear Models
20 / 299
Reminder: Subspaces, Spectra, Projections.
Reminder: Subspaces, Spectra, Projections.
(proof continued)
Proposition
kx
H be a projection onto V. Then
kx Hx k kx v k; 8v 2 V:
Let V be a subspace of Rn and
Hx k2
=
=
Proof
=
H = U U > be the spectral decomposition of H , U = (u1 un ) and
= diag(1 ; : : : ; n ). Letting p = dim(V),
1 = = p = 1 and p +1 = = n = 0,
u1 ; : : : ; un is an orthonormal basis of Rn ,
u1 ; : : : ; up is an an orthonormal basis of V.
Let
n
X
(x >ui (Hx )>ui )2
i =1
n
X
(x >ui x >Hui )2
i =1
n
X
i =1
= 0+
1
2
3
p
X
i =1
i x > u i ) 2
(x >ui
n
X
i =p +1
(x >ui )2
(x > ui
v > ui )2 +
= kx v k2 :
Victor Panaretos (EPFL)
Linear Models
21 / 299
Victor Panaretos (EPFL)
[orthonormal basis]
[H is symmetric]
[u’s are eigenvectors of H]
[eigenvalues 0 or 1]
n
X
(x >ui )2
i =p +1
8v 2 V
Linear Models
22 / 299
Positive-Definite Matrices
Proposition
Let V1 V Rn be two nested linear subspaces. If
and H is the projection onto V, then
H1 is the projection onto V1
=
0
= HH1. For all y 2 Rn we
H1 )y = 0 for all y 2 Rn ,
HH1 = H1 .
First we show that HH1 H1 , and then that H1 H
have H1 y 2 V1 . But then H1 y 2 V, since V1 V.
Therefore HH1 y H1 y . We have shown that HH1
so that HH1 H1
, as its kernel is all Rn . Hence
=
=0
(
(Or, take n linearly independent vectors y1 ; : : : ; yn
matrix Y . Now Y is invertible, and (HH1 H1 )Y
=
0
A p p real symmetric matrix is called non-negative definite (written ) if
and only if x > x for all x 2 Rp . If x > x > for all x 2 Rp n f g, then we
call positive definite (written ).
HH1 = H1 = H1 H :
Proof.
Definition (Non-Negative Matrix – Quadratic Form Definition)
2 Rn , and use them as columns of the n n
= 0, so HH1 H1 = 0, giving HH1 = H1 .)
To prove that H1 H HH1 , note that symmetry of projection matrices and the
first part of the proof give
H1 H = H1> H > = (HH1 )> = (H1 )> = H1 = HH1 :
0
0
0
An equivalent definition is:
Definition (Non-Negative Matrix – Spectral Definition)
0
A p p real symmetric matrix is called non-negative definite (written ) if
and only the eigenvalues of are non-negative. If the eigenvalues of are strictly
positive, then is called positive definite (written ).
0
Lemma (Exercise)
Prove that the two definitions are equivalent.
Victor Panaretos (EPFL)
Linear Models
23 / 299
Victor Panaretos (EPFL)
Linear Models
24 / 299
Covariance Matrices
Covariance Matrices
Definition (Covariance Matrix)
Let Y = (Y1 ; : : : ; Yn )> be a random n 1 vector such that EkY k2 < 1. The
Lemma
Y , say , is the n n symmetric matrix with entries
ij = cov(Yi ; Yj ) = E[(Yi E[Yi ])(Yj E[Yj ])]; 1 i j n :
covariance matrix of
That is, the covariance matrix encodes the variances of the coordinates of
the diagonal) and the covariances between the coordinates of Y (off the
diagonal). If we write
Y
(on
Let Y be a random d vector such that EkY k2
vectors. If denotes the covariance matrix of Y ,
1
Y , then the covariance matrix of Y can be written as
E[(Y )(Y )> ] = E[YY > ] > :
Whenever Y is a random vector, we will write cov(Y ) or var(Y ) for the
covariance matrix of Y .
Non-negative Matrices
Proof.
Corollary (Covariance of Projections)
for the mean vector of
Linear Models
1
Exercise.
= E[Y ] = (E[Y1 ]; : : : ; E[Yn ])>
Victor Panaretos (EPFL)
Let Y be a random d vector such that EkY k2 < 1. Let be the mean
vector and be the covariance matrix of Y . If A is a p d real matrix, the
mean vector and covariance matrix of AY are A and A A> , respectively.
> Y is > ;
the covariance of > Y with > Y is >
< 1.
Let
; 2 Rd
be fixed
the variance of
25 / 299
Covariance Matrices
Victor Panaretos (EPFL)
.
Linear Models
26 / 299
Principal Component Analysis
Let Y be a random vector in Rd with covariance matrix .
Find direction v1 2 Sd 1 such that the projection of Y onto v1 has maximal
variance.
For j
; ; : : : ; d , find direction vj ? vj 1 such that projection of Y onto
vj has maximal variance.
Solution: maximise Var v1> Y
v1> v1 over kv1 k
=2 3
Proposition (Non-Negative and Covariance Matrices)
(
Let be a real symmetric matrix. Then is non-negative definite if and only if
is the covariance matrix of some random variable Y .
Proof.
=1
v1> v1 = v1> U U > v1 = k1=2 U > v1 k2 =
Now
Exercise.
)=
Pd
d
X
i =1
i (ui> v1 )2
[change of basis]
d
> 2
2
i =1 (ui v1 ) = kv1 k = 1 so we have a convex combination of the fj gj =1 ,
d
X
X
pi i ;
pi = 1; pi 0; i = 1; : : : ; d :
i =1
i
1 i 0 so clearly this sum is maximised when p1 = 1 and pj
8j 6= 1, i.e. v1 = u1 .
But
Iteratively, vj
Victor Panaretos (EPFL)
Linear Models
27 / 299
= uj , i.e. principal components are eigenvectors of
Victor Panaretos (EPFL)
Linear Models
=0
.
28 / 299
Principal Component Analysis
Principal Component Analysis
Theorem (Proof of Optimal Linear Dimension Reduction Theorem)
Optimal Linear Dimension Reduction.
Pk
Write Q = j =1 vi vi> for some orthonormal fvj gkj=1 . Then,
Let Y
the projection matrix onto the span of the first
be a mean-zero random variable in Rd with
d d covariance
k eigenvectors of
EkY HY k2 EkY QY k2
for any d d projection operator Q or rank at most k .
. Let
. Then
H
be
EkY
Intuitively: if you want to approximate a mean-zero random variable taking values
Rd by a random variable that ranges over a subspace of dimension at most k d ,
the optimal choice is the projection of the random variable onto the space
spanned by its first k principal components (eigenvectors of the covariance).
“Optimal” is with respect to the mean squared error.
For the proof, use lemma below (follows immediately from spectral decomposition)
Lemma
Q is a rank k projectionPmatrix
if and only if there exist orthonormal vectors
k
k
fvj gj =1 such that Q = j =1 vi vi> .
Victor Panaretos (EPFL)
Linear Models
QY k2 =
trfQ
g=
Y > (I Q )> (I Q )Y = E trf(I Q )YY > (I Q )> g
= trf(I Q )E YY > (I Q )>g = trf(I Q )>8(I Q ) g 9
E
=
trf
(I Q ) g = trf g
d
X
=
i =1
d
X
=
i =1
i
i
k
X
j =1
k
X
j =1
d
X
=
tr vi vi>
[
Var vi> Y
i =1
i
d
X
i =1
k
X
j =1
i
tr
k
<X
:
j =1
vi vi>
=
;
vi vi>
]
=
=
If we can minimise this expression over all fvj gkj=1 with vi> vj 1fi j g, then
we’re done. By PCA, this is done by choosing the top k eigenvectors of .
29 / 299
Principal Component Analysis
Victor Panaretos (EPFL)
Linear Models
30 / 299
Gaussian Vectors and Affine Transformations
Definition (Multivariate Gaussian Distribution)
Corollary
+ + =0
Let fx1 ; :::; xn g Rd be such that x1 : : : xn
, and let X be the matrix
with columns fxi gni=1 . The best approximating k -hyperplane to the points
fx1 ; :::; xn g is given by the span of the first k eigenvectors of the matrix X > X ,
i.e. if H is the projection onto this span, it holds that
n
X
i =1
for any
kxi Hxi k2 n
X
i =1
kxi Qxi k2
Define the discrete random vector
dimension reduction.
Victor Panaretos (EPFL)
Y
Observation: From the definition if follows that Y must have some well-defined
mean vector and some well defined covariance matrix .
To see this note that since Ef > Y 2 g < 1 for all , then we can successively
pick to be equal to each canonical basis vector and conclude that each
coordinate has finite variance and thus EkY k2 < 1.
(
)
So all the means, variances and covariances of its coordinates are well defined.
d d projection operator Q or rank at most k .
Proof.
A random vector Y in Rd has the multivariate normal distribution if and only if
> Y has the univariate normal distribution, 8 2 Rd .
Then, the mean vector (say) and covariance matrix (say)
determined entrywise by equating
[ = xi ] = 1=n , and use optimal linear
by P Y
i
= E[ei>Y ]
&
where ej is the j th canonical basis vector
ej = (0 ; 0 ; : : : ;
Linear Models
31 / 299
Victor Panaretos (EPFL)
ij
1
= covfei>Y ; ej>Y g:
|{z}
;:::;
jth position
Linear Models
can be (uniquely)
0 ; 0)>
32 / 299
Gaussian Vectors and Affine Transformations
Gaussian Vectors and Affine Transformations
Useful facts:
1
Moment generating function of
How can we use this definition to determine basic properties?
The moment generating function (MGF) of a random vector
as
MW E e > W ; 2 Rd ;
( )= [
W
in Rd is defined
]
provided the expectation exists. When the MGF exists it characterises the
distribution of the random vector. Furthermore, two random vectors are
independent if and only if their joint MGF is the product of their marginal MGF’s,
i.e.
Xn 1 independent of Ym 1
[
Ee
>X + >Y
] = E[e
>X
()
] E[e > Y ];
8 2 Rn & 2 Rm
MY (u ) = exp
2
Y N (p1 ;
3
N (; ) density, assuming
4
5
7
Linear Models
33 / 299
Proposition (Property 1: Moment Generating Function)
be arbitrary. Then v > Y is scalar Gaussian with mean
Hence it has moment generating function:
1
B > ).
nonsingular:
(2)p=2 j j1=2
exp
1 (y
2
1 (y
)>
) :
Constant density isosurfaces are ellipsoidal
Marginals of Gaussian are Gaussian (converse NOT true).
diagonal , independent coordinates Yj .
If Y N p 1 ; p p ,
AY independent of BY () A B >
(
)
= 0.
Linear Models
Y N (p1 ;
34 / 299
p p ) and given Bn p and n 1 , we have
+ BY N ( + B ; B
Combining the two, we conclude that
MY (v ) = exp v > + 12 v > v ; v 2 Rd :
Linear Models
u :
B >)
M+BY (u ) = E expfu > ( + BY )g = exp u > E expf(B > u )> Y g
= exp u > MY (B > u )
n
o
= exp u > exp (B > u )> + 1 u > B B > u
2
n
o
1
>
>
>
= exp u + u (B ) + u B B > u
2
n
o
1
>
>
= exp u ( + B ) + u B B > u
2
And this last expression is the MGF of a N ( + B ; B B > ) distribution.
t = 1 and observe that
Mv > Y (1) = E e v > Y = MY (v ):
Victor Panaretos (EPFL)
Proof.
Mv > Y (t ) = E e t > Y = exp t (v > ) + t2 (v > v ) :
Now take
2
+ BY N ( + B ; B
Victor Panaretos (EPFL)
For
v > and variance v > v .
2
u > + 1 u >
Proposition (Property 2: Affine Transformation)
Y N (; ) is
MY (u ) = exp u > + 12 u > u
The moment generating function of
Proof.
Let v 2 Rd
p p ) and given Bn p and n 1 , then
fY (y ) =
6
Victor Panaretos (EPFL)
Y N (; ):
35 / 299
Victor Panaretos (EPFL)
Linear Models
36 / 299
Proposition (Property 3: Density Function)
Let
proof continued
N (p1 ; pp ) is
1
exp 2 (y )> 1(y )
p p be nonsingular. The density of
fY (y ) =
Proof.
=(
1
(2)p=2 j j1=2
)
Z
=)
Z
MZ (u ) = E exp
i =1
ui Zi
which is the MGF of a p -variate
Victor Panaretos (EPFL)
!)
=
p
Y
i =1
exp(ui Zi )g = exp(u >u =2);
Proposition (Property 4: Isosurfaces)
(
37 / 299
1)
The isosurfaces of a N p 1 ; p p are p
-dimensional ellipsoids centred at
, with principal axes given by the eigenvectors of and with anisotropies given
by the ratios of the square roots of the corresponding eigenvalues of .
Victor Panaretos (EPFL)
Linear Models
()
independent if and only if
is diagonal.
Proof.
Suppose that the Yj are independent. Property 5 yields
j > . Thus the density of Y is
p
Y
Yj = (0 ; 0 ; : : : ;
Victor Panaretos (EPFL)
1
|{z}
;:::;
jth position
Linear Models
0 ; 0)Y
p
Y
p1 exp
fY (y ) = fYj (yj ) =
j =1
i =1 j 2
Proposition (Property 5: Coordinate Distributions)
Let Y = (Y1 ; : : : ; Yp )> N (p 1 ; p p ). Then Yj N (j ; jj ) .
38 / 299
Proposition (Property 6: Diagonal
Independence)
Let Y = (Y1 ; : : : ; Yp )> N (p 1 ; p p ). Then the Yi are mutually
Exercise: Use Property 3, and the spectral theorem.
Observe that
Y =d 1=2 Z + N (; ).
0
Proof.
Proof.
is
= ( )
( )
N (0; I ) distribution.
)
Furthermore, since
[Recall that to obtain the density of W
g X at w , we need to evaluate fX at
g 1 w but also multiply by the Jacobian determinant of g 1 at w .]
Ef
Linear Models
(
1=2 .
.
1 2
is
p
X
admits a square root,
fY (y ) = f = Z + (y )
= j 1=2jfZ f 1=2 (y )g
1
1 (y )> 1(y ) :
=
exp
2
(2)p=2 j j1=2
p
Y
(
2
By the change of variables formula,
1
1
1
1
2
>
fZ (z ) = fZi (zi ) = p exp 2 zi = p=2 exp 2 z z :
(2)
i =1
i =1 2
(b) The MGF of
N (0; I ) density is fZ (z ) = (21)p= exp 12 z > z
Now observe that from our Property 2, we have
is
p
Y
the
By the spectral theorem,
non-singular, so is 1=2 .
N (0; 1) random variables. Then,
Let Z
Z1 ; : : : ; Zp > be a vector of iid
because of independence,
(a) the density of
(a )+(b )
1
1 (y
2
(
Yj N (j ; j2 ) for some
1 (yj
2
j )2
j2
)
)> diag(1 2 ; : : : ; p 2 )(y
exp
) :
(2)p=2 jdiag(12; : : : ; p2)j1=2
Hence Y Nf; diag(12 ; : : : ; p2 )g, i.e. the covariance is diagonal.
Conversely, assume is diagonal, say = diag(12 ; : : : ; p2 ). Then we can reverse
the steps of the first part to see that the joint density fY (y ) can be written as a
product of the marginal densities fYj (yj ), thus proving independence.
=
and use Property 2.
39 / 299
Victor Panaretos (EPFL)
Linear Models
40 / 299
Proposition (Property 7:
If
Y N (p1 ;
p p ), and
AY ; BY
indep
()
A B > = 0)
Am p , Bd p be real matrices. Then,
AY independent of BY () A B > = 0.
Proof
It suffices to prove the result assuming . Let W(m +d )1
First assume A B >
=0
1
>
>
>
>
>
>
>
>
=) exp 2 u A A u + v B B v + u A B v + v B A u
1
1
>
>
>
>
= exp 2 u A A u exp 2 v B B v
1
>
>
=) exp 2 2v A B u = 1
=) v >A B >u = 0; 8 u 2 Rd ; v 2 Rm ;
= 0 AY
(and it simplifies the algebra).
= BY and (m +d )1 = uvmd .
1
1
[expfW >g] = E exp Y >A>u + Y >B >v > >
= E exp
Y (A u + B > v ) = MY (A> u + B > v )
1
>
>
>
>
>
= exp 2 (A u + B v ) (A u + B v )
MW () =
E
8 0
<
@
:
AY and BY are independent. Then, 8u ; v ,
MW () = MAY (u )MBY (v ); 8u ; v ;
For the converse, assume that
19
=
>
> A
| {z }
;
= exp 12 u > A A>u + v >B B >v + u >A| {zB >}v + v B A u
=0
=0
= MAY (u )MBY (v );
=)
the orthocomplementa of the column space of
=) A B > = 0:
a recall
that for
Qm d
we have M? (Q ) =
A B > is the whole of Rm :
fy 2 Rm : y > Qx = 0; 8x 2 Rd g
i.e., the joint MGF is the product of the marginal MGFs, proving independence.
Victor Panaretos (EPFL)
Linear Models
Gaussian Quadratic Forms and the
Definition (2 distribution)
41 / 299
2 Distribution
Gaussian Quadratic Forms and the
= Pp
2
2
Let Z N ; Ip p . Then kZ k2
j =1 Zj is said to have the chi-square ( )
distribution with p degrees of freedom; we write kZ k2 2p .
(0
)
[Thus, 2p is the distribution of the sum of squares of
Gaussian random variates.]
Victor Panaretos (EPFL)
V 2p and W 2q be independent random variables. Then
(V =p )=(W =q ) is said to have the F distribution with p and q degrees of
freedom; we write (V =p )=(W =q ) Fp ;q .
42 / 299
2 Distribution
Proposition (Gaussian Quadratic Forms)
Z N (0p1 ; Ipp ) and H
is a projection of rank
r p,
1
If
2
Z > HZ 2r :
Y N (p1 ; pp ) with nonsingular =)
(Y )> 1(Y ) 2p :
p real independent standard
Definition (F distribution)
Linear Models
Let
Victor Panaretos (EPFL)
Linear Models
Exercise: Prove these results.
43 / 299
Victor Panaretos (EPFL)
Linear Models
44 / 299
Approximately
2 Quadratic Forms
What if the random vector is not Gaussian? Here’s a CLT that helps:
Theorem (Weighted Sum CLT)
Let fXn g be an i.i.d sequence of real random variables, with common mean 0 and
variance 1. Let f n g be a sequence of real constants. Then,
sup
1j n
Pn
2
j
i =1
n !1
2i ! 0 =)
1
n
X
X !d N (0; 1):
2i i =1 i i
i =1
qP
n
Linear Models: Likelihood and Geometry
Supremum condition amounts to saying that, in the limit, any single
component contributes a negligible proportion of the total variance.
Coefficient sequence f n g might very well diverge, without contradicting the
negligibility condition.
Victor Panaretos (EPFL)
Linear Models
45 / 299
Victor Panaretos (EPFL)
Linear Models
46 / 299
47 / 299
Victor Panaretos (EPFL)
Linear Models
48 / 299
Simple Normal Linear Regression
General formulation:
Yi jxi ind
Distributionfg (xi )g; i = 1; : : : ; n :
Simple Normal Linear Regression:
Resulting Model:
Victor Panaretos (EPFL)
= Nfg (x ); 2g
g (x ) = 0 + 1 x
Distribution
Yi ind
N ( 0 + 1 xi ; 2 )
m
Yi = 0 + 1 xi + "i ; "i ind
N (0; 2 )
Linear Models
Example: Professor’s Van
Example: Professor’s Van
Fillup
1
2
3
4
5
6
7
8
9
10
Victor Panaretos (EPFL)
Km/L
7.72
8.54
8.35
8.55
8.16
8.12
7.46
6.43
6.74
6.72
Linear Models
49 / 299
Victor Panaretos (EPFL)
Linear Models
Simple Normal Linear Regression
Multiple Normal Linear Regression
Jargon: Y is response variable and x is explanatory variable (or covariate)
Linearity: Linearity is in the parameters, not the explanatory variable.
Example: Flexibility in what we define as explanatory:
Instead of xi
xj
Example: Sometimes a transformation may be required:
1
xj ;
j
j iid
Lognormal
B
@
log() #
" exp()
log Yj = log 0 + 1 xj + log j ; log j iid Normal
Data Structure:
For i
; : : : ; n , pairs
=1
|
X
values of x
(xi ; yi ) ! xyii fixed
treated as a realisation of Yi at xi
Victor Panaretos (EPFL)
Linear Models
2 R could have xi> 2 Rq ):
Yi = 0 + 1 xi 1 + 2 xi 2 + : : : + q xiq + "i ; "i ind
N (0; 2 ):
Letting p = q + 1, this can be summarised via matrix notation:
0
Y1 1 0 1 x11 : : : x1q 10 0 1 0 "1 1
B Y2 C B 1 x21
B
C B
C
x2q C
B
C B
CB 1 C B " 2 C
Yj = 0 + 1 sin(
xj ) + "j ; "j iid Normal(0; 2 ):
| {z }
Yj = 0 e
50 / 299
51 / 299
C
A
..
.
Y{zn
}
Y
=B
@
|
..
.
..
.
..
.
1 xn 1 {z: : : xnq
X
=) nY1 = nXp p1 + n " 1;
CB . C
A@ .. A
}|
q
{z
}
+B
@
|
.. C
. A
"n
{z
"
}
" Nn (0; 2 I )
is called the design matrix.
Victor Panaretos (EPFL)
Linear Models
52 / 299
10
15
100
Heat evolved
5
80
Heat
78.50
74.30
104.30
87.60
95.90
109.20
102.70
72.50
93.10
115.90
83.80
113.30
109.40
20
30
5
10
15
Linear Models
53 / 299
Example: polynomial terms for MPG vs Horsepower
40
30
Gas mileage (MPG)
10
20
●
● ● ●
●●● ●
●
● ●●
●
●● ●● ● ●● ● ● ●
●● ●
●
●
●●●●
● ● ● ● ●●
●
●
●
●●
●●
● ●●
●●●●
●
●
●
●
●● ●
●●● ● ●
●●●
●●● ● ● ● ●
● ●●●
●●
● ● ● ●●
●
● ●● ●●●
●●
●
● ●●●●● ●●
●
●
●
● ●
●
●
●●●● ● ●●●
●
●
●
●
●● ● ●●
● ●● ●● ●●
● ●●
●
●● ●●●
● ● ● ●
● ● ●
●
●
● ● ●●● ●●
● ●
●● ●●●●● ● ●
●
●
●
● ●
●●● ● ●
● ●
●●●●● ●
● ●●● ●
●
●
● ●● ● ● ● ●
● ●●
●
● ● ●
● ● ●● ●
●● ●
●●
●
●
● ● ●
●
●
●
10
●
●
●
●
● ●
●
●
● ●
●
●
●●
●
● ● ● ●
● ● ● ● ●●
●
●●
●
●
●
●●
●
● ●●
● ●
●
●●
●● ● ● ●
●
●●
● ● ● ●● ● ● ●
● ●
●
●
●
●
●
●
100
150
●
●
●
●
●
●●
● ● ●
●
●
●
200
Horsepower
Perhaps more fitting than
would be
Yj = 0 + 1 xj + "j
30
40
50
60
Linear Models
54 / 299
(y1; x11; : : : ; x1q ); : : : ; (yi ; xi 1; : : : ; xiq ); : : : ; (yn ; xn 1; : : : ; xnq )
Yj = 0 + 1 xj + 2 xj2 + "j
Likelihood and Loglikelihood
Still a linear model but now with 2 covariates: xj and xj xj2
Normally would require a (hyper)plane to visualise dependence of mean on 2
or more covariates
When additional covariates are variable transformation, can visualise mean
dependence via a non-linear curve, even though model is linear
=
Victor Panaretos (EPFL)
20
Percent weight of 2CaO.SiO_2
Yi = 0 + 1 xi 1 + 2 xi 2 + + q xiq + "i ; "i iid N (0; 2 )
m
Y = X + "; " Nn (0; 2 I )
Observe: y = (y1 ; : : : ; yn )> for given fixed design matrix X , i.e.:
●
●
50
70
Model is:
●
●
60
Likelihood for Normal Linear Regression
●
● ●
●
●
Victor Panaretos (EPFL)
50
100
20
Percent weight of 4CaO.Al_2O_3.Fe2O_3
Victor Panaretos (EPFL)
40
Percent weight of 3CaO.SiO_2
Heat evolved
Percent weight of 3CaO.Al_2O_3
80
60.00
52.00
20.00
47.00
33.00
22.00
6.00
44.00
22.00
26.00
34.00
12.00
12.00
100
2CaO :SiO2
6.00
15.00
8.00
8.00
6.00
9.00
17.00
22.00
18.00
4.00
23.00
9.00
8.00
Heat evolved
4Cao :Al2 O3 :Fe2 O3
26.00
29.00
56.00
31.00
52.00
55.00
71.00
31.00
54.00
47.00
40.00
66.00
68.00
100
3CaO :SiO2
7.00
1.00
11.00
11.00
7.00
11.00
3.00
1.00
2.00
21.00
1.00
11.00
10.00
Heat evolved
3CaO :Al2 O3
80
Case
1
2
3
4
5
6
7
8
9
10
11
12
13
Cement Heat Evolution
80
Example: Cement Heat Evolution
Linear Models
55 / 299
1 (y X )> (y X )
22
1
1
2
2
>
`( ; ) =
2 n log 2 + n log + 2 (y X ) (y X )
L( ; 2 ) = (212 )n =2 exp
Victor Panaretos (EPFL)
Linear Models
56 / 299
Maximum Likelihood Estimation
Maximum Likelihood Estimation
Whatever the value of , the log-likelihood is maximised when
(y X )>(y X ) is minimised. Hence, the MLE of is:
^ = arg max (y X )> (y X ) = arg min(y X )>(y X )
^ is called the least squares estimator because it is a result of minimising
(y X )>(y X ) =
Obtain minimum by solving:
0 =
0 =
0 =
>
X X =
^ =
@
(y
@
@ (y
@
X )> (y X )
X ) @ (y X )> (y X )
@ (y X )
>
X (y X ) (normal equations)
X >y
(X >X ) 1 X >y (if X has rank p )
Victor Panaretos (EPFL)
57 / 299
Victor Panaretos (EPFL)
2
q xiq ) :
{z
}
Linear Models
58 / 299
Maximum Likelihood Estimation
^ 2
=
=
is
^ = (X >X ) 1 X >y for all values of 2, we have
2
2
e = y X ^, so that e = (e1 ; : : : ; en )> , with
P
ei2 is minimised over all
^ 2 =
1 (y X ^)>(y X ^):
n
Next week we will see that a better (unbiased) estimator is
.
S 2 = n 1 p (y X ^)> (y X ^):
y^ = X ^> , so that y^ = (^y1 ; : : : ; y^n )> , with
y^i = ^0 + ^1 xi 1 + + ^q xiq
Linear Models
1 n log 2 + 1 (y X ^)> (y X ^) :
2
2
Differentiating and setting equal to zero yields
^0 ^1xi 1 ^2 xi 2 ^q xiq
“Regression Line” is such that
arg max
max `( ; 2 )
2
arg max `( ^; 2 )
= arg max
Victor Panaretos (EPFL)
2 xi 2 Thus we are trying to find the that gives the hyperplane with minimum sum of
squared vertical distances from our observations.
(chain rule)
Since the MLE of
Fitted Values:
1 xi 1
0
sum of squares
Linear Models
ei = yi
i =1
(yi
|
Maximum Likelihood Estimation
Residuals:
n
X
59 / 299
Victor Panaretos (EPFL)
Linear Models
60 / 299
Example: Professor’s Van
Example: MPG vs Horsepower
Model with linear term only
●
● ●
●
●
20
30
40
●
●
● ● ●
●●● ●
●
● ●●
●
●● ●● ● ●● ● ● ●
●● ●
●
●
●
●
●●●●
●●
●
●
●
●
●
●●
● ●●● ●
● ●●
●
● ● ●● ●●●
●
●●
●●●●
●●● ●●● ● ●
● ●●●
●●
● ● ● ●●
●
● ●● ●●●
●●
●
● ●●●●● ●●
●
●
●
● ●
●
●
●●●● ● ●●●
●●●
●
● ●● ●●● ● ●●
●
●
●
●●
●
●● ●●●
● ● ● ●●
● ● ●
●
●
● ● ●●● ●●
● ●
●● ●●●●● ● ●
●●
●
●
●
●
●●● ● ●
●● ●
● ●●● ● ●
●●● ●
●
● ●● ● ● ●●●
● ●●
● ● ●● ●
●●●● ● ●
●●
●
●
● ● ●
●
●
●
10
Gas mileage (MPG)
●
●
●
●
● ●
●
●
● ●
●●
● ● ●
● ●
● ● ● ● ●●
●● ● ●●
●
●
● ●●
●●
●● ● ●
●●
● ● ● ●●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ● ●
●
●
●
●
●
50
100
●
●
●
●
●
●
^0 = 8:6 ^1 = 0:068 S 2 = 17:4
150
●●
● ● ●
●
●
●
200
Horsepower
Parameter estimates:
Victor Panaretos (EPFL)
●
Linear Models
61 / 299
^0 = 39:94 and ^1 = 0:16 and S 2 = 24:06.
Victor Panaretos (EPFL)
Linear Models
62 / 299
Example: MPG vs Horsepower
The Geometry of Least Squares
Model with linear quadratic terms
There are two dual geometrical viewpoints that one may adopt:
0
●
30
20
10
Gas mileage (MPG)
40
● ●
●
●
●
●
B
B
B
@
●
● ● ●
●●● ●
●
● ●●
●
●● ●● ● ●● ● ● ●
●● ●
●
●
●
●
●●●●
● ●●
● ●●
●
●
●
●
● ●●● ●
● ●●
●
● ● ●● ●●●
●
●●●●● ●●
●● ●
●
● ●●●
●
●●● ● ● ●●
●●
●
●●
● ●● ●●●
●
● ●●●●● ●●
●
●
●
● ●
●
●
●●●● ● ●●●
●
●
●●●
● ●● ●● ●●
● ●●
● ●●
●
●
●
●
● ● ● ●● ●●● ●
●
●
●
●
● ● ●●● ●●
●
● ●
●● ●●●●● ● ●
●●
●
●
●●● ● ●
●
●
● ●
●●●● ●
●
●
● ●
●
● ●●
● ●● ● ● ●●●
● ●
●
● ●●
● ● ●
●
● ●
●
● ● ●● ●
●
●
●●
●● ●
●●
●
●
●
●
●
● ● ● ●
●
● ●●
● ● ●
●●
●
●●
●
●
●
●
● ●●
●●
●● ● ●
●●
● ● ● ●●
●
●
100
150
Victor Panaretos (EPFL)
●
●
●
●
● ●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●●
..
.
x12
x22
1 x(n 1)1 x(n 1)2
1 xn 1
xn 2
:::
..
.
:::
:::
x1q
x2q
x(n 1)q
xnq
1
0
1
0
C
CB 1 C
CB
C
CB . C
C @ .. A
A
q
0
+
"1
"2
1
B
C
B
C
B . C
@ .. A
"n
Both are useful, usually for different things:
● ● ●
●
●
Row geometry useful for exploratory analysis.
●
Column geometry useful for theoretical analysis.
200
^0 = 56:90, ^1 = 0:47 and ^2 = 0:0012 and S 2 = 19:13.
Linear Models
Yn
=
B
B
B .
B ..
B
@
x11
x21
n OBSERVATIONS
Column geometry: focus on the p EXPLANATORIES
Horsepower
Parameter estimates:
..
.
C
C
C
A
1
1
Row geometry: focus on the
●
50
Y1
Y2
0
1
63 / 299
Both geometries give useful, but different, intuitive interpretations of the least
squares estimators.
Victor Panaretos (EPFL)
Linear Models
64 / 299
Row Geometry (Observations)
Column Geometry (Variables)
Corresponds to the “scatterplot geometry” – (data space)
n points in Rp
Adopt the dual perspective:
each corresponds to an observation
Consider the entire vector
y as a single point living in Rn
Then consider each variable (column) as a point also in Rn
least squares parameters give parametric
equation for a hyperplane
^
What is the interpretation of the p -dimensional vector , and the
vectors y and e in this dual space?
hyperplane has property that it minimizes
the sum of squared vertical distances of
observations from the plane itself over all
possible hyperplanes
^
n -dimensional
Turns out there is another important plane here: the plane spanned by the
variable vectors (the column vectors of X ).
Recall that this is the column space of
X , denoted by M(X ).
Fitted values are vertical projections (NOT orthogonal projections!) of
observations onto plane, residuals are signed vertical distances of observations
from plane.
Victor Panaretos (EPFL)
Linear Models
65 / 299
Column Geometry (Variables)
Recall:
Victor Panaretos (EPFL)
(X )
:= fX : 2 Rp g
M
| {z }
Another derivation of the MLE of :
^ to minimise (y X )>(y X ) = ky X k2, so
^ = arg min ky X k2 :
min 2Rp ky X k2 = min 2M(X ) ky k2
But the unique that yields min 2M(X ) ky
k2 is = Py .
Here P is the projection onto the column space of X , M(X ).
Since X is of full rank, P = X (X > X ) 1 X > .
So = X (X > X ) 1 X > y
^ will now be the unique (since X non-singular) vector of coordinates of
Choose
Y = X + " mean?
A: Y is [some element of M(X )] + [Gaussian disturbance].
Any realisation y of Y will lie outside M(X ) (almost surely). MLE estimates
Q: What does
by minimising
(y X )>(y X ) = ky X k2
Thus we search for a giving the element of M(X ) with the minimum distance
from y .
Hence y^ = X ^ is the projection of y onto M(X ):
y^ = X ^ := X| (X > X{z) 1 X >}y = Hy :
with respect to the basis of columns of
So
H
which implies that
is the hat matrix (puts hat on y !)
Victor Panaretos (EPFL)
66 / 299
Column Geometry (Variables)
Column Space
H
Linear Models
Linear Models
67 / 299
Victor Panaretos (EPFL)
X.
X ^ = = X (X > X ) 1 X > y ;
^ = (X > X ) 1X >y
Linear Models
68 / 299
The (Column) Geometry of Least Squares
The (Column) Geometry of Least Squares
^
So what is ?
If X columns linearly independent,
they are a (non-orthogonal) basis
for M
Hence for any z
exists a unique
z =X
So
2 M(X ), there
2 Rp such that
contains coordinates of
z
with respect to the
X -column basis
Consequently, ^ contains coordinates of y^ with respect to the X -column
basis
But
y^ = Hy = X (|X > X {z
) 1 X >y} = Xu , so u is the unique vector that gives
u
y with respect to the X -column basis
Hence we must have ^ = u = (X > X ) 1 X > y
coordinates of
Victor Panaretos (EPFL)
Linear Models
69 / 299
The (Column) Geometry of Least Squares
2
3
e = (I H )y = (I H )".
y^ and e are orthogonal, i.e. y^ > e = 0
Pythagoras: y > y = y^ > y^ + e > e = y > Hy + "> (I H )"
Derivation:
1
2
3
e =y X^=y
(I H )X + (I
e = y y^ = (I
y > y = (Hy + (I
Victor Panaretos (EPFL)
70 / 299
Assume slightly different model:
Yi = 0 + 1 xi 1 + 2 xi 2 + + q xiq + p"wi ; "i ind
N (0; 2 ); wi > 0
i
Hy = (I H )y = (I H )(X + ") =
H )" = (I H )"
H )y =) y^ > e = y > H > (I H )y = 0
H )y )> (Hy + (I H )y ) = y^ > y^ + e > e + |2yH (I{z H )y}.
=0
Linear Models
Linear Models
Weighted Least Squares
Facts:
1
Victor Panaretos (EPFL)
71 / 299
Yi ind
N
m
2
0 + 1 xi 1 + 2 xi 2 + + q xiq ; w
i
With the wj known weights (example: each
measurements).
Yj
is an average of
:
wj
Arises often in practice (e.g., in sample surveys), but also arises in theory.
Victor Panaretos (EPFL)
Linear Models
72 / 299
Weighted Least Squares
Transformation:
y 0 = W 1=2 y ; X 0 = W 1=2 X
with
Wn n = diag(w1 ; : : : ; wn )
Distribution Theory of Least Squares
Leads to usual scenario. In this notation we obtain:
^ = [(X 0 )>X 0 ] 1(X 0)>y 0
= (X >WX ) 1 X >Wy
Similarly:
S 2 = n 1 p y > W WX (X > WX ) 1 X > W y
Victor Panaretos (EPFL)
Linear Models
73 / 299
Victor Panaretos (EPFL)
Linear Models
74 / 299
Least Squares Estimators
Joint Distribution of LSE’s
Gaussian Linear Model:
Theorem
Let Yn 1 = Xn p p 1 + "n 1 with " Nn (0; 2 I ) and assume that X has full
Yn 1 = Xn p
p 1 + "n 1 ;
" Nn (0; 2 I )
rank
We have derived the estimators:
1
^ = (X >X ) 1X >y
1
1
^ 2 = (y X ^)> (y X ^) = ky^ y k2
n
n
1
2
2
S = n p ky^ y k
2
3
; 2 (X > X ) 1 g;
the random variables
n p S 2 2
n
2
^ and S 2 are independent; and
2
p , where denotes the chi-square distribution with degrees of freedom.
We need to study the distribution of these estimators for the purpose of:
Corollary
Let Yn 1 Xn p p 1 "n 1 with " Nn ; 2 I . The statistic Hy is sufficient
for the parameter . If X has full rank p < n , then is also sufficient for .
=
Understanding their precision
Building confidence intervals
Testing hypotheses
(0
)
^
S 2 is unbiased whereas ^ 2 is biased (so we prefer S 2 ).
...
Linear Models
+
Corollary
Comparing them to other candidate estimators
Victor Panaretos (EPFL)
p < n . Then,
^ Np f
75 / 299
Victor Panaretos (EPFL)
Linear Models
76 / 299
Geometry Reminder
Joint Distribution of LSE’s
Proof of the Theorem.
^ = (X X ) X Y
Y Nn (X ; 2 I )
1. Recall our results for linear transformations of Gaussian variables:
>
1 >
=) ^ Np f
; 2 (X > X ) 1 g
^ = X ^, then S 2 = e >e =(n p ) will be independent of
2. If e is independent of y
(why?). Now notice that:
^
e = (I H )y
y^ = Hy
y N (X ; 2 I )
Therefore, from the properties of the Gaussian distribution e is independent of
since I H 2 I H 2 I H H
, by idempotency of H .
(
Victor Panaretos (EPFL)
Linear Models
)( ) = (
) =0
Victor Panaretos (EPFL)
77 / 299
Linear Models
y^
78 / 299
Proof of the first Corollary.
Write
2
If we can show that the conditional distribution of the n -dimensional vector
W y ; e > given y does not depend on , then we will also know that the
conditional distribution of y y e given y does not depend on either,
proving the proposition.
proof cont’d.
= (^ )
3. For the last part recall that
>
e = (I H )" =) (n
p ) ne ep = "> (I H )"
by idempotency of H . But recall that " Nn (0; 2 In ) so 1 " Nn (0; In ).
p )S 2 = (n
^
=^+
^
^
^
But we have proven that y is independent of e . Therefore, conditional on y , e
always has the same distribution N ; I H 2 . It follows that, conditional on
y , the vector W has a distribution whose first n coordinates equal y almost
surely, and whose last n coordinates are N ; I H 2 . Neither of those two
depend on , and the proof is complete.
^
Therefore, by the properties of normal quadratic forms (slide 40),
(n p ) S 2 = ( 1 ")>(I H )( 1") 2
n
2
y = Hy + (I H )y = y^ + e .
p:
When
X
has full rank,
(0 (
) )
(0 (
) )
^
^ is a 1-1 function of Hy , and is also sufficient for
.
Proof of the second Corollary.
Recall that if
Victor Panaretos (EPFL)
Linear Models
79 / 299
Q 2d , then E[Q ] = d .
Victor Panaretos (EPFL)
Linear Models
80 / 299
Confidence and Prediction Intervals
How to construct
1
Confidence and Prediction Intervals
CI for a linear combination of the parameters,
c>
) 100% CI: q
c > ^ tn p ( =2) S 2 c > (X > X ) 1 c :
What about a (1
) CI for r ? (r th coordinate)
Let cr = (0; 0; : : : ; 0;
1 ; 0; : : : ; 0)
r th position
Then r = c >
?
We obtain
c > ^ N1 (c > ; 2 c > (X > X ) 1 c ) = N1 (c > ; 2 )
p
Therefore Q = (c > ^ c > )=( ) N1 (0; 1)
Hence Q 2 21
and Q 2 is independent of S 2 (since ^ is independent of S 2 )
while n p S 2 2n p .
Have
2
In conclusion:
Q2
1p)
(n 2
2
S
n p
(c > ^ 2c > )2
F1;n p )
But for real
S 22
>^ >
= p c2 > >c 1
S c (X X ) c
!2
Therefore, base CI on
F1;n
^r r
cr> ^ cr>
p
=
t ;
S 2 vr ;r n p
S 2 cr> (X > X ) 1 cr
where vr ;s is the r ; s element of (X > X ) 1 .
Obtain (1
) 100% CI:
p
p
W , W 2 F1;n p () W tn p , so base CI on:
c> ^ c>
p
t
S 2 c > (X > X ) 1 c n p
Victor Panaretos (EPFL)
Linear Models
^ tn p ( =2)pS 2 vrr :
81 / 299
Confidence and Prediction Intervals
=
+
Victor Panaretos (EPFL)
y+ by x+> ^.
y+ for an x+ 2 Rp
But y+ x+>
"+ so a prediction interval is DIFFERENT from an interval
for a linear combination c > (extra uncertainty due to "+ ):
E[x+> ^ + "+ ] = x+>
var[x+> ^ + "+ ] = var[x+> ^] + var["+ ] = 2 [x+> (X > X )
1
ke k2 = e > e = y > (I H )y = 1 y^ > y^
ky k2 y > y
y >y
y >y
x+> ^ y+
tn p :
S 2 f1 + x+> (X > X ) 1 x+ g
Define
) prediction interval:
q
x+> ^ tn p ( =2) S 2 f1 + x+> (X > X ) 1 x+ g:
Victor Panaretos (EPFL)
Linear Models
R2
How large is this, relative to data variation? Look at
x+ + 1]
q
(1
82 / 299
R2 is a measure of fit of the model to the data.
We are trying to best approximate y through an element of the column-space
of X .
How successful are we? Squared error is e > e .
Base prediction interval on:
Obtain
Linear Models
The Coefficient of Determination,
Suppose we want to predict the value of
Our model predicts
(1
Note that
0 R02 1
2
>
R02 = yy^ > yy^ = kkyy^ kk2
Interpretation: what proportion of the squared norm of
explain?
83 / 299
Victor Panaretos (EPFL)
Linear Models
y does our fitted value y^
84 / 299
Geometry Reminder
Different Versions of
R2
“Centred (in fact, usual) R 2 ”. Compares empirical variance of
variance of y , instead of the empirical norms. In other words:
1 Pn (^yi y )2
2
n
R = 1 Pin=1 (y y )2 =
n i =1 i
Pn
Pn
(note that n1 i =1 y^i = n1 i =1 (yi
y^ to empirical
n y^ 2 n y 2
y )2 = P
Pin=1 i2
2:
y )2
i =1 yi n y
ei ) = y because
P e ? 1 (recall that 1 is the
vector of 1’s = first column of design matrix X ) so i ei = 0.
Note that
Pn
(^yi
Pin=1
i =1 (yi
2
2
R2 = kkyy^ kk2 kkyy 11kk2 :
R02 mathematically more natural (does not treat first column of X
as
special).
R2 statistically more relevant (expresses variance—the first column of X
usually is special, in statistical terms!).
R02 and R2 may differ a lot when y large.
85 / 299
R2
y^ on orthogonal complement of
So
1(1> 1)
1(1> 1)
NOTE: Statistical packages (e.g., R) provide
0:96 0:61
0:97 0:69
linear
quadratic
1 1)^y k2
1 1)y k2
●
● ●
●
●
40
1 1> y = 1n 1 Pn yi = 1y .
i =1
1 1> y^ = 1n 1 Pn y^i = 1y .
i =1
2
2
R2 = kkyy^ kk2 kkyy 11kk2 = kk((II
Intuition: Should not take into account the part of ky k that is explained by a
constant, we only want to see the effect of the explanatory variables.
R2 (and/or Ra2 , see below), not R02 .
Show that R 2 = [corr(fy^i gni=1 ; fyi gni=1 )]2 .
Show that R 2 R02 .
●
●
●
● ● ●
●●● ●
●
● ●●
●
●● ●● ● ●● ● ● ●
●● ●
●
●
●
●
●●●●
●
● ●
● ●●
●
●●
● ●●● ●
● ●●
●
● ● ●● ●●●
●
●●
●●●●
●●● ●●● ● ●
● ●●●
●●
● ● ● ●●
●
●●
● ●● ●●●
●
● ●●●●● ●●
●
●
●
● ●
●
●
●●●● ● ●●●
●●●
●
●● ● ●●
● ●● ●● ●●
● ●●
●
●
● ● ● ●● ●●● ●
●
●
●
●
● ● ●●● ●●
● ●
●● ●●●●● ● ●
●● ●
●
●●● ● ●
●
● ●
●●●●● ●
● ●●● ●
●
● ●● ● ● ●●●
● ●●
●
●
● ● ●● ●
●● ● ● ●
●●
●
●
● ● ●
●
●
●
10
Exercise:
Exercise:
86 / 299
R2 coefficients for the linear and quadratic models:
R02 R2
30
( )
1(1> 1)
Linear Models
Horsepower and MPG of cars
Geometrical interpretation of R 2 : project y and
1, then compare the norms (of the projections):
1 1> 1
Victor Panaretos (EPFL)
Gas mileage (MPG)
Different Versions of
Linear Models
20
Victor Panaretos (EPFL)
●
●
●
●
● ●
●
●
● ●
●●
● ● ●
● ●
● ● ● ● ●●
●● ● ●●
●
●
● ●●
●●
●● ● ●
●●
● ● ● ●●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ● ●
●
●
●
●
●
●
●
50
100
150
●
●
●
●
●●
● ● ●
●
●
●
200
Horsepower
Victor Panaretos (EPFL)
Linear Models
87 / 299
Victor Panaretos (EPFL)
Linear Models
88 / 299
Different Versions of
R2
Example: Cement Heat Evolution
R2 takes into account the number of variables employed. It is
Ra2 = R2 (1 R2 ) nn p1 :
Corrects for the fact that we can always increase R 2 by adding variables. One can
also correct the un-centred R02 by evaluating
R02 (1 R02 ) n n p :
The adjusted
defined as:
Victor Panaretos (EPFL)
Linear Models
89 / 299
15
20
30
20
Victor Panaretos (EPFL)
Heat
78.50
74.30
104.30
87.60
95.90
109.20
102.70
72.50
93.10
115.90
83.80
113.30
109.40
Linear Models
50
60
70
Estimate
62.4054
1.5511
0.5102
0.1019
0.1441
Residual standard error:
R-Squared: 0.9824
10
Percent weight of 4CaO.Al_2O_3.Fe2O_3
Victor Panaretos (EPFL)
40
100
Heat evolved
15
60.00
52.00
20.00
47.00
33.00
22.00
6.00
44.00
22.00
26.00
34.00
12.00
12.00
90 / 299
Std. Error
70.0710
0.7448
0.7238
0.7547
0.7091
t value
0.89
2.08
0.70
0.14
0.20
Pr(>jtj)
0.3991
0.0708
0.5009
0.8959
0.8441
2.446 on 8 degrees of freedom
80
100
10
2CaO :SiO2
6.00
15.00
8.00
8.00
6.00
9.00
17.00
22.00
18.00
4.00
23.00
9.00
8.00
(Intercept)
x1
x2
x3
x4
Percent weight of 3CaO.SiO_2
80
Heat evolved
Percent weight of 3CaO.Al_2O_3
5
4Cao :Al2 O3 :Fe2 O3
26.00
29.00
56.00
31.00
52.00
55.00
71.00
31.00
54.00
47.00
40.00
66.00
68.00
> cement.lm<-lm(y1+x1+x2+x3+x4,data=cement)
> summary(cement.lm)
100
Heat evolved
10
3CaO :SiO2
7.00
1.00
11.00
11.00
7.00
11.00
3.00
1.00
2.00
21.00
1.00
11.00
10.00
80
100
5
3CaO :Al2 O3
Example: Cement Heat Evolution
80
Heat evolved
Example: Cement Heat Evolution
Case
1
2
3
4
5
6
7
8
9
10
11
12
13
Linear Models
20
30
40
50
60
Percent weight of 2CaO.SiO_2
91 / 299
Victor Panaretos (EPFL)
Linear Models
92 / 299
Example: Cement Heat Evolution
Horsepower and MPG of cars
> x.plus
[1] 25 25 25 25
predict(cement.lm,x.plus,interval="confidence",
se.fit=T,level=0.95)
40
Upper
128.2
predict(cement.lm,x.plus,interval="prediction",
se.fit=T,level=0.95)
Lower
96.5
Upper
129.2
●
●
●
● ● ●
●●● ●
●
● ●●
●
●● ●● ● ●● ● ● ●
●● ●
●
●
●
●●●● ● ●
●●
●
●
●
●
●●
● ●●● ●
● ●●
●
● ● ●● ●●●
●
●●●●● ●●
●● ●
●
● ●●●
●
●●● ● ● ●●
●●
●
●●
● ●● ●●●
●
● ●●●●● ●●
●
●
●
● ●
●
●
●●●● ● ●●●
●
●
●
●
● ● ●●
● ●● ●● ●●
● ●●
●
● ● ● ●●
●● ●●● ●
●
● ●
●
● ● ●●● ●●
● ●
●● ●●●●● ● ●
●
●
●
● ●
●●● ● ●
●● ●
●
●
●
●
●
●
●
● ●● ● ●●
●
● ●● ●●●●
●●●
● ● ●● ●
●● ● ● ●
●●
●
●
● ● ●
●
●
●
10
Fit
112.8
30
Gas mileage (MPG)
Lower
97.5
20
Fit
112.8
●
● ●
●
●
●
●
●
●
● ●
●
● ●
●●
● ● ●
● ●
● ● ● ● ●●
●● ● ●●
●
●
● ●●
●●
●● ● ●
●●
● ● ● ●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
50
100
150
●
●
●
●●
● ● ●
●
●
●
200
Horsepower
Victor Panaretos (EPFL)
Linear Models
93 / 299
Victor Panaretos (EPFL)
Linear Models
Horsepower and MPG of cars
Horsepower and MPG of cars
> auto.lm <- lm(mpg1+horsepower+I(horsepower2 ),data=Auto)
> summary(auto.lm)
> x.plus
horsepower
120
> predict(auto.lm, x.plus, interval="confidence",
se.fit=T, level=0.95)
Estimate
(Intercept)
horsepower
I(horsepower2 )
56:9000
0:4662
0:0012
Residual standard error:
R-Squared: 0.6876
Std. Error
1:8004
0:0311
0:0001
t value
31:60
14:98
10:08
Pr(>jtj)
< 2 10 16
< 2 10 16
< 2 10 16
Fit
18.68
Linear Models
Upper
19.33
> predict(auto.lm, x.plus, interval="prediction",
se.fit=T, level=0.95)
4.374 on 389 degrees of freedom
Fit
18.68
Victor Panaretos (EPFL)
Lower
18.03
94 / 299
95 / 299
Lower
10.05
Upper
27.30
Victor Panaretos (EPFL)
Linear Models
96 / 299
Gaussian Linear Model: Efficiency of LSE (Optimality)
^ is a sensible estimator. But is it the best
Q: Geometry suggests that the LSE
we can come up with?
A: Yes,
Gauss-Markov & Optimal Estimation
^ is the unique minimum variance unbiased estimator of
(To be seen in Statistical Theory course, since
.
^ is sufficient and complete)
Thus, in the Gaussian Linear model, the LSE are the best we can do as far as
unbiased estimators go.
(actually can show
Victor Panaretos (EPFL)
Linear Models
97 / 299
Second Order Assumptions: Optimality in a weaker setting?
S 2 is optimal unbiased estimator of 2 , by similar arguments)
Victor Panaretos (EPFL)
Linear Models
98 / 299
Second Order Assumptions
For a start, we retain unbiasedness:
The crucial assumption so far was:
Normality:
Lemma
" Nn (0; 2 I )
If we only assume both
[ ]=0
E
What if we drop this strong assumption and assume something weaker?
[ ] = 0 & var["] = 2I
instead of
Uncorrelatedness: E "
(notice we do not assume any particular distribution.)
[ ] = 2 I
var " N (0; 2 I );
then the following remain true:
1
How well do our LSE estimators perform in this case?
2
(note that in this setup the observations may not be independent —
uncorrelatedness implies independence only in the Gaussian case.)
3
[ ^] = ;
Var[ ^] = 2 (X > X )
E[S 2 ] = 2 .
E
1;
But what about optimality properties?
Victor Panaretos (EPFL)
Linear Models
99 / 299
Victor Panaretos (EPFL)
Linear Models
100 / 299
Gauss–Markov Theorem
Gauss–Markov Theorem
Proof.
Let
Theorem
Let Yn 1 = Xn p p 1 + "n 1 , with p < n , X having rank p , and
[ ] = 0,
var["] = 2 I .
Then, ^ = (X > X ) 1 X > Y
is the best linear unbiased estimator of , that is, for
any linear unbiased estimator of , it holds that
~
Victor Panaretos (EPFL)
( ~)
var
( ^) 0:
Linear Models
Large Sample Distribution of
^
,!Gauss-Markov says
Victor Panaretos (EPFL)
^ when cov(") = 2 I , but
Theorem (Large Sample Distribution of ^)
1
= (X >X ) 1X >":
2
fXn gn 1 be a sequence of n p design matrices and Yn = Xn + "n . If
Xn is of full rank p for all n 1
max1i n [xi>(Xn>Xn ) 1 xi ] n !1
! 0,
(where xi> is the i th row of
Since there is a huge variety of candidate distributions for " that would be compatible with
the property cov(") = 2 I , we cannot say very much about the exact distribution of
^
= ( X > X ) 1 X > ".
103 / 299
Xn )
n -vector with i.i.d. coordinates of variance 2 ,
then the least squares estimator ^n = (Xn> Xn ) 1 Xn> Yn satisfies
3
"n
is a zero mean
(Xn> Xn )1=2 ( ^n
Can we at least hope to say something about this distribution asymptotically, as the sample
becomes large?
Linear Models
^
() increasing number of observations.
We let n ! 1 (# rows of X tend to infinity)
# columns of X , i.e., p , (held fixed).
Let
Note that we can always write
^
102 / 299
Large sample
^ optimal linear unbiased estimator, regardless of whether or not " is
Question: What can we say about the distribution of
" is not necessarily Gaussian?
Victor Panaretos (EPFL)
Linear Models
Large Sample Distribution of
Gaussian.
for some Ap n ;
for all 2 Rp :
= E[ ~] = E[AY ] = E[AX + A"] = AX ; 2 Rp
=) (AX I ) = 0; 8 2 Rp :
We conclude that the null space of (AX I ) is the entire Rp , and so AX = I .
var[ ~] var[ ^] = A 2 IA> 2 (X > X ) 1
= 2fAA> AX (X >X ) 1 X >A>g
= 2A(I H )A>
= 2A(I H )(I H )>A>
0:
101 / 299
[ ] = 0 and cov["] = 2I
If E "
~ be linear and unbiased, in other words: ~ =~ AY ;
E[ ] = ;
These two properties combine to yield,
E"
var
(
Victor Panaretos (EPFL)
) d! Np (0; 2I ):
Linear Models
104 / 299
Large Sample Distribution of
^
Large Sample Distribution of
^
To understand Condition (2), consider simple linear model
Theorem’s conclusion can be interpreted as:
d
for n “large enough”, ^ ' Nf ; 2 (Xn> Xn ) 1 g
i.e. distribution of ^ gradually becomes the same as what it would be if "
Yi = 0 + 1 ti + "i ;
Here,
p = 2. Can show that
= (Xn>Xn ) 1 Xn> ,
( )
P
( )=
( )
0
Note that trace H
p , so that the average hjj n =n ! — the
question is do all the hjj n ! uniformly?
Has a very clear interpretation in terms of the form of the design that we will see
when we discuss the notions of leverage and influence.
( ) 0
Victor Panaretos (EPFL)
Linear Models
Large Sample Distribution of
105 / 299
^
2
hjj (n ) = n1 + Pn(tj (t t ) t )2
were Gaussian
. . . provided design matrix X satisfies extra condition (2).
Can be shown equivalent to: diagonal elements of Hn Xn
say hjj n converge to zero uniformly in j as n ! 1
i = 1; : : : ; n :
k =1 k
= i , for i = 1; : : : ; n (regular grid). Then
=2g2
hjj (n ) = n1 + fj (n 2(n +n )1)=12
so
max h (n ) = hnn (n ) = n1 + n6((nn + 1)1) n !1
! 0:
1j n jj
Suppose ti
Victor Panaretos (EPFL)
Large Sample Distribution of
Linear Models
106 / 299
Linear Models
108 / 299
^
=2
i (grid points spread apart as n grows).
Now consider ti
The centre of mass and sum of squares of the grid points is now
t = 2(2
n
and so
Victor Panaretos (EPFL)
n
1) ;
n
X
n +1
n +1
n +3
(ti t )2 = 4 3 4 4 + n4 2
i =1
3
max h (n ) = hnn (n ) n !1
! 4:
1j n jj
Linear Models
107 / 299
Victor Panaretos (EPFL)
Proof.
^
= (Xn>Xn ) 1Xn> "n . We will show that for any unit vector u ,
Recall that n
u > (Xn> Xn ) 1=2 Xn> "n !d N (0; 2 );
and then the theorem will be proven by the Cramér-Wold devicea . Now notice that
u > (Xn> Xn ) 1=2 Xn> "n = n> "n
where:
1
2
3
Diagnostics
u > (Xn> Xn ) 1=2 x1 ; : : : ; u > (Xn> Xn ) 1=2 xn >
2
2
>
1=2 xi 2 = xi> (Xn> Xn ) 1 xi
(Cauchy-Schwarz)
n ;i ku k (Xn Xn )
>
> >
1=2 (Xn> Xn )(Xn> Xn ) 1=2 u = 1.
n n = u (Xn Xn )
n
=(
n ;1 ; : : : ; n ;n )> =
Consequently, the result follows from the weighted sum CLT upon noticing:
max1i n n2;i
a Cramér-Wold:
n
Pn
2 max1i n xi> (Xn> Xn ) 1 xi ! 0
k =1 n ;k
!d in Rd if and only if u > !d u > in R for all unit vectors u .
Victor Panaretos (EPFL)
Linear Models
109 / 299
Assumptions to Check for
Linearity: E Y is linear in
X.
110 / 299
Scientific reasoning: impossible to validate model assumptions.
[ ] = 2 for all j = 1; : : : ; n .
Homoskedasticity: var "j
Cannot prove that the assumptions hold. Can only provide evidence in favour (or
against!) them.
Gaussian Distribution: errors are Normally distributed.
Independent Errors: "i independent of "j for
i 6= j .
Strategy:
Find implications of each assumption that we can check graphically (mostly
concerning residuals).
When one of these assumptions fails clearly, then Gaussian linear
regression is inappropriate as a model for the data.
Construct appropriate plots and assess them (requires experience).
“Magical Thinking”: Beware of overinterpreting plots!
Isolated problems, such as outliers and influential observations also deserve
investigation. They may or may not decisively affect model validity.
Victor Panaretos (EPFL)
Linear Models
How do we check these assumptions?
Four basic assumptions inherent in the Gaussian linear regression model:
[ ]
Victor Panaretos (EPFL)
Linear Models
111 / 299
Victor Panaretos (EPFL)
Linear Models
112 / 299
Residuals Revisited
Checking for Linearity
Residuals e : Basic tool for checking assumptions.
e = y y^ = y X ^ = (I H )y = (I H )"
Intuition: the residuals represent the aspects of y that cannot be explained by the
columns of X .
Since " Nn (0; 2 I ), if the model is correct we should have
e Nn f0; 2 (I H )g.
ei Nf0; 2 (1 hii )g
So if assumptions hold !
cov(ei ; ej ) = 2 hij
Recall:
Note the residuals are correlated, and that they have unequal variances. Define
the standardised residuals:
ri := s p1ei h ; i = 1; : : : ; n :
ii
These are still correlated but have variance 1.
(can decorrelate by U > e , where H = U U > ) – why?
Victor Panaretos (EPFL)
Linear Models
X > e = 0:
Hence, no correlation should appear between explanatory variables and residuals.
Plot standardised residuals r against each explanatory variable (columns of
X ).
,!
No systematic patterns should appear in these plots. A systematic pattern
would suggest incorrect dependence of the response on the particular
explanatory (e.g. need to add a transformation of that explanatory as an
additional variable).
Plot standardised residuals
,!
113 / 299
Linearity OK
Victor Panaretos (EPFL)
A first impression can be drawn by looking at plots of the response against each of
the explanatory variables.
Other plots to look at?
Notice that under the assumption of linearity we have
Victor Panaretos (EPFL)
Linearity NOT OK – need to add
Linear Models
115 / 299
r against explanatories left out of the model.
No systematic patterns should appear in these plots. A systematic pattern
suggests that we have left out an explanatory variable that should have been
included.
Victor Panaretos (EPFL)
Linear Models
114 / 299
sin(x1 ) in model
Linear Models
116 / 299
Important Covariate Left out
Checking for Homoskedasticity
Homoskedastic
= |{z}
óo +" ò&
{z
}
|
same
spread
According to our model assumptions, the variance of the errors "j should be the
same across indices:
var "j
2
( )=
Plot r against the fitted values y^ . (why not against y ?)
,!
,!
Victor Panaretos (EPFL)
Linear Models
117 / 299
Homoskedasticity OK
Victor Panaretos (EPFL)
A random scatter should appear, with approximately constant spread of the
values of r for the different values of y^ . “Trumpet” or “bulging” effects
indicate failure of the homoskedasticity assumption.
Since y^ > e = 0, this plot can also be used to check linearity, as before.
Victor Panaretos (EPFL)
Linear Models
118 / 299
Heteroskedasticity (i.e. lack of Homoskedasticity)
Linear Models
119 / 299
Victor Panaretos (EPFL)
Linear Models
120 / 299
Checking for Normality
Checking for Normality
Idea: compare the distribution of standardised residuals against a Normal
distribution.
A quantile plot for a given sample plots certain empirical quantiles against the
corresponding theoretical quantiles (i.e. those under the assumed distribution).
How?
If the sample at hand originates from
plot fall close to the line.
45
Compare the empirical with the theoretical quantiles . . .
fk =n gnk=1 quantiles of standardised residuals
r(1) r(2) r(n )
against theoretical quantiles 1 f1=(n + 1)g; : : : ; 1 fn =(n + 1)g of a
N (0; 1) distribution.
2 [0; 1]) of a distribution F is the value defined as
:= inf f 2 R : F ( ) p g:
Notation: = F 1 (p ) (although the inverse may not be well defined) Given a
sample X1 ; : : : ; Xn , the empirical p quantile is the value defined as
#
f
X
g
i
p :
= inf 2 R :
n
^ n 1 (p )
Notation: = F
The p -quantile (p
Victor Panaretos (EPFL)
QQ Plot for
Linear Models
Plot the empirical
,!
,!
k
instead of 1 nk .
Think why we pick 1 n +1
If the points of the quantile plot deviate significantly from the 45 line, there
is evidence against the normality assumption. Outliers, skewness and heavy
tails easily revealed.
Beware of overinterpretation when
121 / 299
n = 50
Victor Panaretos (EPFL)
Victor Panaretos (EPFL)
QQ Plot for
Linear Models
F , then we expect that the points of the
123 / 299
n is small!
Linear Models
122 / 299
Linear Models
124 / 299
n = 100
Victor Panaretos (EPFL)
n = 300
Normality NOT OK
Victor Panaretos (EPFL)
Linear Models
125 / 299
Checking for Independence
Victor Panaretos (EPFL)
[ ] = 2 I .
0.5
0
5
10
15
5
Lag
Series std.residual2
ACF
0.0
seriously affects estimator reliability
−1.0
−0.5
inflates standard errors
0
5
10
15
Lag
Linear Models
127 / 299
Victor Panaretos (EPFL)
15
Series std.residual2
0.5
Existence of dependence:
10
Lag
1.0
]
1.0
[
−1.0
−0.5
]
When observations are time-ordered can look at correlation corr rt ; rt +k or
partial correlation corr rt ; rt +k jrt +1 ; : : : ; rt +k 1 . When such correlations
exist, we enter the domain of time series.
Partial ACF
−0.5
0.0
0.5
[
−1.0
e.g. identifying groups of related individuals with correlated responses
ACF
0.0
Difficult to check this assumption in practice.
One thing to check for is clustering, which may suggest dependence.
Partial ACF
−0.5
0.0
0.5
1.0
Series std.residual
1.0
Series std.residual
Under assumption of normality this is equivalent to independence
Victor Panaretos (EPFL)
126 / 299
Checking for Independence
It is assumed that var "
,!
Linear Models
−1.0
QQ Plot for
Linear Models
5
10
Lag
15
128 / 299
Identifying Influential Observations
Outliers
An influential observation can usually be categorised as an:
An outlier is an observation that stands out in some way from the rest of the
observations, causing Surprise! Exact mathematical definition exists (Tukey) but
we will not pursue it.
outlier (relatively easier to spot by eye)
In regression, outliers are points falling far from the cloud surrounding the
regression line (or surface).
OR
leverage point (not as easy to spot by eye)
They have the effect of “pulling” the regression line (surface) toward them.
Influential observations
Outliers can be checked for visually through:
The regression scatterplot.
May or may not decisively affect model validity.
Require scrutiny on an individual basis and consultation with the data expert.
David Brillinger (Berkeley): You will not find your Nobel prize in the fit, you will
find it in the outliers!
Influential observations may reveal unanticipated aspects of the scientific problem
that are worth studying, and so must not simply be scorned as “non-conformists”!
Victor Panaretos (EPFL)
Linear Models
129 / 299
An Outlier
Victor Panaretos (EPFL)
,!
Points that can be seen to fall relatively far from the point cloud surrounding
the regression line (surface)
Residual Plots.
,!
Points that fall beyond
( 2; 2) in the (^y ; r ) plot.
Outliers may result from a data registration error, or a single extreme event. They
can, however, result because of a deeper inadequacy of our model (especially if
there are many!).
Victor Panaretos (EPFL)
Linear Models
130 / 299
Linear Models
132 / 299
Professor’s Van: Outliers
Linear Models
131 / 299
Victor Panaretos (EPFL)
Leverage and Leverage Points
A (very) Noticeable Leverage Point
Outliers may be influential: they “stand out” in the “y -dimension”.
However an observation may also be influential because of unusual values in
the “x -dimension”.
Such influential observations cannot be so easily detected through plots.
Call
(xj ; yj ) the j -th case and notice that
var(yj
y^j ) = var(ej ) = 2 (1 hjj ):
If hjj 1, then the model is constrained so y^j = xj> ^ ' yj ! (i.e., need a separate
parameter entirely devoted to fitting this observation!)
hjj
is called the leverage of the j -th case.
Pn
since trace(H ) = j =1 hjj = p , cannot have low leverage for all cases
a good design corresponds to hjj ' p =n for all j
n !0
(i.e. assumption maxj n hjj ! 0 satisfied in asymptotic thm).
Leverage point: (rule of thumb) if hjj > 2p =n observation needs further scrutiny—e.g.,
fitting again without j -th case and studying effect.
Outlier+Leverage Point = TROUBLE
Victor Panaretos (EPFL)
Linear Models
133 / 299
Assessing the Influence of an Observation
Victor Panaretos (EPFL)
Linear Models
134 / 299
Linear Models
136 / 299
A Cook Distance Plot
How to find cases having strong effect on fitted model?
Idea: see effect when case j , i.e.,
(xj ; yj ), is dropped.
^ j be the LSE when model is fitted to data without case j , and let
y^ j = X ^ j be the corresponding fitted value.
Let
Define Cook’s distance
Cj = ps1 2 (^y y^ j )> (^y y^ j );
which measures scaled distance between
Can show that
j.
r 2h
Cj = p (1j jjh ) ;
jj
implies large rj and/or large hjj .
Cases with Cj > = n
p worth a closer look (rule of thumb)
so large
Cj
y^ and y^
2)
Plot Cj against index j = 1; : : : ; n and compare with 8=(n 2p ) level.
Victor Panaretos (EPFL)
8(
Linear Models
135 / 299
Victor Panaretos (EPFL)
Summary
Diagnostic plots usually constructed:
y against columns of X
,!
check for linearity and outliers
standardized residual
r against columns of X
,!
check for linearity
,!
check for variables left out
,!
check for homoskedasticity
Detour: Reminder on Hypothesis Tests
r against explanatories not included
r against fitted value y^
Normal quantile plot
,!
check for normality
Cook’s distance plot
,!
check for influential observations
Victor Panaretos (EPFL)
Linear Models
137 / 299
Detour: Very brief Reminder on Testing Hypotheses
Victor Panaretos (EPFL)
Linear Models
138 / 299
Hypothesis Testing Setup
H0 : The null hypothesis
Scientific theories lead to assertions that are testable using empirical data.
Data may discredit the theory (call it the hypothesis) or not (i.e., empirical
findings reasonable under hypothesis).
,!
scientific theory under scrutiny
Y;
T ();
,!
data
test statistic, assumed positive
the experimental setup to test theory
INTUITION:
Example: Theory of “luminoferous aether” in late 19th century to explain
light travelling in vacuum. Discredited by Michelson-Morley experiment.
Similarities with the logical/mathematical concept of a necessary condition.
Victor Panaretos (EPFL)
Linear Models
139 / 299
The null hypothesis would predict a certain plausible range of values for
T Y (plausible results of the experiment).
( )
We would say that the assertion made by the null hypothesis (theory) is not
supported by the data if T Y is an extreme (unlikely) observation given the
range of plausible values predicted by the hypothesis (if the experimental
evidence appears to be inconsistent with the theory).
( )
Victor Panaretos (EPFL)
Linear Models
140 / 299
Hypothesis Testing Setup
Example: Mean of a Normal Distribution
()
( )
PH [T (Y ) 2 ]
Suppose that we perform the experiment T (Y ) and the result is T (Y ) = t . The
Plausibility of different values of T under the theory H0
! described by the distribution of T Y under the null hypothesis:
0
result
t is judged to be incompatible with the hypothesis when
p = PH [T (Y ) t ]
0
is small. The value p is called the p-value.
Small values of p suggest that we have observed something which is unlikely
to happen if H0 holds true.
Large values of
true.
p suggest that what we have observed is plausible if H0 holds
(Choice of T often guided by an alternative hypothesis
should be large)
Thus we reject the null hypothesis when
Victor Panaretos (EPFL)
H1 , under which T
p is small.
Linear Models
141 / 299
Example continued and two comments
Let
X N (; 2 ), unknown mean, known variance
H0 : = 0
d
Data: Y = (X1 ; : : : ; Xn ), Xi = X , Xi indep Xj for i 6= j .
P
2
X
i
i
p . (tends to be large when 6= 0).
Test statistic: T (Y ) =
n
Perform experiment (i.e., obtain values y = (x1 ; : : : ; xn )) and observe
T (y ) = t .
H
Under the null hypothesis: T (Y ) 21 . Hence:
p = PH [T (Y ) t ]
= P[21 t ]
p
p
= P[fN (0; 1) t g [ fN (0; 1) t g]:
Usually reject when p < 0:05.
0
0
Victor Panaretos (EPFL)
Example: Testing for
Linear Models
142 / 299
c > = 0 in a Gaussian Regression
Y N (X ; 2 I ), unknown
H0 : c > = 0
Data: (y ; X ).
Let
, unknown variance
!2
c> ^
p
Test statistic: T (Y ) =
S c > (X > X ) 1 c
Suppose we observe T (y ) = and let W tn p . Then,
p = PH [T (Y ) ] = P[fW p g [ fW p g]:
Reject the null hypothesis if p < , some small .
Identical to building a 1
confidence interval for c > based on
>
>
^
pc c
and rejecting the hypothesis H0 if and only if the interval
S c > (X > X ) c
0
For continuous test statistics with everywhere positive densities, if we reject
H0 whenever p < , then our (type I) error probability is .
,!
The probability of rejecting
H0 when in fact H0 is true is
1
does not contain zero.
There is a close link with confidence intervals.
,!
We will only illustrate this link in a specific example
Victor Panaretos (EPFL)
Linear Models
143 / 299
Victor Panaretos (EPFL)
Linear Models
144 / 299
Many many issues remain (this was just a reminder!)
The role of an alternative hypothesis.
How do we choose a test statistic?
Nested Model Selection & ANOVA
Are there optimal tests in a given situation?
Simple and composite hypotheses.
One and two-sided tests.
Limitations of hypothesis testing . . .
...
Review your 2nd year Probability/Statistics course!
Victor Panaretos (EPFL)
Linear Models
145 / 299
Comparing Nested Models
Victor Panaretos (EPFL)
Linear Models
146 / 299
Rephrasing The Question: Gaussian Linear Model
y = X + " with " Nn (0; 2 I ). Estimate:
^ = (X >X ) 1X >y :
Interpretation: y^ = X ^ = Hy is the projection of y into the column space of X,
M(X ). This subspace has dimension p , when X is of full column rank p .
Now for q < p write X in block notation as
X = ( nX1q X2 ):
Model is
Consider the model:
y = 0 + 1 x1 + 2 x2 + 3 x3 + 4 x4 + ":
This will always have higher R 2 than the sub-model:
y = 0 + 1 x1 + ":
n (p q )
Why? (think of geometry. . . )
The question is: is the first model significantly better than the second one?
,!
Interpretation:
Similarly write
i.e. does the first model explain the variation adequately enough, or should we
incorporate extra explanatory variables? Need a quantitative answer.
X1 is built by the first q columns of X
=(
1 2 )> so that:
and
X2 by the rest.
y = X + " = (X1 X2 ) 1 + " = X1 1 + X2 2 + ":
2
Our question can now be stated as:
Is
Victor Panaretos (EPFL)
Linear Models
147 / 299
2 = 0?
Victor Panaretos (EPFL)
Linear Models
148 / 299
Residual Sums of Squares
Let
Geometry Revisited
H1 = X1 (X1> X1 ) 1 X1> , and y^1 = H1 y , e1 = y y^1 .
Pythagoras tells us that:
k| y {zy^1 k}2 = |ky {zy^ k}2 +
RSS ( ^1 )=ke1 k2
Notice that
RSS ( ^)=ke k2
k| y^ {zy^1 k}2
RSS ( ^1 ) RSS ( ^)=ke e1 k2
RSS ( ^1 ) RSS ( ^) always (think why!)
So the idea is simple: to see if it is worthwhile to include
much larger RSS 1 is compared to RSS
.
(^ )
( ^)
Equivalently, we can look at a ratio like
2 we will compare how
fRSS ( ^1 ) RSS ( ^)g=RSS ( ^)
To construct a test based on this quantity, we need to figure out distributions
:::
Victor Panaretos (EPFL)
Linear Models
149 / 299
Victor Panaretos (EPFL)
Linear Models
150 / 299
Distributions of Sums of Squares
proof continued
Theorem
Therefore,
We have the following properties:
(A)
(B)
(C)
(D)
e e1 ? e ;
ke k2 = RSS ( ^) and ke1 e k2 = RSS ( ^1 ) RSS ( ^) are independent;
ke k2 2 2n p ;
under the hypothesis H0 : 2 = 0, ke1 e k2 2 2p q .
e e1 = (I H )y (I H1 H )y = y Hy y + H1 Hy = (H1 I )Hy :
But recall that y N (X ; 2 I ). Therefore, to prove independence of
e e1 = (H1 I )Hy and e = (I H )y , we need to show that
Proof.
(H1 I )H [2 I ](I H )> = 0:
This is clearly the case since H (I H ) = 0, proving (B).
To show (B), we notice that
(C) follows immediately, since we have already proven last time that
when 2
)
(A) holds since e e1 = y y^ y + y^1 = y^ + y^1 2 M(X1 ; X2 ) but
e 2 [M(X1 ; X2 )]? .
=0
e1 = (I H1 )y = (I H1 H )y
because M(X1 ) M(X1 ; X2 ).
Victor Panaretos (EPFL)
Linear Models
151 / 299
Victor Panaretos (EPFL)
RSS ( ^) 2 2n
Linear Models
8
(even
p
152 / 299
proof continued.
Relative Sum of Square Reductions
To prove (D), we note that
e e1 = (H1 I )Hy Nf(H1 I )HX ; 2 |(H1 I )HH{z> (H1 I )>}g:
=H H1
HX = X (X > X ) 1 X > X = X . So, in block notation,
e e1 N ((H1 I )X1 1 + (H1 I )X2 2 ; 2 (H H1 )):
Now (I H1 )X1 1 = 0 always, since I H1 projects onto M? (X1 ). Therefore,
But
Corollary
We conclude that, under the hypothesis
2 = 0,
!
RSS ( ^1 ) RSS ( ^)
p q
!
Fp
RSS ( ^)
n p
e e1 N (0; 2 (H H1 ));
when 2 = 0:
Now observe that (H H1 )> = (H H1 ) and (H H1 )2 = (H H1 ) (because
M(X1 ) M(X1 ; X2 )). Thus,
q ;n p
e e1 N (0; 2 (H H1 )2 ) =) e e1 =d (H H1 )"
=) RSS ( ^1) RSS ( ^) = ke e1k2 =d ">(H H1)" 2 2p q :
since (H H1 ) is symmetric idempotent with trace p q and " N (0; 2 In ).
Victor Panaretos (EPFL)
The
Linear Models
153 / 299
F -Test
Victor Panaretos (EPFL)
Linear Models
154 / 299
Example: Nested Models in Cement Data
IWe fitted the model:
y = 0 + 1 x1 + 2 x2 + 3 x3 + 4 x4 + "
Distributional results suggest the following test:
Have
Y
N (X1 1 + X2 2 ; 2 I )
H0 : 2 = 0
Data: (y ; X1 ; X2 ).
I But would the following simpler model be in fact adequate?
RSS ( ^1 ) RSS ( ^)
p q !
Test statistic: T =
RSS ( ^)
n p
Then, under H0 , it holds that T Fp q ;n p . Suppose we observe T = . Then,
p = PH [T (Y ) ] = P[Fp q ;n p ]
Reject the null hypothesis if p < , some small , usually 0.05.
0
Victor Panaretos (EPFL)
y = 0 + 1 x1 + "
!
Linear Models
155 / 299
I Intuitively: is the extra explanatory power of the “larger” model significant
enough in order to justify its use instead of a simpler model? (i.e., is the residual
vector for the “larger” model significantly smaller than that of the simpler model?)
I In this case, n
,p
,q
and
= 13 = 5 = 2
RSS ( ^) = 47:86;
RSS ( ^1 ) = 1265:7
yielding
:86)=(5 2)
= (1265(47:7:86)47=(13
5) = 67:86
Ip = P[F3;8 67:86] = 4:95 10 6 , so we reject the hypothesis
H0 : 2 = 3 = 4 = 0.
Victor Panaretos (EPFL)
Linear Models
156 / 299
Example: MPG vs Horsepower
IWe can fit the quadratic model:
+
2 horsepower2
+"
40
1 horsepower
MPG
= 0+
1 horsepower
Gas mileage (MPG)
I But would the model only with linear term suffice?
+"
I Intuitively: is the reduction of RSS afforded by the “complex” model
substantial enough in order to justify its use instead of a simpler model?
I In this case, n
,p
,q
and
RSS ( ^1 ) = 1943:9
yielding
(14433:1 1943:9) = 12489:2
=
(3 2)
Ip = P[F1;389 12489:2] = 2:2 10 6 , so we reject the hypothesis H0 : 2 = 0.
Victor Panaretos (EPFL)
Linear Models
●
●
●
●
● ●
●
● ●
●●
● ● ●
● ●
● ● ● ● ●●
●● ● ●●
●
●
● ●●
●●
●● ● ●
●●
● ● ● ●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ● ●
●
●
●
●
●
●
●
100
●
●
●
150
●●
●
● ● ●
●
●
●
200
Horsepower
Victor Panaretos (EPFL)
Linear Models
158 / 299
Proceed similarly as before. Define:
X1 , . . . ,Xr be groups of columns of X (the “terms”), such that
X = (n11 nX1q nX2q : : : nXrq ); = ( 0 1 2 : : : r )>
2
r
11 1q1 1q2
X0 := 1 and Xk = (X0 X1 X2 : : : Xk ); k 2 f0; : : : ; r g
Hk := Xk (Xk> Xk ) 1 Xk> ; k 2 f0; : : : ; r g
y^k := Hk y ; k 2 f0; : : : ; r g
ek = y y^k ; k 2 f0; : : : ; r g
Note that y^0 = y 1.
1qr
y = X + " = 1 0 + X1 1 + + Xr r + "
I Would like to do the same “F-test investigation”, but this time do it
term-by-term. That is, we want to look at the following sequence of nested
models:
I As before, Pythagoras implies
k| y {zy^0 k}2 = k| y {zy^r k}2 + k| y^ {zy^r 1 k}2 + + k| y^1 {zy^0 k}2
y =1 0+"
y = 1 0 + X1 1 + "
y = 1 0 + X1 1 + X2 2 + "
ke0 k2
y = 1 0 + X1 1 + X2 2 + + Xr r + "
RSSr
with
Linear Models
ker k2
= k|e{zr k}2 +
..
.
Victor Panaretos (EPFL)
●
The Analysis of Variance
1
We have
●
● ● ●
●●● ●
●
● ●●
●
●● ●● ● ●● ● ● ●
●● ●
●
●
●
●
●●●● ●
●●
●
●
●
●
●●
● ●●● ●
● ●●
●
● ● ●● ●●●
●
●●●●● ●●
●● ●
●
● ●●●
●
●●● ● ● ●●
●●
●
●●
● ●● ●●●
●
● ●●●●● ●●
●
●
●
● ●
●
●
●●●● ● ●●●
●
●
●
●
● ● ●●
● ●● ●● ●●
● ●●
●
●● ●●●
● ● ● ●●
●
● ● ●
●
● ● ●●● ●●
● ●
●● ●●●●● ● ●
●● ●
●
●●● ● ●
●● ●
●
●
●
●
●
●
●
●
●●● ●
● ●
●
●
● ● ●●●●
●●●
● ● ●● ●
●● ● ● ●
●●
●
●
● ● ●
●
●
●
50
157 / 299
The Analysis of Variance
I Let 1,
●
10
= 392 = 3 = 2
RSS ( ^) = 14433:1;
30
= 0+
20
MPG
●
● ●
●
●
159 / 299
|
r 1
X
ker er 1 k2
{z
ker e0 k2
ke1 e0 k2
}
k| ek +1{z ek k}2
k =0 RSSk RSSk +1
RSSk the residual sum of squares for y^k , with k degrees of freedom.
Victor Panaretos (EPFL)
Linear Models
160 / 299
The Analysis of Variance
ANOVA Table
Terms
Some observations:
RSSk RSSk +1 is the reduction in residual sum of squares caused by adding
Xk +1 , when the model already contains X0 ; : : : ; Xk .
RSSr and fRSSk RSSk +1 gkr =01 are all mutually independent.
Obviously, 0 1 2 r
k +1 = k if Xk +1 2 M(Xk ).
IGiven this information, we want to see how adding each term in the model
sequentially, affects the explanatory capacity of the model.
,! In other words, we want to investigate the reduction in the residual sum of
squares achieved by adding each term to the model. Is this significant?
df
n 1
1
1; X1
1; X1 ; X2
..
.
1; X1 ; : : : ; Xr
1
2
..
.
r
161 / 299
Example: Nested Sequence in Cement Data
Reductions in overall sum of squares when sequentially entering terms
x1
x2
x3
x4
Residual SSq
n 1 1
Xr
r
..
.
2
..
.
1
r
Fk Fk
1
F-test
RSS0 RSS1
RSS1 RSS2
RSSr
..
.
1
RSSr
RSS when Xk is
RSSk )=(k 1 k ) ;
Fk = (RSSk 1 RSS
r =r
k ;r under the null hypothesis H0 : k = 0.
Fk relative to the null distribution are evidence against H0 .
Victor Panaretos (EPFL)
Linear Models
162 / 299
x1 , x2 ,
Significance of entering a term depends on how the sequence is defined:
when entering terms in different order get different results! (why?)
Does adding extra variables improve model significantly?
Red Sum Sq
1450.08
1207.78
9.79
0.25
47.86
RSSr
X1
X2
1
Reduction
in RSS
Warning!
x3 and x4 .
Df
1
1
1
1
8
..
.
df
1
Large values of
Linear Models
RSS0
RSS1
RSS2
Terms
added
The F -statistic for testing the significance of the reduction in
added to the model containing terms ; X1 ; : : : ; Xk is
and
Victor Panaretos (EPFL)
Residual
RSS
F value ( )
242.37
201.87
1.64
0.04
p -value
2.8810 7
5.8610 7
When a term is entered “early” and is significant, this does not tell us much
(why?)
When a term is entered “late” is significant, then this is quite informative
(why?)
0.2366
0.8441
I Why is this true? Are there special cases when the order of entering terms
doesn’t matter?
I In this case, each term is a single column (variable).
Victor Panaretos (EPFL)
Linear Models
163 / 299
Victor Panaretos (EPFL)
Linear Models
164 / 299
The Effect of Orthogonality
X0 = 1; X1 ; X2 from X , so
X = (nX01 nX1q nX2q ); = ( 0 1 2 )>
11 1q 1q
IAssume orthogonality of terms, i.e. Xi> Xj = 0; i 6= j
IConsider terms
1
2
1
2
Model Selection / Collinearity / Shrinkage
Notice that in this case
1
0
0
^ = @ 0 X1>X1 0 A X0 X1 X2 > y
0
0 X2>X2
=) ^0 = y ; ^1 = (X1>X1) 1 X1>y ; ^2 = (X2> X2) 1X2>y
0
1
X0> X0
It follows that the reductions of sums of squares are unique, in the sense that they
do not depend upon the order of entry of the terms in the model. (show this!)
Intuition: Xi contains completely independent linear information from Xj for y ,
i 6= j
Victor Panaretos (EPFL)
Linear Models
165 / 299
Theory VS Practice
Victor Panaretos (EPFL)
Linear Models
166 / 299
Albert Einstein (1879–1955)
I Theory: We are given a relationship
y =X +"
and asked to provide estimators, tests, confidence intervals, optimality properties
...
. . . and we can do it with complete success!
I Practice: We are given data (y ; X ) and suspect a linear relationship between
y and some of the columns of X . We don’t know a priori which exactly!
,! Need to select a “most appropriate” subset of the columns of X
,! General principle: parsimony (Latin parsimo nia: sparingness; simplicity and
least number of requisites and assumptions; economy or frugality of
components and associations).
Victor Panaretos (EPFL)
Linear Models
‘Everything should be made as simple as possible, but no simpler.’
167 / 299
Victor Panaretos (EPFL)
Linear Models
168 / 299
William of Ockham (?1285–1347)
Exploratory Data Analysis
Graphical exploration
provides initial picture:
y against candidate variables;
plots of transformations of y against candidate variables;
plots of transformations of certain variables against y ;
plots of
plots of pairs of candidate variables.
This will often provide a starting point, but:
Automatic Model Selection: Need objective model comparison criteria, as
a screening device.
,!
We saw how to do an
nested?
F -test, but what if models to be compared are not
Automatic Model Building: Situations when
possible models.
,!
Victor Panaretos (EPFL)
Linear Models
169 / 299
Automatic Model Selection
2
Denote set of all models generated by
(1 + )
Automatic methods for building a model? We saw that ANOVA depends on
the order of entry of variables in the model . . .
Victor Panaretos (EPFL)
Linear Models
170 / 299
Model Selection Criteria
Occam’s razor:
is vainXtowith
do with
more what can be done with fewer.
Consider
design Itmatrix
p variables.
Given pseveral explanations of the same phenomenon, we should prefer the simplest.
possible models!
If wish to consider
becomes
kp
p large, so there are lots of
X
2
by X (model powerset)
k different transformations of each variable, then p
Many possible choices, none universally accepted. Some (classical) possibilities:
Prediction error based criteria (CV)
Information criteria (AIC, BIC, . . . )
Mallow’s Cp statistic
Before looking at these, let’s introduce terminology: Suppose that the truth is
y = X + " but with
Fast algorithms (branch and bound, leaps in R) exist to fit them, but they
don’t work for large p , and anyway : : :
. . . need criterion for comparison.
So given a collection of models, we need an automatic (objective) way to pick out
a “best” one (unfortunately cannot look carefully at all of them, BUT NOTHING
replaces careful scrutiny of the final model by an experienced researcher).
r
= 0 for some subset
r of .
The true model contains only the columns for which r
,!
6= 0
Equivalently, the true model uses X~ as the design matrix, the latter being the
matrix of columns of X corresponding to non-zero coefficients.
A correct model is the true model plus extra columns.
,!
Equivalently, a correct model has a design matrix
M(X~ ) M(X} ).
X} , such that
A wrong model is a model that does not contain all the columns of the true
model.
,!
Victor Panaretos (EPFL)
Linear Models
171 / 299
Equivalently, a wrong model has a design matrix
M(X~ ) \ M(X} ) 6= M(X~ ).
Victor Panaretos (EPFL)
Linear Models
X} , such that
172 / 299
Expected Prediction Error
The Bias/Variance Tradeoff
I We may wish to choose a model by minimising the error we make on average,
when predicting a future observation given our model.
Our “experiment is”:
X
response y at X
Every model f 2 2X , will yield fitted values y^ (f ) = Hf y . And suppose we now
obtain new independent responses y+ for the same “experimental setup” X .
Let X be a design matrix, and let X} (n p ) and X~ (n q ) be matrices built
using columns of X . Suppose that the true relationship between y and X is
y =X
~ +"
| {z }
Design matrix
Then, one approach is to select the model
1 E ky+ y^ (f )k2 ;
f = arg min
X n
f 22
where expectation is taken over both
Victor Panaretos (EPFL)
|
{z
(f )
}
y^ = (X}> X} ) 1 X}> y = H} y :
Now suppose that we obtain new observations y+ corresponding to the same
design X
y+ = X~ + "+ = + "+ :
y+ y^ = + "+ H} ( + ")
= (I H}) + "+ H}":
y and y+ .
Linear Models
173 / 299
Victor Panaretos (EPFL)
174 / 299
I Impossible to calculate (depends on unknown and 2 ), so we must find a
proxy (estimator) b .
Suppose that n is large so that we can split the data in two pieces:
ky+ y^ k2 = (y+ y^ )> (y+ y^ )
= >(I H}) + ">H}" + ">+"+ + [cross terms]:
] = 0 (why?), we observe that
n 1 > (I H} ) + (1 + p =n )2 ;
= : (1 + p =n )22 ;
(1 + q =n ) ;
8
<
The estimator of the prediction error will be
if model wrong;
if model correct;
if model true:
b = (n 0) 1ky 0 X 0 ^k2:
In practice n can be small and we often cannot afford to split the data (variance
of is too large).
Instead we use the leave-one-out cross validation sum of squares:
^
Selecting a correct model instead of the true model brings in additional
variance, because q < p .
Selecting a wrong model instead of the true model results in bias, since
I H} 6 when is not in the column space of X} .
n b CV = CV =
) =0
Must find a balance between small variance (few columns in the model) and
small bias (all columns in the model).
Linear Models
X , y used to estimate the model
X 0 , y 0 used to estimate the prediction error for the model
Since E cross terms
Victor Panaretos (EPFL)
Linear Models
Cross Validation
It follows that
(
X} instead of X~ (i.e., we fit a different model). Therefore
Then, observe that
The Bias/Variance Tradeoff
[
but we use the matrix
our fitted values are
175 / 299
where
^
n
X
j =1
(yj xj> ^ j )2;
j is the estimate produced when dropping the j th case.
Victor Panaretos (EPFL)
Linear Models
176 / 299
Cross Validation
Akaike’s Information Criterion
No need to perform
n regressions since
CV =
Criteria can be obtained based on the notion of information (relative entropy).
n (y
X
j =1
j
x > ^)2
j
(1 hjj )2
Same basic idea as for prediction error: aim to choose candidate model
to minimise information distance:
;
Z
so the full regression may be used (show this!). Alternatively one may use a more
stable version:
(yj xj> ^)2 ;
GCV = (1 trace
(H )=n )2
j =1
n
X
where “G” stands for “generalised”, and we guard against any hjj
It holds that:
[
E GCV
>
2
] = (1 (I p =Hn ))2 + 1 n p =n n :
Linear Models
1.
g
(
y
)
log f (y ) g (y )dy 0;
where g y represents true model—equivalent to maximising expected log
likelihood
Z
Can show that (apart from constants) information distance is estimated by
AIC
^
= 2`^ + 2p ( n log ^ 2 + 2p in linear model)
where ` is maximised log likelihood for given model, and
parameters.
177 / 299
Other Information Criteria
Victor Panaretos (EPFL)
p is number of
Linear Models
178 / 299
Simulation Experiment
2f
AICc
AIC + n2p (pp+ 1)1 :
n
Also can use Bayes’ information criterion
BIC
g
= 2`^ + p log n :
10
Cp
BIC
AIC
AICc
Mallows suggested
p
Cp = SS
s 2 + 2p n ;
where SSp is RSS for fitted model and s 2 estimates 2 .
20
Cp
BIC
AIC
AICc
Comments:
AIC tends to choose models that are too complicated, buts AICc cures this
somewhat;
BIC is model selection consistent—if the true model is among those fitted,
BIC chooses it with probability ! 1 as n ! 1 (for fixed p ).
Linear Models
For each n
10; 20; 40 we construct 20 n 7 design matrices. We multiply each of these
design matrices from the right with = (1; 2; 3; 0; 0; 0; 0)> and we add a n 1 Gaussian error.
We do this independently 50 times, obtaining 1000 regressions with p = 3. Selected models with
1 or 2 covariates have a bias term, and those with 4 or more covariates have excess variance.
Improved (corrected) version of AIC for regression problems:
Victor Panaretos (EPFL)
log f (y )g (y )dy :
B Suggests strategy: pick variables to minimise (G)CV.
Victor Panaretos (EPFL)
()
f (y )
179 / 299
40
Cp
BIC
AIC
AICc
Victor Panaretos (EPFL)
1
15
2
131
72
52
398
4
6
2
8
Number
3
504
373
329
565
of covariates
4
5
91
63
97
83
97
91
18
4
6
83
109
125
7
128
266
306
673
781
577
859
121
104
144
94
88
52
104
30
61
30
76
8
53
27
97
1
712
904
673
786
107
56
114
105
73
20
90
52
66
15
69
41
42
5
54
16
Linear Models
180 / 299
Automatic Model Building
Forward/Backward/Stepwise Selection
I We saw so far:
Automatic Model Selection: build a set of models and select the “best” one.
I Now look at different philosophy:
Automatic Model Building: construct a single model in a way that would
hopefully provide a good one.
Forward selection: starting from the model with constant only,
1
2
3
add each remaining term separately to the current model;
if none of these terms is significant, stop; otherwise
update the current model to include the most significant new term; go to
step 1.
Backward elimination: starting from the model with all terms,
There are three standard methods for doing this:
1
2
Forward Selection
Backward Elimination
1
CAUTION: Although widely used, these have no theoretical basis. Element of
arbitrariness . . .
Linear Models
2
181 / 299
Forward/Backward/Stepwise Selection
statistic; go
consider three options—add a term, delete a term, swap a term in the model
for one not in the model, and choose the most significant option;
if model unchanged, stop; otherwise go to step 1.
Victor Panaretos (EPFL)
Linear Models
182 / 299
Example: Nuclear Power Station Data
Some thoughts:
Each procedure may produce a different model.
Systematic search minimising Prediction Error, AIC or similar over all possible
models is preferable— BUT not always feasible (e.g., when p large).
Stepwise methods can fit ‘highly significant’ models to purely random data!
Main problem is lack of objective function.
Can be improved by comparing Prediction Error/AIC for different models at
each step — uses objective function, but no systematic search.
Victor Panaretos (EPFL)
F
Stepwise: starting from an arbitary model,
Stepwise Selection
Victor Panaretos (EPFL)
if all terms are significant, stop; otherwise
update current model by dropping the term with the smallest
to step 1.
Linear Models
183 / 299
Data on light water reactors (LWR) constructed in the USA. The covariates are date
(date construction permit issued), T1 (time between application for and issue of permit),
T2 (time between issue of operating license and construction permit), capacity (power
plant capacity in MWe), PR (=1 if LWR already present on site), NE (=1 if constructed
in north-east region of USA), CT (=1 if cooling tower used), BW (=1 if nuclear steam
supply system manufactured by Babcock–Wilcox), N (cumulative number of power plants
constructed by each architect-engineer), PT (=1 if partial turnkey plant).
1
2
3
4
5
6
7
8
..
.
32
cost
460.05
452.99
443.22
652.32
642.23
345.39
272.37
317.21
date
68.58
67.33
67.33
68.00
68.00
67.92
68.17
68.42
T1
14
10
10
11
11
13
12
14
T2
46
73
85
67
78
51
50
59
capacity
687
1065
1065
1065
1065
514
822
457
PR
0
0
1
0
1
0
0
0
NE
1
0
0
1
1
1
0
0
CT
0
1
1
1
1
1
0
0
BW
0
0
0
0
0
0
0
0
N
14
1
1
12
12
3
5
1
PT
0
0
0
0
0
0
0
0
270.71
67.83
7
80
886
1
0
0
1
11
1
Victor Panaretos (EPFL)
Linear Models
184 / 299
Example: Nuclear Power Station Data
Full model
Est
Int.
date
logT1
logT2
logcap
PR
NE
CT
BW
log(N)
PT
s (df)
14:24
0.2
0:092
0:29
0:694
0:092
0:25
0:12
0:033
0:08
0:22
t
3:37
3.21
0.38
1.05
5.10
1:20
3.35
1.82
0.33
1:74
1:83
0.164 (21)
Victor Panaretos (EPFL)
More Dangers of “Big” Models
Backward
Est
13:26
0:21
0:72
t
4:22
4.91
6.09
0.24
0.14
3.36
0:08
0:22
2:11
1:99
Forward
Est
Recall:
,! Adding more variables (columns) into X
t
7:62
0:13
2:66
0:67
4:75
y^ is projection of y onto M(X )
3.38
Consider two extremes
Adding a new variable
,!
0.159 (25)
0:49
Xp+1 2= M(X )
4:77
0.195 (28)
but
185 / 299
Victor Panaretos (EPFL)
Linear Models
Problem of Multicollinearity
Using block matrix properties, have
Multicollinearity: when
dimension q < p
( ^) = 2 (X Xp+1 )>(X Xp+1 )
1
186 / 299
p explanatories concentrate around a subspace of
[simplest case: pairs of variables that are correlated]
BUT: might exist even if pairs of variables appear uncorrelated!
with
(X Xp+1 )>(X Xp+1) 1 = CA DB
X (X > X ) 1 X > Xp+1 = HXp+1 ' Xp+1 ?
We can certainly fit the regression, but what will happen?
Unstable Matrix Inversion
Xp+1 2 M(X )
Gives no “new” information — cannot even do least squares (why not?)
What if we are between the two extremes? What if
Linear Models
var
Xp+1 2 M? (X )
Gives us completely “new” information.
Adding a new variable
,!
( )
“enlarges” M X
. . . IF the rank increases by the # of new variables
Can be caused by:
Poor design [can try designing again],
Inherent relationships [other remedies needed].
where
A = (X > X ) 1 + (X > X ) 1 X > Xp+1
(Xp>+1 Xp+1 Xp>+1 HXp+1 ) 1 Xp>+1 X (X > X ) 1 ;
B = (X > X ) 1 X > Xp+1 (Xp>+1 Xp+1 Xp>+1 HXp+1 ) 1 ;
C = (Xp>+1 Xp+1 Xp>+1 HXp+1 ) 1 Xp>+1 X (X > X ) 1 ;
D = (Xp>+1 Xp+1 Xp>+1 HXp+1 ) 1 :
So what are the results?
Huge variances of the estimators!
,!
Can even flip signs for different data, to give the impression of inverse effects.
Individual coefficients insignificant:
,! t -test p -values inflated.
But global
Victor Panaretos (EPFL)
Linear Models
187 / 299
F -test might give significant result!
Victor Panaretos (EPFL)
Linear Models
188 / 299
The Picket-Fence (Hocking & Pendleton)
Diagnosing Multicollinearity
Simple first steps:
Look at scatterplots,
Look at correlation matrix of explanatories,
Might not reveal more complex linear constraints, though.
Look at the variance inflation factors:
2
^
VIFj = var( j)2kXj k = kXj k2 (X > X ) 1 jj :
VIFj = 1 1R2
j
Can show that
Rj2 is the coefficient of determination for the regression
Xj = 0;j + 1;j X1 + + j 1;j Xj 1 + j +1;j Xj +1 + +
measuring linear dependence of Xj on the other columns of X .
where
Victor Panaretos (EPFL)
Linear Models
189 / 299
Victor Panaretos (EPFL)
p ;j Xp + ";
Linear Models
190 / 299
More on Diagnosing Multicollinearity
Let
X
j be the design matrix without the j -th variable. Then
Rj2 =
is close to
kX
j (X >j X j ) 1 X >j Xj k2
kXj k2
Consider the spectral decomposition of X > X ,
diagf1 ; : : : ; p g and U > U I . Then
=
2 [0; 1]
=
rank(X > X ) = #fj : j
1 if |X j (X >j X{z j ) 1 X }j Xj ' Xj .
H
j
Large values of VIFj indicate that
of the design matrix.
Xj
is linearly dependent on the other columns
Rule of thumb:
Hence “small” j ’s mean “almost” reduced rank, revealing the effect of
collinearity. Measure using condition index:
q
max =j
q
CN (X > X ) = max =min
30
Rule of thumb: CN >
indicates moderate to significant collinearity,
CN > indicates severe collinearity (choices vary).
100
Linear Models
j =1
Global “instability” measured by the condition number,
VIFj > 5 or VIFj > 10 considered to be “large”.
Victor Panaretos (EPFL)
p
Y
det(X > X ) =
j :
6= 0g;
CIj (X > X ) :=
Interpretation: how much the variance is inflated when including variable j as
compared to the variance we would obtain if Xj were orthogonal to the other
variables—how much worse are we doing as compared to the ideal case.
X > X = U U > with
191 / 299
Victor Panaretos (EPFL)
Linear Models
192 / 299
Remedies?
Example: Body Fat Data
Body fat is measure of health ! not easy to measure!
Collect 252 measurements on body fat and some explanatory variables.
If design faulty, may redesign.
,! Otherwise?
Inherent relationships between explanatories.
Variable deletion - attempt to remove problematic variables
!
Can we use measuring tape and scales only to find body fat?
E.g., by backward elimination.
Explanatory variables:
(X ) and use its elements as explanatories
Choose an orthogonal basis for M
!
!
!
from spectrum, X > X = U U >
Use columns of U
OK for prediction
Problem: lose interpretability
age
neck
hip
weight
chest
thigh
height
abdomen
knee a
biceps
forearm
ankle
wrist
Other approaches?
Victor Panaretos (EPFL)
Linear Models
193 / 299
Some Scatterplots [library(car);scatterplot.matrix( . . . )]
| ||||||||||||||||||||||
|||||||||||||||||||||
||||||
|||
|||||||||||||||||||||| ||
350
|
200
250
300
350
16
●
●
●
●
●
●
●
●
●● ●
●
●●
●
●●●●●●
● ●●●
●●
●●
●● ●
●
●●
●
●
●●
●●
●● ●
●
●●
●●
●
●●
●●●●
●
●
●●
●●
●
● ● ●
●●●
●
●
●●●
●●
●●
● ●●
●
●●
●
●
●
●
●●
●
●●●●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●●
●
●●●● ● ● ●
●
●
●
●●
●●●
●
●●●●
●
●
●●●
●●●
●●
●
●●
●●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●● ●●
●●●
●
●●
●
●●
●
●
●
●
● ●
●● ●
●
●●●●●
●●
● ●●
● ●
●
●●● ● ● ●
●
●●
●
●
●
●
●
●
17
18
19
20
●
●
●
●●●
●●●●●●●●●
●●
●●
●●●
● ●●
● ●●●
●
●●● ●
●
●●●●●●
● ●
●●●
●●●●
●●●●●
●● ● ●●
●
● ●●
●●●
●
● ●●
●
●
●● ●
●●
●● ●
●●
●●●●
●●
● ●
●●
●●
●●●
● ●
●●
●
● ●
●●●●●
●●●
●●●
● ● ● ●
●●●●
● ●●
●●
● ● ● ● ●
●●● ●●●●●●●
●
●●●●
●●● ●●
●
●
●
●
●
●
21
●
●
●
●
●
●
●
●
●
| || ||||||||||||||||||||||||||||||||||||||||||||||||||
|||||||||||||
||||||| ||| ||||| ||| |||
|
●
●●
● ●
●
●
●
● ●● ●
●
● ● ●
●●
●
●
●
●
●●● ● ●
●●
●
●
● ● ●●
●
●
●
●●
● ●●
●
● ●●●
●●●
●●●
●●
●
●
●
● ●●
● ●
●
●
●
●●
●●
●
●●
●●
●
●
●
●●●
●●●
●●●
●●
●● ●
●●
●●
●
●
●●
●
●
●
●●
●
●●●●
●●
●
●
● ●
●●●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●●●● ●
●
●
● ●●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●●●
●●
●
●
●
●
●
● ● ●
●
● ●●
●●
●
●● ●
●
●
●
●
●●●
●
●●● ●
●
●
●
●●
●●
●
●
●
●
20
18
16
●
●
30
40
50
60
70
●
abdomen
●
●
●
|| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||| || || | | | |
|
●
●
●
●
●
● ●
●
●
● ● ● ●●
● ● ●●
● ●
●
●● ●●
● ●● ●
● ● ●●
●●●
●●●●●
● ●●
● ● ●
●●
● ●
●● ●●●
●
●
●
●●
●
●
●●● ●●●●●●●●
● ●
●●● ●
● ●●
●●
●●
● ●●
●●
●●●●●
●●
●●
●
●●
●
●
●
●
●
● ●
●●
●
●●●
●
●● ●
●● ●●●
●
●
●
●●
●
●
●
●●
●●●
● ●●
●
● ●● ●
●
●
●
●
●
●●●
● ●● ● ●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
● ●●●●●
● ●●
●
●●
●●●●●● ● ● ●
●
●●
●
●
●
●● ●●
●●● ●
●
●
●
●● ●
●●
●
●●
●● ●
●● ●
●
● ●● ● ●●
●
●
●
●●
●
●
●●
●● ●
●●
●●
● ● ●
●
●● ●●●●● ●
●
● ●●● ● ●
●●
●
●
●
●●
●● ● ● ●
●●●●
●
●●
●
●
●
●
●● ●
●●●●
●●
●
●
●
●●●
●
●●
●
●●●●
●●
●●
●
●● ●● ●
●
●
● ●
●
●●
●●
●
●
●
●
●
●
●
●
●
●●●
●
●●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ● ●●
●
●●
●
●●●
●
●
●
●
●
●●
●●
●●
●
●
●
●●
●●
●
●
● ●●
●
●
●
●●●
●●● ●●●
●
●
●●
●
●
●
●
●●●
●
●
●
●
●
●
●●
●
● ●●
● ●●
●
●
●●●● ●
●
● ●●●●● ● ●
●
● ●
●
●
● ● ●●
● ● ● ●
● ●● ●
●
●
● ●●
●●
●
●
● ●●
●
●●
● ●●
●
●● ●
●
●
●●●
● ●●
●●
●
●●
●●
●
●
●
●
●●●
●●●
●●
●●●
●● ●
●
●
●
●
●●●
●
●
●
●
●●
●
●●
●
●
●
●
● ●●
●
●
●
●
●●
●●●
●
●
●
●
●●
●
●
●●●
●●
●●●●●
●●●
●
●
●
●
●
●
●●●●●●
●
●
●
●
●
●
●
●
●
●
● ●●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●●●
●● ●●
●
●●
●●●
● ●
●
● ●●
●
●●●
● ●● ●●●
●
●●
●● ●● ●● ●
●
●
●
●
●
●
●
|
●
●
●● ●
●
●●
●
● ●● ●
●●● ● ● ●
●
●●
●
●
●
●●
●●●●
● ● ●
●●
● ●●●●●
●
●●●●●●
●●●
●● ● ●●● ●
●● ● ●●
●●
●●
● ● ●● ●●●
●
●●●● ●●
●●●
●●
● ●●
●
● ● ●●
●●●
●●●●
●
●
●●
●●● ●●●
● ●
●
●●
● ●
●
●
●
●●●
●
●
● ●● ●
●●●
● ●
●●●● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●●● ●
●
●●● ●●●
●
●●
● ●●● ●
●
●●●●
●
●●●
●
● ●
●
●●
●
●
●
●
●
●
80
100
●
wrist
●
●
●
● ●
●
● ●● ●
●
● ●
●
●
●
●● ●
●● ●
●●●●
●
●
●
● ●●●
●●
●
●● ●
●●●
●
●
●●
●●● ●●
●
●
●●
●●●●●
●● ●●●●
●
●●●
●
●●●● ●●
●
●●●
●
●● ●●
●●
●
●● ●●●●
●●●●
●
●● ●● ●
●
●
● ●
●
●●
●●●●
●
●
●●
●● ●●
●●
●
● ●
●
●●
●●
●● ● ●● ● ●
●
●
●
●
● ●●●●● ●
●
● ●●
● ●
●
●●
●
●● ●● ●●●
● ● ●●
●
● ●●●
●
●
●●
●● ● ●●●
●●
● ●
●●●
●●
●
●
●
● ● ● ●●
●
●
● ●
●● ●
●
●●
● ● ●●
●
● ● ● ●
●
●
●
●
●
●
●● ●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●●
●●
● ● ●●
●
● ●
●
●●●●
●●
● ●● ●●
●●●●
●● ●●●● ●●
●
●
●●
●●●●
●● ●●
●
● ●
●●
● ●
● ●●
●●
●●
●● ●
● ●●
● ●
●● ● ●
●
● ●● ●●
● ●
● ●
●● ●●●●
●
●●
●
●
● ●
●●
●●
●●
● ●●● ●
● ●
●●
●●●
●●
● ●
●
●●●
● ●
● ●●●
●●
●●
● ● ●
●
●●
●● ●
● ●
●●
● ● ●●●● ●●
●●
●
●
●
●
●
●
●
●
● ●
● ● ● ●●●
● ● ●
● ● ● ● ●
●
● ● ●
●●●
● ●
100
150
●
●
● ●●
●
●
●● ● ●
●● ●●
●●
● ●
●
● ●● ● ●
●●●● ● ●
●
●●
●●
●
●●
●
●● ●
●●
●●
●
●
●●●
●●
●●●
●
●
●
●●
●●
●●
●
●
●● ●●●● ●●
●
●
●●
●●
●●
●●
●●
●
●
●
●
●
●●
●●
●
●
●●●
●
●
●
●
●
●●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
● ●●
●●●
●
●
●●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●●
●
●●
●
●
●
●
● ●
●
●●●
●
● ●
●
●●●●●
●●●
●
●
●●
●
●
●● ●●
● ●
●
●● ●
●
140
●
80
250
weight
● ●
● ●●
●
●
●
●
●● ●●
●●
●
●
●●●
●
●
●●●
●●●●
●●
●●
●
●● ●
●●
●
●●
●●
●
●●
●
●●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●● ● ●
●
●●
●●
●
●
● ●
●
●
●●
●●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●●
●
●
●
●
●●● ●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●●●
● ● ●
●
●
120
| | | |||||||||||||||||||||||||||||||||||||| |
|
|
140
Looks like we’re in trouble. Let’s go ahead and fit anyway . . .
Victor Panaretos (EPFL)
Linear Models
Linear Models
194 / 299
Model Fit Summary
30 40 50 60 70
150
●
●
●
●● ●●●●●
● ●●
●
●
●●●●●●
●
●● ●● ●● ● ●
●●
●●● ●●●
●●
●●
●
●● ●
●●●
●●
●●
●
●
●
●
●
● ●●●● ●
●●
●●
●●
●
●
●●
●
●
●●●
●●●●
●●
●
●
●
●●
●
●
●
● ●●
●
●
●
●●
●
●
●
●
● ●
●●● ●
●
● ●●
●●●
●
●
●
●●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●● ●●●●●
●
●●
●
●●
●
●●●
●●
●
●●●
●
●●
●●
●
●
●
●●
●●●●
●
●
●
●
●●●
● ● ● ●●● ●
●
●
● ●
●
height
Victor Panaretos (EPFL)
(Intercept)
age
weight
height
neck
chest
abdomen
hip
thigh
knee
ankle
biceps
forearm
wrist
Estimate
18.1885
0.0621
0.0884
0.0696
0.4706
0.0239
0.9548
0.2075
0.2361
0.0153
0.1740
0.1816
0.4520
1.6206
Std. Error
17.3486
0.0323
0.0535
0.0960
0.2325
0.0991
0.0864
0.1459
0.1444
0.2420
0.2215
0.1711
0.1991
0.5349
t value
1.05
1.92
1.65
0.72
2.02
0.24
11.04
1.42
1.64
0.06
0.79
1.06
2.27
3.03
Pr(>jtj)
0.2955
0.0562
0.0998
0.4693
0.0440
0.8100
0.0000
0.1562
0.1033
0.9497
0.4329
0.2897
0.0241
0.0027
R2 = 0:749, F -test: p < 2:2 10 16 .
195 / 299
Victor Panaretos (EPFL)
Linear Models
196 / 299
Victor Panaretos (EPFL)
Pr(>jtj)
0.9730
0.6252
0.8223
0.7284
0.2635
0.5877
0.0000
0.9130
0.8793
0.7366
0.3516
0.1179
0.3750
0.0560
Linear Models
age
weight
height
neck
chest
abdomen
hip
thigh
knee
ankle
biceps
forearm
wrist
197 / 299
Variable Deletion: Backward Elimination
(Intercept)
age
weight
neck
abdomen
hip
thigh
forearm
wrist
Victor Panaretos (EPFL)
Singular Values
Eigenvalue Roots
●
●
●
2
●
●
4
●
●
6
●
8
●
●
●
10
●
●
12
●
14
Index
Condition Number
' 556 !
Linear Models
198 / 299
Variable Transformation: Eigenvector Basis
0.7466,
< 2.2e-16
Estimate
22.6564
0.0658
0.0899
0.4666
0.9448
0.1954
0.3024
0.5157
1.5367
1
2
3
4
5
6
7
8
9
10
11
12
13
CI
1.00
17.47
25.30
58.61
83.59
100.63
137.90
175.29
192.62
213.01
228.16
268.21
555.67
Victor Panaretos (EPFL)
Define
Multiple R-Squared:
F-statistic p-value:
VIF
2.25
33.51
1.67
4.32
9.46
11.77
14.80
7.78
4.61
1.91
3.62
2.19
3.38
4000
Estimate
1.2221
0.0256
0.0237
0.1005
0.4619
0.0910
0.8924
0.0265
0.0334
0.1310
0.5037
0.4458
0.2247
1.5902
2000
Pr(>jtj)
0.1393
0.0153
0.0502
0.5207
0.0721
0.9002
0.0000
0.1265
0.0565
0.5111
0.0694
0.5485
0.0174
0.0309
0
(Intercept)
age
weight
height
neck
chest
abdomen
hip
thigh
knee
ankle
biceps
forearm
wrist
Estimate
32.6564
0.1048
0.1285
0.0666
0.5086
0.0168
0.9750
0.2891
0.3850
0.2218
0.4377
0.1297
0.8871
1.7378
Diagnostic Check
values
Split Data in Two and Fit Separately (Picket Fence)
Std. Error
11.7139
0.0308
0.0399
0.2246
0.0719
0.1385
0.1290
0.1863
0.5094
Linear Models
t value
1.93
2.14
2.25
2.08
13.13
1.41
2.34
2.77
3.02
Pr(>jtj)
0.0543
0.0336
0.0252
0.0388
0.0000
0.1594
0.0199
0.0061
0.0028
Z = XU
as design matrix.
(Intercept)
Z[, 1]
Z[, 2]
Z[, 3]
Z[, 4]
Z[, 5]
Z[, 6]
Z[, 7]
Z[, 8]
Z[, 9]
Z[, 10]
Z[, 11]
Z[, 12]
Z[, 13]
VIF
2.05
18.82
4.08
8.23
13.47
6.28
1.94
3.09
199 / 299
Victor Panaretos (EPFL)
R2 =0.749, F-test
Estimate
18.1885
0.1353
0.0168
0.2372
0.7188
0.0248
0.4546
0.5903
0.1207
0.0836
0.5043
0.5735
0.3007
1.5168
Std. Error
17.3486
0.0619
0.0916
0.1070
0.0571
0.0827
0.1001
0.1366
0.1742
0.1914
0.2082
0.2254
0.2628
0.5447
Linear Models
p-value<
t value
1.05
2.19
0.18
2.22
12.58
0.30
4.54
4.32
0.69
0.44
2.42
2.54
1.14
2.78
2:2 10
16
Pr(>jtj)
0.2955
0.0297
0.8546
0.0276
0.0000
0.7649
0.0000
0.0000
0.4890
0.6627
0.0162
0.0116
0.2536
0.0058
200 / 299
From Rotation to Shrinkage
Ridge Regression
Multicollinearity problem is that det
[i.e. X > X almost not invertible]
Eigenvector approach rotates space so as to “free” the dependence of one
coefficient j on others f i gi 6=j
,! Imposes constraint on X (orthogonal columns)
(X >X )
1 0
A Solution: add a “small amount” of a full rank matrix to
X >X .
For reasons to become clear soon, we standardise the design matrix:
Problem: lose interpretability! (prediction OK)
Write
Recentre/rescale the covariates defining:
Example: most significant “rotated” term in fat data: Z[,4]=-0.01*age
-0.058*weight -0.011*height +0.46*neck -0.144*chest
-0.441*abdomen +0.586*hip +0.22*thigh -0.197*knee
-0.044*ankle -0.07*biceps -0.33*forearm -0.249*wrist
,!
,!
Other approach to reduce this strong dependence?
,! Impose constraint on ! How? (introduces bias)
Victor Panaretos (EPFL)
X = (1 W ), = ( 0 )>
1W j
)
Coefficients now have common scale
Interpretation of j slightly different: not “mean impact on response per unit
change of explanatory variable”, but now “mean impact on response per unit
deviation of explanatory variable from its mean, measured in units of standard
deviation”
The Zj are all orthogonal to 1 and are of unit norm.
Linear Models
201 / 299
Ridge Regression
Victor Panaretos (EPFL)
Linear Models
202 / 299
Ridge Regression: Shrinkage Viewpoint
! Ridge term I
Since Zj ? 1 for all, j , we can estimate
(orthogonality).
0 and
by two separate regressions
seems slightly ad-hoc. Motivation?
,! Can see that ( ^0
Least squares estimators become
^0 = Y ; ^ = (Z >Z ) 1 Z >Y :
Ridge regression replaces
Z >Z
by
Z > Z + I
(i.e. adds a “ridge”)
01
Z k22
p 1
X
subject to
j =1
2 = k k2 r ()
2
j
instead of least squares estimator which minimizes
:
kY
Adding I to Z > Z makes inversion more stable
,! called ridge parameter.
Linear Models
^) = (Y (Z >Z + I ) 1Z >Y ) minimizes
kY 0 1 Z k22 + k k22
or equivalently
kY
^0 = Y ; ^ = (Z >Z + I ) 1Z >Y
Victor Panaretos (EPFL)
p
Zj = sd(Wn j ) (Wj
01
Z k22 :
Idea: in the presence of collinearity, coefficients are ill-defined: a wildly positive
coefficient can be cancelled out by a largely negative coefficient (many coefficient
combinations can produce the same effect). By imposing a size constraint, we
limit the possible coefficient combinations!
203 / 299
Victor Panaretos (EPFL)
Linear Models
204 / 299
L2 Shrinkage [Ridge Regression]
Ridge Regression
Proposition
Let Zn q be a matrix of rank r
Given
q with centred column vectors of unit norm.
> 0, the unique minimiser of
Q ( ^0 ; ^) = ky ^0 1 Z ^k22 + k^k22
is
( ^0 ; ^) = (y ; (Z > Z + I ) 1 Z >y ):
Proof.
Write
y = |(y {zy 1}) + |{z}
y 1
=y 2M? (1) 2M(1)
(Z ). Therefore by Pythagoras’ theorem
Z ^k22 = k(
y ^0 )1 + (y Z ^)k2 = k(y ^0 )1k22 + k(y Z ^)k22 :
| {z } | {z } 2
Note also that by assumption 1 2 M?
ky ^0 1
Victor Panaretos (EPFL)
Linear Models
205 / 299
n
2
2
^
^ 2
min
^ ;^ Q ( 0 ; ^) = min
^ k(y 0 )1k2 + min
^ k(y Z ^)k2 + k^k2
0
Clearly,
Note that if the SVD of
used to show that
^0 )1k22 = y^ while the second component can be written
pZ ^
min
q
^2R
Iq q
y
Note that
and
p
Linear Models
206 / 299
0q 1
Iq q ) p
Z
Iq q
1
(Z > ;
p
Iq q )
{z
Victor Panaretos (EPFL)
q
X
!j
>
2j + (vj y )uj ;
!
j =1
V
and
U , respectively.
Compare this to the ordinary least squares solution, when
y = (Z > Z + I ) 1 Z > y 0
^=
q 1
which is not even defined if
Role of
= diagf| 1 ; :{z
: : ; r ; r +1 ; : : : ; q g (Z > Z 0 & rank (Z > Z ) = rank (Z )).
}
|
^=
where the vj s and uj s are the columns of
2
Z > Z + I is indeed invertible. Writing Z > Z = U U > , we have
Z > Z + I = U U > + U (Iq q )U > = U ( + Iq q )U >
>0
=0
To complete the proof, observe that
Z is Z = V U > , last steps of previous proof may be
2
using block notation. This is the usual least squares problem with solution
(Z > ;
Victor Panaretos (EPFL)
o
0
arg min ^0 k(y
2M(Z )
The Effect of Shrinkage
(proof ctd)
Therefore,
2M(1)
Z
q
X
= 0:
1 (v >y )uj ;
j =1 !j
j
is of reduced rank.
is to reduce the size of 1=!j
when !j becomes very small.
}
Z > y = Z > y yZ > 1 = Z > y .
Linear Models
207 / 299
Victor Panaretos (EPFL)
Linear Models
208 / 299
Bias and Variance
Domination over Least Squares
Proposition
Corollary
Let
^ be the ridge regression estimator of . Then
bias(^ ; ) = Z > Z + Iq
and
Assume that
1
E
Proof.
Since E(^ ) = (Z > Z + I ) 1 Z > E(y ) = (Z > Z + I ) 1 Z > Z , the bias is
bias(^ ; ) = E(^ )
= f(Z > Z + I ) 1 Z > Z I g
1
1
1
>
>
=
Z Z +I
Z Z +I I I
=
I
1 Z >Z + I 1
I
=
for all
(~
)(~
)>
E
(^
)(^
)> 0
2 (0; 22 =k k2 ).
Gauss-Markov only covers unbiased estimators – but ridge estimator biased.
1 Z >Z + I 1
A bit of bias can improve the MSE by reducing variance!
:
Also, there is a catch! The “right” range for
Choosing a good
Linear Models
209 / 299
depends on unknowns.
is all about balancing bias and variance.
Victor Panaretos (EPFL)
Linear Models
210 / 299
Bias–Variance Tradeoff
Proof.
From our bias/variance calculations on the ridge estimator, we have
E
2 (Z > Z )
Ridge estimator uniformly better than least squares! How can this be?
(What happened to Gauss-Markov?)
The covariance term is obvious.
Victor Panaretos (EPFL)
^ = (Z >Z + I ) 1Z >y
be the least squares estimator and ridge estimator, respectively. Then,
cov (^) = 2 (Z > Z + I ) 1 Z > Z (Z > Z + I ) 1 :
rank(Zn q ) = q and let
~ = (Z >Z )1Z >y &
1
(~γ γ )(~γ γ )>
E
(Z > Z +I ) 1 2 Z > Z (Z > Z +I )
= (Z > Z + I )
1
(^γ
1
γ )(^
γ
γ )> =
2 Z > Z + I
2 (2I + (Z > Z ) 1 )
1
γγ > Z > Z
+ I
1
γγ > (Z > Z + I ) 1 :
Role of : Regulates Bias–Variance tradeoff
" decreases variance (collinearity) but increases bias
# decreases bias but variance inflated if collinearity exists
Recall:
Ek
^
To go from 2nd to 3rd line, we wrote
2 (Z > Z )
= 2 (Z > Z + I ) 1 (Z > Z + I )(Z > Z ) 1 (Z > Z + I )(Z > Z + I )
= (Z > Z + I ) 1 (2 Z > Z + 22 I + 2 2 (Z > Z ) 1 )(Z > Z + I ) 1
1
Note that if
1
and did the tedious (but straighforward) algebra. The final term can be made positive definite if
22 I + 2 (Z > Z )
1
k2 =
Ek
|
^ {zE^k}2 + k|E^ {z k}2 + 2(
E^
|
Variance =trace [cov (^)]
Bias 2
Z > Z = U U > trace (cov (^)) =
Pq
i
j =1 2i + 2
){z>E[^
=0
E
^}]
So choose so as to optimally increase bias/decrease variance
Use cross validation!
γγ > 0:
Noting that we can always write
f k k
g
>
Pq 1
>
j =1 θj θj
an orthonormal basis of Rq we see that 2 (0; 2 2 =kγ k2 ) suffices for
I
= kγγ
+
γ k2
for γ = γ ; θ1 ; :::; θq 1
positive definiteness to hold true.
Victor Panaretos (EPFL)
Linear Models
211 / 299
Victor Panaretos (EPFL)
Linear Models
212 / 299
L1 Shrinkage?
Orthogonal Design
When the explanatory variables are orthogonal (i.e.
performs model selection via thresholding:
Motivated from Ridge Regression formulation can consider:
kY
min!
01
Z k22
subject to
()
kY
min!
01
p 1
X
j =1
j j j = k k1 r ()
Consider the linear model
Y = 0 n11 + n (Zp 1)
+ "
11
(p 1)1 n 1
where Z > 1 = 0 and Z > Z = I . Let ^ be the least squares estimator of
^ = (Z >Z ) 1 Z >Y = Z >Y :
n 1
Z k22 + k k1 :
Y
No explicit form available (unless Z > Z = I ), needs quadratic programming
Resulting estimator non–linear in
algorithm
Why choose a different type of norm?
Because L1 penalty (almost) produces a “continuous” model selection!
Linear Models
min 2R; 2Rp kY 01 Z k22 + k k1
is given by ( ^0 ; ) = ( 0 ; 1 ; : : : ; p 1 ), defined as
0
213 / 299
Note that since Z > 1 = 0 and since 0 does not appear in the L1 penalty, we have
^0 = (1>1) 11Y = Y :
&
Victor Panaretos (EPFL)
0
kY
01
Z ^k22 + k^k1 = 2min
ku Z ^k22 + k^k1 :
Rp
1
u = Y Y 1 for tidiness. Expanding the squared norm gives
> Z ) = u > u 2Y > Z + 2Y 1> Z +
ku Z ^k22 = u > u 2u > Z + > (|Z{z
| {z }
| {z }
}
where
=^>
=I
Since
u > u does not depend on
>
=0
, we see that the LASSO objective function is
2^> + k k22 + k k1:
1=2, i.e.
^i i + 12 i2 + 2 j i j :
Clearly, this has the same minimizer if multiplied across by
^> + 12 k k22 + 12 k k1 = Ppi =11
Victor Panaretos (EPFL)
Linear Models
215 / 299
1
i = sgn (^i ) j^i j 2
+
;
i = 1; :::; p 1:
Linear Models
214 / 299
1
Notice that we now have a sum of p
objective functions, each depending only
on one i . We can thus optimise each separately. That is, for any given
i p , we must minimise
1
^i i + 12 i2 + 2 j i j:
Thus, the LASSO problem reduces to
1
,
Then, the unique solution to the LASSO problem
^0 = Y
Proof.
min
2R; 2Rp
Z > Z = I ), then the LASSO exactly
Theorem
Shrinks coefficient size by different version of magnitude.
Victor Panaretos (EPFL)
$ Model Selection
We distinguish 3 cases:
1
2
Case i
. In this case, the objective function becomes 12 i2 2 j i j and it
is clear that it is minimised when i
. So in this case i
.
1
2
Case i > . In this case, the objective function
i i 2 i 2 j i j is
minimised somewhere in the range i 2 ; 1 because the term
i i is
negative there (and all other terms are positive). But when i , the
objective function becomes
^ =0
^ 0
=0
[0 )
+
=0
^ + +
0
^
1
2
^i i + 2 i + 2 i = 2 ^i i + 12 i2:
If 2 ^i 0, then the minimum over i 2 [0; 1) is clearly at i = 0.
Otherwise, when 2 ^i < 0, we differentiate and find the minimum at
i = ^i =2 > 0. In summary, i = (^i =2)+ = sgn (^i )(j ^i j =2)+ .
Victor Panaretos (EPFL)
Linear Models
216 / 299
3
Case i < . In this case, the objective function
i i 21 i2 2 j i j is
minimised somewhere in the range i 2 1; because the term
i i is
negative there (and all other terms are positive). But when i , the
objective function becomes
^ 0
(
0]
^ +
+
0
^
L1 Shrinkage [The LASSO]
1
1
2
2
^i i + 2 i + 2 ( i ) = 2 + ^i ( i )+ 2 i = 2 j^i j ( i )+ 12 i2 :
If 2 j ^i j 0, then the minimum over i 2 ( 1; 0] is clearly at i = 0,
since
i ranges over [0; 1). Otherwise, when 2 j ^i j < 0, we differentiate
and find the minimum at i = j ^i j + =2 < 0, which we may re-write as:
j^i j + =2 = (j^i j =2) = sgn (^i ) (j^i j =2) :
In summary, i = sgn (^i )(j ^i j =2)+ .
The proof is now complete, as we can see that all three cases yield
i = sgn (^i ) j^i j 2
Victor Panaretos (EPFL)
+
;
i = 1; :::; p 1:
Linear Models
217 / 299
LASSO as the Relaxation of Best Subsets
Intuition:
Victor Panaretos (EPFL)
Linear Models
218 / 299
LASSO profile for Bodyfat Data [LARS algorithm]
L1 norm induces “sharp” balls!
LASSO and CV for different values of
r ()=k^k1
LASSO
0
Induces model selection by regulating the lasso (through )
Extreme case: L0 “Norm”, gives best subsets selection!
7
8
10
13
70
150
*
* *
*
●
●
●
**
●
50
●
cv
●
●
40
●
**
**
*
*
*
**
*
**
**
**
*
*
*
*
0.0
0.2
0.4
0.6
0.8
* **
** *
** **
**
●
8
** *
* * * ** **
**
* **
1.0
|beta|/max|beta|
Linear Models
219 / 299
Victor Panaretos (EPFL)
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●
●●●●●
●●●●●●●●●●●
●●●●●●●●
●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
1
*
**
*
−50
**
****
**
**
30
* **
*
*
*
*
**
**
**
**
**
*
*
**
*
*
20
*
*
*
*
Victor Panaretos (EPFL)
●
●
3
1 p
j =1 j j j , sharp balls for 0 < p 1
Pp
●
*
4
: j 6= 0g
*
100
=0g = #fj
j6
60
**
1f
●
●
*
2
=
5
50
k
j =1
4
*
0
Generally:
kpp
j =1
j j j0 =
p 1
X
3
**
Standardized Coefficients
k k0 =
p 1
X
1
6
Balls more concentrated around the axes
0.0
0.2
0.4
0.6
0.8
1.0
fraction
Linear Models
220 / 299
Robust/Resistant Methods
The “success” of the LSE in a regression model depends on “assumptions”:
Normality (LSE optimal in this case)
Not many “extreme” observations (LSE affected from “extremities”)
Robust Linear Modeling
Picture:
Resistant procedure: not strongly affected by changes to data.
Robust procedure: not strongly affected by departures from distribution.
Often: Robust , Resistant
Victor Panaretos (EPFL)
Linear Models
221 / 299
Motivating Example: Estimating a Mean
Let
X1 ; : : : ; Xn iid F , estimate =
x = 1
n
X
n i =1
1 xF (dx ) by
n
X
2R i =1
(xi
m = arg min
)2
jxi
;
n = 2k + 1;
j = xx(kk +1)
+x k ; n = 2k :
2
( )
( +1)
Median much less sensitive to bad values.
Higher breakdown point: must blow up at least 50% of obs to blow
Median is optimal (MLE) when
x 7! x + =) x 7! x + =n .
large relative to n ! disaster . . .
May not be optimal for other possible
m up.
F is Laplace.
But how well does m perform when F ' Normal (relative efficiency)?
x is optimal (MLE) when F is Normal.
Victor Panaretos (EPFL)
n
X
2R i =1
Extremely sensitive to outliers (low breakdown point).
If
222 / 299
Can we “cure” sensitivity by using different distance function?
R1
xi = arg min
Blows up from a single value:
Linear Models
Motivating Example: Estimating a Mean
Some observations:
Average
Victor Panaretos (EPFL)
Remember picture:
F ’s . . .
Linear Models
223 / 299
Victor Panaretos (EPFL)
Linear Models
224 / 299
Motivating Example: Estimating a Mean
Regression Setup
Other alternatives?
Regression situation is similar. Have:
Y = X + "; " F ;
I -Trimmed mean: throw away most extreme observations:
trm = jE1c j
X
i 2= E
xi ;
LSE for
n
F = Normal
Disastrous if yi 7! yi + c with c large:
How to objectively/automatically choose weight?
Linear Models
225 / 299
Robust/Resistant Alternatives
^ 7! ^ + (X T X ) 1xi c
Gauss-Markov: optimal linear for any
,!
May not be overall optimal for other
Victor Panaretos (EPFL)
F ’s
Linear Models
226 / 299
Seek a unifying approach:
2R
= arg min PKi=1 (yi xi> )2(i ), where we set
2Rp
K = bn =2c + b(p + 1)=2c
Weighted least squares: = (X > V 1 X ) 1 X > V 1 Y for a diagonal weight
Trimmed least squares:
Instead of
()2 or j j, consider a more general distance function ().
MLE when errors are Gaussian is obtained as maximising loglikelihood kernel
n ^ = arg max 1 X yi xi>
2 i =1
2Rp
(recall earlier lecture):
0
B
V =B
B
@
w1
w2
0
0
..
.
wn
1
C
C
C:
A
Replacing
b
! Find a general formulation of which above are special cases.
Linear Models
2
(u ) = u 2 by general () yields:
Would like to formalise the concept of robust/resistant estimation
Victor Panaretos (EPFL)
F
M-Estimators
Pn
>
L1 regression: ~ = arg min
k =1 jyi xi j
p
V
2R k =1
Optimal at
Weights downplaying certain observations (i.e., give less weight to extremes
...)
matrix
given by
^ = (X >X ) 1X >y = arg min X(yi xi> )2
p
E being subset of n most extreme observations from each end.
Both m and trm may ‘throw away’ information. View as special cases of the
IWeighted estimate:
Pn
wi xi
wm = Pi =1
n w :
i =1 i
Victor Panaretos (EPFL)
[ ] = 0; cov["] = 2 I
E"
:= arg min
p
2R
n
X
i =1
y x>
i i
Call this an M(aximum likelihood like)-Estimator.
227 / 299
Victor Panaretos (EPFL)
Linear Models
228 / 299
M-Estimation as Weighted Regression
Obtaining
Pn
arg min
i =1 p
2R
yi xi>
n
X
i =1
with
xi>
Examples of Distance Functions and Weight Functions
reduces to solving
yi xi>
Idea: choose to have desirable properties (reduce/eliminate impact of outliers)
— same as choosing weight function.
=0
Some typical examples are:
(z ) = z 2
(z ) = jz j
(t ) = d (t )=dt . Letting w (u ) = (u )=u this reduces to
n
X
i =1
wi xi> (yi xi> ) = 0;
where
>
wi = w yi xi
:
But this is simply the weighting scenario!
I Robust Regression can be written as a Weighted Regression, but the weights
depend on the data.
Distance functions are in
Victor Panaretos (EPFL)
1 1 correspondence with loss functions.
Linear Models
229 / 299
Examples of Distance Functions and Weight Functions
Victor Panaretos (EPFL)
, w (u ) = 2
, w (u ) = 1=ju j
2
z ; if jz j H
Huber: (z ) =
2H jz j H 2; otherwise 8
n
< 1 2
2 o3 ; jz j B ;
B
1
1
(
z
=
B
)
6
Bisquare: (z ) =
: 1 2
6 B ; otherwise:
Linear Models
Victor Panaretos (EPFL)
Linear Models
230 / 299
Examples of Distance Functions and Weight Functions
231 / 299
Victor Panaretos (EPFL)
Linear Models
232 / 299
Computing a Regression M-Estimator
(Asymptotic) Distribution of M-Estimators
IObtained M-Estimator as the solution to the system
X > ψ( ) = 0
I Explicit expression for LSE
I M-Estimation: non-linear optimisation problem — use iterative approach
instead of
I Iteratively re-weighted least squares:
1
2
3
4
5
6
^(0)
Obtain initial estimate
(0) = (
^ )
Form normalised residuals ui
yi xi> (0) =MAD yi
(0) w u (0) for the chosen weight function w
Obtain wi
i
= (
)
(
()
X > (y X ) = 0. Here we defined
ψ
xi> ^(0)
( )=
;:::;
yn xn>
>
X > ( ) = 0; 8 2 Rp ;
then under mild regularity conditions, as n ! 1, we can show that
E
Iterate until convergence (?)
^n d Np
Linear Models
y1 x1>
IIf these estimating equations are unbiased, i.e.,
(0)
(0)
Perform weighted least squares with V (0) = diagfw1 ; : : : ; wn g
Obtain updated estimate ^(1)
Victor Panaretos (EPFL)
233 / 299
Example: Professor’s Van
;
Victor Panaretos (EPFL)
[
E X > rψ
]
1 X > E[ψψ> ]X E[X > rψ] 1 :
Linear Models
234 / 299
Example: Professor’s Van
^ = 0:07 (with p = 0:06) while ~ = 0:09 (with p ' 0)
Victor Panaretos (EPFL)
Linear Models
235 / 299
Victor Panaretos (EPFL)
Linear Models
236 / 299
Asymptotic Relative Efficiency (ARE)
ARE in the Linear model
Remember our picture:
Linear model
known.
y = X + ", with "j iid g (); assume var("j ) = 2 < 1 is
Assume MLE is regular, with
ARE measures quality of one estimator of p 1 relative to another, often the
MLE , for which var I 1 , for large sample size.
Generally ARE of relative to is less than 1 (100%): low ARE is bad, high
ARE is good.
ARE of relative to is
(
)1=p
jvar j
:
jvar j
^
( ^) = ( )
~
^
~
~
^
^
ARE of r relative to r is
( ^)
( ~)
( 100%)
( ^ ) (100%):
(~ )
Linear Models
ARE of LSE of
@ 2 log g (u )
g (u )du =
@u 2
relative to MLE of
Z @ log g (u ) 2
g (u )du :
@u
is
1
2 ig
Examples:
g () Gaussian: 1
ARE at g () Laplace: 1/2
ARE of Huber at g () Gaussian is 95% with H = 1:345
ARE at
var r
var r
Victor Panaretos (EPFL)
ig =
Z
237 / 299
Victor Panaretos (EPFL)
Linear Models
238 / 299
Mallow’s Rule
A simple and useful strategy is to perform one’s analysis both robustly and by
standard methods and to compare the results. If the differences are minor, either
set may be presented. If the differences are not minor, one must perforce consider
why not, and the robust analysis is already at hand to guide the next steps.
Nonlinear and Nonparametric Models
Perform analysis both ways and compare results.
Plot weights to see which observations were downweighted.
Try to understand why.
Victor Panaretos (EPFL)
Linear Models
239 / 299
Victor Panaretos (EPFL)
Linear Models
240 / 299
The Big Picture
Example: Logistic Growth
Decennial population data from US, for 1790–1990.
Recall most general version of regression given in Week 1:
Yi j xi> ind
Distfg (xi> )g;
y is population in millions, x is time.
i = 1; : : : ; n :
Regression model:
Yi = 1 + exp( 1 + x ) + "i ; "i iid N (0; 2 ) i = 1; : : : ; n :
2 3 i
So far we have investigated what happens when
g (x > ) = x > ;
Dist = N (x > ; 2 ):
2 Rp ;
Here
(x ;
We now consider a more general situation:
where
(xi> ;
)
Yi j xi> ind
Nf(xi> ; ); 2 g; i = 1; : : : ; n ;
is a KNOWN function,
that depends on a parameter
Distribution remains Gaussian.
Cannot transform into a linear regression problem.
Coefficient interpretation different than in a linear model.
Related to the differential equation
2 Rp ,
d (x ) = C (x )f1 (x )g:
dx
but is not linear in .
Victor Panaretos (EPFL)
) = 1 + exp( 1 + x ) :
2 3
Linear Models
241 / 299
Example: Logistic Growth
Victor Panaretos (EPFL)
Linear Models
242 / 299
Basic Observations and Notation
Still assume independent random variables
y1 ; : : : ; yn , and explanatories x1 ; : : : ; xn .
Y1 ; : : : ; Yn , with observed values
Distribution still Gaussian.
Introduce notation:
y = (y1 ; : : : ; yn )> 2 Rn ,
) = (1 ( ); : : : ; n ( ))> = ((x1>; ); : : : ; (xn>; ))> , i.e.,
( ) : Rp ! R n
2 Rp 7! ( ) 2 Rn
Therefore ( ) is a vector-valued function.
Analogy with linear case: ( ) plays the role of X but is no longer linear in
(
.
Model now is:
y = |{z}
( ) + n " 1;
n 1
Victor Panaretos (EPFL)
Linear Models
243 / 299
Victor Panaretos (EPFL)
n 1
2 Rp ; " Nn (0; 2 I ):
Linear Models
244 / 299
Likelihood and . . . least squares - Again!
Since
" iid
N (0; 2 ), have
Model Fitting by Taylor Expansions
y Nf( ); 2 g;
Main problem is non-linearity — cannot obtain closed form solution in general.
,! Idea:
so likelihood and loglikelihood are
L( ; 2 ) = (212 )n =2 exp
1
22 (y
(
))>(y
(
))
;
linearise locally, assuming that
is sufficiently smooth.
First-order Taylor expansion: approximate as
1
1
>
2
2
`( ; ) =
2 n log 2 + n log + 2 (y ( )) (y ( )) :
. . . exactly as in linear case, but with ( ) replacing X . Hence, suggests least
( ) ' ( (0) ) + [r ] =
n 1
n 1
|
{z
n p
where
(0)
(
}|
{z
(0) )
p 1
}
(0) .
is sufficiently close to
squares estimators,
8
<
^ = arg minky ( )k2
2Rp
:
1
2
^ = n ky ( ^)k2:
Victor Panaretos (EPFL)
We dropped higher order terms by appealing to smoothness of
() “close to zero” higher derivatives).
;
(assuming identifiability)
Linear Models
245 / 299
Model Fitting by Taylor Expansions
Victor Panaretos (EPFL)
(smoothness
Linear Models
246 / 299
Geometry of Nonlinear Least Squares
Linearised representation suggests Newton–Raphson iteration:
(0) is available (k
and = u (0) + (0) .
Suppose an initial estimate
Let D (0) = [r ] =
(0)
^k < ).
(0)
As ranges over Rp ,
surface) in Rn ,
Taylor expansion yields
y ( (0) ) D (0) |(
To get
{z
u (0)
1
Initialise with
2
Let
() = f( ) : 2 Rp g:
M
}
provides the intrinsic coordinates on that manifold.
y is obtained by selecting a point ( ) on the manifold, and adding a mean
zero Gaussian vector ".
(0) .
u (1) = arg min
ky ( (0) ) D (0) u k2
p
u 2R
:-) (but this is just a linear least squares problem, with
X (0) D (0) !)
=
y (0) = y ( (0) ) and
u (1) = ([D (0) ]> D (0) ) 1 [D (0) ]> fy ( (0) )g.
(1) = (0) + u (1) and iterate until convergence criterion satisfied.
3
Thus set
4
Let
Return last
(k ) as
Victor Panaretos (EPFL)
) traces a p -dimensional differentiable manifold (smooth
(0) ) + ":
u (0) . Consider the following iteration:
we need
(
Regression asks to find the coordinates of the point on the manifold that
generated y .
Would like to project
expression!
y on the manifold, but do not have a closed form
^.
Linear Models
247 / 299
Victor Panaretos (EPFL)
Linear Models
248 / 299
Geometry of Nonlinear Least Squares
Geometry of Linear Approximation
Newton–Raphson algorithm is interpretable via differential geometry:
The p -dimensional tangent plane at a point (0) 2 M is spanned by
(0) r u 2 Rp .
= (0) u ;
Hence we may write that
(
)+[
(
( )]
T
(0)
() = f(
M
)
()
(0) ) + D (0) u : u 2 Rp g
In other words, the p columns of D (0) , when translated by (0) , form a
basis for the tangent plane at (0) .
Taylor expansion merely says that if is close to (0) , we approximately have
(0) 2 T (0) M . This is equivalent to the expression
)
(
()
(
)
(
()
( ) ( (0) ) [r ] = (
|
|
{z
D (0)
(0)
}
{z
)
(0) ):
u (0)
}
Therefore, y (0) D (0) u (0) " means that E y approximately lies in
T (0) M .
Newton–Raphson algorithm iterated projection on approximating linear
subspaces.
(
()
Victor Panaretos (EPFL)
Linear Models
249 / 299
Geometry of Nonlinear Least Squares
)
+
Victor Panaretos (EPFL)
[]
Linear Models
250 / 299
Geometry of Linear Approximation
Summarising, suppose we consider
the tangent space is a subspace).
( (0) ) as the origin of space (i.e., now
Then y (0) is approximately the response obtained when adding
(0) 2 T (0) M .
an element D (0)
(
)
(
)
()
" to
So, approximately, we have our usual linear problem, and we can use
orthogonal projection to solve it.
() by a plane T
Amounts to approximating the manifold M
around (0) .
(
)
Once initial value (0) is updated to
and repeat the whole procedure.
But how do we obtain our initial
Victor Panaretos (EPFL)
Linear Models
251 / 299
Victor Panaretos (EPFL)
(0)
() locally
M
(1) , use a new tangent plane approximation
(0) ?
Linear Models
252 / 299
Choosing
(0)
Approximate CIs for Parameters
Under smoothness conditions on , one can in general prove that
Successful linearisation depends on good initial value.
n
o1=2
S 1=2 r ( ^)> r ( ^) ( ^
Occasionally, can find initial values by inspection in simple problems.
More generally, it takes some experimentation.
for large
E.g., one can try fitting polynomial models to data.
Use these to find fitted values at fixed design points.
Solve a system of equations to get initial values.
n , where S = (n p ) 1 ke k2 . May thus mimic linear case:
c > ^ d N1
= + 1 expf( xj =)g + "j
Example: consider the model yj
0
1
Fit a polynomial regression to data
Find fitted values
3
Equate fitted values with model expectation:
4
5
Linear Models
which gives a
(1
r
253 / 299
Victor Panaretos (EPFL)
Start from simplestproblem:
Dist N ; 2
) Yi
xi 2 R
(
= g (xi ; );
2 B Rp ;
)
g (; ) known up to to be estimated from data, e.g.
Dist( j ) = N ( j ) and = g (x j ) = x > ,
Dist( j ) = N ( j ) and = g (x j ) = (x ; ).
Victor Panaretos (EPFL)
Linear Models
o
1
=
= g (xi ) + "i ;
c :
254 / 299
"i iid
N (0; 2 )
50
Head Accelaration (g)
−100
F is usually assumed to be a class of smooth functions (e.g.,
c :
Figure: Motorcycle Accident Data
Would now like to extend model to a more flexible dependence:
Yi j xi ind
Dist[y j i ] ! gi 2=Fg (xiL);2 (Rp ) (say);
with g unknown, to be estimated given data f(yi ; xi )gni=1 .
A nonparametric problem (parameter 1-dimensional)!
How to estimate g in this context?
0
with
1
−50
Yi j xi Dist[y j i ] !
o
Linear Models
Until today we have discussed the following setup:
i
n
c > ^ z =2 S 2 c > r ( ^)> r ( ^)
Scatterplot Smoothing
r ( ^)> r ( ^)
) 100% CI:
A More Flexible Regression Model
ind
n
d
c> ^ c>
n
o 1 N (0; 1);
S 2 c > r ( ^)> r ( ^) c
0 ; 1 by linear regression, once (0) is at hand.
Victor Panaretos (EPFL)
c> ; S 2c>
r
y~k = 0 + 1 expf (x0 + k )=g; k = 0; 1; 2:
System yields initial estimate (0) = = log [(~
y0 y~1 )=(~y1 y~2 )]
Get initial values for
So base confidence intervals (and tests) on
y~0 ; y~1 ; y~2 at x0 ; x0 + ; x0 + 2 .
2
d
) Np (0; Ip )
10
C k ).
20
30
40
50
Time After Impact (ms)
255 / 299
Victor Panaretos (EPFL)
Linear Models
256 / 299
Exploiting Smoothness
Exploiting Smoothness
! 1 and large covariate classes):
Usually unique xi distinct:
0
−50
2
−100
3
Head Accelaration (g)
50
Ideally: multiple y ’s at each xi (n
Response
10
20
30
40
50
Time After Impact (ms)
1
Here is where the smoothness assumption comes in
Since have unique
y at each xi , need to borrow information from nearby . . .
. . . use continuity!!! (or even better, smoothness)
g : R ! R is continuous if:
8 > 0 9 > 0 : jx x0 j < =) jg (x ) g (x0 )j < :
I So maybe average yi ’s corresponding to xi ’s in a -neighbourhood of x as
g^(x )?
0
I Recall: A function
0
20
40
60
80
100
x
Then average y ’s at each xi and interpolate . . .
But this is never the case . . .
Victor Panaretos (EPFL)
I Motivates the use of a kernel smoother . . .
Linear Models
257 / 299
Kernel Smoothing
Naive idea:
Victor Panaretos (EPFL)
Linear Models
258 / 299
Linear Models
260 / 299
Visualising a Kernel at Work
g^(x0 ) should be the average of yi -values with xi ’s “close” to x0 .
g^(x0 ) = Pn
n
X
1
x0 j g i =1 yi 1fjxi x0 j g:
i =1 1fjxi
A weighted average! Choose other weights? Kernel estimator:
1
g^(x0 ) = Pn K
i =1
K
xi x0 n
X
i =1
x
x
i
0
yi K
:
is a weight function (kernel), e.g. a pdf
,!
Usually symmetric, non-negative, decreasing away from zero
,!
small
is the bandwidth parameter
gives local behaviour, large gives global behaviour
Choice of K not so important, choice of very important!
The resulting fitted values are linear in the responses, i.e., y S y , where
the smoothing matrix S depends on x1 ; : : : ; xn , K and . Analogous to a
projection matrix in linear regression, but S is NOT a projection.
^=
Victor Panaretos (EPFL)
Linear Models
259 / 299
Victor Panaretos (EPFL)
j=1
n
X
Motorcycle Data Kernel Smooth
= 2
j=1
>
>
>
>
yj log
yj
ηˆj
− yj + ηˆj
Penalised Likelihood
.
plot(time,accel,xlab="Time After Impact (ms)",ylab="Head Accelaration (g)")
lines(ksmooth(time,accel,kernel="normal",bandwidth=0.7))
lines(ksmooth(time,accel,kernel="normal",bandwidth=5),col="red")
lines(ksmooth(time,accel,kernel="normal",bandwidth=10),col="blue")
Find
g 2 C 2 that minimises
n
X
http://smat.epfl.ch/courses.html
i =1
|
{z
+
}
50
Fit Penalty
Z
fg 00 (t )g2 dt
|
I
{z
}
Roughness Penalty
This is a Gaussian likelihood with a roughness penalty
0
,!
If use only likelihood, any interpolating function is an MLE!
−50
to balance fidelity to the data and smoothness of the estimated h .
Remarkably, problem has unique explicit solution!
,! Natural Cubic Spline with knots at fxi gni=1 :
−100
Head Accelaration (g)
fyi
g (xi )g2
piecewise polynomials of degree 3,
with pieces defined at the knots,
10
20
30
40
with two continuous derivatives at the knots,
50
and linear outside the data boundary.
Time After Impact (ms)
2
Victor Panaretos (EPFL)
Linear Models
261 / 299
Victor Panaretos (EPFL)
Linear Models
262 / 299
Natural Cubic Spline Details
Can represent splines via natural spline basis functions
s (t ) =
n
X
j =1
Bj , as
j Bj (t ):
where
B1 (t )
B2 (t )
Bm +2 (t )
m (t )
= 1
= t
= m (t ) n 1(t )
3
3
= (t xm t)+ t(t xn )+
n
m
Figure: The n = 4 natural spline basis functions for knots at
x3 = 0:6 and x4 = 0:8
x1 = 0:2, x2 = 0:4,
where xm are the knot locations and
()+ = maxf; 0g
is the positive part of any function.
Victor Panaretos (EPFL)
Linear Models
263 / 299
Victor Panaretos (EPFL)
Linear Models
264 / 299
Where does this come from?
Back to Penalised Likelihood
We wish to find a basis for natural cubic splines with knot locations
()
fxi gni=1
Letting
Observe that any piecewise polynomial PP3 t of order 3 with 2 cts
derivatives at the knots can be expanded in the truncated power series basis
PP3 (t ) =
3
X
j =0
j t j +
n
X
i =1
xi )3+
k (t
=(
g (t ) =
1 ; : : : ; n )> ,
n
X
i =1
i Bi (t );
Z
B = fBij g = fBj (xi )g;
ij
= Bi00 (t )Bj00(t )dt ;
our penalised likelihood
n
X
The n + 4 coefficients fj g3j =0 [ fi gni=1 must satisfy constraints to ensure
linearity beyond boundary knots:
2 = 0 & 3 = 0
Pn
i =1 i = 0
Pn
i =1 i xi = 0
i =1
becomes
fyi
g (xi )g2
Z
+ fh 00 (t )g2 dt
I
(y B )>(y B ) + >
:
Differentiating and equating with zero yields
(B >B + )^ = B >y =) ^ = (B >B + ) 1B >y :
Can then use relations re-express basis in form on previous slide, with only
(rather than n
) basis functions, and unconstrained coefficients.
+4
Victor Panaretos (EPFL)
Linear Models
n
1B >.
The smoothing matrix is S B B > B The natural cubic spline fit is approximately a kernel smoother.
= (
265 / 299
Motorcycle Example Cubic Spline Fit
Victor Panaretos (EPFL)
+ )
Linear Models
266 / 299
Motorcycle Example Cubic Spline Residuals
lines(smooth.spline(time,accel),col="red")
10
20
30
40
20
30
40
50
0
−2
Time (ms)
50
20
40
10
−80 −60 −40 −20
Sample Quantiles
20
0
−80 −60 −40 −20
Residuals
0
−50
−100
Head Accelaration (g)
40
50
Normal Q−Q Plot
−1
0
1
2
Theoretical Quantiles
Time After Impact (ms)
Victor Panaretos (EPFL)
Linear Models
267 / 299
Victor Panaretos (EPFL)
Linear Models
268 / 299
Equivalent degrees of freedom
Bias/Variance Tradeoff
=
+
y^ = Hy , with
= X (X >X ) 1X >. Here
y^ = B| (B > B +{z ) 1 B >} y :
Least squares estimation: y Xn p
", we have
trace H
p , in terms of the projection matrix H
( )=
Focus on the fit for the given grid
^ = (^g (x1 ); : : : ; g^(xn ));
g
n
X
1
1
+
j
j =1
( )
)
(
Each eigenvalue of S lies in
matrix.
Choosing
)
( ) 2
0
bias2
Linear Models
(
so
269 / 299
^=
>
>
g
^k2 ) = trace(S S ) 2 + (g S g) (g S g) ;
E kg
(0; 1), so this is a smoothing, NOT a projection,
Victor Panaretos (EPFL)
(^)k}2:
Eg
{z
When estimator potentially biased, need to worry about both!
In the case of a linear smoother, for which g S y , we find that
( )
( )=
^ ) = E| fkE(^g{z) g^k}2g + k|g
variance
where j are eigenvalues of K
B > B 1=2 B > B 1=2 .
Hence trace S is monotone decreasing in , with trace S ! as ! 1
(K will have twos zero eigenvalues) and trace S ! n as ! .
Note 1–1 map $ trace S
df, so usually determine roughness using df
(interpretation easier).
=(
= (g (x1); : : : ; g (xn ))
g k2
(
E kg
Idea: define equivalent degrees of freedom of smoother
( )=
g
Consider the mean squared error:
S
trace S
x1 ; : : : ; xn :
n
n
" =)
# =)
variance # but bias ",
bias # but variance ".
Would like to choose to find optimal bias-variance tradeoff:
,! Unfortunately, optimal will generally depend on unknown g !
Victor Panaretos (EPFL)
Linear Models
270 / 299
Orthogonal Series: “Parametrising” The Problem
Fitted values are
^
Depending on what F
y^ = S y .
Fitted value yj obtained when yj is dropped from fit is
g (x ) =
Sjj ()(yj y^j ) = y^j y^j :
with
Cross-validation sum of squares is
() =
CV
n
X
j =1
(yj
n X
y^j )2 =
j =1
yj y^j 2 ;
1 Sjj ()
Yi =
k k (x )
(in an appropriate sense);
X
k =1
k k (xi ) + "i ;
<1
Notice: truncation has implications, e.g., in Fourier case:
Truncating implies assume
g 2 G L2 .
Interpret this as a smoothness assumption on g .
How to choose
Linear Models
k =1
Idea: if truncate series, then have simple linear regression!
2
y
y
^
j
j
GCV() =
;
j =1 1 trace(S )=n
where Sjj () is (j ; j ) element of S .
Victor Panaretos (EPFL)
1
X
f g1
k =1 known (orthogonal) basis functions for F, e.g.,
F = L2 ( ; ),
f k g = fe ikx gk 2Z , i ? j , i =6 j . R
Gives Fourier series expansion, k = 21 g (x )e ikx dx .
and generalised cross-validation sum of squares is
n
X
3 g () is (Hilbert space) can write:
271 / 299
optimally?
Victor Panaretos (EPFL)
Linear Models
272 / 299
Convolution: Series Truncation
'? Smoothing
The Dirichlet kernel
Easy exercise in Fourier analysis:
X
k= ke
ikx
Z 1
= 2 g (y )D (x y )dy
with the Dirichlet kernel of order ,
Recall kernel smoother:
D (u ) = sin f( + 1=2) u g=sin(u =2).
n
X
Z
1
y
K
(
x
x
)
i
i
0
g^(x0 ) = Pn K (x x ) = c y (x )K (x x0 )dx ;
0
I
i =1 i =1 i
with
y (x ) =
So if
K
n
X
i =1
yi (x xi ):
is the Dirichlet kernel, we can do series approximation via kernel smoothing.
Works for other series expansions with other kernels (e.g., Fourier with convergence factors)
Victor Panaretos (EPFL)
From
Linear Models
273 / 299
x 2 R to (x1 ; : : : ; xd ) 2 Rp
So far: how to estimate
274 / 299
! Need some definition of “local” in the space of explanatories
,! Use some metric on Rp 3 (x1 ; : : : ; xp ) !
g : R ! R (assumed smooth) in
given data
f(yi ; xi )gni=1 :
But which one?
Choice of metric
I Generalise to include multivariate explanatories?
I “Immediate” Generalisation: g Rp ! R (smooth)
:
,!
Yj = g (xj 1 ; : : : ; xjp ) + "j ; "j iid N (0; 2 )
()
choice of geometry
e.g., curvature reflects intertwining of dimensions
Geometry
I Estimation by (e.g.) multivariate kernel method.
I Two basic drawbacks of this approach . . .
=)
reflects structure in the explanatories
potentially different units of measurement
(variable stretching of space)
g may be of higher variation in some dimensions
Shape of kernel? (definition of local)
(need finer neighbourhoods there)
Curse of dimensionality
Victor Panaretos (EPFL)
Linear Models
What is “local” in Rp ?
Yi = g (xi ) + "i ; "i iid N (0; 2 );
,!
,!
Victor Panaretos (EPFL)
statistical dependencies present in the explanatories
(“local” should reflect these)
Linear Models
275 / 299
Victor Panaretos (EPFL)
Linear Models
276 / 299
Curse of Dimensionality (U [0; 1]p )
1.0
Curse of Dimensionality
0.6
Bellman (1961)
0.4
Notion of local in terms of % of data: fails in high dimensions
,! There is too much space!
Hence to allow for reasonably small bandwidths
,! Density of sampling must increase.
0.0
0.2
Side Length
0.8
“neighbourhoods with a fixed number of points become less local as the
dimensions increase”
Need to have ever larger samples as dimension grows.
0.0
0.2
0.4
0.6
0.8
1.0
Average Percentage of Sample
p = 1, p = 2, p = 5, p = 10
Victor Panaretos (EPFL)
Linear Models
277 / 299
Tackling the Dimensionality Issue
Victor Panaretos (EPFL)
How is the model fitted to data?
our mastery of 1D case (at least we can do that well . . . ),
Assume only one term,
and higher dimensional explanatories (and associated difficulties).
k =1
Smooth: Given a direction #, fitting
splines.
g into smooth functions hk : R ! R.
Pursue: Given
Z
I
)
fh100 (t )g2 dt :
g1 (#> x) is done via 1D smoothing
h1 , have a non-linear regression problem w.r.t. #.
Hence, iterate between the two steps
,!
,!
projections directions chosen for best fit
,! similarities to tomography.
Each hk is a ridge function of x: varies only in the direction defined by #k
Linear Models
i =1
fyi h1 ([#> x]i )g2 +
Two steps:
Each function depends on a global feature
,! a linear combination of the explanatories,
Victor Panaretos (EPFL)
(
n
X
min
hk (#>k x) + "; k#k k = 1; " N (0; 2 ):
Additively decomposes
K = 1 and consider penalized likelihood:
h 2C 2 ;k#k=1
One approach: Projection-Pursuit Regression
Y=
278 / 299
Projection Pursuit Regression
Attempt to find a link/compromise between:
K
X
Linear Models
279 / 299
Complication is that
h1 not explicitly known, so need numerical derivatives.
Computationally intensive (impractical in the ’80’s).
! Further terms added in forward stepwise manner.
Victor Panaretos (EPFL)
Linear Models
280 / 299
Additive Models
The Backfitting Algorithm
Projection pursuit:
(+) Can uniformly approximate C 1 compact Rp function arbitrarily well as
K ! 1 (very useful for prediction)
( ) Interpretability? What do terms mean within problem?
(
[ ])
I How to fit additive model?
,!
,!
Know how to fit each fk separately quite well
Take advantage of this . . .
I Motivation: Fix
Need something that can be interpreted variable-by-variable
j
and drop it for ease:
I Compromise: Additive Model
Yj = j +
p
X
k =1
m 6=k
fk (xjk ) + "j ; "j iid N (0; 2 );
P
j fk (xjk ) = 0.
fm (xm )5 = fk (xk )
= avefyj P
g, fk = fk0 , k = 1; : : : ; p .
Cycle: fk = Sk (y
m 6=k fm ) k = 1; : : : ; p ; 1; : : : ; p ; : : :
(1) Initialise:
(2)
3
I Suggests the Backfitting Algorithm:
In our standard setting, have:
2
= N (j ; P
);
Yj j xej ind
Dist( j j ) ! Dist
j = j = j + pk =1 fk (xjk ):
Linear Models
(3) Stop: when individual functions don’t change
I
281 / 299
S is arbitrary scatterplot smoother
Victor Panaretos (EPFL)
Linear Models
0.5
0.5
6.5
0.0
−0.5
s(age,2.29)
−0.5
s(base,1.87)
0.0
6.0
5.0
10
15
Age
−30 −25 −20 −15 −10
−5
−1.0
3.0
−1.0
3.5
4.0
4.5
C−peptide
5.5
6.0
5.5
5.0
4.5
4.0
3.5
3.0
5
0
Base Deficit
5
10
15
age
Victor Panaretos (EPFL)
282 / 299
Example: Diabetes Data
6.5
Example: Diabetes Data
C−peptide
X
E 4Y
fj ’s univariate smooth functions,
Victor Panaretos (EPFL)
2
Linear Models
283 / 299
Victor Panaretos (EPFL)
−30 −25 −20 −15 −10
−5
0
base
Linear Models
284 / 299
Example: Rock Permeability Data
Example: Rock Permeability Data
Family: gaussian
Link function: identity
3000
Formula:
perm ~ 1 + s(peri) + s(area)
1000
2000
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
415.45
27.18
15.29
<2e-16 ***
--Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
0
s(area,3.36)
1000
Approximate significance of smooth terms:
edf Est.rank
F p-value
s(peri) 8.739
9 18.286 9.49e-11 ***
s(area) 3.357
7 6.364 7.41e-05 ***
--Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
−2000
−1000
0
−2000
−1000
s(peri,8.74)
2000
3000
Measurements on 48 rock samples from a petroleum reservoir:
rock.gam<-gam(perm 1+s(peri)+s(area),family=gaussian)
1000
2000
3000
4000
5000
2000
4000
peri
6000
8000
10000
12000
area
Victor Panaretos (EPFL)
R-sq.(adj) =
Linear Models
285 / 299
0.815
Victor Panaretos (EPFL)
Deviance explained = 86.3%
Linear Models
286 / 299
Where does the cubic spline magic come from??
A ‘magic’ function
To show the result, we will first consider the problem of:
Our analysis will hinge on a seemingly magical function2
k (x ; y ) = x 2 y =2 x 3 =6 1f0 x < y g + xy 2 =2 y 3 =6 1fy x 1g
Smooth Interpolation (with boundary constraints)
0 < x1 < : : : < xn < 1 and corresponding response values
0.20
such that:
Z
|0
1
2
h 00 (x ) dx {z
C
Victor Panaretos (EPFL)
(h )
}
Z
|0
1
2
f 00 (x ) dx ;
{z
C
Linear Models
(f )
}
0.00
h interpolates the points f(xi ; yi )gni=1 , i.e.
h (xi ) = yi ; i = 1; :::; n :
h is smoothest (has minimal curvature) among all H-interpolants,
0.15
0
(g 00 (u ))2 < 1; g (0) = g 0 (0) = 0
0.10
1
Z
k(x,y), for y=0.25, 0.5, 0.75
h 2 H := g : [0; 1] ! R :
0.05
Given design points
fyi gni=1 find a function
0.0
0.2
0.4
0.6
0.8
1.0
2x
8 f 2 H:
x 7! k (x ; 1=4); x 7! k (x ; 1=2); x 7! k (x ; 3=4)
2 which
287 / 299
is in fact the covariance of integrated Brownian motion
Victor Panaretos (EPFL)
Linear Models
288 / 299
k (x ; y ) = x 2 y =2 x 3 =6 1f0 x < y g + xy 2 =2 y 3 =6 1fy x 1g
Now write kx0
1
Z
0
(x ) = k (x ; x0) and notice3 that for any f 2 C 2
k 00 (u )f 00 (u )du
So if
x0
=
Z
x0
0
u )f 00 (u )du +
(x0
Z
1
x0
0du = f (x0 ) f (0)
f 0 (0)x0
0
1
Note that ky
1
Z
0
3 the
0
x
0
Proof.
By
i =1 j =1
f (x ) = f (0) + f 0 (0)x +
Victor Panaretos (EPFL)
0
f 00 (u )(x
289 / 299
Step 2: Guessing a Feasible Candidate Solution
We guess the form
h (x ) =
n
X
i =1
n
X
j =1
Victor Panaretos (EPFL)
Z
i = 1; :::; n :
1
2
Victor Panaretos (EPFL)
=
0
i =1
i kxi (u )
00
du
=0
Linear Models
290 / 299
By the reproducing kernel’s evaluation property and the interpolation constraints
n equations in n unknowns, y = K θ, where
>
y = (y1 ; : : : ; yn ) ; K = fk (xj ; xi )gni;j =1 ; θ = (1 ; : : : ; n )> :
The solution K 1 y exists uniquely, because K 0 since k is nonnegative kernel
Check: h is feasible.
We can directly verify that h 2 C 2 and h (0) = h0 (0) = 0, so h 2 H.
h also interpolates:
j =1
00
00
!2
h 2 H. Letting = h h , we have 2 H and
R
R
R
R
R
C(h ) = (h 00 )2 = (h00 + 00 )2 = (h00 )2 + ( 00 )2 + 2 h00 00
by solving a linear system of
h (xi ) =
i kxi (u ) j kxj (u )du
n
X
Consider another interpolant
i kxi (x )
j k (xj ; xi );
n
X
i =1 j =1 0
1
Z
Step 3: Showing Optimality
where the i are chosen so that the interpolation property is valid,
yi =
1
[0 )
= = =
u )du
Linear Models
i j k (xi ; xj ) =
n X
n Z
X
Each function kx00i is a linear function supported on ; xi hence if the fxi g are
distinct, the supports are disjoint and the kx00j are linearly independent.
Consequently the sum can be zero only if 1
.
2 :::
n
second equality by Taylor’s formula with integral remainder,
Z x
k reproducing itself we can see that
n X
n
X
= ky (x ) ! We say that k reproduces itself.
y
fxi gni=1 (0; 1) we have
Pn Pn
8 ( 1 ; :::; n )> 2 Rn
i =1 j =1 i j k (xi ; xj ) 0
in other words K = fk (xi ; xj )gni;j =1 is nonnegative. If furthermore the fxi g are
Given any
distinct, we have strict inequality in the display and so K is strictly positive.
kx00 (u )h 00 (u )du = h (x0 )
(x ) itself is C 2 and such that ky (0) = ky0 (0) = 0 so that
k 00 (u )k 00 (u )du
Theorem (k is a non-negative kernel)
h 0 (0) = h (0) = 0 we can evaluate h at x0 by integrating it against kx :
Z
k (x ; y ) = x 2 y =2 x 3 =6 1f0 x < y g + xy 2 =2 y 3 =6 1fy x 1g
j k (xj ; xi )=yi
Linear Models
Z
h00 00 = 00
n
X
i =1
i kx00i
=
n
X
i =1
i
Z
00 kx00i
=
n
X
i =1
i (xi ) =
X
i
i (h (xi )
h (xi )) = 0:
Therefore
(h ) = R (h 00)2 = R (h00)2 + R (00 )2 R (h00 )2 = C(h);
R
showing optimality. Inequality is strict unless ( 00 )2 = 0, in which case 00 = 0.
C
Taylor’s formula (with integral remainder) then says
Z
(x ) = (0)+ 0 (0)x +
x
0
00 (u )(x
u )du = (h (0) h (0))+(h 0 (0) h0 (0))x +0 = 0
establishing uniqueness of the optimum.
291 / 299
Victor Panaretos (EPFL)
Linear Models
292 / 299
Recap
Smoothing vs Interpolation
Theorem (Smooth Interpolation)
0
1
Given design points < x1 < : : : < xn < and corresponding response values
fyi gni=1 , there exists a function of the form
h (x ) =
n
X
i =1
Interpolating would be overfitting, we want to smooth: find
n
X
i kxi (x )
i =1
|
1
(g 00 (u ))2du
0
o
n
R1
among all g 2 H = g : [0; 1] ! R : 0 (g 00 (u ))2 < 1; g (0) = g 0 (0) = 0 that
interpolate f(xi ; yi )gni=1 . The coefficients θ = (1 ; : : : ; n )> are uniquely given by
θ = K 1y ;
(g ) =
C
where y
Linear Models
Smoothing vs Interpolation, Step 1: Linear Plus
x1
fg 00 (t )g2 dt
{z
}
Roughness Penalty
for design points x1 < : : : < xn and responses fyi gni=1 . Note that we can always
re-scale the x -axis so that without loss of generality we may assume
0 < x1 < : : : < xn < 1
We proceed in two steps:
1
Split the candidate function into a linear plus an H-part.
2
Use smooth interpolation as a technical device.
293 / 299
H
g (x ) = g (0) + g 0 (0)x +
Z
x
0
Victor Panaretos (EPFL)
s (x ) = + x +
g 00 (u )(x u )du = + x + h (x )
To show this, let
h (0) = 0
h 0 (0) = 0
n
X
j =1
j kxj (x )
g (x ) be any C 2 function, so that
g (x ) = 1 + 1 x + h (x ); 1 ; 1 2 R; h 2 H
and define
zi = h (xi ):
Letting f be the smoothest interpolant of (xi ; zi ) among H-functions, we have
n
o
R
h 2 H = g : [0; 1] ! R : 01 (g 00 (u ))2 du < 1; g (0) = g 0 (0) = 0
n
X
i =1
So we may re-write the problem as a minimising
n
X
(
294 / 299
Given our smooth interpolation theorem we guess that the optimum has the form
h is obviously C 2 and can be verified to satisfy
Therefore:
Linear Models
Smoothing vs Interpolation, Step 2: Guesswork and Leveraging Interpolation
Any candidate g 2 C 2 can be decomposed into two components by Taylor’s
formula with integral remainder
over all
|
}
Z xn
= (y1; : : : ; yn )> and K = fk (xj ; xi )gni;j =1.
Victor Panaretos (EPFL)
where
{z
+
Fit Penalty
that uniquely minimises the curvature functional
Z
fyi
g (xi )g2
g 2 C 2 minimising
i =1
fy i ( +
; ; h ) 2 R R H.
Victor Panaretos (EPFL)
xi + h (xi ))g2
1
Z
1
Z
0
fg 00 g2 =
+ fh 00(t )g2dt
0
Note that the penalty term is indifferent to
Linear Models
fyi g (xi )g2 +
(
;
).
295 / 299
where the inequality is strict, unless
our smooth interpolation theorem).
Victor Panaretos (EPFL)
n
X
fyi
1
1 xi h| ({zxi })g2 +
fyi
1
1 xi
i =1
n
X
i =1
=zi
f| ({zxi })g2 +
=zi
Z
1
0
Z 1
0
fh 00 g2
ff 00 g2
g is itself of the form s (x ) (whence f = h , by
Linear Models
296 / 299
And the final step
And the final step
( ) = (; y ) is C 2, is a cubic on [0; y ) and linear on [y ; 1].
(x ) is a cubic in any subinterval [xi ; xi +1 )
( ) = (; y ) is C 2, is a cubic on [0; y ) and linear on [y ; 1].
(x ) is a cubic in any subinterval [xi ; xi +1 )
Observe that ky k
Pn
Hence j =1 j kxj
Observe that ky k
Pn
Hence j =1 j kxj
C 2 at the joins (knots) xj
It is zero outside left of x1 and linear right of xn
Pn
Therefore s (x ) = + x + j =1 j kxj (x ) is
C 2 at the joins (knots) xj
It is zero outside left of x1 and linear right of xn
Pn
Therefore s (x ) = + x + j =1 j kxj (x ) is
It is
It is
a cubic polynomial between knots.
a cubic polynomial between knots.
C 2 at the knots.
linear outside
C 2 at the knots.
linear outside [x1 ; xn ].
[x1; xn ].
... i.e. it is a natural cubic spline with knots
Victor Panaretos (EPFL)
Linear Models
297 / 299
And the final step
Victor Panaretos (EPFL)
fx1 ; :::; xn g.
Linear Models
297 / 299
And the final step
( ) = (; y ) is C 2, is a cubic on [0; y ) and linear on [y ; 1].
(x ) is a cubic in any subinterval [xi ; xi +1 )
Observe that ky k
Pn
Hence j =1 j kxj
It is C 2 at the joins (knots) xj
x1 and linear right of xn
Pn
Therefore s (x ) = + x + j =1 j kxj (x ) is
It is zero outside left of
=)
Therefore,
1
The solution
2
θ> K θ
=
R1
0
s (x ) is of the form Pnk=1 k Bk (x )
0
@
|
a cubic polynomial between knots.
C 2 at the knots.
linear outside [x1 ; xn ].
where
ij
fx1 ; :::; xn g.
... it turns out that any
Pnnatural cubic spline with knots fx1 ; :::; xn g can also be
written as + x + j =1 j kxj (x )
... i.e. it is a natural cubic spline with knots
=
R1
0
n
X
j =1
12
j kx00j (u )A
{z
[s 00 (x )]2
2
R du = 01 Pnj=1 j Bj00 (u ) du = γ > Ωγ,
}
Bi00 (t )Bj00 (t )dt
The first equality is from the reproducing kernel property
>
Ω 0 because if there existed
0,nthen there would
Pna γ for which γ Ωγ P
exist ( ; ; θ ) such that
B
(
x
)
=
+
x
+
k
k =1 k
j =1 j kxj (x ) and
θ> K θ = γ> Ωγ 0, contradicting our proof that K 0.
(will not prove the latter)
Victor Panaretos (EPFL)
Linear Models
297 / 299
Victor Panaretos (EPFL)
Linear Models
298 / 299
Theorem (Roughness Penalised Least Squares is Spline Smoothing)
Given covariates
0 < x1 < : : : < xn < 1 and responses fyi gni=1, the functional
(g ) =
L
is uniquely minimised over
n
X
i =1
yi g (xi ) 2 + 0
(g 00(u ))2 du
g 2 C 2 (0; 1) at
g^(x ) =
with
1
Z
n
X
j =1
^j Bj (x );
(^1 ; : : : ; ^n )> = γ^ = (B >B + Ω)
1B >y ;
where
Bj (x ) is the j th natural spline basis function with knots fxj gnj=1
y = (y1 ; : : : ; yn )>
B = fBj (xi )gni;j =1
nR
on
1
Ω = 0 Bi00 (t )Bj00 (t )dt
is a positive definite matrix.
i ;j =1
Victor Panaretos (EPFL)
Linear Models
299 / 299
Download