Introduction to Matlab

advertisement
Linear methods for regression
Hege Leite Størvold
Tirsdag 12.04.05
Linear regression models

Assumes that the regression function E(Y|X) is linear
N
E (Y | X )   0    i X i  f ( X )
i 1

Linear models are old tools but …




Still very useful
Simple
Allow an easy interpretation of regressors effects
Very wide since Xi’s can be any function of other variables
(quantitative or qualitative)

Useful to understand because most other methods are
generalizations of them.
Matrix Notation



X is n  (p+1) of input
vectors
y is the n-vector of
outputs (labels)
 is the (p+1)-vector of
parameters
1 x1T  1 x11 x12 ... x1 p 

 

T
1
x
x
...
x
1 x 
21
22
2p 
X   2   

...
...

 

T
1 x  1 xn1 xn 2 ... xnp 
 n 
 y1 
y 
y   2
 ... 
 
 yn 
 0 
 
1
β 
 ... 
 
  p 
Lesast squares estimation

The linear regression model has the form
p
f ( X )  0   x j  j
j 1
the βj’s are unknown parameters or coefficients.

Typically we have a set of training data (x1,y1), …, (xn,yn)
from which we want to estimate the parameters β.

The most popular estimation method is least squares
Linear regression and least squares

Least Squares: find solution, β̂ , by minimizing the
residual sum of squares (RSS):
2


RSS (β)    yi   0   xij  j   (y  Xβ)T (y  Xβ)
i 1 
j 1

N
p
Reasonable criterion when…


Training samples are random, independent draws
OR, yi’s are conditionally independent given xi
Geometrical view of least squares

Simply find the best linear fit to the data

ei is the residual of observation i
One covariate
Two covariates
Solving Least Squares

Derivative of a Quadratic Product
d dx Ax  b CDx  e  AT CDx  e  DT CT Ax  b
T

Then,
RSS

 Xβ  y T I N  Xβ  y 

β
β
  X T I N  Xβ  y   X T I N  Xβ  y 
T
 2 X T y  Xβ 

Setting the First Derivative to Zero:
X T y  X T Xβ  0
X T Xβ  X T y

β X X
T

1
X Ty
The normal equations

Assuming that (XTX) is non-singular, the normal
equations gives the unique least squares solution:
βˆ OLS  ( X T X ) 1 X T y

Least squares predicitons
yˆ  Xβˆ OLS  X ( X T X ) 1 X T y

When (XTX) is singular the least squares coefficients are
no longer uniquely defined.

Some kind of alternative strategy is needed to obtain a solution:



Recoding and/or dropping redundant columns
Filtering
Control fit by regularization
Geometrical interpretation
of least squares estimates
Predicted outcomes ŷ are the orthogonal projection of y onto
the columnspace of X (that spans a subspace of Rn).
Properties of least squares estimators

If Yi are independent, X fixed and Var(Yi) = σ2
constant, then
E (βˆ )  β, Var (βˆ )   2 ( X T X ) 1
and

E (ˆ 2 )   2 with ˆ 2 
1
ˆ (x )) 2
(
y

f

i
i
n  p 1
If, in addition Yi=f(Xi) + ε with ε ~ N(0,σ2), then
βˆ ~ N (β, ( X T X ) 1 2 ) and (n  p  1)ˆ 2 ~  2  n2 p 1
Properties of the least squares
solution

To test the hypothesis that a particular coefficient
βj = 0 we calculate
ˆ j
zj 
vj is the jth diagonal
ˆ v j
element of (XTX)-1


Under the null hypotesis that βj = 0, zj will be distributed as
tn-p-1 and hence a large absolute value of zj will reject the null
hypothesis
A (1-2α) confidence interval for βj can be formed by:
( ˆ j  z (1 ) v jˆ , ˆ j  z (1 ) v jˆ )

F-test can be used to test the nullity of a vector of
parameters
Significance of Many Parameters

We may want to test many features at once


Comparing model M0 with k parameters to an alternative
model MA with m parameters from M0 (m<k)
Use the F statistic:
Full model
Reduced/alternative model

RSS 0  RSS A  k  m 
F
~ F (( k  m), (n  k  2))
RSS A n  k  2
Example: Prostate cancer


Response: level of prostate antigen
Regressors: 8 clinical measures useful for men receiving
prostatectomy
Results from linear fit:
Term
Coefficient
Std.error
Z score
intercept
2,48
0,09
27,66
lcalvol
0,68
0,13
5,37
lweight
0,30
0,11
2,75
age
-0,14
0,10
-1,40
lbph
0,21
0,10
2,06
svi
0,31
0,12
2,47
lcp
-0,29
0,15
-1,87
gleason
-0,02
0,15
-0,15
0,27
0,15
1,74
pgg45
Gram-Schmidt Procedure
1)
2)
Initialize z0 = x0 = 1
For j = 1 to p
For k = 0 to j-1, regress xj on the zk’s so that
 kj 
zk  x j
(univariate least squares estimates)
zk  zk
Then compute the next residual
j 1
zj  xj 

z
kj k
k 0
3)
Let Z = [z0 z1 … zp] and  be upper triangular with entries kj
X = Z  = ZD-1D  = QR
where D is diagonal with Djj = || zj ||
O(Np2)
Technique for Multiple Regression



Computing βˆ  X T X
properties

1
X T y directly has poor numeric
QR Decomposition of X

Decompose X = QR where




Q is N  (p+1) orthogonal matrix (QTQ = I(p+1))
R is an (p+1)  (p+1) upper triangular matrix
Then



 
1
1
βˆ  RT QT QR RT QT y  RT R RT QT y  R 1 RT
1) Compute QTy
2) Solve R β̂ = QTy by back-substitution
1
RT QT y  R 1QT y
Multiple outputs

Suppose we want to predict multiple outputs Y1,Y2,…,YK from multiple
inputs X1,X2,…,Xp. Assume a linear model for each output:
p
Yk   0 k   X j  jk   k  f k ( X )   k
j 1

With n training cases the model can be written in matrix notation
Y=XB+E
here




Y is the nxK response matrix, with ik entry yik
X is the nx(p+1) input matrix
B is the (p+1)xK matrix of parameters
E is the nxK matrix of errors
Multiple outputs cont.

A straightforward generalization of the univariate loss function is
K
n

RSS ( B)   ( yik  f k (x i )) 2  tr (Y  XB)T (Y  XB)

k 1 i 1

And the least squares estimates have the same form as before:
Bˆ  ( X T X ) 1 X T Y
the coefficients for the k’th outcome are just the least squares estimates of
the single output regression of yk on x0,x1,…,xp
 If the errors ε=(ε1,…., εK) are correlated a modified model might be
more appropriate (details in textbook)
Why Least Squares?

Gauss-Markov Theorem:


The least squares estimates have the smallest variance
among all linear unbiased estimates
However, there may exist a biased estimator with
lower mean square error
MSE (βˆ )  E (βˆ  β) 2
 Var (βˆ )  [ E (βˆ )  β]2

this is zero for least squares
Subset selection and Coefficient Shrinkage

Two reasons why we are not happy with least
squares



Prediction accuracy: LS estimates often provide
predictions with low bias, but high variance.
Interpretation: when the number of regressors i too high,
the model is difficult to interpret. One seek to find a smaller
set of regressors with higher effects.
We will consider a numer of approaches to variable
selection and coefficient shrinkage.
Subset selection and shrinkage: Motivation

Bias – variance trade off:

Goal: choose a model to minimize error
2
ˆ
ˆ
ˆ
MSE (  )  Var (  )  bias (  )


Method: sacrifice a little bit of bias to reduce the variance.
Better interpretation: find the strongest factors from
the input space.
Subset selection


Goal: to eliminate unnecessary variables
from the model.
We will consider three approaches:

Best subset regression


Forward stepwise selection


Choose subset of size k that gives lowest RSS.
Continually add features with the largest F-ratio
Backward stepwise selection

Remove features with small F-ratio
Greedy techniques – not guaranteed to find the best model
Best subset regression

For each k {0,1,2,..., p} find the subset of size k
that gives the smallest RSS.

Leaps and bounds procedure works with p ≤ 40.

How to choose k? Choose model that minimizes
prediction error (not a topic here).

When p is large searching through all subsets is
not feasible. Can we seek a good path through
subsets instead?
Forward Stepwise selection

Method:


Start with intercept model.
Sequentially include variable that most improve
the RSS(β) based on the F statistic:
~
ˆ
RSS (  )  RSS (  )
F
~
RSS (  ) /( n  k  2)

Stop when no new variable improves fit
significantly
Backward Stepwise selection

Method:




Start with full model
Sequentially delete predictors that produces the
smallest value of the F statistic, i.e. increases
RSS(β) least.
Stop when each predictor in the model produces a
significant value of the F statistic
Hybrids between forward and backward
stepwise selection exists
Subset selection

Produces model that is interpretable and has possibly
lower prediction error

Forces some dimensions of X to zero, thus probably
decrease Var(β)
ˆβ ~ N (β, ( X T X ) 1 2 )

Optimal subset must be chosen to minimize predicion
error (model selection: not a topic here)
Shrinkage methods

Use additional penalties/constraints to reduce
coefficients

Shrinkage methods are more continous than
stepwise selection and therefore don’t suffer as
much from variability

Two examples:

Ridge Regression


p

Minimize least squares s.t. 
2
j
s
j 1
The Lasso

p
Minimize least squares s.t.  |  j |  s
j 1
Ridge regression


Shrinks the regression coefficients by imposing a
penalty on their size
Complexity parameter λ controls amount of shrinkage
βˆ
ridge
p
p


2
2
 arg min  ( yi   0   xij  j )     j 
β
j 1
j 1
 i

equivalently
p
ˆβ ridge  arg min  ( y    x  ) 2 
i i 0 
ij j
β
j

1


p
subject t o

j 1
2
j
s
One-to-one
corresponence
between s and λ
Properties of ridge regression

Solution by matrix notation:
βˆ ridge  ( X T X  I ) 1 X T y
Addition of λ>0 to the diagonal of XTX before inversion makes
the problem nonsingular even if X is not of full rank.

The size constraint prevents coefficient estimates of
highly correlated variables show high variance.

Quadratic penalty makes the ridge solution a linear
function of y.
Properties of ridge regression cont.

Can also be motivated through bayesian statistics by
choosing an appropriate prior for β.

Does no automatic variable selection

Ridge existence theorem states that there exists a
λ>0 so that
MSE (βˆ ridge )  MSE (βˆ OLS )

Optimal complexity parameter must be estimated
Example
Complexity parameter of
the model:
Effective degrees of freedom
The parameters are
continously
shrunken towards
zero
Singular value decomposition (SVD)

The SVD of the matrix X  Rn p has the form
X  UDV T
p p
V

R
where U  R and
are orthogonal matrices
n p



and D=diag(d1,…..,dr)
d1  d 2  .....  d r  0 are the non-zero singular
values of X.
r ≤ min(n,p) is the rank of X
The eigenvectors vi are called the principal
components of X.
Linear regression by SVD

A general solution to y=Xβ can be written as
p
ˆβ   u i y v

i
i
d
i 1
i

The filter factors ωi determines the extent of
shrinking, 0≤ ωi ≤1, or stretching, ωi >1, along the
singular directions ui

For the OLS solution ωi =1, i=1,…,p, i.e. all the
directions ui contribute equally
Ridge regression by SVD

I ridge regression the filter factors are given by
d i2
i  2
,
di  


i  1,...., p
Shrinks the OLS estimator in every direction
depending on λ and the corresponding di.
The directions with low variance (small singular
values) are the directions shrunken the most by ridge

Assumption: y vary most in the directions of high variance
The Lasso


A shinkage method like ridge, but with important
differences
The lasso estimate
β

lasso
p
2
p
subject t o
| 
j 1
j
| s
The L1 penalty makes the solution nonlinear in y





 arg min   yi   0   xij  j 
β
i 1 
j 1

N
quadratic programming needed to compute the solutions.
Sufficient shrinkage will cause some coefficients to
be exactly zero, so it acts like a subset selection
method.
Example
Coefficients plottet against
s
t

p
j 1
| ˆ |
Note that the lasso profiles
hit zero, while those for ridge
do not.
A unifying view

We can view these linear regression techniques
under a common framework
2
p
p
 N 

ˆ  arg min   yi   0   xij  j     |  j

j 1
j 1
 i 1 



| 

q
 includes bias, q indicates a prior distribution on 




=0: least squares
>0, q=0: subset selection (counts number of nonzero parameters)
>0, q=1: the lasso
>0, q=2: ridge regression
Methods using derived input directions


Goal: Using linear combinations of inputs as
inputs in the regression
Includes

Principal Components Regression


Partial Least Squares


Regress on M < p principal components of X
Regress on M < p directions of X weighted by y
The methods differ in the way the linear
combinations of the input variables are
constructed.
PCR

Use linear combinations zm=X v as new features


vj is the principal component (column of V) corresponding to the
jth largest element of D, e.g. the directions of maximal sample
variance
For some M ≤ p form the derived input vectors

[z1…zM] = [Xv1……XvM]
Regress y on z1…zM, gives the solution
M
ˆ PCR ( M )   ˆm z m
m 1
where
ˆm   zm , y  /  zm , zm 
PCR continued

The m’th principal component direction vm solves:
max
||α||1
Var ( Xα)
vTl Sα 0,l 1,..., m 1

Filter factors become
1 for i  M
i  
0 for i  M

e.g. it discards the p-M smallest eigenvalue components
from the OLS solution.
If p=M it gives the OLS solution
Comparison of PCR and ridge
Shrinkage and trucation patterns as a function of the principal component index
PLS





Idea: find the directions that have high variance and have correlation
with y
In construction of each zm the inputs are weighted by the strength of
their univariate effect on y
Pseudo-algoritm:
( 0)
Set x(j0)  (x j  x j ) / var( x j ) and yˆ  1 y
For m=1,….,p
1.
2.
3.
4.

Find m’th PLS component
Regress y on zm
Update y
Orthogonalize each xj(m)
with respect to zm
z m  X ( m )ˆ m where ˆ m  X ( m ) y ( m )
ˆ  z , y  /  z , z 
T
m
m
m
m
m
yˆ ( m)  yˆ ( m1)  ˆm z m
X ( m )  X ( m 1)  z m  X ( m ) , z m  /  z m , z m 
PLS solution: ˆ jPLS (m)  lm1ˆljˆl
PLS cont.

Nonlinear function of y, because y is used to find the linear
components

As with PCR M=p gives OLS estimate, while M<p directions
produces a reduced regression.

The m’th PLS direction is found by using the ̂m that maximizes the
covariation between the input and output variable:
max
||α|| 1
Corr 2 (y, Xα)Var ( Xα)
lT Sα 0 ,l 1,..., m 1
where S is the sample covariance matrix of the xj.
PLS cont.

Filter factors for PLS become
q
i  1   (1 
j 1
d i2
j
)
where θ1≥… ≥θ are the Ritz values (not defined here).


Note that some ωi>1, but it can be shown that PLS shrinks
the OLS solution, || ˆ PLS ||2  || ˆ OLS ||2
It can also be shown that the sequence of PLS components
for m=1,2,…,p represents the conjugate gradient sequence
for computing the OLS solution.
Comparison of the methods
p
Consider the general solution:
uy
βˆ   i i v i
di
i 1

Ridge shrinks all
directions, but shinks the
low-variance directions
most
d i2
i  2
,
di  

PCR leaves M high
variance directions alone,
and discards the rest.
1 for i  1,..., M
i  
0 for i  M  1,..., p

PLS tends to shrink the
low-variance directions,
but can inflate some of the
higher variance directions
q
i  1   (1 
j 1
i  1,...., p
d i2
j
)
Comparison of the methods
Consider an example with two correlated inputs x1 and x2, ρ=0.5.
Assume true regression coefficients β1=4 and β2=2
Coefficient profiles for the different methods as the tuning parameters
are varied:
PCR
PLS
ridge
Least Squares
Best
subset
lasso
β2
2
β1
4
Example: Prostate cancer
Term
LS
Best
subset
Ridge
Lasso
PCR
PLS
Intercept
2,480
2,495
2,467
2,477
2,513
2,452
lcalvol
0,680
0,740
0,389
0,545
0,544
0,440
lweight
0,305
0,367
0,238
0,237
0,337
0,351
-0,152
-0,017
age
-0,141
-0,029
lbph
0,210
0,159
0,098
0,213
0,248
svi
0,305
0,217
0,165
0,315
0,252
lcp
-0,288
0,026
-0,053
0,078
Gleason
-0,021
0,042
0,230
0,003
Pgg45
0,267
0,123
0,059
-0,053
0,080
Test err.
0,586
0,574
0,540
0,491
0,527
0,636
Std.err.
0,184
0,156
0,168
0,152
0,122
0,172
Download