Linear Methods for Regression

advertisement
Linear Methods for
Regression
Dept. Computer Science & Engineering,
Shanghai Jiao Tong University
Outline
• The simple linear regression model
• Multiple linear regression
• Model selection and shrinkage —the state of
the art
2015/4/8
Linear Methods for Regression
2
Preliminaries
• Data ( x1 , y1 ), ( x 2 , y 2 ),
, ( xN , yN )
– x i is the predictor (regressor, covariate, feature, independent
variable)
– y i is the response (dependent variable, outcome)
• We denote the regression function by
 ( x )  E (Y | x )
• This is the conditional expectation of Y given x.
• The linear regression model assumes a specific linear form for
 (x)    x
which is usually thought of as an approximation to the truth.
2015/4/8
Linear Methods for Regression
3
Fitting by least squares
• Minimize:
• Solutions are
ˆ 0 , ˆ  arg min
ˆ 


N
j 1
N
 0 ,

N
i 1
( yi   0   xi )
2
( xi  x ) yi
( xi  x )
j 1
ˆ 0  y  ˆ x
2
• yˆ i  ˆ 0  ˆ x i are called the fitted or predicted
values
ˆ  ˆ x
r

y


• i
i
0
i are called the residuals
2015/4/8
Linear Methods for Regression
4
Gaussian Distribution
• The normal distribution with arbitrary center μ, and variance
σ 2.
2015/4/8
Linear Methods for Regression
5
Standard errors & confidence intervals
• We often assume further that
yi   0   xi   i
where E (  i )  0 and Var (  i )  
2
.Then
1/ 2



ˆ
se (  )  
2 
  ( x i  x ) 
2
2
2
Estimate  by σˆ   ( y i  yˆ i ) /( N  2 ).
2
• Under additional assumption of normality for the  i s,
a 95% confidence interval for  is:
ˆ  1 . 96 s eˆ ( ˆ ),
2015/4/8

ˆ
ˆ
s eˆ (  )  
2
(
x

x
)
  i
Linear Methods for Regression
2



1/ 2
6
Fitted Line and Standard Errors
ˆ ( x )  ˆ 0  ˆ x
 y  ˆ ( x  x )

2
ˆ
ˆ
se [ ( x )]  var( y )  var(  )( x  x )
2
2
 2
 (x  x)
 

2
 n
 ( xi  x )
2015/4/8
Linear Methods for Regression




1/ 2
1/ 2
7
• Fitted
regression
line with
pointwise
standard
errors:
ˆ ( x )  2 se ˆ ( x )
2015/4/8
Linear Methods for Regression
8
Multiple linear regression
P 1
f ( xi )   0  
x j
• Model is
j 1 ij
• Equivalently in matrix notation:
f  X
• f is N-vector of predicted values
• X is N × p matrix of regresses, with ones in
the first column
•  is a p-vector of parameters
2015/4/8
Linear Methods for Regression
9
Estimation by least squares
ˆ  arg min
 (y
i
 0  
p 1
j 1
x ij  j )
2
i
 arg min( y  X  ) ( y  X  )
T
T
1
T
ˆ
Solution is
  (X X ) X y
yˆ  X ˆ
T
1
2
ˆ
Also
Var (  )  ( X X ) 
2015/4/8
Linear Methods for Regression
10
Z
2015/4/8
Linear Methods for Regression
11
The Bias-variance tradeoff
• A good measure of the quality of an estimator ˆf (x)
is the mean squared error. Let f0(x) be the true value
of f (x) at the point x. Then
2
ˆ
ˆ
Mse[ f ( x )]  E[ f ( x )  f 0 ( x )]
• This can be written as
2
ˆ
ˆ
ˆ
Mse[ f ( x )]  Var[ f ( x )]  [ E f ( x )  f 0 ( x )]
variance + bias^2.
• Typically, when bias is low, variance is high and
vice-versa. Choosing estimators often involves a
tradeoff between bias and variance.
2015/4/8
Linear Methods for Regression
12
The Bias-variance tradeoff
• If the linear model is correct for a given problem,
then the least squares prediction f is unbiased, and
has the lowest variance among all unbiased
estimators that are linear functions of y.
• But there can be (and often exist) biased
estimators with smaller MSE.
• Generally, by regularizing (shrinking, dampening,
controlling) the estimator in some way, its variance
will be reduced; if the corresponding increase in
bias is small, this will be worthwhile.
2015/4/8
Linear Methods for Regression
13
The Bias-variance tradeoff
• Examples of regularization: subset selection
(forward, backward, all subsets); ridge regression,
the lasso.
• In reality models are almost never correct, so there
is an additional model bias between the closest
member of the linear model class and the truth.
2015/4/8
Linear Methods for Regression
14
Model Selection
• Often we prefer a
restricted estimate
because of its
reduced estimation
variance.
2015/4/8
Linear Methods for Regression
15
Analysis of time series data
• Two approaches: frequency domain (fourier)—see
discussion of wavelet smoothing.
• Time domain. Main tool is auto-regressive (AR) model of
order k:
y t   1 y t 1   2 y t  2    k y t  k   t
• Fit by linear least squares regression on lagged data
y t   1 y t 1   2 y t  2    k y t  k
y t 1   1 y t  2   2 y t  3    k y t  k 1

y k  1   1 y k   2 y k 1    k y 1
2015/4/8
Linear Methods for Regression
16
Variable subset selection
• We retain only a subset of the coefficients
and set to zero the coefficients of the rest.
• There are different strategies:
– All subsets regression finds for each s 0, 1,
2, . . . p the subset of size s that gives smallest
residual sum of squares. The question of how to
choose s involves the tradeoff between bias and
variance: can use cross-validation (see below)
2015/4/8
Linear Methods for Regression
17
Variable subset selection
– Rather than search through all possible subsets,
we can seek a good path through them. Forward
stepwise selection starts with the intercept and
then sequentially adds into the model the
variable that most improves the fit. The
improvement in fit is usually based on the
F ratio:
old
new
ˆ
ˆ
RSS (  )  RSS ( 
)
F 
new
ˆ
RSS ( 
) /( N  s )
2015/4/8
Linear Methods for Regression
18
2015/4/8
Linear Methods for Regression
19
2015/4/8
Linear Methods for Regression
20
2015/4/8
Linear Methods for Regression
21
Variable subset selection
• Backward stepwise selection starts with the full
OLS model, and sequentially deletes variables.
• There are also hybrid stepwise selection strategies
which add in the best variable and delete the least
important variable, in a sequential manner.
• Each procedure has one or more tuning
parameters:
– subset size
– P-values for adding or dropping terms
2015/4/8
Linear Methods for Regression
22
Model Assessment
• Objectives:
1. Choose a value of a tuning parameter for a technique
2. Estimate the prediction performance of a given model
• For both of these purposes, the best approach is to run the
procedure on an independent test set, if one is available
• If possible one should use different test data for (1) and (2)
above: a validation set for (1) and a test set for (2)
• Often there is insufficient data to create a separate
validation or test set. In this instance Cross-Validation is
useful.
2015/4/8
Linear Methods for Regression
23
K-Fold Cross-Validation
• Primary method for estimating a tuning parameter
(such as subset size)
• Divide the data into K roughly equal parts (typically
K=5 or 10)
2015/4/8
Linear Methods for Regression
24
K-Fold Cross-Validation
• for each k = 1, 2, . . .K, fit the model with
parameter to the other K − 1 parts, giving
k
ˆ (  ) and compute its error in predicting the
kth part: E (  ) 
k
2
( y  x ˆ (  ))

k
i kth part
i
i
This gives the cross-validation error
CV (  ) 
1
K
K
E
k
( )
k 1
• do this for many values of and choose the
value of that makes CV (  ) smallest.
2015/4/8
Linear Methods for Regression
25
K-Fold Cross-Validation
• In our variable subsets example, is the
subset size
k
ˆ

(  ) are the coefficients for the best subset
•
of size , found from the training set that
leaves out the k-th part of the data
• E k (  ) is the estimated test error for this best
subset.
2015/4/8
Linear Methods for Regression
26
K-Fold Cross-Validation
• From the K cross-validation training sets, the
K test error estimates are averaged to give:
CV (  ) 
1
K
K
E
k
( )
k 1
• Note that different subsets of size will
(probably) be found from each of the K
cross-validation training sets. Doesn’t matter:
focus is on subset size, not the actual subset.
2015/4/8
Linear Methods for Regression
27
The Bootstrap approach
• Bootstrap works by sampling N times with
replacement from training set to form a “bootstrap”
data set. Then model is estimated on bootstrap
data set, and predictions are made for original
training set.
• This process is repeated many times and the
results are averaged.
• Bootstrap most useful for estimating standard
errors of predictions.
• Sometimes produces better estimates than crossvalidation (topic for current research)
2015/4/8
Linear Methods for Regression
28
Shrinkage methods
• Ridge regression
• The ridge estimator is defined by
ˆ
ridge
 arg min( y  X  ) ( y  X  )   
T
T
• Equivalently,
ridge
T
ˆ

 arg min( y  X  ) ( y  X  )
2
subject to   j  s
2015/4/8
Linear Methods for Regression
29
Shrinkage methods
• The parameter  > 0 penalizes  j proportional
to its size  j2. Solution is
T
1
ˆ
   ( X X  I ) X
T
y
• where I is the identity matrix. This is a biased
estimator that for some value of  > 0 may
have smaller mean squared error than the
least squares estimator.
• Note  = 0 gives the least squares estimator;
ˆ  0

if    , then
2015/4/8
Linear Methods for Regression
30
The Lasso
• The lasso is a shrinkage method like ridge,
but acts in a nonlinear manner on the
outcome y.
• The lasso is defined by
lasso
T
ˆ

 arg min( y  X  ) ( y  X  )
subject to   j  t
2015/4/8
Linear Methods for Regression
31
The Lasso
• Notice that ridge penalty 
is replaced
by   j
• this makes the solutions nonlinear in y, and a
quadratic programming algorithm is used to
compute them.
• because of the nature of the constraint, if t is
chosen small enough then the lasso will set
some coefficients exactly to zero. Thus the
lasso does a kind of continuous model
selection.

2015/4/8
Linear Methods for Regression
2
j
32
The Lasso
• The parameter t should be adaptively
chosen to minimize an estimate of expected,
using say cross-validation
• Ridge vs Lasso: if inputs are orthogonal,
ridge multiplies least squares coefficients by
a constant < 1, lasso translates them
towards zero by a constant, truncating at
zero.
2015/4/8
Linear Methods for Regression
33
A family of shrinkage estimators
• Consider the criterion
  arg min
subject


to
N
i 1

( yi  xi  )
T
j
q
2
s
• for q >=0. qThe contours of constant value
of  j  j are shown for the case of two
inputs.
2015/4/8
Linear Methods for Regression
34
Use of derived input directions
• Principal components regression
• We choose a set of linear combinations of the xj s,
and then regress the outcome on these linear
combinations.
• The particular combinations used are the sequence
of principal components of the inputs. These are
uncorrelated and ordered by decreasing variance.
• If S is the sample covariance matrix of x1 , x 2 , , x p ,
2
then the eigenvector equations Sq l  d j q l
define the principal components of S.
2015/4/8
Linear Methods for Regression
35
• Principal components of some
input data points. The largest
principal component is the
direction that maximizes the
variance of the projected data,
and the smallest principal
component minimizes that
variance. Ridge regression
projects y onto these components,
and then shrinks the coefficients
of the low variance components
more than the high-variance
components.
2015/4/8
Linear Methods for Regression
36
PCA regression
• Write q ( j ) for the ordered principal components,
2
ordered from largest to smallest value of d .j
• Then principal components regression computes
the derived input columns
z j  Xq( j)
and then regresses y on
J<=p.
2015/4/8
z1 , z 2 ,
Linear Methods for Regression
, zJ
for some
37
PCA regression
• Since the zjs are orthogonal, this regression
is just a sum of univariate regressions:
yˆ
pcr
 y
J
ˆjz j

j 1
• where ˆ j is the univariate regression
coefficient of y on zj .
• Principal components regression is very
similar to ridge regression: both operate on
the principal components of the input matrix.
2015/4/8
Linear Methods for Regression
38
PCA regression
• Ridge regression shrinks the coefficients of the
principal components, with relatively more
shrinkage applied to the smaller components than
the larger; principal components regression
discards the p-J+1 smallest eigenvalue
components.
•
2015/4/8
Linear Methods for Regression
39
Partial least squares
• This technique also constructs a set of linear
combinations of the xj s for regression, but unlike
principal components regression, it uses y (in
addition to X) for this construction.
– We assume that x is centered and begin by computing
the univariate regression coefficient
ˆ j  x j , y 
2015/4/8
Linear Methods for Regression
40
Partial least squares
• From this we construct the derived input
z 1   ˆ j x j
which is the first partial least squares direction.
• The outcome y is regressed on z1, giving coefficient
z 1 : r1  y  ˆ1 z 1
• then we orthogonalize y, x1, . . . xp with respect to
*
x  x  ˆ z
l
l
l
1
• We continue this process, until J directions have
been obtained.
2015/4/8
Linear Methods for Regression
41
2015/4/8
Linear Methods for Regression
42
Ridge vs PCR vs PLS vs Lasso
• Recent study has shown that ridge and PCR
outperform PLS in prediction, and they are simpler
to understand.
• Lasso outperforms ridge when there are a
moderate number of sizable effects, rather than
many small effects. It also produces more
interpretable models.
• These are still topics for ongoing research.
2015/4/8
Linear Methods for Regression
43
Download