Chapter 3: Linear Methods for Regression The Elements of Statistical Learning Aaron Smalter

advertisement
Chapter 3:
Linear Methods for Regression
The Elements of Statistical Learning
Aaron Smalter
Chapter Outline



3.1 Introduction
3.2 Linear Regression Models and Least
Squares
3.3 Multiple Regression from Simple Univariate
Regression

3.4 Subset Selection and Coefficient Shrinkage

3.5 Computational Considerations
3.1 Introduction




A linear regression model assumes the
regression function,
E(Y|X)
is linear on the inputs
X1, ..., Xp
Simple, precomputer model
Can outperform nonlinear models when low #
training cases, low signal-noise ration, sparse
Can be applied to transformations of the input
3.2 Linear Regression Models and
Least Squares

Vector of inputs: X = (X1, X2, ..., Xp)

Predict real-valued output: Y
3.2 Linear Regression Models and
Least Squares (cont'd)

Inputs derived from various sources

Quantitative inputs

Transformations of inputs (log, square, sqrt)

Basis expansions (X2 = X1^2, X3 = X1^3)


Numeric coding (map one multi-level input into
many Xi's)
Interactions between variables (X3 = X1 * X2)
3.2 Linear Regression Models and
Least Squares (cont'd)


Set of training data used to estimate
parameters,
(x1,y1) ... (xN,yN)
Each xi is a vector of feature measurements,
xi = (xi1, xi2, ... xip)^T
3.2 Linear Regression Models and
Least Squares (cont'd)

Least Squares


Pick a set of coefficients,
B = (B0, B1, ..., Bp)^T
In order to minimize the residual sum of squares
(RSS):
3.2 Linear Regression Models and
Least Squares (cont'd)



How do we pick B so that we minimize the
RSS?
Let X be the N x (p+1) matrix

Rows are input vectors (1 in first position)

Cols are feature vectors
Let y be the N x 1 vector of outputs
3.2 Linear Regression Models and
Least Squares (cont'd)

Rewrite RSS as,

Differentiate with respect to B,

Set first derivative to zero (assume X has full column rank,
X^TX is positive definite),

Obtain unique solution:

Fitted values are:
3.2.2 The Gauss-Markov Theorum


Asserts: least squares estimates have the
smallest variance among all linear unbiased
estimates.
Estimate any linear combination of the
parameters,
3.2.2 The Gauss-Markov Theorum
(cont'd)


Least squares estimate is,
If X is fixed, this is a linear function c0^Ty of
response vector y

If linear model is correct, a^TB is unbiased.

Gauss-Markov Theorum:
3.2.2 The Gauss-Markov Theorum
(cont'd)



However, unbiased estimator is not always
best.
Biased estimator may exist with better MSE,
trading bias for reduction in variance.
In reality, most models are approximations of
the truth and therefore biased; need to pick a
model that balances bias and variance.
3.3 Multiple Regression from
Simple Univariate Regression

Linear model with p > 1 inputs is a multiple
linear regression model.

Suppose we first have univariate model (p = 1),

Least squares estimate and residuals are,
3.3 Multiple Regression from
Simple Univariate Reg. (cont'd)


With inner product notation we can rewrite as,
Will use univariate regression to build multiple
least squares regression.
3.3 Multiple Regression from
Simple Univariate Reg. (cont'd)



Suppose inputs x1,x2,...,xp are orthogonal
(<xi,xj> = 0, for all j != k)
Multiple least squares estimates Bj are equal to
the univariate estimates <xj,y> / <xj,xj>
When inputs are orthogonal, they have no effect
on each others parameter estimates.
3.3 Multiple Regression from
Simple Univariate Reg. (cont'd)



Observed data is not usually orthogonal, so we
must make it so.
Given intercept and single input x, the least
squares coefficient of x has the form,
This is the result of two simple regression
applications:


1. regress x on 1 to produce residual z = x – x^-1
2. regress y on the residual z to give coefficient
B1^^
3.3 Multiple Regression from
Simple Univariate Reg. (cont'd)

In this procedure, the phrases ”regress b on a”
or ”b is regressed on a” refers to:


A simple univariate regression of b on a with no
intercept,

producing coefficient y^^ = <a,b> / <a,a>,

And residual vector b – y^^a.
We say b is adjusted for a, or is
”orthogonalized” with respect to a.
3.3 Multiple Regression from
Simple Univariate Reg. (cont'd)


We can generalize this approach to cases of p
inputs (Gram-Schmidt procedure):
The result of the algorithm is:
3.3.1 Multiple Outputs

Suppose we have multiple outputs Y1,Y2,...,Yk,

Predicted from inputs X0,X1,X1,...,Xp.

Assume a linear model for each output:

In matrix notation:

Y is N x K response matrix, ik entry is yik

X is N x (p+1) input matrix

B is (p+1) x K parameter matrix

E is N x K matrix of errors
3.3.1 Multiple Outputs (cont'd)

Generalization of the univariate loss function:

Least squares estimate has same form:

Multiple outputs do not affect each other's least
squares estimates
3.4 Subset Selection and
Coefficient Shrinkage


As mentioned, unbiased estimators are not
always the best.
Two reasons why not satisfied by least squares
estimates:


1. prediction accuracy – low bias but large variance;
improve by shrinking or settings coefficients to zero.
2. interpretation – with a large number of predictors,
signal gets lost in the noise, we want to find a
subset with strongest effects.
3.4.1 Subset Selection



Improve estimator performance by retaining
only a subset of variables.
Use least squares to estimate coefficients of
remaining inputs.
Several strategies:

Best subset regression

Forward stepwise selection

Backwards stepwise selection

Hybrid stepwise selection
3.4.1 Subset Selection (cont'd)

Best subset regression





For each k in {0,1,2,...,p},
find subset of size k that gives smallest residual
sum
Leaps and Bounds procudure (Furnival and
Wilson 1974)
Feasible for p up to 30-40.
Typically choose k such that estimate of
expected prediction error is minimized.
3.4.1 Subset Selection (cont'd)

Forward stepwise selection

Searching all subsets is time consuming

Instead, find a good path through them.



Start with intercept,

Sequentially add predictor that most improves the fit.
”Improved fit” based on F-statistic, add predictor
that gives largest value of F
Stop adding when no predictor gives a significantly
greater F value.
3.4.1 Subset Selection (cont'd)

Backward stepwise selection

Similar to previous procedure, but starts with the full
model

Sequentially deletes predictors.

Drop predictor giving the smallest F value.

Stop when dropping any other predictor leads to
significanty decrease in F value.
3.4.1 Subset Selection (cont'd)

Hybrid stepwise selection




Consider both forward and backward moves at
each iteration
Make ”best” move
Need parameter to set threshold between choosing
”add” or ”drop” operation.
The stopping rule (for all stepwise methods)
leads to only a local max/min, not guarenteed to
find the best model.
3.4.3 Shrinkage Methods



Subset selection is a discrete process
(variables retained or not) so may give high
variance.
Shrinkage methods are similar, but use a
continuous process to avoid high variance.
Examine two methods:

Ridge regression

Lasso
3.4.3 Shrinkage Methods (cont'd)

Ridge regression


Shrinks regression coefficients by imposing a
penalty on their size.
Ridge coefficients minimize the penalized sum of
squares,


Where s is a complexity parameter controlling the
amount of shrinkage.
Mitigates high variance produced when many input
variables are correlated.
3.4.3 Shrinkage Methods (cont'd)

Ridge regression

Can be reparameterized using centered inputs,
replacing each xij with

Estimate,

Ridge regression solutions give by,
3.4.3 Shrinkage Methods (cont'd)

Singular value decomposition (SVD) of
centered input matrix X has form,

U and V are N x p and p x p orthogonal matrices

Cols of U span col space of X

Cols of V span row space of X

D is p x p diagonal matrix with d1 >= ... >= dp >= 0
(singular values)
3.4.3 Shrinkage Methods (cont'd)

Using SVD, rewrite least squared fitted vector:

Now the ridge solutions are:

Greater amount of shrinkage is applied to
vectors with smaller dj^2
3.4.3 Shrinkage Methods (cont'd)




SVD of centered matrix X expresses the
principle components of the variables in X.
The the eigen decomposition of X^TX is
The eigenvectors vj are the principle
components directions of X.
First principle component has largest sample
variance, last one has minimum variance.
3.4.3 Shrinkage Methods (cont'd)

Ridge regression


Assume that response varies most in direction of
high variance of the inputs.
Shrink the coefficients of the low-variance
components more than high-variance ones
3.4.3 Shrinkage Methods (cont'd)

Lasso




Shrinks coefficients, like ridge regression, but has
some differences.
Estimate is defined by,
Again, reparameterize and fid model without
intercept.
Ridge penalty replaced by lasso penalty
3.4.3 Shrinkage Methods (cont'd)

Compute using quadratic programming.

Lasso is kind of continuous subset selection




Making t very small will make some coefficients
zero
If t is larger than some threshold, lasso estimates
are same as least squares
If t chosen as half of the this threshold, coefficients
are shrunk by about 50%.
Choose t to minimize estimate of prediction error.
3.4.4 Methods Using Derived
Input Directions


Sometimes we have a lot of inputs, and want to
instead produce a smaller number of inputs
which are a linear combinations of the original
inputs.
Different methods for constructing linear
combintaions:

Principal component regression

Partial least squares
3.4.4 Methods Using Derived
Input Directions (cont'd)

Principal Components Regression

Constructs linear combinations using principal
components

Regression is a sum of univariate regressions:

zm are each linear combinations of original xj

Similar to ridge regression, instead of shrinking PC
we discard p – M smallest eigenvalue components.
3.4.4 Methods Using Derived
Input Directions (cont'd)

Partial least squares




Linear combinations use y in addition to X.
Begin by computing univariate regression
coefficient of y on each xi,
Construct the derived input
In construction of each zm, inputs are weighted by
strength of their univariate effect on y.

The outcome y is regressed on z1

We orthogonalize with respect to z1

Iterate until M <= p directions obtained.
3.4.6 Multiple Outcome Shrinkage
and Selection


Least squares estimates in multi-output linear
model are collection of individual estimates for
each output.
To apply selection and shrinkage we could
simply apply a univariate technique

individually to each outcome

or simultaneously to all outcomes
3.4.6 Multiple Outcome Shrinkage
and Selection (cont'd)

Canonical correlation analysis (CCA)

Data reduction technique that combines responses

Similar to PCA

Finds



sequence of uncorrelated linear combinations of inputs,
corresponding sequence of uncorrelated linear
combinations of the responses,
That maximizes correlations:
3.4.6 Multiple Outcome Shrinkage
and Selection (cont'd)

Reduced-rank regression


Formalizes CCA approach using regression model
that explicitly pools information
Solve multivariate regression problem:
Questions
Download