Lec_4n5_LinearReg

advertisement
Lecture4,5 – Linear Regression
Rice ELEC 697
Farinaz Koushanfar
Fall 2006
Summary
•
•
•
•
•
The simple linear regression model
Confidence intervals
Multiple linear regression
Model selection and shrinkage methods
Homework 0
Preliminaries
• Data streams X and Y, forming the measurement
tuples (x1,y1), …., (xN,yN)
• xi is the predictor (regressor, covariate, feature,
independent variable)
• yi is the response (dependent variable, outcome)
• Denote the regression function by: (x) = E (Y |x)
• The linear regression model assumes a specific linear
form for (x) = 0 + 1 x
Linear Regression - More
• ’s are unknown parameters (coefficients)
• X’s can come from different sources:
–
–
–
–
Quantitative inputs
Transformations of quantitative inputs
Basis expansions, e.g. X2=X12, X3=X13
Numeric or “dummy” coding of the levels of qualitative
inputs
– Interaction between variables, e.g. X3=X1.X2
• No matter the source of X, the model is linear in
parameters
Fitting by Least Squares
• For (x1,y1), …., (xN,yN), estimate ’s!
• xi=(xi1,xi2,…, xip)T is a vector of feature measures for
the i-th case
p
N
RSS ()   ( yi  0   x ij j ) 2
• Least square:
i 1
j1
How to Find Least Square Fits?
• X is N(p+1), y is the N-vector of outputs
• RSS() = (y - X)T (y - X)
RSS
 2X T ( y  X),

 2 RSS
T
 2X X
T

• If X is full rank, then XTX is PD, set
RSS
T
X ( y  X)  0
0

ˆ  (XT X)1 XT y
yˆ  Xˆ  X(XT X)1 XT y
H (hat matrix)
Geometrical Representation
• Least square estimates in N:
• Minimize RSS()=||y - X||2, s.t. residual vector y – ŷ
is orthogonal to this subspace
• H (hat matrix a.k.a projection matrix)
Non-full Rank Case!
•
•
•
•
•
•
What if columns of X are linearly dependent!
XTX is singular, ’s not uniquely defined
ŷ is still projection of y into columns of X
There is just more than one way for projection!
Resolve by dropping redundant columns of X
Most SW packages detect and automatically
implement some strategy to remove them
Sampling Properties of 
• yi’s uncorrelated - variance yi= 2, xi’s fixed
ˆ  (XT X)1 XT y
Var (ˆ )  (XT X)1 2
• We use an unbiased estimator for 
N
1
2
ˆ
ˆ 2 
(
y

y
)
 i i
N  p  1 i 1
• Y is linear in X1,…, Xp. Deviations are linear
and additive, N(0, 2)
• We can show that: ˆ  N(, (XT X)1 2 )
Confidence Intervals
• Assume that our population parameter of interest is the
population mean
• What is the meaning of a 95% confidence interval in this
situation?
• X (wrong) there is a 95% chance that the confidence interval
contains the population mean
– But any particular confidence interval either contains the population
mean, or it doesn’t. The confidence interval shouldn’t be interpreted as
a probability.
•  (correct) If samples of the same size are drawn repeatedly
from a population, and a confidence interval is calculated from
each sample, then 95% of these intervals should contain the
population mean.
But Before We Proceed…
• Let’s talk some real probability and statistics,
refresh your memories about a number of
useful distributions!
Chi-square Distribution
• For Z1,Z2,…,Z independent, Z~N(0,1)
• Set 2= Z12+Z22+…+Z2 with > 0
• The distribution of 2 is chi-square with  “degrees of
freedom” (DF)
• 2(x) denotes the value of the distribution function at x
• p, denotes the p-th quantile of the distribution
• Usually use look-up tables to find the values
N
1
2
ˆ
ˆ 2 
(
y

y
)
 i i
N  p  1 i 1
ˆ 2 ~ 2Np1
(N  p 1)
• This is a 2 distribution with (N-p-1) DF
Example – Chi-square Distribution
t-distribution
• Let Z and 2 be independent random variables, assume further
that
– Z has the standard normal distribution
– 2 has the chi-square distribution with  degrees of freedom
– Define t as:
t
Z
2 
• The distribution of t is referred to as t-distribution (student’s tdistribution) with  “degrees of freedom”
• The value of the corresponding distribution function is denoted
by t(x) and its p-th quantile is denoted by tp, (0p1)
Example – t-distribution
Tail of t-distribution vs. Normal
F-distribution
• Let 12, 22 be independent random variables, assume further
that
– 12 has the chi-square distribution with 1 DF
– 22 has the chi-square distribution with 2 DF
• Define F as:
F
1
2
1
2
2
2
• F distribution with 1 DF in the numerator and 2 DF in the
denomenator.
• F1,2(x): distribution function at x, p-th quantile is Fp,1,2
• F is always positive and F1,2(-)= F1,2(0)=0, quantiles>0
Example – F-distribution
Confidence Interval (CI)
• CI is an estimated range of values which is likely to
include an unknown population parameter,
• The estimated range calculated from a given set of
sample data
• If independent samples are taken repeatedly from the
same population, and CI calculated for each sample,
 a certain percentage (confidence level) of the
intervals will include the unknown population
parameter.
Hypothesis Testing
2 distribution,
•
(N-p-1) DF:
•  and 2 statistically independent
• Hypothesis: Is j=0?
• Standard coefficient or Z-score:
(N  p 1)ˆ 2 ~ 2Np1
zj 
j
 vj
• vj: j-th diagonal element of (XTX)-1
• If j=0, zj is distributed as tN-p-1 (t distribution, N-p-1 DF)
• Enter the t-table for N-p-1 DF, choose the significance level
() required, and find the tabulated value
• If the zj exceeds the tabulated value, then j is significantly
different from zero and the hypothesis is rejected!
Test the Significance of a Group
• Simultaneously test the significance of a group of ’s
• F-statistics:
(RSS 0  RSS1 ) /(p1  p 0 )
F
RSS1 /( N  p  1)
• RSS1 is for the bigger model with p1+1
• RSS0 is for the nested model with p0+1, having p1-p0
parameters constrained to be zero
• The F-stat measures the change in RSS, normalized
by an estimate of 2
• If the smaller model is correct, F-stat~Fp1-p0,N-p1-1
Confidence Intervals (CI)
• Recall that ˆ  N(, (XT X)1 2 )
• Isolate j and get the (1-2) CI for j:
1
2
j
1
2
j
ˆ ,  j  z(1) v ˆ )
( j  z(1) v 
• Where z(1- ) is 1-  percentile of the normal
• Similarly, obtain CI for entire vector 
ˆ 22p(11 )
CI  { | (ˆ  )T XT X(ˆ  )  
• Where 2(1-) is the (1-) percentile of the 2
distribution with  DF
The Gauss-Markov Theorem
• The least square estimates of the parameters 
have the smallest variance among all linear
unbiased estimates
• Restriction to unbiased estimation is not
always the best
• Focus on a linear combination =aT of 
• For example, prediction f(X0)=X0
• For a fixed X=X0, this is a linear function
=CY
The Gauss-Markov Theorem - Proof
• =CY: Since the new estimator is unbiased
– E(CY|X) = E(C X+C  |X) =   CX=I
•
•
•
•
•
Var(|X)=Var(CTCT|X) =2CCT
Need to show Var(|X)>Var(ols|X)
Define D=C - (XTX)-1XT 0r, DY = - ols
I=CX=(D+(XTX)-1XT)X=DX+(XTX)-1 XTXDX=0
Var(|X)=2CCT= 2[D+(XTX)-1XT][DT+(XTX)-1] =
2[DDT+(XTX)-1], DDT is positive semidefinite 
• Var(|X)>Var(ols|X), except for D=0, where = ols
Multiple Outputs
• Multiple outputs Y1, Y2, …, YK from multiple inputs
X0,X1,X2,…,Xp
p
Yk  0 k   j1 X j jk   k  f k (X)  
• For each output:
• Y=XB+E (Y(NK), X(N(p+1)), E(NK))
K
N
RSS (B)   ( yik  f k (x i ))2  tr[(Y  XB )T (Y  XB )]
k 1 i 1
Bˆ  (XT X)1 XT Y
• The least square estimates:
• Note: If the errors were correlated as Cov ()=,
K
N
RSS(B)    ( y  f ( x )) ( y  f ( x ))
k 1 i 1
T
ik
k
i
ik
k
i
The Bias Variance Trade-off
• A good measure of the quality of an estimator f*(x) is the MSE
• Let f0(x) be the true value of f(x) at the point x. Then:
• MSE[f*(x)] = E[f*(x)-f0(x)]2.This can be written as:
• MSE[f*(x)] = Var[f*(x)] + [E (f*(x) – f0(x))]2
• This is variance plus squared bias
Typically, when bias is low, variance is high and vice versa.
Choosing estimators often involves a trade-off between bias
and variance
Linear Methods for Regression
• If the linear model is correct for a given problem, then the
OLS prediction is unbiased, and has the lowest variance
among all linear unbiased estimators
• But there can be (and often exist) biased estimators with
smaller MSE
• Generally, by regularizing (thinking, dampening, controlling)
the estimator in some way, its variance will be reduced; if the
corresponding increase in bias is small, this will be worthwhile
• Examples of regularization: subset selection (forward,
backward, all subsets): ridge regression, the lasso
• In reality, models are almost never correct, so there is an
additional model bias between the closest member of the linear
model class and the truth
Model Selection
• Often we prefer a restricted estimate because of its
reduced estimation variance
Subset Selection and Coefficient
Shrinkage
• There are two reasons why we are often not satisfied
with the least squares estimates:
• Prediction accuracy: it can be sometimes improved
by trading a little bias to reduce the variance of the
predicted values
• Interpretation: determine a smaller subset of
predictors that exhibit the strongest effect. To get the
“big picture” sacrifice some of the small details
• Let us describe a number of approaches to variable
selection and coefficient shrinkage…
Subset Selection - Linear Model
• Best subset selection: Finds the subset of size k that
gives the smallest RSS
– Leaps and bounds procedure (Furnival and Wilson, 1974)
– sort of an exhaustive search, all possible combinations!
• Let’s take a look again at prostate cancer exp:
• Correlation b/w the level of prostate specific antigen
(PSA) and clinical predictors
• We use log(PSA) - lpsa as the response variable
• Denote: lpsa ~ lcavol + lweight + age + lbph + svi +
lcp + gleason + pgg45
Prostate Cancer Data
All the Subset Models for PC Example
Subset Selection - Linear Model
• Exhaustive search (best subset selection) becomes
infeasible for large p
• Forward stepwise selection: start from intercept and
sequentially add variables
~
ˆ
• For k inputs   , for 1+k inputs  
~
ˆ
RSS ()  RSS ()
F
~
RSS () /( N  k  2)
• Typically, add predictors producing largest value of F,
stopping when cannot produce F-ratio greater than
90-95th percentile of the F1,N-k-2 distribution
Subset Selection - Linear Model
• Backward stepwise selection: start with a full model and
sequentially delete predictors
• Again, typically uses the F-ratio as stopping criteria
• Drops predictors producing the smallest F at each stage,
stopping when each of the predictors in the model produces a
value of F greater than 90-95th percentile when dropped
• There are also hybrid strategies that simultaneously consider
both forward and backward selection
• Note: this is only a local search, because we are not
considering all the combinations, just the sequential
combination of variables
K-fold Cross Validation
• Primary method for estimating a tuning parameter (such as
subset size)
• Divide data into K roughly equal parts (typically K=5 or 10)
• For each k=1, 2,. . .K, fit the model with  to the other
K−1parts, giving ˆ  k () and compute its error in predicting
the k-th part:
• This gives the cross-validation error
• Do this for many ’s, choose  that makes CV() smallest.
K-fold Cross Validation Notations
• In our variable subsets example, is the subset size
• ˆ  k () are the coefficients for the best subset of size , found
from the training set that leaves out the kth part of the data
• Ek() is the estimated test error for this best subset.
• From the K cross-validation training sets, the K test error
estimates are averaged to give
• Note that different subsets of size will (probably) be found
from each of the K cross-validation training sets. Doesn’t
matter: focus is on subset size, not the actual subset.
The Bootstrap Approach
• Bootstrap works by sampling N times with replacement from
training set to form a “bootstrap” data set. Then model is
estimated on bootstrap data set, and predictions are made for
original training set
• This process is repeated many times and the results are
averaged
• Bootstrap is most useful for estimating standard errors of
predictions
• Can also use modified versions of the bootstrap to estimate
prediction error
• Sometimes produces better estimates than cross-validation
(topic for current research)
Shrinkage Methods – Ridge Reg.
• Because subset selection is a discrete process, it often produces a high
variance model
• Shrinkage methods are more continuous
• Ridge regression: Ridge coefficient minimize a penalized RSS:
• Equivalently
• The parameter >0 penalizes j proportional to its size j2. Solution is
• where I is the identity matrix. This is a biased estimator that for some value
of >0 may have smaller mean squared error than the least squares
estimator. Note =0 gives the least squares estimator; if  then 0
Prostate Cancer Example (Cont’d)
Shrinkage Methods – The Lasso
• The lasso is a shrinkage method like ridge, but acts in a
nonlinear manner on the outcome y
• The lasso is defined by:
• Notice that ridge penalty j2 is replaced by |j|
• This makes the solution nonlinear in y, and a quadratic
programming algorithm is used to compute them
• Because of the nature of the constraint, if t is chosen small
enough then the lasso will set some coefficients to zero. Thus
lasso does a kind of continuous model selection
Shrinkage Methods
• The parameter t should be adaptively chosen to minimize an
estimate of expected, using say cross-validation
• Ridge vs Lasso: if inputs are orthogonal, ridge multiplies least
squares coefficients by a constant < 1, lasso translates them
towards zero by a constant, truncating at zero
Prostate Cancer Example (Cont’d)
Lagrange Multiplier -- Concept
• Solving an optimization problem
– finding a min or max, e.g., min f(P)
•
•
•
•
•
•
•
Not closed form, subject to constraints (e.g. g(P)=0)
If no constraints, you would set grad(f(P))=0
With constraints, define F(P,)=f(P)- g(P)
Now, set the gradient of F to 0, grad(F(P, ))=0
One more dim, partial derivative w.r.t =0  g(P)=0
Thus, grad(F)=0 automatically satisfies our constraint
Add a new Lagrange multiplier() for each constraint
Lagrange Multiplier – Geometrical
Example
•
•
•
•
In 2D, assume, OF: min f(x,y)
Constraint: g(x,y)-c=0
Traverse along g(x,y)-c=0
May intersect with f(x,y)=dn at
many points
• Only when g=c, and f(x,y) touches
tangentially, but does not cross g,
we will find min
• For more info, see Lagrange
Multipliers without Permanent
Scarring (by Dan Klein)
From wikipedia:
Drawn in green is the locus of
points satisfying the constraint
g(x,y) = c. Drawn in blue are
contours of f. Arrows represent
the gradient, which points in a
direction normal to the contour.
Principle Component Analysis (PCA)
Intro – eigenvlues, eigenvectors
• A square matrix can be interpreted as a
transformation – translation, rotation, stretching, etc.
• Eigenvectors of transformations are vectors that are
either left intact or simply multiplied by a scalar
factor after the transformation
• An eigenvector's eigenvalue is the scale factor by
which it has been multiplied
• E.g., matrix A, eigenvector u, eigen value 
• By definition, Au = u
– To find, set Au - u =0, det(A- I)=0, sys of linear eq.
PCA Intro – Singular Value
Decomposition (SVD)
• The covariance matrix of a matrix XNp, pN, is a
pp matrix (square), whose elements are covariance
between column pairs of X
• Singular value decomposition: matrix X can be
written as X=UDVT
• U and V are Np and pp orthogonal matrices:
– Columns of U spanning the column space of X, UTU=INN
– Columns of V spanning the row space, VTV=Ipp
– D is a pp diagonal matrix, with diagonal entries d1d2…
dp 0 called the singular values of X
• The SVD always exists, and is unique up to signs
PCA Intro – SVD, correlation matrix
• SVD finds the eigenvalues and eigenvectors of XXT
and XTX
• The eigenvectors of XTX make up the columns of V ,
the eigenvectors of XXT make up the columns of U.
• XTX= VDTUTUDVT=VD2VT, V is the eigenvectors
of the covariance matrix
• The singular values in D (diagonal entries of D) are
square roots of eigenvalues from XTX and are
arranged in descending order
• The singular values are always real numbers. If the
matrix X is a real matrix, then U and V are also real
PCA and dimension reduction
• The eigenvectors vj are the principle component
directions of X
• The first PC is z1=Xv1=u1d1 has the largest variance
among all normalized directions, var(z1)=d12/N
• Subsequent zj’s have max variance dj2/N subject to
the constraint of being orthogonal to earlier ones
• If we have a large number of correlated inputs, we
can produce a small number of linear combinations
Zm, m=1,…,M (M<P) of the original inputs Xj, then
use Zm’s as regression inputs
Methods Using Derived Input
Directions
• Principle Component Regression:
PCA Regression
• Let Dq be D, with all but the first q diagonal elements set to
ˆ  UD V T
zero. Then X
q
q
~ ˆ
min
|| X  Xq ||
ˆ
rank ( Xq ) q
• Write q(j) for the ordered principle components, ordered from
largest to smallest value of dj2
• Then principle component regression computes the derived
input columns zj=Xq(j) and then regresses y on z1,z2,…,zJ for
some Jp
• Since the zj’s are orthogonal, this regression is just a sum of
univariate regressions:
J
ˆ pcr  ~
y
y
ˆ z

j1
j
j
Where ˆ j is the univariate regression coefficient of y on zj
PCA vs. Ridge Regression
• Principal components regression is very similar to ridge regression: both
operate on the principal components of the input matrix.
• Ridge regression shrinks the coefficients of the principal components, with
relatively more shrinkage applied to the smaller components than the
larger; principal components regression discards the p − J + 1 smallest
eigenvalue components.
Partial Least Squares
• This technique also constructs a set of linear combinations of
the xj’s for regression, but unlike principal components
regression, it uses y (in addition to X) for this construction.
• We assume that y is centered and begin by computing the
univariate regression coefficient ˆj of y on each xj
• From this we construct the derived input z1 = ˆjxj, which is
the first partial least squares direction.
• The outcome y is regressed on z1, giving coefficient ˆ1, and
then we orthogonalize y, x1, . . . xp with respect to z1: r1 = y −
ˆ1z1, and
*
ˆ
xl  xl  θl zl
• We continue this process, until J directions are obtained.
Partial Least Squares (Cont’d)
• In this manner, partial least squares produces a
sequence of derived inputs or directions z1, z2, . . . zJ .
• As with principal components regression, if we
continue on to construct J = p new directions, we get
back the ordinary least squares estimates; use of J < p
directions produces a reduced regression
• Notice that in the construction of each zj , the inputs
are weighted by the strength of their univariate effect
on y.
Example – Cancer Data (Cont’d)
Ridge vs. PCA vs. PLS vs. Lasso
• Recent study has shown that ridge and PCR outperform PLS in
prediction, and they are simpler to understand.
• Lasso outperforms ridge when there are a moderate number of
sizable effects, rather than many small effects. It also produces
more interpretable models.
• These are still topics for ongoing research.
PCA – Singular Value Decomposition
(SVD)
• The SVD of the Np matrix X has the form X=UDVT
• U and V are Np and pp orthogonal matrices:
– Columns of U spanning the column space of X
– Columns of V spanning the row space
– D is a pp diagonal matrix, with diagonal entries d1d2…
dp 0 called the singular values of X
• The SVD always exists, and is unique up to signs
• The columns of V are the principal components, and
Zj = Ujdj
Download