Multiple Linear Regression Passenger car mileage

advertisement
Multiple Linear Regression
University of California, San Diego
Instructor: Ery Arias-Castro
http://math.ucsd.edu/~eariasca/teaching.html
1 / 42
Passenger car mileage
Consider the carmpg dataset taken from here.
Variables:
make.model – Make and model
vol – Cubic feet of cab space
hp – Engine horsepower
mpg – Average miles per gallon
sp – Top speed (mph)
wt – Vehicle weight (100 lb)
Goal: Predict a car’s gas consumption based on its characteristics.
Graphics: pairwise scatterplots and possibly individual boxplots.
To [R].
2 / 42
Scatterplot highlights
hp, sp and wt seem correlated.
vol and wt seem correlated.
Highly correlated predictors can lead to a misinterpretation of the fit.
mpg seems polynomial in hp and sp.
mpg versus wt shows a fan shape indicating unequal variances.
3 / 42
Least squares regression
We fit a (simple) linear model:
mpg = β0 + β1 vol + β2 hp + β3 sp + β4 wt
With data {(xi , yi) : i = 1, . . . , n}, where x = (xi,1 , . . . , xi,p ), we fit:
yi = β0 + β1 xi,1 + · · · + βp xi,p + ǫi
Standard assumptions: measurement errors are i.i.d. normal with mean zero, i.e.
ǫ1 , . . . , ǫn ∼iid N (0, σ 2)
The least squares fit minimizes the error sum of squares
SSE(β0 , β1 , . . . , βp ) =
n
X
i=1
(yi − β0 − β1 xi,1 − · · · − βp xi,p )2
Under the standard assumption, this fit corresponds to the MLE.
4 / 42
Fitted values, residuals and standard error
The fitted (predicted) values are defined as:
ybi = βb0 + βb1 xi,1 + · · · + βbp xi,p
The residuals are defined as:
ei = yi − ybi
Our estimate for σ 2 is the mean squared error of the fit
n
n
X
X
1
1
2
(yi − ybi) =
e2
σ
b =
n − p − 1 i=1
n − p − 1 i=1 i
2
5 / 42
Matrix interpretation
Let X be the n × (p + 1) matrix with row vectors {(1, xi) : i = 1, . . . , n}.
Define y = (y1 , . . . , yn ), ǫ = (ǫ1 , . . . , ǫn ) and β = (β0 , β1 , . . . , βp ).
The model is:
y = Xβ + ǫ
Then the least squares estimate βb minimizes
ky − Xβk2
If the columns of X are linearly independent, then
βb = (XT X)−1 XT y
Note that the estimate is linear in the response.
Define the hat matrix
H = X(XT X)−1 XT
H is the orthogonal projection onto span(X).
The fitted values may be expressed as
The residuals may be expressed as
b = X(XT X)−1 XT y = Hy
y
b = (I − H)y
e= y−y
6 / 42
Distributions
Assume that the ǫi ’s are i.i.d. normal with mean zero.
Define M = (XT X)−1 ∈ R(p+1)×(p+1) .
βb = (βb0 , . . . , βbp ) has the multivariate normal distribution with mean β (unbiased) and covariance matrix
σ 2 M.
I.e. the βbj ’s are jointly normal and
For σ
b2 , we have
βb and σ
b2 are independent.
E βbj = βj ,
σ
b2 ∼
cov βbj , βbk = σ 2 Mjk
σ2
χ2
n − p − 1 n−p−1
7 / 42
t-Tests
Consequently,
βb − βj
pj
∼ tn−p−1
σ
b2 Mjj
In general, if c = (c0 , c1 , . . . , cp )T ∈ Rp+1 , then
√
b
where se
b c β =σ
b cT Mc
T
cT βb − cT β
∼ tn−p−1
se
b cT βb
In particular, the following is a level-(1 − α) confidence interval for β0 + β1 x1 + · · · + βp xp = β T x, where
x = (1, x1 , . . . , xp ):
α/2
βbT x ± tn−p−1 se
b βbT x
To [R].
8 / 42
Confidence bands / regions
Using the fact that
(βb − β)T XT X(βb − β)
∼ Fp+1,n−p−1
(p + 1)b
σ2
the following defines a level-α confidence region (an ellipsoid) for β:
α
(βb − β)T XT X(βb − β) ≤ (p + 1)b
σ 2 Fp+1,n−p−1
Using the same fact, Scheffe’s S-method gives that
α
cT β ± ((p + 1)Fp+1,n−p−1
)1/2 se
b cT βb
is a level-α confidence interval for cT β simultaneously for all c ∈ Rp+1 .
As a special case, we obtain the following confidence band
T
α
1/2
T
b
b
β x ± ((p + 1)Fp+1,n−p−1) se
b β x
9 / 42
Comparing models
Say we want to test
H0 :
H1 :
Same as
H0 :
H1 :
β1 = · · · = βp = 0
βj 6= 0, for some j = 1, . . . , p
yi = β0 + ǫi
yi = β0 + β1 xi,1 + · · · + βp xi,p + ǫi
The model under H0 is a submodel of the model under H1 .
10 / 42
Analysis of variance
The residual sum of squares under H0 is
SSY =
n
X
i=1
(yi − y)2
It has n − 1 degrees of freedom.
The residual sum of squares under H1 is
n
X
SSE =
(yi − ybi )2
i=1
It has n − p − 1 degrees of freedom.
The sum of squares due to regression is
SSreg = SSY − SSE =
n
X
i=1
It has p degrees of freedom.
The ANOVA F -test rejects for large:
SSreg/p
SSE/(n − p − 1)
F =
(y − ybi)2
Under H0 ,
F ∼ Fp,n−p−1
11 / 42
Coefficient of (multiple) determination
Defined as
R2 = 1 −
SSE
SSY
Interpreted as the fraction of variation in y explained by x.
R2 always increases when variables are added.
The adjusted R2 incorporates the number of predictors
Ra2 = 1 −
SSE/(n − p − 1)
SSY /(n − 1)
12 / 42
Testing whether a subset of the variables are zero
Consider a subset of variables J ⊂ {1, . . . , p}.
We want to test
H0 : ∀j ∈ J βj = 0
H1 : ∃j ∈ J βj =
6 0
Same as
H0 :
yi = β0 +
X
βj xi,j + ǫi
X
βj xi,j + ǫi
j ∈J
/
p
H1 :
yi = β0 +
j=1
The model under H0 is a submodel of the model under H1 .
13 / 42
Analysis of Variance
Let SSEJ the rss for the model under H0 .
Let SSE be the rss for the model under H1 .
The ANOVA F -test rejects for large:
F =
(SSEJ − SSE)/|J|
SSE/(n − p − 1)
Under the null,
F ∼ F|J|,n−p−1
In particular, testing βj = 0 versus βj 6= 0 using the F -test above is equivalent to using the t-test described
earlier.
14 / 42
Testing whether a given set of linear combinations are zero
Consider a (full-rank) matrix q-by-(p + 1) matrix A.
We want to test
H0 : Aβ = 0
H1 : Aβ 6= 0
Assuming A = (A1 , A2 ) with A2 q-by-q invertible, we are comparing the two models
H0 :
H1 :
y = (X1 − X2 A−1
2 A1 )β1 + ǫ
y = Xβ + ǫ
The model under H0 is a submodel of the model under H1 .
15 / 42
Analysis of Variance
Let SSEA the rss for the model under H0 .
Let SSE be the rss for the model under H1 .
The ANOVA F -test rejects for large:
F =
(SSEA − SSE)/rank(A)
SSE/(n − p − 1)
Under the null,
F ∼ Frank(A),n−p−1
16 / 42
Spreading
Spreading the variables often equalizes the leverage of the data points and improves the fit, both numerically
(e.g. R2 ) and visually.
A typical situation is an accumulation of data points near 0, in which case applying a logarithm or a square
root helps spread the data more evenly.
Example: mammals dataset.
17 / 42
Standardized variables
Standardizing the variables removes the unit and makes comparing the magnitude of the coefficients easier.
One way to do so is to make all the variables mean 0 and unit length:
y−y
y←√
SSY
and
xj − xj
xj ← p
SSXj
An intercept is not needed anymore.
Standardizing the variables changes the estimates βbj , but not σ
b2 , the multiple R2 or the p-value of the tests
we saw.
18 / 42
Checking assumptions
Can we rely on those p-values?
In other words, are the assumptions correct?
1. Mean zero, i.e. model accuracy
2. Equal variances, i.e. homoscedasticity
3. Independence
4. Normally distributed
19 / 42
Residuals
We look at the residuals as substitutes for the errors.
Remember that e = (e1 , . . . , en ), with
b = y − Xβb = (I − H)y
e=y−y
Under the standard assumptions, e is multivariate normal with mean zero and covariance σ 2 (I − H)).
In particular,
cov (ei , ej ) = −σ 2 hij
∀i 6= j,
var (ei ) = σ 2 (1 − hii ) ∀i
20 / 42
Standardized residuals
In practice, the residuals are often corrected for unequal variance.
Internally studentized residuals (what R uses):
Note that:
ri2
∼ Beta
n−p−1
ei
ri = √
σ
b 1 − hii
1 n−p−2
,
2
2
Externally studentized residuals:
ti =
, so that ri ∼
˙ tn−p−1
e
√i
σ
b(i) 1 − hii
where σ
b(i) comes from the fit with the ith observation removed.
Note that:
ti ∼ tn−p−2
21 / 42
Checking model accuracy
With several predictors, the residuals-vs-fitted-values plot may be misleading. Instead, we look at partial
residual plots, which focus on one variable at a time.
Say we want to look at the influence of predictor Xj = (x1,j , . . . , xn,j ) (the jth column of X) on the response
y.
Added variable plots take into account the influence of the other predictors.
1. Regress y on all the predictors excluding Xj , obtaining residuals e(j) .
2. Regress Xj on all the predictors excluding Xj , obtaining residuals X(j) .
3. Plot e(j) versus X(j) .
Component plus residual plots are another alternative.
1. Plot e + βbj Xj versus Xj .
22 / 42
Checking homoscedasticity
Partial residual plots may help assess whether the errors have equal variance or not.
A fan-shape in the plot is an indication that the errors do not have equal variances.
23 / 42
Checking normality
A q-q plot of the standardized residuals is used.
24 / 42
Checking for independence
Checking for independence often requires defining a dependency structure explicitly and testing against that.
(For otherwise we would need a number of replicates of the data, i.e. multiple samples of same size.)
If dependency is expected, then it is preferable to fit some time series model (for the errors). In that context,
we could use something like the Durbin-Watson test to assess whether the errors are autocorrelated.
25 / 42
Outliers
Outliers are points that are unusual compared to the bulk of the observations.
An outlier in predictor is a data point (xi , yi ) such that xi is away from the bulk of the predictor vectors.
Such a point has high-leverage
An outlier in response is a data point (xi , yi) such that yi is away from the trend implied by the other
observations.
A point that is both an outlier in predictor and response is said to be influential.
26 / 42
Detecting outliers
To detect outliers in predictor, plotting the hat values hii may be useful. (hii > 2p/n is considered suspect.)
To detect outliers in response, plotting the residuals may be useful.
27 / 42
Leave-one-out diagnostics
If βb(i) is the least squares coefficient vector computed with the ith case deleted, then
(XT X)−1 xi ei
βb − βb(i) =
1 − hii
Define
b(i) = Xβb(i)
y
In particular, yb(i)i is the value at xi predicted by the model fitted without the observation (xi , yi ).
28 / 42
Cook’s distances
Cook’s distances
Di
b(i) k2
kb
y−y
(change in fitted values)
=
(p + 1)b
σ2
(βb − βb(i) )T XT X(βb − βb(i) )
=
(change in coefficients)
(p + 1)b
σ2
ri2 hii
=
(combination of residuals and hat values)
(p + 1)(1 − hii )
(.05)
Di > Fp+1,n−p−1 is considered suspect.
To [R].
29 / 42
DFBETAS and DFFITS
DFBETAS
DFBETASj(i) = q
√
DFBETASj(i) > 2/ n is considered suspect.
DFFITS
DFFITSi > 2
p
βbj − βbj(i)
2
σ
b(i)
(XT X)−1
j,j
ybi − ybi(i)
DFFITSi = q
= ti
2
σ
b(i)
hi,i
r
hi
1 − hi
p/n is considered suspect.
30 / 42
Multicollinearity
When the predictors are nearly linearly dependent, several issues arise:
1. Interpretation is difficult.
2. The estimates have large variances:
var βbj =
σ2
(1 − Q2j )SSXj
where Q2j is the R2 when regressing Xj on Xk , k 6= j.
3. The fit is numerically unstable since XT X is almost singular.
31 / 42
Correlation between predictors
To detect multicollinearity, we look at large correlation between predictors.
For that we compute the correlation matrix of the predictors, with coefficients cor(Xj , Xk ).
To [R].
32 / 42
Variance Inflation Factors
Alternatively, one can look at the variance inflation factors:
VIFj =
1
1 − Q2j
If the predictors are standardized to unit norm, XT X is a correlation matrix and we have:
VIF = diag((XT X)−1 )
VIFj > 10 (same as Q2j > 90%) is considered suspect.
To [R].
33 / 42
Condition Indices
Another option is to examine the condition indices of XT X:
κj =
λmax
λj
where the λj ’s are the eigenvalues of XT X.
κj > 1000 is considered suspect.
λmax /λmin is called the condition number of matrix XT X.
To [R].
34 / 42
Dealing with questionable assumptions
If some of the assumptions are doubtful, it is important to correct the model, for otherwise we cannot rely on
our results (estimates + tests).
35 / 42
Dealing with curvature
When the model accuracy is questionable, we can:
1. Transform the variables and/or the response, which often involves guessing. This is particularly useful to
spread the variables.
2. Augment the model by adding other terms, perhaps with interactions. This is often used in conjunction
with model selection so as to avoid overfitting.
36 / 42
Dealing with heteroscedasticity
When homoscedasticity is questionable, we can:
1. Apply a variance stabilizing transformation to the response.
2. Fit the model by weighted least squares.
3. Both involve guessing at the dependency of the variance as a function of the variables.
37 / 42
Variance stabilizing transformations
Here are a few common transformations on the response to stabilize the variance:
σ2
σ2
σ2
σ2
σ2
√
∝ E(yi ) (Poisson)
change to y
√
∝ E(yi )(1 − E(yi )) (binomial) change to sin−1 ( y)
∝ E(yi )2
change to log(y)
√
∝ E(yi )3
change to 1/ y
∝ E(yi )4
change to 1/y
In general, we want a transformation ψ such that
ψ ′ (Eyi )var (yi ) ∝ 1
38 / 42
Weighted least squares
Suppose
yi = β0 + β T xi + ǫi ,
Maximum likelihood estimation of β0 , β corresponds to the weighted least squares solution that minimizes
SSE(β0 , β) =
n
X
i=1
ǫi ∼ N (0, σi2)
wi (yi − β0 − β T xi )2 ,
wi =
1
σi2
To determine appropriate weights, different approaches are used:
– Guess how σi2 varies with xi , i.e. the shape of var (yi |xi ).
– Assume a parametric model for the weights and use maximum likelihood estimation – solved by
iteratively reweighted least squares.
Using weights is helpful in situations where some observations are more reliable than others, mostly because
of variability.
39 / 42
Generalized least squares
Weighted least squares assumes that the errors are uncorrelated.
Generalized least squares assumes a more general form for the covariance of the errors, namely σ 2 V, where
V is usually known.
Assuming both the design matrix X and the covariance matrix V are full rank, the maximum likelihood
estimates are
βb = (XT V−1 X)−1 XT V−1 y
40 / 42
Dealing with outliers
We have identified an outlier. Was it recorded properly?
– No: correct it or remove it from the fit.
– Yes: decide whether to include it in the fit.
Genuine outliers carry information, so simply discarding them amounts to losing that information. However,
strong outliers can compromise the quality of the overall fit.
Possible option: to model outliers and non-outliers separately.
41 / 42
Dealing with Multicolinearity
If interpretation is the main goal, drop a few redundant predictors.
If prediction is the main goal, use a model selection procedure.
42 / 42
Download