Multiple Regression

advertisement
Multiple Regression
Objectives
𝑦 = 𝛽0 + 𝛽1 𝑋1 + β‹― + 𝛽𝑝 𝑋𝑝
• Maximize the predictive power of the independent variables as
represented in the variate.
• Compare two or more sets of independent variables to ascertain the
predictive power of each variate
Explanation
• The most direct interpretation of the regression variate is a
determination of the relative importance of each independent
variable in the prediction of the dependent measure.
• Assess the nature of the relationships between the independent
variables and the dependent variable.
• Provide insight into the relationships among independent variables.
X3
Y’
X1
X2
Sample Problem (Leslie Salt Property):Finding Fair
Price of a Land
Variable Name
Description
PRICE
Sale price in $000 per acre
COUNTY
San Mateo=0, Santa Clara=1
SIZE
Size of the property in acres
ELEVATION
Average elevation in feet above sea level
SEWER
Distance (in feet) to nearest sewer connection
DATE
Date of sale counting backward from current time (in months)
FLOOD
Subject to flooding by tidal action =1; otherwise=0
DISTANCE
Distance in miles from Leslie property
2
4
6
8
Histogram of leslie_salt[, 1]
0
0
10
20
30
40
leslie_salt[, 1]
Histogram of Y
0.0 0.2 0.4 0.6 0.8
DISTANCE
0
0.3
0
2.5
1
10.3
0
14
1
14
0
0
0
0
0
0
0
1.2
0
0
0
0
0
0
0
0.5
0
4.4
0
4.2
0
4.5
0
4.7
0
4.9
0
4.6
0
5
0
16.5
0
5.2
0
5.5
1
11.9
1
5.5
0
7.2
0
5.5
0
10.2
0
5.5
1
5.5
0
5.5
Density
COUNTY SIZE
ELEVATION SEWER DATE
FLOOD
4.5
1
138.4
10
3000
-103
10.6
1
52
4
0
-103
1.7
0
16.1
0
2640
-98
5
0
1695.2
1
3500
-93
5
0
845
1
1000
-92
3.3
1
6.9
2
10000
-86
5.7
1
105.9
4
0
-68
6.2
1
56.6
4
0
-64
19.4
1
51.4
20
1300
-63
3.2
1
22.1
0
6000
-62
4.7
1
22.1
0
6000
-61
6.9
1
27.7
3
4500
-60
8.1
1
18.6
5
5000
-59
11.6
1
69.9
8
0
-59
19.3
1
145.7
10
0
-59
11.7
1
77.2
9
0
-59
13.3
1
26.2
8
0
-59
15.1
1
102.3
6
0
-59
12.4
1
49.5
11
0
-59
15.3
1
12.2
8
0
-59
12.2
0
320.6
0
4000
-54
18.1
1
9.9
5
0
-54
16.8
1
15.3
2
0
-53
5.9
0
55.2
0
1320
-49
4
0
116.2
2
900
-45
37.2
0
15
5
0
-39
18.2
0
23.4
5
4420
-39
15.1
0
132.8
2
2640
-35
22.9
0
12
5
3400
-16
15.2
0
67
2
900
-5
21.9
0
30.8
2
900
-4
Frequency
PRICE
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Y
SEWER
FLOOD
SIZE
COUNTY
DISTANCE
ELEVATION
DATE
PRICE
SEWER
FLOOD
SIZE
COUNTY
DISTANCE
ELEVATION
DATE
PRICE
PRICE
COUNTY SIZE
ELEVATION SEWER DATE
FLOOD DISTANCE
PRICE
100.00% -18.22% -23.97%
35.18% -39.12% 59.47% -32.31%
9.33%
COUNTY
-18.22% 100.00% -33.94%
47.52% -5.00% -36.98% -55.18% -74.22%
SIZE
-23.97% -33.94% 100.00% -20.95%
5.34% -34.95% 10.89% 55.69%
ELEVATION 35.18% 47.52% -20.95% 100.00% -35.94% -5.65% -37.31% -36.25%
SEWER
-39.12% -5.00%
5.34% -35.94% 100.00% -15.15% -11.31% -15.87%
DATE
59.47% -36.98% -34.95%
-5.65% -15.15% 100.00%
1.54%
4.44%
FLOOD
-32.31% -55.18% 10.89% -37.31% -11.31%
1.54% 100.00% 42.33%
DISTANCE
9.33% -74.22% 55.69% -36.25% -15.87%
4.44% 42.33% 100.00%
0.0
0.4
0.8
0
5
15
-100 -60
-20
0
5
10 15
0.0 1.0
0.5
PRICE
COUNTY
0 15
0
SIZE
ELEVATION
0
SEWER
0.0 1.0
-100
DATE
15
FLOOD
0
DISTANCE
0.5
2.0
3.5
0
1000
0
4000
10000
0.0
0.4
0.8
PRICE
COUNTY
SIZE
ELEVATION
SEWER
DATE
FLOOD
DISTANCE
PRICE
COUNTY SIZE
ELEVATION SEWER
DATE
FLOOD
DISTANCE
100.00%
-18.22%
-23.97%
35.18%
-39.12%
59.47%
-32.31%
9.33%
-18.22% 100.00%
-33.94%
47.52%
-5.00%
-36.98%
-55.18%
-74.22%
-23.97%
-33.94% 100.00%
-20.95%
5.34%
-34.95%
10.89%
55.69%
35.18%
47.52%
-20.95%
100.00%
-35.94%
-5.65%
-37.31%
-36.25%
-39.12%
-5.00%
5.34%
-35.94% 100.00%
-15.15%
-11.31%
-15.87%
59.47%
-36.98%
-34.95%
-5.65%
-15.15% 100.00%
1.54%
4.44%
-32.31%
-55.18%
10.89%
-37.31%
-11.31%
1.54% 100.00%
42.33%
9.33%
-74.22%
55.69%
-36.25%
-15.87%
4.44%
42.33% 100.00%
log 𝑃𝑅𝐼𝐢𝐸 = 𝑏0 + 𝑏1 𝐸𝐿𝐸𝑉𝐴𝑇𝐼𝑂𝑁 + 𝑏2 π‘†πΈπ‘ŠπΈπ‘… + 𝑏3 𝐷𝐴𝑇𝐸 + 𝑏4 𝐹𝐿𝑂𝑂𝐷 + πœ€
summary(model)
Call:
lm(formula = leslie_salt[, 1] ~ leslie_salt[, 4] + leslie_salt[,
5] + leslie_salt[, 6])
Residuals:
Min 1Q Median 3Q Max
-9.6076 -3.2506 -0.0281 2.8770 20.2776
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21.2787636 2.9203157 7.286 7.75e-08 ***
leslie_salt[, 4] 0.5614588 0.2515472 2.232 0.034107 *
leslie_salt[, 5] -0.0005871 0.0004460 -1.316 0.199129
leslie_salt[, 6] 0.1836824 0.0421712 4.356 0.000172 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.559 on 27 degrees of freedom
Multiple R-squared: 0.5327, Adjusted R-squared: 0.4807
F-statistic: 10.26 on 3 and 27 DF, p-value: 0.000111
Assumptions
• Linearity of the dependent variable in terms of independent
variables.
• 𝑦 = 𝛽0 + 𝛽1 𝑋12 + β‹―
• 𝑦 = 𝛽0 + 𝛽1 𝑒 𝑋1 + β‹―
• 𝑦 = 𝛽0 + 𝛽1 sin 𝑋1 + β‹―
𝑦
𝑋1
Linearity (cts.)
A higher order term of the dependent
variable should be included.
In that case define a new variable by
taking the square (for this case) of that
independent variable and use squared
values in the regression.
Use: Visual inspection
More troublesome is MODERATOR effect
• If an independent-dependent variable relationship is effected by
another independent variable this situation is termed a moderator
effect.
• The most common moderator effect in multiple regression is the
bilinear moderator in which the slope of the relationship of one
independent variable (X1) changes across values of the moderator
variable (X2).
Example
Family income (X2) can be a positive moderator of the relationship
between family size (X1) and credit card usage (Y). Then expected
change in credit card usage based on family size (𝛽1 ) might be lower for
families with low incomes and high in high incomes.
Without the moderator effect we are assuming that family size have a
constant effect on credit card usage.
Adding Moderator Effect
The idea comes from observing a self moderator effect. If a variable has
a moderator effect onto itself then we would assume a nonlinear
(second degree) relationship with the dependent variable.
Thus if there is a moderator effect add X1X2 as an independent variable
to regression equation. But we will return back to this!!!
Assumption:
Homoscedasticity
Constant variance of the error terms.
• Heteroscedasticity (cts.)
in residuals
within variables
Heteroscedasticity (cts.)
• Use: Levene Test.
Levene Test: Tests the equality of variance.
Levene's test works by testing the null hypothesis that the variances of
the group are the same. The output probability is the probability that
at least one of the samples in the test has a significantly different
variance. If this is greater than a selected percentage (usually 5%) then
it is considered too great to be able to usefully apply parametric tests.
Variances
In SPSS it is reported.
In R: In «lawstat» library use levene.test() function.
Use F test for more than
2 groups…
Residuals vs Fitted
0 5
2
-10
Residuals
15
26
25
0
5
10
15
20
Fitted values
lm(leslie_salt[, 1] ~ leslie_salt[, 4] + leslie_salt[, 5] + leslie_salt[, 6 ...
Assumptions
• Independence of the error terms.
Check the coordinates!!!
Independence of Error Terms
• Use: Durbin-Watson
The value of the Durbin-Watson statistic ranges from 0 to 4. As a
general rule of thumb, the residuals are uncorrelated is the DurbinWatson statistic is approximately 2. A value close to 0 indicates strong
positive correlation, while a value of 4 indicates strong negative
correlation.
• In SPSS Durbin Watson is reported.
• In R under «lmtest» library use dwtest()
dwtest(formula, order.by = NULL, alternative = c("greater", "two.sided",
"less"), iterations = 15, exact = NULL, tol = 1e-10, data = list())
For our regression model.
> dwtest(model)
Durbin-Watson test data: model
DW = 2.3762, p-value = 0.7783
alternative hypothesis: true autocorrelation is greater than 0
Assumptions
• Normality of the error term distribution.
3
2
0
1
2
26
-1
Standardized residuals
Normal Q-Q
10
-2
-1
0
1
2
Theoretical Quantiles
lm(leslie_salt[, 1] ~ leslie_salt[, 4] + leslie_salt[, 6] + leslie_salt[, 7 ...
-1
0
1
2
3
Studentized Residuals(model)
qqPlot(model)
-2
-1
0
t Quantiles
1
2
Diagonistics
Call:
lm(formula = leslie_salt[, 1] ~ leslie_salt[, 4] + leslie_salt[,
5] + leslie_salt[, 6])
Residuals:
Min 1Q Median 3Q Max
-9.6076 -3.2506 -0.0281 2.8770 20.2776
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21.2787636 2.9203157 7.286 7.75e-08 ***
leslie_salt[, 4] 0.5614588 0.2515472 2.232 0.034107 *
leslie_salt[, 5] -0.0005871 0.0004460 -1.316 0.199129
leslie_salt[, 6] 0.1836824 0.0421712 4.356 0.000172 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.559 on 27 degrees of freedom
Multiple R-squared: 0.5327, Adjusted R-squared: 0.4807
F-statistic: 10.26 on 3 and 27 DF, p-value: 0.000111
Identifying Influential Observations
• observations that lie outside the general patterns of the data set
• observations that strongly influence regression results
Types of Influential Observations
1. Outliers – observations that have large residuals (based on
dependent variables)
2. Leverage points – observations that are distinct from the remaining
observations based on their independent variable values.
3. Influential observations – including all observations that have a
disproportionate effect on the regression results.
Outliers
• Typical boxplot test.
• In «car» library
οƒ˜ outlierTest(model)
rstudent
unadjusted p-value
2
3.704906
0.0010527
Bonferonni p
0.032634
Leverage
An observation with an extreme value on a predictor variable is called a
point with high leverage. Leverage is a measure of how far an IV deviates
from its mean. These leverage points can have an unusually large effect on
the estimate of regression coefficients. We hope to see very few (if any)
points in the plot representing high values of leverage. High leverage can also
point toward outliers, which are defined as observations with large residuals
in regression. You should say something about the number of cases that
appear to represent high leverage.
Leverage:
Cut off point :
2 𝑝+1
𝑛
p: # of independent variables
n: # of observations
-0.5
0.0
0.5
1.0
1.0
-1.0 0.0
leslie_salt[, 1] | others
0.0
-1.0
leslie_salt[, 1] | others
Leverage Plots
-0.5
-0.6
-0.2
0.2 0.4
leslie_salt[, 7] | others
0.5
1.0
-0.5
0.5
leslie_salt[, 6] | others
leslie_salt[, 1] | others
0.5
-0.5
leslie_salt[, 1] | others
leslie_salt[, 4] | others
0.0
-0.2 0.0 0.2 0.4 0.6
leslie_salt[, 8] | others
0.30
4
Cut off point:
𝑛
0.15
p: # of independent
variables
n: # of observations
0.00
cooks.distance(model)
Cook’s Distance:
0
5
10
15
Index
20
25
30
Influence Plot
R-Code
3
2
26
30
0
1
5
21
-1
Studentized Residuals
# Influential Observations
# added variable plots
av.Plots(model)
# Cook's D plot
# identify D values > 4/(n-k-1)
cutoff <- 4/((nrow(leslie_salt)length(model$coefficients)-2))
plot(fit, which=4, cook.levels=cutoff)
# Influence Plot
influencePlot(model,
id.method="identify",
main="Influence Plot",
sub="Circle size is proportial to
Cook's Distance" )
2
10
0.1
9
4
0.2
0.3
0.4
Hat-Values
Circle size is proportial to Cook's Distance
Residuals vs Leverage
2
1
1
0.5
9
4
0.5
-2
Cook's distance
0.0
0.1
0.2
Leverage
lm(leslie_salt[, 1] ~ leslie_salt[, 4] + leslie_salt[, 6] + leslie_salt[, 7 ...
0.3
0.4
Cook’s Distance:
4
Cut off point:
𝑛
p: # of independent
variables
n: # of observations
0
Standardized residuals
3
2
Assessing Multicollinearity*****
A key issue in interpreting the regression variate is the correlation
among the independent variables.
Our task in a regression analysis includes the following:
1. Assess the degree of multicollinearity
2. Determine its impact on results
3. Apply the necessary remedies if needed
Assess the degree of multicollinearity
• The simplest and most obvious way: Identifying collinearity in
correlation matrix. Check for correlation >90%.
• A direct measure of multicollinearity is tolerance (1/VIF).
• The amount of variability of the selected independent variable not explained
by the other independent variables. Computation:
• Take each independent variable. Assume it as the dependent variable. Compute adjusted
R2.
• Tolerance is then 1-R2.
• For example if other variables explain 25% of an independent variable then
tolerence of this variable is 75%. Tolerence should be more than 10%
> 1/vif(model)
leslie_salt[, 4] leslie_salt[, 6] leslie_salt[, 7] leslie_salt[, 8]
0.8081325
0.9959058
0.7650806
0.7715437
Further…
• see page http://www.statmethods.net/stats/rdiagnostics.html for
diagonistic tests with R
Partial Correlation
• A partial correlation coefficient is a way of expressing the unique
relationship between the criterion and a predictor. Partial correlation
represents the correlation between the criterion and a predictor after
common variance with other predictors has been removed from both
the criterion and the predictor of interest.
t.values <- model$coeff / sqrt(diag(vcov(model)))
partcorr <- sqrt((t.values^2) / ((t.values^2) + model$df.residual))
partcorr
*****************************************************
leslie_salt[, 4] leslie_salt[, 6] leslie_salt[, 7] leslie_salt[, 8]
0.6562662
0.8043296
0.6043579
0.5740840
Part (Semi-partial) Correlation
• A semipartial correlation coefficient represents the correlation between
the criterion and a predictor that has been residualized with respect to all
other predictors in the equation. Note that the criterion remains unaltered
in the semipartial. Only the predictor is residualized. After removing
variance that the predictor has in common with other predictors, the
semipartial expresses the correlation between the residualized predictor
and the unaltered criterion. An important advantage of the semipartial is
that the denominator of the coefficient (the total variance of the criterion,
Y) remains the same no matter which predictor is being examined. This
makes the semipartial very interpretable. The square of the semipartial can
be interpreted as the proportion of the criterion variance associated
uniquely with the predictor. It is also possible to use the semipartial to fully
deconstruct the variance components in a regression analysis.
Project (Step1):
Go to web page:
http://luna.cas.usf.edu/~mbrannic/files/regression/Partial.html
Replicate the results there using a dataset of your own. Be creative in
problem formulation. Data may be imaginary. Use at least 5
independent variables.
Comparing Regression Models
• In multiple regression the hardest problem is deciding on which
variables to enter into equation even after checking assumption such
as multicollinearity.
• Adjusted 𝑅2 is not a proper way of model comparison.
• Next we learn a better way.
Stepwise Regression
• Start with the most basic model. Pick your favourite independent
variable and construct the model. Test it.
Remember correlation matrix (price in logs)
PRICE
COUNTY
SIZE
ELEVATION
SEWER
DATE
FLOOD
DISTANCE
PRICE
COUNTY
SIZE
ELEVATION SEWER
DATE
FLOOD
DISTANCE
100.00%
-18.22%
-23.97%
35.18%
-39.12%
59.47%
-32.31%
9.33%
-18.22%
100.00%
-33.94%
47.52%
-5.00%
-36.98%
-55.18%
-74.22%
-23.97%
-33.94%
100.00%
-20.95%
5.34%
-34.95%
10.89%
55.69%
35.18%
47.52%
-20.95%
100.00%
-35.94%
-5.65%
-37.31%
-36.25%
-39.12%
-5.00%
5.34%
-35.94%
100.00%
-15.15%
-11.31%
-15.87%
59.47%
-36.98%
-34.95%
-5.65%
-15.15%
100.00%
1.54%
4.44%
-32.31%
-55.18%
10.89%
-37.31%
-11.31%
1.54%
100.00%
42.33%
9.33%
-74.22%
55.69%
-36.25%
-15.87%
4.44%
42.33%
100.00%
π‘π‘Ÿπ‘–π‘π‘’ = 𝛽0 + 𝛽1 π‘‘π‘Žπ‘‘π‘’
π‘π‘Ÿπ‘–π‘π‘’ = 𝛽0 + 𝛽1 π‘‘π‘Žπ‘‘π‘’
Call:
lm(formula = leslie_salt[, 1] ~ leslie_salt[, 6])
Residuals:
Min
1Q Median
3Q Max
-1.12046 -0.34364 0.04853 0.39719 1.00081
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.322336 0.269975 12.306 4.9e-13 ***
leslie_salt[, 6] 0.018124 0.004257 4.257 0.000198 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5719 on 29 degrees of
freedom
Multiple R-squared: 0.3846, Adjusted R-squared: 0.3634
F-statistic: 18.12 on 1 and 29 DF, p-value: 0.0001982
Our focus is the improvement in RSS. So we need
residual sum of squares. But it is not given in the report
directly (given in SPSS).
> anova(m1)
Analysis of Variance Table
Response: leslie_salt[, 1]
Df Sum Sq Mean Sq F value Pr(>F)
leslie_salt[, 6] 1 5.9282 5.9282 18.124 0.0001982 ***
Residuals
29 9.4858 0.3271
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
• Now lets add another variable say SEWER and assume we have done
all testing
Call:
lm(formula = leslie_salt[, 1] ~ leslie_salt[, 6] + leslie_salt[,
5])
Residuals:
Min
1Q Median
3Q Max
-1.21681 -0.21980 0.08597 0.29875 0.81520
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
3.442e+00 2.442e-01 14.093 3.07e-14 ***
leslie_salt[, 6] 1.643e-02 3.841e-03 4.278 0.000199 ***
leslie_salt[, 5] -1.105e-04 3.797e-05 -2.910 0.007013 **
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.51 on 28 degrees of freedom
Multiple R-squared: 0.5275, Adjusted R-squared: 0.4937
F-statistic: 15.63 on 2 and 28 DF, p-value: 2.766e-05
Analysis of Variance Table
Response: leslie_salt[, 1]
Df Sum Sq Mean Sq F value Pr(>F)
leslie_salt[, 6] 1 5.9282 5.9282 22.7903 5.146e-05 ***
leslie_salt[, 5] 1 2.2024 2.2024 8.4671 0.007013 **
Residuals
28 7.2833 0.2601
---
How much improvement do we have?
Our aim is to check whether the imrovement in RSS is statistically
significant or not.
Define
π‘…π‘†π‘†π‘œπ‘™π‘‘ − 𝑅𝑆𝑆𝑛𝑒𝑀
πΉπ‘–π‘šπ‘π‘Ÿπ‘œπ‘£π‘’π‘šπ‘’π‘›π‘‘ = Δ π‘–π‘› 𝑖𝑛𝑑. π‘£π‘Žπ‘Ÿπ‘–π‘Žπ‘π‘™π‘’π‘ 
π‘…π‘†π‘†π‘œπ‘™π‘‘
𝑑. 𝑓.π‘œπ‘™π‘‘
Numerator measures average improvement as we add a new variable
(we may add a bunch of new variables) and scales the improvement
with respect to original model.
The degrees of freedom of the statistic is (1,degrees of freedom of old
model)
In our case
π‘…π‘†π‘†π‘œπ‘™π‘‘ = 9.4858
𝑅𝑆𝑆𝑛𝑒𝑀 = 7.2833
π‘‘π‘“π‘œπ‘™π‘‘ = 29
2.2025
1
𝐹=
= 6.733487
9.4858
29
πΉπ‘π‘Ÿπ‘–π‘‘ 1,29 = 4.182964
So, new model is superior, the improvement is statistically significant.
Back to moderator effect.
To test the moderator effect we use,
𝑦 = 𝛽0 + 𝛽1 𝑋1
as the simple model and
𝑦 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋1 𝑋2
as the extended model and then decide accordingly.
Mediation
Project (Step2,3 and 4):
• Find the best regression equation for your Project.
• Test moderator effects
• Test mediation effects.
Download