H0: b 1 =0 - soomi lee

advertisement
PADM 692 | Data Analysis II
Session II
Linear Regression Diagnostics
March 17, 2012
University of La Verne
Soomi Lee, PhD
Copyright © by Soomi Lee
Do not copy or distribute without permission
Overview
1. Recap: multiple regression
2. Assumptions of Classical Linear Regression
Model (CLRM)
3. Most common problems in regression
analysis
1.
2.
3.
4.
Multicollinearity
Omitted variable bias
Heteroskedasticity
Outliers
Recap: Multiple Regression
• Summary statistics
• Eyeball the relationship between your main IV(s) and DV
– Cannot plot two IVs. Show one IV and DV at a time.
• Interpretation
1. Individual coefficients (significance, magnitude)
– Statistically significant?
– How big is it?
– Do not forget: “holding other variables constant”
2. Overall model performance
– Adjusted R square: how much variation of Y does your model
explain?
– F statistic (statistical significance of the model as a whole): are
your independent variables jointly important?
Presidential Election Example
GROWTH
INFLATION
Constant
N
.705*** (.217)
-.478 (.339)
53.365*** (1.919)
23
Adjusted R2
.527
F
13.25***
Note: *** p<0.01; standard
errors are in parentheses.
DV: Incumbent
president’s vote share
(%)
1. How many IVs?
2. Interpretation of
each coefficient?
3. Overall model
performance?
Abortion Rates Example
Religion
.0004 (.083)
Price
-.045** (.022)
Income
.002*** (.000)
Picket
-.109*** (.040)
Constant
-5.869 (9.182)
N
Adjusted R2
F
50
.498
13.162***
Note: ** p<0.05, *** p<0.01;
Standard errors are in parentheses.
DV: abortion rate (per
1000 women aged 1544)
1. How many IVs?
2. Interpretation of
each coefficient?
3. Overall model
performance?
British Crime Rates Example
Unemployment
5.352* (2.703)
DV: Crime rate (per 1000
Cars
-.052 (.036)
people)
1. How many IVs?
Police
-4.204 (7.546)
Young population 7.941*** (2.176) 2. Interpretation of
each coefficient?
Constant
.309 (36.312)
3. Overall model
N
42
performance?
Adjusted R2
.589
F
15.705***
Note: * p<0.1; ** p<0.05, ***
p<0.01; Standard errors are in
parentheses.
Concerns
• No mindless data crunching. Theory should
guide you.
• Limitation of our dataset
– Number of observation > number of IVs
– External validity: generalization
– Internal validity: are you measuring what you
want to measure?
Break
Assumptions of Classical Linear
Regression Model (CLRM)
1.
2.
3.
4.
5.
6.
7.
Zero average of population errors
Equal variance (homoskedasticity)
No autocorrelation
No correlation between X and errors
No measurement errors
No model specification error
Normal distribution of the population error term
Break
Collinearity
• Collinearity: Two independent variables are
linearly related.
• Multicollinearity: More than two variables are
a linear combination of one another
High Multicollinearity
• High multicollinearity: one of the IVs is highly correlated
with one or more of the other IVs.
• Why it is a problem?
– Redundant information. Cannot separate the effect of
X1 and X2 on the DV. Unable to estimate the
marginal effect.
– Why? Inflates standard errors
– Not a signal of a problem in our model
– OLS is still BLUE in the presence of multicollinearity
High Multicollinearity
• How to detect
1. Look at the significance of individual independent
variables and overall model performance:
•
Adjusted R2 is high, but individual variables are not
statistically significant.
2. Examine correlations among X variables (Create
correlation matrix).
• Solution
1. Increase sample size (not always feasible).
2. Drop one of the variables.
High multicollinearity: Example
Unemployment
CARS
Police
Young population
Constant
N
Adjusted R2
F
• British Crime Rates Model
• CARS is not statistically
-.052 (.036)
significant. Possible
-4.204 (7.546)
multicollinearity?
7.941*** (2.176)
• CARS may be highly
.309 (36.312)
correlated with other
variables.
42
• What to do: create a
.589
correlation matrix
.329* (5.703)
15.705***
Note: ** p<0.05, *** p<0.01;
Standard errors are in parentheses.
Detecting High Multicollinearity
CARS
Police
Young
population
Unemployment
CARS
1
-
-
-
Police
-0.639
1
-
-
Young population
-0.519
0.620
1
-
Unemployment
-0.575
0.810
0.598
1
•
•
•
•
CUT-OFF (rule of thumb): 0.8
CARS: did not exceed correlation coefficient of 0.8 with any other IVs.
It may be correlated with combination of variables.
We also have a high correlation coefficient between police and
unemployment.
Multicollinearity Diagnostics in SPSS
• Request the display of Collinearity Statistics
• AnalyzeRegressionLinearStatisticsCollinearity
diagnostics
• SPSS will show you “tolerance” and “VIF” along with regression
results.
Multicollinearity Diagnostics in SPSS
• Tolerance (between 0 and 1)
– The percent of variance in a predictor that cannot be accounted for by
the other predictors. (1-R2i)
– Low level of tolerance  high multicollinearity
– At 0.3  Attention
– Less than .1 indication of multicollinearity. Likely to be a problem.
• VIF: Variance Inflation Factor
– 1/tolerance
– High level of VIF  high multicollinearity
– Greater than 10  Attention
Example: Medicaid
We want to estimate the effect of poverty on the share of state
Medicaid spending.
H0: b1=0; H1: b1>0
1. Estimate the following model:
Medicaid=a+b1(poverty)+b2(age65)+b3(income)+e
2. Examine if there is a problem of multicollinearity.
1) Look at the overall model performance and individual coefficients.
2) Create a correlation matrix. High correlations?
3) Compute tolerance and VIF. What these values tell us about
multicollinearity in this model? What do we do about it?
Group Work
We want to estimate the marginal effect of education on wage.
Hypothesis 1: H0: b1=0; H1: b1>0
Hypothesis 2: H0: b2=0; H1: b2>0
1. Estimate the following model:
Wage = a+b1(edu)+b2(experience)+e
2. Examine if there is a problem of multicollinearity.
1) Look at the overall model performance and individual coefficients.
2) Create a correlation matrix. High correlations?
3) Compute tolerance and VIF. What these values tell us about
multicollinearity in this model? What do we do about it?
Break
Omitted Variable Bias
• Omitted variable bias: exclusion of a relevant
variable or inclusion of an irrelevant variable
• Model specification error
• If you have an omitted variable bias you
violate the assumption that regressors are
uncorrelated to the errors.
Omitted Variable Bias
• Suppose our true model is:
Yi = α + β1X1i + β2X2i + ui
• Our misspecified model excludes X2 (due to
ignorance or unavailability of X2):
Yi = c + d1X1i + vi
• Now our error term in the misspecified model
contains the effects of the X2. vi = β2X2i + ui
• Our estimation of d1 is biased!
Solution?
• Think about possible measurement errors.
• Re-specify your model: adding one or more
variables or excluding possibly irrelevant
variables
• See if an additional variable changes anything
(significance and the size of predictors, R2)
• Go back to theory.
Example: Medicaid
We want to estimate the effect of poverty on the share of state
Medicaid spending.
H0: b1=0; H1: b1>0
1. Estimate the following models:
Medicaid=a+b1(poverty)+b2(age65)+b3(income)+e
Medicaid=a+c2(age65)+c3(income)+v
2. Compare the two models.
1) Compare Adjusted R2 and the F statistic.
2) See if there are any changes in statistical significance or magnitude of
each regression coefficient.
3) Do you consider the second model misspecified?
Group Work
We want to estimate the marginal effect of education on wage.
Hypothesis 1: H0: b1=0; H1: b1>0
Hypothesis 2: H0: b2=0; H1: b2>0
1. Estimate the following models:
Wage = a+b1(edu)+b2(experience)+b3(female)+e
Wage = a+c2(experience)+c3(female)+v
2. Compare the two models.
1) Compare Adjusted R2 and the F statistic.
2) See if there is any changes in statistical significance or
magnitude of each regression coefficient.
3) Do you consider the second model misspecified?
Break
Heteroskedasticity
• Error variance is constant throughout the
regression line.  homoskedastic (equal
variance)
• If the error variance is not constant
throughout the regression line it violates the
assumption of equal variance 
heteroskedastic
Homoskedastic
Heteroskedastic
Heteroskedastic
Heteroskedastic
Heteroskedasticity: Causes
• Measurement errors
• Omitted variables
– Suppose the true model is: y=a+b1x1+b2x2+u
– The model we estimate fails to include x2: y=a+c1x1+v
– Then the error term in the model will be capturing the
effect of X2, so it will be correlated with X2.
• Non-linearity
– True model: y=a+b1X12+u
– Our model: y=a+cX1+v
– Then the residual will capture the non-linearity and affect
the variance accordingly.
Heteroskedasticity: Consequences
• Heteroskedasticity by itself does not cause OLS to be
biased or inconsistent.
• It is still unbiased, but not the best.
• Then you no longer have the minimum variance 
lose accuracy
• Heteroskedasticity is a symptom of omitted
variables, and measurement errors. OLS estimators
will be biased and inconsistent if you have omitted
variables or measurement errors.
Heteroskedasticity: Detection
• Graphical Methods
– Plot the standardized residuals (on Y axis) against
the standardized predicted values (on X axis)
• AnalyzeRegressionLinearPlotsSelect ZRESID
(standardized residuals) as the Y and ZPRED
(standardized predicted value) as the X variable. Click
on the “Continue” button.
– If you see a pattern (a funnel shape or a curve)
this indicates heteroskedasticity.
Satisfactory Residual Plot
Non-constant Variance
Heteroskedasticity: Solution
• Re-specify your model. You may omit one or
more important variables.
• Consider using different measurements.
Example: Medicaid
We want to estimate the effect of poverty on the share of state
Medicaid spending.
H0: b1=0; H1: b1>0
1. Estimate the following models:
Medicaid=a+b1(poverty)+b2(age65)+b3(income)+e
Medicaid=a+c2(age65)+c3(income)+v
2. Create and compare residual plots for both models.
Group Work
We want to estimate the marginal effect of education on wage.
Hypothesis 1: H0: b1=0; H1: b1>0
Hypothesis 2: H0: b2=0; H1: b2>0
1. Estimate the following models:
Wage = a+b1(edu)+b2(experience)+b3(female)+e
Wage = a+c2(experience)+c3(female)+v
2. Create and compare residual plots for both models.
Break
Outlier
• Outliers: cases with extreme values
– They are influential: removing them substantially
changes the estimate of coefficients.
• Consequence
– Estimates are biased especially when the sample size
is small.
• Causes of outliers
– Errors in coding or data entry
– Highly unusual cases
– Important real variation
Extreme case that pulls
the regression line up
Regression line with
extreme case removed
from the sample
Detecting Outliers
1. Scatter plots
– Detect if there are any outliers (eyeballing)
Detecting Outliers
2. Compute Cook’s D
– Identifies strongly influential cases to the
regression line
– Higher value  potential outlier
– Rule of thumb: cut-off point = 4/N
– If Cook’s D > 4/N  pay attention
– Example: N=50  cutoff: 4/50=0.08
Detecting Outliers
3. Compute DfBeta
–
–
–
–
–
DfBeta: change in the regression coefficient that results
form the deletion of the ith case.
DfBeta value is calculated for each case for each
regression coefficient.
Higher value  potential outlier
Rule of thumb: Pay attention if DfBeta > cut-off =
2/sqrt(N)
Ex. N=50  cut-off = 2/sqrt(50) = 0.28
Detecting Outliers
4. Compute DfFit
– Changes in the predicted value when the ith case
is deleted.
– Higher value  potential outlier
– Rule of thumb: Pay attention if DfFit > cut-off =
2×sqrt(K/N); K=number of independent variables
– Ex. N=50; K=5  cut-off = 6.32
Example
1. Scatter plots
AnalyzeRegressionLinearPlotsProduce all
partial plots
1. Compute Cook’s D, DfBeta, DfFit
AnalyzeRegressionLinearSaveCook’s,
DfBeta(s), DfFit
(Cook’s D, DfBeta, DfFit will be saved in your data file. Go back to Data View and see)
Solution?
• In the presence of outliers…
• Fit the model with and without outliers
• Remove influential observations from
regression analysis
• Recall “They are influential: removing them substantially
changes the estimate of coefficients.”
• Must justify why. Do not destroy your data
without justification.
Example: Medicaid
We want to estimate the effect of poverty on the share of state
Medicaid spending.
H0: b1=0; H1: b1>0
1. Estimate the following model:
Medicaid=a+b1(poverty)+b2(age65)+b3(income)+e
2. Examine outliers.
1) Create partial scatter plots. Eyeball each scatter plot if there are any
outliers.
2) Compute Cook’sD, DfBeta, and DfFit. Calculate cut-offs for each. Do
we have any outliers?
Group Work
We want to estimate the marginal effect of education on wage.
Hypothesis 1: H0: b1=0; H1: b1>0
Hypothesis 2: H0: b2=0; H1: b2>0
1. Estimate the following model:
Wage = a+b1(edu)+b2(experience)+b3(female)+e
2. Examine outliers.
1) Create partial scatter plots. Eyeball each scatter plot if there are any
outliers.
2) Compute Cook’sD, DfBeta, and DfFit. Calculate cut-offs for each. Do
we have any outliers?
Go Home.
Download