Class 5 Lecture: Multiple Regression Diagnostics

advertisement
Multiple Regression Diagnostics 2
Sociology 8811
Copyright © 2007 by Evan Schofer
Do not copy or distribute without
permission
Announcements
• None
Multiple Regression Assumptions
• 3. d. Predictors (Xis) are uncorrelated with error
– This most often happens when we leave out an
important variable that is correlated with another Xi
– Example: Predicting job prestige with family wealth,
but not including education
– Omission of education will affect error term. Those
with lots of education will have large positive errors.
• Since wealth is correlated with education, it will be
correlated with that error!
– Result: coefficient for family wealth will be biased.
Multiple Regression Assumptions
• 4. In systems of equations, error terms of
equations are uncorrelated
• Knoke, p. 256
– This is not a concern for us in this class
• Worry about that later!
Multiple Regression Assumptions
• 5. Sample is independent, errors are random
– Not only should errors not increase with X
(heteroskedasticity), there should be no pattern at all!
• Cases that are non-independent often have
correlated error
• Things that cause patterns in error
(autocorrelation):
– Measuring data over long periods of time (e.g., every
year). Error from nearby years may be correlated.
• Called: “Serial correlation”.
Multiple Regression Assumptions
• More things that cause patterns in error
(autocorrelation):
– Measuring data in families. All members are similar,
will have correlated error
– Measuring data in geographic space.
• Example: data on 50 US states. States in a similar region
have correlated error
• Called “spatial autocorrelation”
• There are variations of regression models to
address each kind of correlated error.
Regression: Outliers
• Note: Even if regression assumptions are met,
slope estimates can have problems
• Example: Outliers -- cases with extreme values
that differ greatly from the rest of your sample
• More formally: “influential cases”
• Outliers can result from:
• Errors in coding or data entry
• Highly unusual cases
• Or, sometimes they reflect important “real” variation
• Even a few outliers can dramatically change
estimates of the slope, especially if N is small.
Regression: Outliers
• Outlier Example:
Extreme case that
pulls regression
line up
4
2
-4
-2
0
-2
-4
2
4
Regression line
with extreme case
removed from
sample
Regression: Outliers
• Strategy for identifying outliers:
• 1. Look at regression partial plots (avplots) for
extreme values
• 2. Compute outlier diagnostic statistics
– High values indicate potential outliers
•
•
•
•
•
“Leverage”
Cook’s D
DFFIT
DFBETA
residuals, standardized residuals, studentized residuals.
Scatterplots
• Example: Study time and student achievement.
– X variable: Average # hours spent studying per day
– Y variable: Score on reading test
Case
1
2
3
4
5
6
7
X
2.6
1.4
.65
4.1
.25
1.9
3.5
Y
28
13
17
31
8
16
6
Y axis
30
20
10
X axis
0
0
1
2
3
4
Outliers
• Results with outlier:
Model Summaryb
Model
1
R
R Square
a
.466
.217
Adjus ted
R Square
.060
Std. Error of
the Es timate
9.1618
a. Predictors : (Constant), HRSTUDY
Coefficientsa
b. Dependent Variable: TESTSCOR
Standardi
zed
Uns tandardized
Coefficien
Coefficients
ts
Model
B
Std. Error
Beta
1
(Cons tant)
10.662
6.402
HRSTUDY
3.081
2.617
.466
a. Dependent Variable: TESTSCOR
t
1.665
1.177
Sig.
.157
.292
Outlier Diagnostics
• Residuals: The numerical value of the error
• Error = distance that points falls from the line
• Cases with unusually large error may be outliers
• Note: residuals have many other uses!
• Standardized residuals
– Z-score of residuals… converts to a neutral unit
– Often, standardized residuals larger than 3 are
considered worthy of scrutiny
• But, it isn’t the best outlier diagnostic
– Studentized residuals: recalculates standarized
residual after removing case from analysis.
Outlier Diagnostics
• Cook’s D: Identifies cases that are strongly
influencing the regression line
– SPSS calculates a value for each case
• Go to “Save” menu, click on Cook’s D
• How large of a Cook’s D is a problem?
– Rule of thumb: Values greater than: 4 / (n – k – 1)
– Example: N=7, K = 1: Cut-off = 4/5 = .80
– Cases with higher values should be examined.
Outlier Diagnostics
• Example: Outlier/Influential Case Statistics
Hours
2.60
1.40
.65
4.10
.25
1.90
3.50
Score
28
13
17
31
8
16
6
Resid
9.32
-1.97
4.33
7.70
-3.43
-.515
-15.4
Std Resid
1.01
-.215
.473
.841
-.374
-.056
-1.68
Cook’s D
.124
.006
.070
.640
.082
.0003
.941
Outliers
• Results with outlier removed:
Model Summaryb
Model
1
R
R Square
.903 a
.816
Adjus ted
R Square
.770
Std. Error of
the Es timate
4.2587
a. Predictors : (Constant), HRSTUDY
Coefficientsa
b. Dependent Variable: TESTSCOR
Standardi
zed
Uns tandardized
Coefficien
Coefficients
ts
Model
B
Std. Error
Beta
1
(Cons tant)
8.428
3.019
HRSTUDY
5.728
1.359
.903
a. Dependent Variable: TESTSCOR
t
2.791
4.215
Sig.
.049
.014
Regression: Outliers
• Question: What should you do if you find
outliers? Drop outlier cases from the analysis?
Or leave them in?
– Obviously, you should drop cases that are incorrectly
coded or erroneous
– But, generally speaking, you should be cautious about
throwing out cases
• If you throw out enough cases, you can produce any result
that you want! So, be judicious when destroying data.
Regression: Outliers
• Circumstances where it can be good to drop
outlier cases:
• 1. Coding errors
• 2. Single extreme outliers that radically change
results
– Your results should reflect the dataset, not one case!
• 3. If there is a theoretical reason to drop cases
– Example: In analysis of economic activity,
communist countries may be outliers
• If the study is about “capitalism”, they should be dropped.
Regression: Outliers
• Circumstances when it is good to keep outliers
• 1. If they form meaningful cluster
– Often suggests an important subgroup in your data
• Example: Asian-Americans in a dataset on education
• In such a case, consider adding a dummy variable for them
– Unless, of course, research design is not interested in
that sub-group… then drop them!
• 2. If there are many
– Maybe they reflect a “real” pattern in your data.
Regression: Outliers
• When in doubt: Present results both with and
without outliers
– Or present one set of results, but mention how results
differ depending on how outliers were handled
• For final projects: Check for outliers!
• At least with scatterplots
– In the text: Mention if there were outliers, how you
handled them, and the effect it had on results.
Multicollinearity
• Another common regression problem:
Multicollinearity
• Definition: collinear = highly correlated
• Multicollinearity = inclusion of highly correlated
independent variables in a single regression model
• Recall: High correlation of X variables causes
problems for estimation of slopes (b’s)
• Recall: variable denominators approach zero, coefficients
may wrong/too large.
Multicollinearity
• Multicollinearity symptoms:
• Unusually large standard errors and betas
• Compared to if both collinear variables aren’t included
• Betas often exceed 1.0
• Two variables have the same large effect when
included separately… but…
– When put together the effects of both variables shrink
– Or, one remains positive and the other flips sign
• Note: Not all “sign flips” are due to multicollinearity!
Multicollinearity
• What does multicollinearity do to models?
• Note: It does not violate regression assumptions
• But, it can mess things up anyway
• 1. Multicollinearity can inflate standard error
estimates
• Large standard errors = small t-values = no rejected null
hypotheses
• Note: Only collinear variables are effected. The rest of the
model results are OK.
Multicollinearity
• What does multicollinearity do?
• 2. It leads to instability of coefficient estimates
• Variable coefficients may fluctuate wildly when a collinear
variable is added
• These fluctuations may not be “real”, but may just reflect
amplification of “noise” and “error”
– One variable may only be slightly better at predicting Y… but
SPSS will give it a MUCH higher coefficient.
Multicollinearity
• Diagnosing multicollinearity:
• 1. Look at correlations of all independent vars
• Correlation >.8 is is a concern
• But, sometimes problems aren’t always bivariate… and
don’t show up in bivariate correlations
– Ex: If you forget to omit a dummy variable
• 2. Watch out for the “symptoms”
• 3. Compute diagnostic statistics
• Tolerance, VIF (Variance Inflation Factor).
Multicollinearity
• Multicollinearity diagnostic statistics:
• “Tolerance”: Easily computed in SPSS
• Low values indicate possible multicollinearity
– Start to pay attention at .3; Below .1 is very likely to be a problem
• Tolerance is computed for each independent variable by
regressing it on other independent variables
– VIF = inverse of tolerance
Multicollinearity
• If you have 3 independent variables: X1, X2, X3…
– Tolerance is based on doing a regression: X1 is
dependent; X2 and X3 are independent.
• Tolerance for X1 is simply 1 minus regression R-square.
• If a variable (X1) is highly correlated with all the
others (X2, X3) then they will do a good job of
predicting it in a regression
• Result: Regression r-square will be high… 1 minus rsquare will be low… indicating a problem.
Multicollinearity
• Variance Inflation Factor (VIF) is the reciprocal
of tolerance: 1/tolerance
• High VIF indicates multicollinearity
– Gives an indication of how much the Standard Error
of a variable grows due to presence of other variables.
Multicollinearity
• Solutions to multcollinearity
– It can be difficult if a fully specified model requires
several collinear variables
• 1. Drop unnecessary variables
• 2. If two collinear variables are really measuring
the same thing, drop one or make an index
– Example: Attitudes toward recycling; attitude toward
pollution. Perhaps they reflect “environmental views”
• 3. Advanced techniques: e.g., Ridge regression
• Uses a more efficient estimator (but not BLUE – may
introduce bias).
“Robust” Standard Errors
• Robust / Huber / White / Sandwich standard error
– An alternative method of estimating regression SEs
• More accurate under conditions of heteroskedasticity
• Potentially more accurate under conditions of nonindependence (clustered data)
• Potentially more accurate when model is underspecified
• Stata: regress y x1 x2, robust
– Increasingly common… Some people now use them
routinely…
• But, Freedman (2006) criticizes use for underspecification:
• What use are SEs if model is underspecified and therefore
slopes are biased?
Models and “Causality”
• Issue: People often use statistics to support
theories or claims regarding causality
– They hope to “explain” some phenomena
• What factors make kids drop out of school
• Whether or not discrimination leads to wage differences
• What factors make corporations earn higher profits
• Statistics provide information about association
• Always remember: Association (e.g., correlation)
is not causation!
• The old aphorism is absolutely right
• Association can always be spurious
Models and “Causality”
• How do we determine causality?
• The randomized experiment is held up as the
ideal way to determine causality
• Example: Does drug X cure cancer?
• We could look for association between receiving
drug X and cancer survival in a sample of people
• But: Association does not demonstrate causation; Effect
could be spurious
• Example: Perhaps rich people have better access to drug X;
and rich people have more skilled doctors!
• Can you think of other possible spurious processes?
Models and “Causality”
• In a randomized experiment, people are assigned
randomly to take drug X (or not)
• Thus, taking drug X is totally random and totally
uncorrelated with any other factor (such as wealth, gender,
access to high quality doctors, etc)
• As a result, the association between drug X and
cancer survival cannot be affected by any
spurious factor
• Nor can “reverse causality” be a problem
• SO: We can make strong inferences about causality!
Models and “Causality”
• Unfortunately, randomized experiments are
impractical (or unethical) in many cases
• Example: Consequences of high-school dropout, national
democracy, or impact of homelessness
• Plan B: Try to “control” for spurious effects:
• Option 1: Create homogenous sub-groups
– Effects of Drug X: If there is a spurious relationship
with wealth, compare people with comparable wealth
• Ex: Look at effect of drug X on cancer survivors among
people of constant wealth… eliminating spurious effect.
Models and “Causality”
• Option 2: Use multivariate model to “control”
for spurious effects
• Examine effect of key variable “net” of other relationships
– Ex: Look at effect of Drug X, while also including a
variable for wealth
• Result: Coefficients for Drug X represent effect net of
wealth, avoiding spuriousness.
Models and “Causality”
• Limitations of “controls” to address spuriousness
• 1. The “homogenous sub-groups” reduces N
• To control for many possible spurious effects, you’ll throw
away lots of data
• 2. You have to control for all possible spurious
effects
• If you overlook any important variable, your results could
be biased… leading to incorrect conclusions about causality
• First: It is hard to measure and control for everything
• Second: Someone can always think up another thing you
should have controlled for, undermining your causal claims.
Models and “Causality”
• Under what conditions can a multivariate model
support statements about causality?
• In theory: A multivariate model support claims
about causality… IF:
• The sample is unbiased
• The measurement is accurate
• The model includes controls for every major possible
spurious effect
• The possibility of reverse causality can be ruled out
• And, the model is executed well: assumptions, outliers,
multicollinearity, etc. are all OK.
Models and “Causality”
• In Practice: Scholars commonly make tentative
assertions about causality… IF:
• The data set is of high quality; sample is either random or
arguably not seriously biased
• Measures are high quality by the standards of the literature
• The model includes controls for major possible spurious
effects discussed in the prior literature
• The possibility of reverse causality is arguably unlikely
• And, the model is executed well: assumptions, outliers,
multicollinearity, etc. are all acceptable… (OR, the author
uses variants of regression necessary to address problems).
Models and “Causality”
• In sum: Multivariate analysis is not the ideal tool
to determine causality
• If you can run an experiment, do it
• But: Multivariate models are usually the best tool that we
have!
• Advice: Multivariate models are a terrific way to
explore your data
• Don’t forget: “correlation is not causation”
• The models aren’t magic; they simply sort out correlation
• But, if used thoughtfully, they can provide hints into likely
causal processes!
Download