Linear Regression Checklist

advertisement
“Linear Regression. Part I OLS.”
_Dataset can be found on the Poli 496 Course Website and helping1.sav at:
http://wps.ablongman.com/ under “jump to” datasets.
Extended Linear Regression Procedure:
_In the first lecture we produced the results for a simple linear regression (by hand).
Today we will do the same for a multiple of independent variables, hence termed
multiple regression analysis.
Basics:
_Opening the file: if an excel file (indicate “read variable names”) – save in new name.
(1). Getting your own data: Collect and input interval level cross-sectional data in rows
(variable values in columns).
_Exceptions: If DV is dichotomous, use logistic regression. If data is longitudinal, use
time series regression.
_Data Editor: spreadsheet: rows = cases; columns = variables (recall names).
_Look at the dataset we are using: n=81, 20 variables, on how people help each other.
_The model of interest: we want to know how peoples’ feelings towards other people
affects whether they help them.
_This is the dataset provided with the text.
_Variable Definitions:
_Presumably acquired through a survey of 81 persons facing people in need.
zhelp: Z-scores (normalized) of the amount of time spent helping a friend on a -3 to +3
scale.
Sympathy: Sympathy felt by helper in response to a friend’s need on a little (1) to much
(7) scale.
Anger: Anger felt by helper in response to friend’s need; 7-point scale above.
Efficacy: Self-efficacy [ability] of helper in relation to friend’s need; 7-point scale above.
Dsex: Gender: dummy variable: 1 = female and 0 = male.
(2). Measuring Skewness and Kurtosis:
_The independent variables should have approximately normal distributions, otherwise
the findings will be biased (not relevant for dichotomous dummy variables).
_Skewness measures the extent to which the left or right bell curve is drawn out.
_Kurtosis measures the extent to which the bell curve rises or flattens.
_Procedure: >> Analyze >> Descriptive Statistics >> Descriptives: INPUT anger,
sympathy, efficacy.
_Options: tick Kurtosis and Skewness >> Continue >> Paste [OK] >> Run.
_Saving Syntax: I need all your assignments and your final paper to be accompanied by
the pasted codes: File >> Save As >> Name. This can be opened by Word.
_Save the Paste in a text file for later collation (this will permit me to see if you followed
the correct procedures).
1
_Result Indicator: Skewness and Kurtosis are normal if they are within +/-1, and at
tolerable limits if at +/-2 (beyond which the distribution of the independent variable will
bias the results of the regression).
_Results: Efficacy and Sympathy are within required parameters; anger is within
tolerable parameters.
_Solution: Increase sample size to achieve an approximately normal distribution.
(2). Run Pearson’s r between each IV to control for multicollinearity. If r > 0.75, then two
IVs are too closely related and measure the same thing. Including them will bias the
results.
_Procedure: >> Analyze >> Correlate >> Bivariate: INPUT anger, sympathy, efficacy,
dsex (Pearson): Paste [OK] >> Run.
_Results: anger, efficacy, sympathy and dsex are all uncorrelated.
Interpretation:
_Each IV is regressed against the others: the table provides the Pearson correlation (r),
the t-test significance of the correlation, and the n, or number of observations.
_This means that the two variables explain essentially the same thing and are either two
different aspects of the same thing, or are themselves caused by an antecedent
variable.
_The t significance test, or t-test, is a measure of the probability that the correlation you
are examining is the probability of luck and not a genuine correlation. Thus the lower its
value the better.
_T-tests commonly vary in three levels: significance at the 10% level, at the 5% level
(most common), and at the 1% level (the best).
_If any IV correlates at greater than (r=) 0.75, then one of the two IVs should be
excluded (or combined through factor analysis) or it will bias the regression results.
_Solution: (i) exclude IV, (ii) or (Solution): run factor analysis to combine related IVs.
(3). Scatterplot individual IVs with DV to assure linearity. If IV does not have a linear
association with the DV, then the IV must be (Solution): transformed. Otherwise the
regression results will be biased. Do not transform DVs into linear representations (this is
sometimes but requires re-plotting all the IVs). Proceed through each individual IV (not
relevant for dichotomous (dummy) IVs).
Procedure:
>> Graph >> Legacy Dialogs>> Scatter/Dot >> Simple Scatter >> Define >> Y Axis
(put the DV: zhelp) >> X Axis (put the IV: each of anger, efficacy, and sympathy in
different scatterplots) >> Paste syntax >> Run.
_Examine the scatterplot for each of the IVs to ensure they are linear.
_Results:
Sympathy: zhelp and sympathy: diffuse but linear.
Anger: zhelp and anger: skewed to the left (low values of the IV), but linear.
Efficacy: zhelp and efficacy: slight skewness to the right, but linear.
_Assessment: there is no transformation required in these data.
_Solution: if the IV-DV line is not linear, then the IV must be transformed.
_Let us examine a non-linear relationship that is in need of transformation:
2
_Get anxiety.sav dataset from the Poli 644 website. N=74
_Variable Definitions:
_Exam (DV): The score on a 100-point exam.
_Anxiety (IV): A measure of pre-exam anxiety measured on a low(1) to high(10) scale.
_First step is to diagnose the linearity of the relationship.
_Procedure:
>> Graph >> Legacy Dialogs>> Scatter/Dot >> Simple Scatter >>Define >> Y Axis
(put the DV: exam) >> X Axis (put the IV) anxiety >> Paste syntax >> Run.
_Examine the scatterplot for each of the IVs to ensure they are linear.
_Results: evident “inverted U” curvilinear relationship.
_Theory: Some anxiety is beneficial, but too much hampers performance.
_Solution: the IV anxiety must be transformed to render the relationship linear.
_The next step is to determine which mathematical transformation fits best:
Conversions of Data for Linearity:
(a). If the data is an “F Curve,” get the log (in SPSS compute = Ln(10) or LG10(X) ) of
the X-axis (the independent variable).
(b). If the data is a “L curve,” then transform the IV on the X-axis into its reciprocal
(divide it by 1).
(c). If the data is a “bell curve,” then use the quadratric (multiply the IV by itself) or the
cubic (multiply the IV by itself three times – the cube).
(d). If the data is a “soft Z curve,” get the log of both the DV (Y-axis) and the IV (Xaxis).
_This procedure is very time consuming and benefits from extensive trial-and-error, and
the solutions are rarely perfect.
_Note: Logs will not work with 0 or negative values.
_All transformed IVs must be regressed against the DV in a scatterplot above to confirm
the success of the linear transformation.
_Assessment: Cubic and Quadratic = transform the IV by that value.
_The next step is to quadratically transform anxiety.
_Procedure: >> Transform >> Compute Variable >> Target Variable >> [IV new
name] INPUT (qanxiety) >> Functions INPUT >> anxiety*anxiety >> OK.
_Procedure:
>> Graph >> Legacy Dialogs>> Scatter/Dot >> Simple Scatter >> Define >> Y Axis
(put the DV: exam) >> X Axis (put the IV) qanxiety >> Paste syntax >> Run.
_Assessment: the transformation had little impact in this case.
_Return to original dataset.
(4). We must identify and potentially exclude data-points that are significantly farther
from the mean than two standard deviations (termed outliers). Not to do so will skew
the generalizability of our regression results.
_Normally the variables would be normalized, but looking for an approximate bell curve
is sufficient.
_EG: We would exclude Superman if we wanted to make generalizations about human
strength.
3
_Procedure: >> Graphs >> Legacy Dialogs >>Histogram >> INPUT: Display Normal
Curve >> INPUT: zhelp >> Paste syntax >> Run.
_Repeat for each variable: zhelp, sympathy, anger, and efficacy.
_Interpretation:
zhelp: Some data points at 3 SD (which is expected as help is “normalized.”)
Sympathy: Seems approximately a normal distribution.
Anger: Clearly a non-normal distribution, but Kurtosis and Skewness was acceptable. No
major outliers.
Efficacy: Has a nearly non-normal distribution, but tolerable, and without major outliers.
_Procedure: If there are outliers, they must be removed temporarily to determine if their
exclusion would significantly affect the regression findings.
_Solution: Increase data size to decrease proportion of extreme datapoints
(5). Dummy Variables:
_Dummy variables are included in the regression (equation) when there is a dichotomous
independent variable such as sex, in which there are two values: 1 for female and 0 for
male.
_Procedure: create a dummy variable as you would any variable.
(6). After linear transformations, re-run Pearson’s r between each IV to control for
multicollinearity. See #2 above.
(7). Run the Multiple Linear Regression:
_Run model, examining the strength of the relationship (R2), and the slope/standard error
for the t-test of significance for each of the IVs.
_Procedure: Analyze >> Regression >> Linear >> INPUT: Dependent: zhelp >>
INPUT: Independent: anger, efficacy, sympathy, dsex >> INPUT: Save: Predicted:
unstandardized, standardized >> Save: Residuals: unstandardized, Studentized >>
Continue >> INPUT: Method: Enter >> INPUT: Statistics: Durbin-Watson >>
Continue >> Paste [OK] >> Run.
_Interpretation of the Results: Paste Output.
R: 0.641
This is termed the variance.
R-Square: 0.411
This identifies the proportion of variance in the DV accounted for by the IVs. Thus, 38%
of the variance in exam values is explained by the three IVs.
Adjusted R-Square: 0.381 (moderate value)
Adjusted R-Square compensates for higher likelihood of correlations in the sample, and
therefore provides a more accurate estimate for the population.
Strong value: 0.6 to 0.8
Moderate value: 0.3 to 0.5
Weak value: 0.1 to 0.2
4
ANOVA:
This provides a general estimate of the significance of the model’s findings.
df: degrees of freedom: the number of independent variables.
_For the residual, its n (the number of cases) minus the number of independent variables.
_As there are fewer cases and more variables, degrees of freedom falls and the
likelihood of obtaining unbiased regression estimates is decreased.
F-statistic: The mean square regression divided by mean square residual. Used to test
differences in means.
Significance of F-statistic: Likelihood that the finding could occur by chance (0.000).
Co-Efficients:
B (Constant) = -4.272 (standard error and significance test not relevant).
This fits into the equation that facilitates prediction.
Anger:
Unstandardized B Co-Efficient: 0.300.
Standard Error: 0.081
_This is the expected standard deviation of the expected values for a population in which
there was no actual association.
_Confidence intervals:
_67% of all cases in a normal distribution fall within the values of B +/- 1*(S.E.).
_95% of all cases in a normal distribution fall within the values of B +/- 2*(S.E.).
_EG: For Anger, 67% of observations are within: 0.3 +/- 0.081 or 0.219 to 0.381; 95% of
observations are within 0.138 to 0.462.
t-Test: 3.681 (B divided S.E.) KEY MEASURE
Significance: 0.001
_Significance Tests: Significance tests help us determine whether the statistical
correlations and relationships we observe in our samples are the product of chance or
whether they are genuine and can be used to generalize to the larger population.
_Specifically: Statistical significance test tell us what probability there is that the
relationship we observe in our sample could occur if there was no such relationship in the
larger population or universe of cases.
_Strength of the significance test: (0.1 to 0.5 is usually the minimum significance in the
scientific community):
 t-test sig < 0.01 = 1% chance of finding a false positive (false association).
 t-test sig < 0.05 = 5% chance of finding a false positive (false association).
 t-test sig < 0.1 = 10% chance of finding a false positive (false association).
_Note: we can never be certain that the association isn’t just wild luck, but we can be
pretty sure.
_Solution: If significance is too low, the sample size should be increased.
_There are dozens of different significance tests.
_Proceed to determining the value of t:
_The t-distribution is an approximately normal distribution.
5
_Use one-tailed test in the hypothesis testing.
_Step 1: determine the degrees of freedom: n (# of cases) – IVs.
_Step 2: determine in what column (significance level) the t-statistic value is greater than
the chart number.
_General metric: SE of 2.00 is usually good enough
BETA: (Anger): 0.328
_Beta = Standardized (z-score) regression coefficients.
_Beta is a measure of the relative contribution of the IV if the other IVs were controlled
for. It is also known as the “partial r.” It is the most accurate measure of the relative
impact of the IV on explaining the outcome in the DV.
Sympathy:
Unstandardized B Co-Efficient: 0.499
Standard Error: 0.098
t-Test: 5.082
Significance: 0.000
BETA: 0.455
Efficacy:
Unstandardized B Co-Efficient: 0.435
Standard Error: 0.129
t-Test: 3.359
Significance: 0.001
BETA: 0.300
Dsex:
Unstandardized B Co-Efficient: -0.455 (negative means inverse relationship)
Standard Error: 0.224
t-Test: -2.030
Significance: 0.046
BETA: -0.180
_Regression Estimate: Overall the four IVs explain a moderate amount of variance in
the DV. Sympathy explains the most variance, followed by anger, efficacy, and then
dsex.
(8). Identifying significant variables:
_Exclude insignificant IVs and re-run model.
(9). Method of Testing Linearity Assumption: since the sum of the residuals (the
unexplained variance should equal zero), regressing the residuals and the predicted values
should show no pattern.
_Procedure:
>> Graph >> Legacy Dialogs>> Scatter/Dot >> Simple Scatter Define >> Y Axis
(pre_1) >> X Axis (res_1) >> Paste syntax >> Run.
6
_Result: Output shows random distribution with no patterns, which is good: linear
association.
_Solution: The solution is weighted least squares.
(10). Heteroskedasticity:
_Heteroskedasticity occurs when there is a pattern in the residuals, specifically when the
standard deviations of the residuals are uneven. Homeskedasticity occurs when there is
no pattern among the residuals.
_Method of Testing Equality of Variance Assumption: The error term (residuals) has a
zero mean and is normally distributed, and so regressing the Studentized residuals and
standardized predicted values should show no pattern. The Studentized residuals should
also be regressed against each independent variable.
_Procedure:
>> Graph >> Legacy Dialogs>> Scatter/Dot >> Simple Scatter Define >> Y Axis
(sdr_1) >> X Axis (zpr_1) >> Paste syntax >> Run.
>> Graph >> Legacy Dialogs>> Scatter/Dot >> Simple Scatter Define >> Y Axis
(sdr_1) >> X Axis (anger) >> Paste syntax >> Run.
>> Graph >> Legacy Dialogs>> Scatter/Dot >> Simple Scatter Define >> Y Axis
(sdr_1) >> X Axis (sympathy) >> Paste syntax >> Run.
>> Graph >> Legacy Dialogs>> Scatter/Dot >> Simple Scatter Define >> Y Axis
(sdr_1) >> X Axis (efficacy) >> Paste syntax >> Run.
>> Graph >> Legacy Dialogs>> Scatter/Dot >> Simple Scatter Define >> Y Axis
(sdr_1) >> X Axis (dsex) >> Paste syntax >> Run.
_Result: Output shows random distribution with no patterns, which is good: the error
term has a zero mean.
_Solution: The solution to achieve homoskedasticity is weighted least squares. Or it is to
add whatever omitted variable caused the pattern.
(11). Autocorrelation (First Order Serial Correlation):
_Method of Testing the Independence of Error Assumption: This is a test of serial
correlation and indicates whether there is a sequential relationship between residuals.
Serial correlation violates the assumption of the independence of the residuals.
_Procedure: The Durbin-Watson statistic was provided by the regression results = 2.133.
_Durbin-Watson statistics vary between 0 to 4.
_Consult Table B-4: significance of 0.05.
_Step 1: Determine the number of cases: n = 74.
_Step 2: Determine the number of independent variables: IV = 3 (= k).
_Step 3: Locate 2.133 in relation to (dL=1.53) and (dU=1.70).
_Result: 2.133 is greater than dU of 1.70, so there is no first order autocorrelation.
_General Indicator: a Durbin-Watson of greater than 2 usually indicates no
autocorrelation.
_Solution: The solution is weighted least squares.
To calculate outcome:
Outcome = constant + IV1*(B Coefficient) + IV2*(B Coefficient) + IV3*(B Coefficient).
7
_Slope Interpretation:
_The slope is 0.77.
_Therefore, for every one-unit increase in the independent variable, there is a 0.77 unit
increase in the dependent variable.
Data Assumptions of Linear Regression:
To get BLUE = “Best Linear Unbiased Estimate.”
(1). Interval level data (not nominative/categorical, or ordinal).1
Solution: use scatterplots.
(2). Data has a normal distribution (as per the CLT).
Solution: seek larger n.
(3). There are no omitted variables (creates omitted variable bias).
Solution: include the omitted variables.
(4). The data is linearly distributed:
 It does not deviate or curve or skew in any direction.
 It does not have significant bulges.
 It does not significantly increase its dispersal.
Solution: transform the data.
(5). No multicollinearity:
No IV is a perfect linear function of other explanatory variables (r=0.75).
Solution: exclusion of the replicated variable.
(6). There is no (serial) autocorrelation:
 The error terms are uncorrelated with each other (implies there is omitted variable
bias).
 The individual residuals do not covary with any IV (this implies there is an
omitted variable).
Solution: Weighted Least Squares.
(7). The error term is homoskedasticick (does not suffer from heteroskedasticity).
 The error term has constant variance (if not, significance of findings drop):
Solution: Weighted Least Squares.
(8). The error term has a zero mean and is normally distributed:
 It does not deviate or curve or skew in any direction (this implies omitted variable
bias).
Solution: Weighted Least Squares.
1
Exception (thanks to Prof Perella): 1- Categorical data can be converted to dummy variables, which are compatible for regression; 2Categorical can be combined into an index/scale, which qualify as continuous, and thus, appropriate for regression.
8
Download