Lab Lecture 1

advertisement
Psychology 522/622
Winter 2008
Lab Lecture #1
Multiple Regression: Dummy Coding and Regression Diagnostics
The goals of this lab are to:
 familiarize the students with multiple regression for categorical predictors
 introduce some diagnostic procedures to find strange data points
Oneway ANOVA and Multiple Regression
DATAFILE: HEALTH.SAV
The following data are from an experiment that explored the effect of health intervention
type on health status. There were three health intervention types, a control, moderate,
and maximum intervention. Outcomes were assessed on a health status scale where
higher numbers indicate better health status. The data appear in the data set “health.sav.”
We first analyze this data using the ANOVA. Here is some SPSS syntax. (All the SPSS
syntax in this lab lecture is included in the syntax file “dummycodes.sps”.)
ONEWAY
health BY treat
/STATISTICS DESCRIPTIVES HOMOGENEITY
/PLOT MEANS
/MISSING ANALYSIS
/POSTHOC = TUKEY ALPHA(.05).
Alternatively, here are the drop down menu commands that give you the same
information:
Analyze Compare Means  Oneway ANOVA
Move health to the DV box, move treat to the Factor box
Click Post Hoc: select Tukey; Click Continue
Click Options: select descriptives, homogeneity of variance test, and means plot; Click
Continue
Click OK
Oneway
Descriptives
health
N
Control
Moderate Intervention
Maximum Intervention
Total
20
20
20
60
Mean
4.2000
6.1000
7.3000
5.8667
Std. Deviation
1.47256
2.48998
2.07998
2.39680
Std. Error
.32927
.55678
.46510
.30943
95% Confidence Interval for
Mean
Lower Bound Upper Bound
3.5108
4.8892
4.9347
7.2653
6.3265
8.2735
5.2475
6.4858
Minimum
1.00
1.00
2.00
1.00
Maximum
7.00
9.00
10.00
10.00
Descriptively by inspection of the group means, it looks like health status improved with
the degree of intervention.
Test of Homogeneity of Variances
health
Levene
Statistic
2.686
df1
df2
2
Sig.
.077
57
There is not strong evidence that the assumption of variance homogeneity is violated.
This is good.
ANOVA
health
Between Groups
W ithin Groups
Total
Sum of
Squares
97.733
241.200
338.933
df
2
57
59
Mean Square
48.867
4.232
F
11.548
Sig.
.000
We reject the null hypothesis that all three population health status means are equal to
one another.
Post Hoc Tests
Multiple Comparisons
Dependent Variable: health
Tukey HSD
(I) treat
Control
Moderate Intervention
Maximum Intervention
(J) treat
Moderate Intervention
Maximum Intervention
Control
Maximum Intervention
Control
Moderate Intervention
Mean
Difference
(I-J)
-1.90000*
-3.10000*
1.90000*
-1.20000
3.10000*
1.20000
*. The mean difference is s ignificant at the .05 level.
Homogeneous Subsets
Std. Error
.65051
.65051
.65051
.65051
.65051
.65051
Sig.
.014
.000
.014
.164
.000
.164
95% Confidence Interval
Lower Bound Upper Bound
-3.4654
-.3346
-4.6654
-1.5346
.3346
3.4654
-2.7654
.3654
1.5346
4.6654
-.3654
2.7654
health
Tukey HSD
a
treat
Control
Moderate Intervention
Maximum Intervention
Sig.
N
20
20
20
Subset for alpha = .05
1
2
4.2000
6.1000
7.3000
1.000
.164
Means for groups in homogeneous s ubs ets are displayed.
a. Us es Harmonic Mean Sample Size = 20.000.
So what’s the story here? It is better to receive either the moderate or maximum
intervention rather than the control intervention; however, there does not appear to be a
statistically significant difference between the moderate and maximum interventions.
Means Plots
Mean of health
7.00
6.00
5.00
4.00
Control
Moderate Intervention
Maximum Intervention
Treatment
This graph provides a picture of the means as a function of intervention type.
We now analyze this data using multiple regression. Yay!
We first need to create dummy codes for our treatment levels. As there are three levels of
treatment, we will need only two dummy codes. We’ll set the control group to be the
“base category.”
One can create these dummy codes using the SPSS menus. Another way is to use syntax.
Here is what the SPSS syntax looks like for the creation of the dummy codes.
RECODE
treat
(2=1) (ELSE=0) INTO moderate .
EXECUTE .
RECODE
treat
(3=1) (ELSE=0) INTO maximum .
EXECUTE .
Running this syntax creates two new variables that indicate, respectively, the moderate
and maximum intervention groups. The new data set with the dummy codes is saved as
“health1.sav”.
Alternatively, here are the drop down commands for creating the dummy variables:
TransformRecode into Different Variables
Move treat into the Input VariableOutput Variable box
*In the Output Variable section, type in Moderate [Maximum] in the name box
Click Change
Click Old and New Values
*On the left side of the window in the Old Value section, type in 2 [3] in the Value box
(this is because in the treat variable, moderate = 2)
On the right side of the window in the New Value section, type 1 in the Value box
Click Add
On the left side of the window, click All Other Values
On the right side of the window, type 0 in the Value box
Click Add, Click Continue, Click OK.
*Repeat these steps to create the dummy variable for maximum, but when you see an
asterisk use the information provided in the brackets.
DATAFILE: HEALTH1.SAV
We now run our regression of health status on intervention type. We specify health as
the outcome variable and moderate and maximum as the predictor variables. Note that
we do not include the control group explicitly as a predictor variable. We see that things
work out fine without it.
Here is some syntax.
REGRESSION
/DESCRIPTIVES CORR
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT health
/METHOD=ENTER moderate maximum .
Alternatively, here are the drop down commands:
AnalyzeRegressionLinear
Move health to the DV box
Move Moderate and Maximum to the IV box
Click StatisticsDescriptives (estimates of effect size and model fit should already be
checked)
Click Continue, Click OK.
Regression
Correlations
Pearson Correlation
health
1.000
.069
.426
health
moderate
maximum
moderate
.069
1.000
-.500
maximum
.426
-.500
1.000
We see that the indicator for the moderate intervention correlates weakly with the health
status variable, but that the maximum intervention correlates more substantially with the
health status variable. Note that there is a negative correlation (-.50) between the
moderate and maximum intervention variables. This must be the case because if you are
in one of the groups you cannot be in the other group.
b
Va riables Entere d/Re moved
Model
1
Variables
Entered
maximum,
a
moderate
Variables
Removed
.
Method
Enter
a. All reques ted variables ent ered.
b. Dependent Variable: health
Model Summaryb
Model
1
R
.537a
R Square
.288
Adjusted
R Square
.263
Std. Error of
the Estimate
2.05708
a. Predictors: (Constant), maximum, moderate
b. Dependent Variable: health
28.8% of the variance in health status is attributable to the three intervention groups.
ANOVAb
Model
1
Regres sion
Residual
Total
Sum of
Squares
97.733
241.200
338.933
df
2
57
59
Mean Square
48.867
4.232
F
11.548
Sig.
.000a
a. Predic tors: (Constant), max imum, moderate
b. Dependent Variable: health
Note that although the terminology has changed, the numbers in this ANOVA summary
table are identical to those for the ANOVA analysis. The take home message from both
analyses is that intervention type is related to health status.
In the ANOVA, we needed to run post-hoc tests to look for group mean differences. In
regression, because we have dummy coded our predictors, two of these 1df contrasts
appear automatically as regression slopes.
Coeffi cientsa
Model
1
(Const ant)
moderate
maximum
Unstandardized
Coeffic ients
B
St d. Error
4.200
.460
1.900
.651
3.100
.651
St andardiz ed
Coeffic ients
Beta
.377
.615
t
9.131
2.921
4.766
Sig.
.000
.005
.000
a. Dependent Variable: health
Note the unstandardized coefficients.
The constant (4.2) is the mean of the control intervention group (check this with the
ANOVA results above).
The regression slope for moderate intervention is the difference between the means for
the moderate and control interventions (check this with the ANOVA results above).
This difference is statistically significant.
The regression slope for maximum intervention is the difference between the means for
the maximum and control interventions (check this with the ANOVA results above).
This difference, too, is statistically significant.
Note that we do not get a test of whether the moderate and maximum intervention means
differ. To get this using multiple regression, we would need to change one of these
groups to the “base category” which is simple enough to do.
Finding Strange Data Points
Here we consider ways to find strange data points in the simple case where we are
interested in bivariate associations (e.g., correlations and simple regression slopes).
Although these methods may seem unnecessary in this case because we can find such
data points simply from looking at a scatter plot, these methods are more useful when we
consider multivariate associations. Thus, we seek to gain understanding of these tools in
the simple case.
DATAFILE: LUMBERSHORT.SAV
This data contains the results of two tests of lumber stiffness on 28 planks. The data are
contained in the SPSS file “lumbershort.sav”. Of interest is how the results of these two
tests of lumber stiffness are related. The SPSS syntax for this section of the lab lecture
are contained in the SPSS syntax file “lumber.sps”.
First, we compute the correlation between test results. Can you remember how to do this
via the drop down menus? The syntax is below.
CORRELATIONS
/VARIABLES=static1 static2
/PRINT=TWOTAIL NOSIG
/MISSING=PAIRWISE .
Correlations
static1
static1
static2
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
1
.
28
.739**
.000
28
static2
.739**
.000
28
1
.
28
**. Correlation is s ignificant at the 0.01 level
(2-tailed).
The two test scores are positively correlated and this correlation is statistically
significant.
Next, we explore this data looking at a simple scatter plot. (Hopefully, you would do
such exploration BEFORE you conduct the preceding analysis )
GRAPH
/SCATTERPLOT(BIVAR)=static2 WITH static1
/MISSING=LISTWISE .
Alternatively, the drop down commands:
Graphs Scatter (Note: if you’re using SPSS 14 or higher, you need to go
GraphsLegacy DialogsScatter)
Select Simple Scatter
Move static2 to the x axis and static1 to the Y axis
Click OK
2200.00
Lumber stiffness static1 test
2000.00
1800.00
1600.00
1400.00
1200.00
1000.00
1000.00
1250.00
1500.00
1750.00
2000.00
2250.00
Lumber stiffness static2 test
Clearly, there is one data point that is much different than the rest. (Note: SPSS does not
circle it in red for you, that’s a favor from me ) Its score is low on one of the tests and
high on the other test. Inspection of the data set reveals that this in Plank #13. In newer
versions of SPSS if you use the Chart Editor you can actually right click on that data
point and SPSS will take you to it in the dataset if you select “Go to Case.” At this point
we should decide what to do with this data point.
Although this point was easy to find with bivariate data, we now examine techniques that
are especially useful with multivariate data where bivariate plots are not sufficient. We’ll
be interested in seeing whether these more complicated techniques identify Plank #13 as
problematic. We do these analyses within SPSS Regression. (Note that these techniques
are useful even if we don’t plan on conducting a regression analyses, e.g., a factor
analysis.)
In this case, the variable that serves as the outcome variable (DV) is arbitrary. Thus, we
will regress Test 1 on Test 2 with some additional syntax. So, test 1 = DV and test 2 = IV.
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT static1
/METHOD=ENTER static2
/CASEWISE PLOT(ZRESID) ALL
/SAVE MAHAL COOK LEVER .
Alternatively, here are the drop down commands:
Analyze Regression  Linear
Move static1 to the DV box, move static2 to the IV box
Click Statistics; Click Casewise diagnostics, Click All Cases; Click Continue
Click Save; Select Mahalanobis, Cook’s, Leverage values; Click Continue; Click OK
Regression
Variables Entered/Removedb
Model
1
Variables
Entered
static2 a
Variables
Removed
Method
Enter
.
a. All requested variables entered.
b. Dependent Variable: static1
Model Summaryb
Model
1
R
.739a
R Square
.547
Adjusted
R Square
.529
Std. Error of
the Estimate
181.76823
a. Predictors: (Constant), s tatic2
b. Dependent Variable: static1
ANOVAb
Model
1
Regres sion
Residual
Total
Sum of
Squares
1035385
859031.9
1894417
df
1
26
27
Mean Square
1035384.767
33039. 690
F
31.338
Sig.
.000a
a. Predic tors: (Constant), static2
b. Dependent Variable: static1
Coeffi cientsa
Model
1
(Const ant)
static2
Unstandardized
Coeffic ients
B
St d. Error
353.045
208.312
.678
.121
a. Dependent Variable: static1
St andardiz ed
Coeffic ients
Beta
.739
t
1.695
5.598
Sig.
.102
.000
Ca sew ise Dia gnosticsa
Case Number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
St d. Residual
.014
1.345
-.244
-1. 553
-.086
.208
-1. 182
.514
.047
.192
-.070
.339
3.686
-.693
-.816
.067
.326
-.858
-1. 305
.372
-.500
-.484
-.200
-.610
.576
-.119
1.086
-.052
static1
1561.00
2087.00
1815.00
1110.00
1614.00
1439.00
1271.00
1717.00
1384.00
1518.00
1627.00
1595.00
1901.00
1389.00
1002.00
1252.00
1602.00
1313.00
1547.00
1422.00
1290.00
1646.00
1356.00
1238.00
1701.00
1414.00
2065.00
1214.00
Predic ted
Value
1558.4177
1842.4734
1859.4219
1392.3231
1629.6012
1401.1363
1485.8785
1623.4997
1375.3747
1483.1667
1639.7702
1533.3341
1230.9741
1515.0298
1150.2995
1239.7872
1542.8252
1468.9300
1784.1708
1354.3586
1380.7982
1734.0035
1392.3231
1348.9351
1596.3822
1435.7111
1867.5571
1223.5168
Residual
2.58225
244.52656
-44.42188
-282.323
-15.60115
37.86369
-214.878
93.50028
8.62531
34.83329
-12.77021
61.66593
670.02594
-126.030
-148.300
12.21275
59.17481
-155.930
-237.171
67.64136
-90.79819
-88.00349
-36.32312
-110.935
104.61777
-21.71111
197.44288
-9. 51675
a. Dependent Variable: st atic1
Note that Plank #13 has the largest residual (standardized and raw). This is consistent
with the scatter plot results. The value of that observation is very different from its
predicted value on the regression line.
Re siduals Sta tisticsa
Minimum Maximum
Mean
Predic ted V alue
1150.2996 1867.5571 1503.2143
St d. P redic ted Value
-1. 802
1.861
.000
St andard E rror of
34.416
73.593
46.792
Predic ted V alue
Adjust ed P redicted Value 1150.4421 1867.7739 1501.0418
Residual
-282.323 670.02594
.00000
St d. Residual
-1. 553
3.686
.000
St ud. Residual
-1. 592
3.901
.006
Deleted Residual
-296.431 750.55786
2.17248
St ud. Deleted Residual
-1. 643
5.942
.077
Mahal. Dis tanc e
.004
3.462
.964
Cook's Dis tanc e
.000
.915
.058
Centered Leverage Value
.000
.128
.036
St d. Deviat ion
195.82539
1.000
N
28
28
13.297
28
195.68737
178.37040
.981
1.034
198.10760
1.348
1.151
.174
.043
28
28
28
28
28
28
28
28
28
a. Dependent Variable: static1
For our purposes here we focus on the last three rows of this table.
Recall that Mahalanobis’ Distance is a measure of distance of each data point from the
multivariate centroid (mean) that takes into account the variances of the variables. Note
that the maximum value here is 3.46. This value means nothing in and of itself, but
comparing it to the mean value of .964 and the minimum value of .004 we see that the
maximum is much farther above the mean than the minimum is below the mean. Think
about what this would look like if you put these three points on a distribution. This
suggests an asymmetric distribution and that this maximum may indicate one or more
problematic data points.
Cook’s Distance is another distance measure. The maximum value of this distance
measure is much farther above the mean value than the minimum value is below the
mean value. Again this suggests an asymmetric distribution and that this maximum may
indicate one or more problematic data points.
Centered leverage is an index of the influence of each data point on the regression slope.
Larger values indicate a greater degree of influence on this slope. The maximum
leverage value is much farther above the mean value than the minimum value is below
the mean value. Again this suggests an asymmetric distribution and that this maximum
may indicate one or more problematic data points.
However, from the above info, we don’t know which data point (or points) are potentially
problematic. You’ll notice in the SPSS data file after you run the syntax that three new
variables have been added (MAH_1, COO_1, LEV_1). We can now use boxplots of these
values to look for the offending case(s). Remember that although histograms will give us
an idea of what the distribution of Mahalanobis, Cook’s and Leverage values look like,
they don’t give us a way to identify specific cases. Boxplots, on the other hand, give us
an idea of distribution and help us to identify outliers.
EXAMINE
VARIABLES=MAH_1 /COMPARE
VARIABLE/PLOT=BOXPLOT/STATISTICS=NONE/NOTOTAL/ID=plank
/MISSING=LISTWISE .
Alternatively, the drop down commands:
GraphsBoxplots
Select Simple, Summaries of Separate Variables; Click Define
Move Mahalanobis to the “Boxes Represent” box
Move Plank number to the Label Cases by box; Click OK
4.00000
3.00000
2.00000
1.00000
0.00000
Mahalanobis Distance
The median (heavy black bar) is less than .5, but we know the mean = .94. So what? If
the median is smaller than the mean, it indicates asymmetry (more specifically, positive
skew). Imagine this plot tipped to the right, on its side. We can see that the majority of
the values cluster between 0 and 2, while the rest slide out to the right tail.
Now let’s create a boxplot for Cook’s distance and Leverage values. We’ll put these two
variables in the same graph to save time/space.
EXAMINE
VARIABLES=COO_1 LEV_1 /COMPARE
VARIABLE/PLOT=BOXPLOT/STATISTICS=NONE/NOTOTAL/ID=plank
/MISSING=LISTWISE .
Drop down commands:
GraphsBoxplotsSimple, Summaries of Separate Variables Define
Move Cook’s and Leverage values to the “Boxes Represent” box
Move Planks to the “Label Cases by” box; Click OK.
1.00000
13.00
0.80000
0.60000
0.40000
0.20000
2.00
27.00
15.00
19.00
4.00
0.00000
Cook's Distance
Centered Leverage Value
These boxplots illustrate the positive skew in these distributions. Cook’s distance in this
case clearly identifies one observation that is much different that the others. This is Plank
#13. In this case, Mahalanobis’ Distance and the leverage values appear less sensitive to
the type of discrepancy that Plank #13 exhibits.
Thus, in summary it is a good idea to inspect all four types of statistic (i.e., residuals,
Mahalanobis’ distance, Cook’s distance, and leverage) as each can pick up on different
types of discrepancy.
Download