Handout - University of Iowa

advertisement
Statistics Outreach Center
Short Course
SPSS ANOVA/Regression
Wednesday, February 19, 2014
6:00 – 8:00 pm
N106 LC
Topics Covered:

Analysis of Variance
o One-Way ANOVA
o Two-way ANOVA
o ANCOVA
o MANOVA

Regression Analysis
o Regression
o Logistic Regression

Data Management Syntax

Syntax for Common Analyses

Helpful Links
Overview
This course is designed for users with some SPSS experience. The first sections introduce
users to ANOVA and Regression analyses. The remaining sections describe some data
management issues, commonly used inferential statistics syntax, and other related topics.
During this tutorial, a sample dataset, Employee data.sav, is used for all examples. This
example dataset can be downloaded from the webpage of short course at Statistical
Outreach Center (http://www.education.uiowa.edu/centers/soc/shortcourses.aspx).
Getting Started

To open SPSS, go to the Start icon on your Windows computer. You should find
SPSS under the Programs menu item. SPSS is not actually on these computers,
we are accessing SPSS through the Virtual Desktop (for more info go to
http://helpdesk.its.uiowa.edu/virtualdesktop).

If SPSS isn’t listed under programs, you may need to access it through the Virtual
Desktop website (This site can be found at:
https://virtualdesktop.uiowa.edu/Citrix/VirtualDesktop/auth/login.aspx)

When using the Virtual Desktop to access SPSS, you can only open and save files
from your University of Iowa personal drive (the H: drive) or from a data source
(e.g., flash drive) you have connected prior to opening SPSS.

When using the Virtual Desktop, a dialog box may appear asking for read/write
access. If you want to use and save files, you need to agree to give CITRIX full
access.
When SPSS opens, it will present you with a “What would you like to do?” dialog box.
For now, click the Cancel button.
Section 1: Analysis of Variance
1.1. One-Way ANOVA
Comparing group differences for one or more independent and dependent variables in
SPSS. For this section, if you have one categorical independent variable and an interval
dependent variable the One-Way ANOVA procedure is appropriate.
Analyze > Compare Means > One-Way ANOVA...
One-Way ANOVA: Used to test if the population means of two or more groups are equal.
H0: μ1 = μ2 = μ3 = … = μk
H1: At least one μi ≠ μj
Example: Does the population mean for current salary differ by employment category?
To conduct the one-way ANOVA, first select the independent and dependent variables to
produce the following dialog box:
The above options will produce the following output (some output is omitted):
Test of H omog enei ty of Vari ances
C urrent Salary
Lev ene
St at is tic
59. 733
df 1
df 2
471
2
Sig.
. 000
ANOVA
Current Salary
Between Groups
Within Groups
Total
Sum of
Squares
8.9E+010
4.8E+010
1.4E+011
df
2
471
473
Mean Square
4.472E+010
102925714.5
F
434.481
Sig.
.000
In the above example in which the hypothesis is that three categories of employment do
not differ in their salaries, the F statistic has a value of 434.481 with the associated
significance level of .000 (Technically, the p-value is less than 0.001). The significance
level tells us that the hypothesis of no difference among three groups is rejected under
the .05 significance level. Accordingly, we conclude that the three groups of employment
(Clerical, Custodial, and Manager) differ in their salaries.
In order to know which pairs of means differed significantly, we would need to request
follow-up tests using the Post Hoc option. Click on the Post Hoc option then select the
post hoc test or tests of interest. We will use the Tukey’s Least Significant Difference
(LSD)
The results indicate that people in job category 3 (Manager) are paid significantly more
than people in job categories 1 and 2 (Clerical and Custodian) and there was not a
significant difference between categories 1 and 2.
Multiple Comparisons
salary
LSD
95% Confidence Interval
Mean Difference
(I) jobcat
(J) jobcat
1
2
$-3,100.349
$2,023.760
.126
$-7,077.06
$876.37
3
$-36,139.258*
$1,228.352
.000
$-38,552.99
$-33,725.53
1
$3,100.349
$2,023.760
.126
$-876.37
$7,077.06
3
$-33,038.909*
$2,244.409
.000
$-37,449.20
$-28,628.62
1
$36,139.258*
$1,228.352
.000
$33,725.53
$38,552.99
2
$33,038.909*
$2,244.409
.000
$28,628.62
$37,449.20
2
3
(I-J)
Std. Error
Sig.
Lower Bound
Upper Bound
*. The mean difference is significant at the 0.05 level.
1.2. Two-Way ANOVA
When there is more than one independent variable, the analysis is done by selecting
General Linear Model (GLM) procedures in the Analyze menu. If the analysis involves
independent groups and one dependent variable, choose
Analyze > General Linear Model > Univariate...
Example: Is current salary dependent on minority and employment category?
For this example, the Dependent Variable is Current Salary (salary) and the Fixed
Factors are Minority (minority) and Employment Category (jobcat).
You can plot the means in order to get a visual understanding of the results. If you select
plots, the screen will appear as follows. Add jobcat to the Horizontal Axis box and add
minority to the Separate Lines box.
Between-Subjects Factors
Value Label
Clerical
Custodial
Manager
No
Yes
Employ ment
Category
1
2
3
Minorit y Clas sif icat ion 0
1
N
363
27
84
370
104
Tests of Between-Subjects Effects
Dependent Variable: Current Salary
Source
Corrected Model
Intercept
jobcat
minority
jobcat * minorit y
Error
Total
Corrected Total
Ty pe I II Sum
of Squares
9.034E+010a
1.537E+011
2.596E+010
237964814
788578413
4.757E+010
6.995E+011
1.379E+011
df
5
1
2
1
2
468
474
473
Mean Square
1.807E+010
1.537E+011
1.298E+010
237964814.4
394289206.5
101655279.9
F
177.742
1511.773
127.699
2.341
3.879
Sig.
.000
.000
.000
.127
.021
a. R Squared = . 655 (Adjusted R Squared = .651)
Estimated Marginal Means of Current Salary
Minority Classification
$80,000
No
Yes
Estimated Marginal Means
$70,000
$60,000
$50,000
$40,000
$30,000
$20,000
Clerical
Custodial
Employment Category
Manager
__
Assuming alpha = .05, the jobcat main effect and the jobcat by minority interaction are
significant. The change in the simple main effect of one variable over levels of the other
is most easily seen in the graph of the interaction. If the lines describing the simple main
effects are not parallel, then a possibility of an interaction exists. The presence of an
interaction was confirmed by the significant interaction in the summary table.
1.3. ANCOVA
ANCOVA (analysis of covariance) is an extension of ANOVA. Examines whether group
means (categorical independent variable) differ on a dependent variable after statistically
control for another continuous variables (covariate). The analysis is done by selecting
General Linear Model (GLM) procedures in the Analyze menu. If the analysis involves
independent groups choose
Analyze > General Linear Model >Univariate...
Example: Does salary differ for males and females after controlling for previous
experience?
For this example, the Dependent Variable is Current Salary (salary) and the Fixed
Factor is Gender (gender) and the covariate is Previous Experience (prevexp).
Under options we can display adjusted means for group which in this case is gender.
Note that if you have more than two groups you can compare by selecting Contrasts as
opposed to post hoc analyses.
The output above shows there is a significant difference in salary between males and
females after controlling for previous experience, F(1, 471) = 137.020. p<.001. The
second table gives the adjusted means in salary for each group based on the covariate. If
we compare to the descriptive statistics we see the means have slightly changed but are
still significantly different.
1.4. MANOVA
MANOVA (multivariate analysis of variance) is an extension of ANOVA except there
are two or more dependent variables with one categorical independent variable. The
analysis is done by selecting General Linear Model (GLM) procedures in the Analyze
menu. If the analysis involves independent groups choose
Analyze > General Linear Model >Multivariate...
Example: We want to know whether groups differ on a grouping of variables. In this case
do the three different job categories differ on job characteristics (salary, beginning salary,
and jobtime). These three variables are our dependent variables and jobcat is the fixed
factor.
Results: Wilks’ Lambda indicates there is a significant difference in job characteristics
based on job category F(6, 938) = 117.402, p<.001, Wilks’ Λ = .326
The table labeled tests of between-subjects effects are univariate ANOVAs and therefore
an alpha correction (such as Bonferroni) needs to be made. We can see from this table
that job category has a significant effect on salary and beginning salary but not on
jobtime.
Section 2: Regression Analysis
2.1. Regression
Regression models can be used to predict or explain values on a (dependent) variable
based on information from other (independent) variables.
Overall Model Fit (F-Test): Used to test if the regression model is “better” than using
only the mean of the dependent variable.
H0: Y = β0
H1: Y = β0 + β1X1 + … + βkXk
Test for a Single βk: Used to test if βk differs from zero.
H0: βk = 0
H1: βk ≠ 0
Example: What is the regression model for using “educational level” and “years
experience” to predict salary?
Everything we need to create a linear regression model is located in the following menu:
Analyze > Regression > Linear…
The variable we are trying to “predict” is the Current Salary variable, which goes in the
Dependent box. Educational Level and Previous Experience go in the Independent(s)
box.
There are many options available in the linear regression dialog box. We’ll just look at
one, plotting the predicted value against the standardized residual allows you to exam
where the errors seem to be random and whether homogeneity of variances appears to be
a reasonable assumption.
Model Summary
Std. Error of the
Model
R
R Square
.664a
1
Adjusted R Square
.441
Estimate
.439
$12,788.694
a. Predictors: (Constant), Previous Experience (months), Educational Level (years)
ANOVAb
Model
1
Sum of Squares
df
Mean Square
Regression
6.088E10
2
3.044E10
Residual
7.703E10
471
1.636E8
Total
1.379E11
473
a. Predictors: (Constant), Previous Experience (months), Educational Level (years)
b. Dependent Variable: Current Salary
F
186.132
Sig.
.000a
Coefficientsa
Standardized
Unstandardized Coefficients
Model
1
B
(Constant)
Educational Level (years)
Previous Experience (months)
Std. Error
-20978.304
3087.258
4020.343
210.650
12.071
5.810
Coefficients
Beta
t
Sig.
-6.795
.000
.679
19.085
.000
.074
2.078
.038
a. Dependent Variable: Current Salary
So, what does the output tell us?



R2 = 0.664, means that 66% of the variance in salary can be “accounted for” by
information about educational level and previous experience.
An F-statistic of 186.132 (p-value < 0.001) indicates that a regression model
containing educational level and previous experience is “better” than a model
without any predictor variables (using the mean salary as an estimate for anyone).
The regression equation is:
Salary = -20978 + 4020 * Education level + 12 * Previous Experience


The t-statistics for each βi is large enough in magnitude to reject the null
hypothesis that each βi = 0 when  = .05.
The plot doesn’t look very random, we may need to reconsider our analysis. This
is common for a variable like salary and indicates that we may want transform the
variable or choose a different analysis.
2.2. Logistic Regression
Predicting a binary outcome variable from one or more predictor variables.
Analyze > Regression >Binary Logistic…
Example: Can we predict an individual’s gender based on salary jobtime and previous
experience?
Here gender (either male or female) is our binary dependent variable and salary, jobtime,
and prevexp are the covariates.
What the output tell us:
The test of the overall model is significant. X² =180.206 , p<.001
Both salary and previous experience are significant predictors of gender. Based on the
model in which we predict gender from salary, jobtime, and prevexp we are correctly
classifying 75% of individuals. Practically speaking if we wanted to create an efficient
prediction model we would not include variables that aren’t significant predictors. Let’s
see what happens if we remove jobtime from the model.
Our correct classification percentage is still about 75 based on just salary and prevexp in
our model because as we saw previously jobtime is not a significant predictor of gender.
Section 3: Data Management Syntax
Compute missing =nmiss(salary).
Execute.
Listwise deletion excludes all cases that have missing values for any variables in the
analyses. Pairwise deletion uses all cases that have valid responses for the variables in
each particular statistic being calculated. Default in SPSS is pairwise deletion. Add
/missing = listwise subcommand to analysis for listwise deletion.
Add value labels gender “m” “Male” “f” “Female”/
minority 0 “No” 1 “Yes”/
item1 to item20 1 “Not at all true of me” 7 “Very true of me”.
Execute.
Sort cases by prevexp (A).
Compute salchange = salary-salbegin.
Execute.
Compute sum.3(i3, i6, i8).
Execute.
Recode item1 (1=7) (2=6) (3=5) (4=4) (5=3) (6=2) (7=1) into Ritem1.
Execute.
Sort cases by gender.
Temporary.
Split file by gender.
Corr var = salary educ.
Reliability var = item1 to item20
/scale(SC) = all
/statistics descriptive scale corr cov
/summary = total.
Section 4: Syntax for Common Analyses
Frequencies
Freq var = educ
/statistics
/percentiles = 25 75
/format = notable.
Descriptives
Desc var = salary
/statistics = mean stddev
/sort = mean (D)
/save.
Chi-square
Crosstabs tables = salary by gender
/statistics = chisq phi
/cells = count sresid expected.
T-test
t-test
/groups = gender (m f)
/variables = salary.
Correlation
corr var = educ salary
/missing = listwise.
Regression
reg
/dependent salary
/ method = enter educ jobtime.
ANOVA
glm salary by jobcat
/posthoc = jobcat (tukey)
/emmeans = tables (jobcat).
Graphs
Graph
/bar = jobcat
/title = ‘Frequencies of Different Job Categories’.
Graph
/bar(grouped) = jobcat by gender
/title = ‘Gender Differences in Job Categories’.
Graph
/line = educ by jobcat
/title = ‘Distribution of educ by job category’.
Graph
/scatterplot = salary with educ.
Graph
/histogram(normal) = salary.
Section 5: Helpful Links
Powerpoint of various statistical analysis in SPSS:
What statistical analysis should I use?
Website of annotated analysis and code/syntax for various analyses in SPSS, Stata, SAS,
and Mplus
http://www.ats.ucla.edu/stat/AnnotatedOutput/
Download