Statistics Outreach Center Short Course SPSS ANOVA/Regression Wednesday, February 19, 2014 6:00 – 8:00 pm N106 LC Topics Covered: Analysis of Variance o One-Way ANOVA o Two-way ANOVA o ANCOVA o MANOVA Regression Analysis o Regression o Logistic Regression Data Management Syntax Syntax for Common Analyses Helpful Links Overview This course is designed for users with some SPSS experience. The first sections introduce users to ANOVA and Regression analyses. The remaining sections describe some data management issues, commonly used inferential statistics syntax, and other related topics. During this tutorial, a sample dataset, Employee data.sav, is used for all examples. This example dataset can be downloaded from the webpage of short course at Statistical Outreach Center (http://www.education.uiowa.edu/centers/soc/shortcourses.aspx). Getting Started To open SPSS, go to the Start icon on your Windows computer. You should find SPSS under the Programs menu item. SPSS is not actually on these computers, we are accessing SPSS through the Virtual Desktop (for more info go to http://helpdesk.its.uiowa.edu/virtualdesktop). If SPSS isn’t listed under programs, you may need to access it through the Virtual Desktop website (This site can be found at: https://virtualdesktop.uiowa.edu/Citrix/VirtualDesktop/auth/login.aspx) When using the Virtual Desktop to access SPSS, you can only open and save files from your University of Iowa personal drive (the H: drive) or from a data source (e.g., flash drive) you have connected prior to opening SPSS. When using the Virtual Desktop, a dialog box may appear asking for read/write access. If you want to use and save files, you need to agree to give CITRIX full access. When SPSS opens, it will present you with a “What would you like to do?” dialog box. For now, click the Cancel button. Section 1: Analysis of Variance 1.1. One-Way ANOVA Comparing group differences for one or more independent and dependent variables in SPSS. For this section, if you have one categorical independent variable and an interval dependent variable the One-Way ANOVA procedure is appropriate. Analyze > Compare Means > One-Way ANOVA... One-Way ANOVA: Used to test if the population means of two or more groups are equal. H0: μ1 = μ2 = μ3 = … = μk H1: At least one μi ≠ μj Example: Does the population mean for current salary differ by employment category? To conduct the one-way ANOVA, first select the independent and dependent variables to produce the following dialog box: The above options will produce the following output (some output is omitted): Test of H omog enei ty of Vari ances C urrent Salary Lev ene St at is tic 59. 733 df 1 df 2 471 2 Sig. . 000 ANOVA Current Salary Between Groups Within Groups Total Sum of Squares 8.9E+010 4.8E+010 1.4E+011 df 2 471 473 Mean Square 4.472E+010 102925714.5 F 434.481 Sig. .000 In the above example in which the hypothesis is that three categories of employment do not differ in their salaries, the F statistic has a value of 434.481 with the associated significance level of .000 (Technically, the p-value is less than 0.001). The significance level tells us that the hypothesis of no difference among three groups is rejected under the .05 significance level. Accordingly, we conclude that the three groups of employment (Clerical, Custodial, and Manager) differ in their salaries. In order to know which pairs of means differed significantly, we would need to request follow-up tests using the Post Hoc option. Click on the Post Hoc option then select the post hoc test or tests of interest. We will use the Tukey’s Least Significant Difference (LSD) The results indicate that people in job category 3 (Manager) are paid significantly more than people in job categories 1 and 2 (Clerical and Custodian) and there was not a significant difference between categories 1 and 2. Multiple Comparisons salary LSD 95% Confidence Interval Mean Difference (I) jobcat (J) jobcat 1 2 $-3,100.349 $2,023.760 .126 $-7,077.06 $876.37 3 $-36,139.258* $1,228.352 .000 $-38,552.99 $-33,725.53 1 $3,100.349 $2,023.760 .126 $-876.37 $7,077.06 3 $-33,038.909* $2,244.409 .000 $-37,449.20 $-28,628.62 1 $36,139.258* $1,228.352 .000 $33,725.53 $38,552.99 2 $33,038.909* $2,244.409 .000 $28,628.62 $37,449.20 2 3 (I-J) Std. Error Sig. Lower Bound Upper Bound *. The mean difference is significant at the 0.05 level. 1.2. Two-Way ANOVA When there is more than one independent variable, the analysis is done by selecting General Linear Model (GLM) procedures in the Analyze menu. If the analysis involves independent groups and one dependent variable, choose Analyze > General Linear Model > Univariate... Example: Is current salary dependent on minority and employment category? For this example, the Dependent Variable is Current Salary (salary) and the Fixed Factors are Minority (minority) and Employment Category (jobcat). You can plot the means in order to get a visual understanding of the results. If you select plots, the screen will appear as follows. Add jobcat to the Horizontal Axis box and add minority to the Separate Lines box. Between-Subjects Factors Value Label Clerical Custodial Manager No Yes Employ ment Category 1 2 3 Minority Clas sif ication 0 1 N 363 27 84 370 104 Tests of Between-Subjects Effects Dependent Variable: Current Salary Source Corrected Model Intercept jobcat minority jobcat * minorit y Error Total Corrected Total Ty pe I II Sum of Squares 9.034E+010a 1.537E+011 2.596E+010 237964814 788578413 4.757E+010 6.995E+011 1.379E+011 df 5 1 2 1 2 468 474 473 Mean Square 1.807E+010 1.537E+011 1.298E+010 237964814.4 394289206.5 101655279.9 F 177.742 1511.773 127.699 2.341 3.879 Sig. .000 .000 .000 .127 .021 a. R Squared = . 655 (Adjusted R Squared = .651) Estimated Marginal Means of Current Salary Minority Classification $80,000 No Yes Estimated Marginal Means $70,000 $60,000 $50,000 $40,000 $30,000 $20,000 Clerical Custodial Employment Category Manager __ Assuming alpha = .05, the jobcat main effect and the jobcat by minority interaction are significant. The change in the simple main effect of one variable over levels of the other is most easily seen in the graph of the interaction. If the lines describing the simple main effects are not parallel, then a possibility of an interaction exists. The presence of an interaction was confirmed by the significant interaction in the summary table. 1.3. ANCOVA ANCOVA (analysis of covariance) is an extension of ANOVA. Examines whether group means (categorical independent variable) differ on a dependent variable after statistically control for another continuous variables (covariate). The analysis is done by selecting General Linear Model (GLM) procedures in the Analyze menu. If the analysis involves independent groups choose Analyze > General Linear Model >Univariate... Example: Does salary differ for males and females after controlling for previous experience? For this example, the Dependent Variable is Current Salary (salary) and the Fixed Factor is Gender (gender) and the covariate is Previous Experience (prevexp). Under options we can display adjusted means for group which in this case is gender. Note that if you have more than two groups you can compare by selecting Contrasts as opposed to post hoc analyses. The output above shows there is a significant difference in salary between males and females after controlling for previous experience, F(1, 471) = 137.020. p<.001. The second table gives the adjusted means in salary for each group based on the covariate. If we compare to the descriptive statistics we see the means have slightly changed but are still significantly different. 1.4. MANOVA MANOVA (multivariate analysis of variance) is an extension of ANOVA except there are two or more dependent variables with one categorical independent variable. The analysis is done by selecting General Linear Model (GLM) procedures in the Analyze menu. If the analysis involves independent groups choose Analyze > General Linear Model >Multivariate... Example: We want to know whether groups differ on a grouping of variables. In this case do the three different job categories differ on job characteristics (salary, beginning salary, and jobtime). These three variables are our dependent variables and jobcat is the fixed factor. Results: Wilks’ Lambda indicates there is a significant difference in job characteristics based on job category F(6, 938) = 117.402, p<.001, Wilks’ Λ = .326 The table labeled tests of between-subjects effects are univariate ANOVAs and therefore an alpha correction (such as Bonferroni) needs to be made. We can see from this table that job category has a significant effect on salary and beginning salary but not on jobtime. Section 2: Regression Analysis 2.1. Regression Regression models can be used to predict or explain values on a (dependent) variable based on information from other (independent) variables. Overall Model Fit (F-Test): Used to test if the regression model is “better” than using only the mean of the dependent variable. H0: Y = β0 H1: Y = β0 + β1X1 + … + βkXk Test for a Single βk: Used to test if βk differs from zero. H0: βk = 0 H1: βk ≠ 0 Example: What is the regression model for using “educational level” and “years experience” to predict salary? Everything we need to create a linear regression model is located in the following menu: Analyze > Regression > Linear… The variable we are trying to “predict” is the Current Salary variable, which goes in the Dependent box. Educational Level and Previous Experience go in the Independent(s) box. There are many options available in the linear regression dialog box. We’ll just look at one, plotting the predicted value against the standardized residual allows you to exam where the errors seem to be random and whether homogeneity of variances appears to be a reasonable assumption. Model Summary Std. Error of the Model R R Square .664a 1 Adjusted R Square .441 Estimate .439 $12,788.694 a. Predictors: (Constant), Previous Experience (months), Educational Level (years) ANOVAb Model 1 Sum of Squares df Mean Square Regression 6.088E10 2 3.044E10 Residual 7.703E10 471 1.636E8 Total 1.379E11 473 a. Predictors: (Constant), Previous Experience (months), Educational Level (years) b. Dependent Variable: Current Salary F 186.132 Sig. .000a Coefficientsa Standardized Unstandardized Coefficients Model 1 B (Constant) Educational Level (years) Previous Experience (months) Std. Error -20978.304 3087.258 4020.343 210.650 12.071 5.810 Coefficients Beta t Sig. -6.795 .000 .679 19.085 .000 .074 2.078 .038 a. Dependent Variable: Current Salary So, what does the output tell us? R2 = 0.664, means that 66% of the variance in salary can be “accounted for” by information about educational level and previous experience. An F-statistic of 186.132 (p-value < 0.001) indicates that a regression model containing educational level and previous experience is “better” than a model without any predictor variables (using the mean salary as an estimate for anyone). The regression equation is: Salary = -20978 + 4020 * Education level + 12 * Previous Experience The t-statistics for each βi is large enough in magnitude to reject the null hypothesis that each βi = 0 when = .05. The plot doesn’t look very random, we may need to reconsider our analysis. This is common for a variable like salary and indicates that we may want transform the variable or choose a different analysis. 2.2. Logistic Regression Predicting a binary outcome variable from one or more predictor variables. Analyze > Regression >Binary Logistic… Example: Can we predict an individual’s gender based on salary jobtime and previous experience? Here gender (either male or female) is our binary dependent variable and salary, jobtime, and prevexp are the covariates. What the output tell us: The test of the overall model is significant. X² =180.206 , p<.001 Both salary and previous experience are significant predictors of gender. Based on the model in which we predict gender from salary, jobtime, and prevexp we are correctly classifying 75% of individuals. Practically speaking if we wanted to create an efficient prediction model we would not include variables that aren’t significant predictors. Let’s see what happens if we remove jobtime from the model. Our correct classification percentage is still about 75 based on just salary and prevexp in our model because as we saw previously jobtime is not a significant predictor of gender. Section 3: Data Management Syntax Compute missing =nmiss(salary). Execute. Listwise deletion excludes all cases that have missing values for any variables in the analyses. Pairwise deletion uses all cases that have valid responses for the variables in each particular statistic being calculated. Default in SPSS is pairwise deletion. Add /missing = listwise subcommand to analysis for listwise deletion. Add value labels gender “m” “Male” “f” “Female”/ minority 0 “No” 1 “Yes”/ item1 to item20 1 “Not at all true of me” 7 “Very true of me”. Execute. Sort cases by prevexp (A). Compute salchange = salary-salbegin. Execute. Compute sum.3(i3, i6, i8). Execute. Recode item1 (1=7) (2=6) (3=5) (4=4) (5=3) (6=2) (7=1) into Ritem1. Execute. Sort cases by gender. Temporary. Split file by gender. Corr var = salary educ. Reliability var = item1 to item20 /scale(SC) = all /statistics descriptive scale corr cov /summary = total. Section 4: Syntax for Common Analyses Frequencies Freq var = educ /statistics /percentiles = 25 75 /format = notable. Descriptives Desc var = salary /statistics = mean stddev /sort = mean (D) /save. Chi-square Crosstabs tables = salary by gender /statistics = chisq phi /cells = count sresid expected. T-test t-test /groups = gender (m f) /variables = salary. Correlation corr var = educ salary /missing = listwise. Regression reg /dependent salary / method = enter educ jobtime. ANOVA glm salary by jobcat /posthoc = jobcat (tukey) /emmeans = tables (jobcat). Graphs Graph /bar = jobcat /title = ‘Frequencies of Different Job Categories’. Graph /bar(grouped) = jobcat by gender /title = ‘Gender Differences in Job Categories’. Graph /line = educ by jobcat /title = ‘Distribution of educ by job category’. Graph /scatterplot = salary with educ. Graph /histogram(normal) = salary. Section 5: Helpful Links Powerpoint of various statistical analysis in SPSS: What statistical analysis should I use? Website of annotated analysis and code/syntax for various analyses in SPSS, Stata, SAS, and Mplus http://www.ats.ucla.edu/stat/AnnotatedOutput/