Graphical analysis of multivariate data Choose the columns containing marks on the final exam (final), marks on the assignment (assignment) and marks on the midterm exam (midterm). Then make the following choices: Graphs -> Legacy Dialogs -> Scatter/Dot Choose MatrixScatter and press the button labelled Define. Choose as Matrix variables the three variables final, assignment and midterm and press Ok. Mark on midterm test Mark on assignment Mark on final exam The result will be the following array of scatter-plots: Mark on final exam Mark on assignment Mark on midterm test We see that there is a linear pattern between Mark on midterm test and Mark on final exam, but not between Mark on assignment and Mark on final exam, thus suggesting that only the mark on the midterm test will be significant in predicting the mark on the final exam. The absence of any pattern in the plot of Mark on midterm test and Mark on assignment indicates that these two variables are uncorrelated; we expect no problem with multicollinearity in the statistical analyses that are to follow. Statistical analyses Choose the columns containing marks on the final exam (final), marks on the assignment (assignment) and marks on the midterm exam (midterm). Then make the following choices: Analyze -> Regression -> Linear Choose as Dependent variable Mark on final exam, and as Independent variables Mark on assignment and Mark on midterm test. Press the button labelled Statistics and check the following boxes: Regression coefficients Confidence intervals Covariance matrix Model fit Descriptives Press the button labelled Continue, to leave the Statistics window. To analyse the data as specified by the commands given above, press Ok and SPSS will produce the following outputs. You should pay attention to the highlighted numbers below. Descriptive Statistics Mark on final exam Mark on assignment Mark on midterm test Mean 35,4333 14,7000 Std. Deviation 7,43562 3,49532 17,6000 5,74516 N 30 30 30 Correlations Pearson Correlation Sig. (1-tailed) Mark on final exam Mark on assignment Mark on midterm test Mark on final exam Mark on assignment N Mark on midterm test Mark on final exam Mark on assignment Mark on midterm test Mark on final exam 1,000 Mark on assignment ,180 Mark on midterm test ,869 ,180 ,869 1,000 ,000 ,104 . 1,000 ,104 ,170 ,170 ,000 30 30 . ,293 30 30 ,293 30 30 30 . 30 30 We see that the correlation between Mark on final exam and Mark on midterm test is high (0.869) and significant (p-value = 0.000), just as we saw in the plot-matrix above. The correlation between Mark on final exam and Mark on assignment is weak (0.180), suggesting that the assignment mark will not be significant in predicting the mark on final assignment. Furthermore, there seems to be no linear relationship between the two “x-variables” Mark on midterm test and Mark on assignment, since their correlation is weak (0.104), just as we saw in the plot-matrix! Model Summary(b) Std. Error of Adjusted R R Square the Estimate R Square ,763 3,75245 ,873(a) ,745 a Predictors: (Constant), Mark on midterm test, Mark on assignment b Dependent Variable: Mark on final exam Model 1 The coefficient of determination is quite high (0.763) indicating that the multiple linear regression model Yi = b 0 + b 1 X 1i + b 2 X 2i + ei is appropriate for these data; the symbols Yi denote mark person number i received on the final exam, X 1i the mark person number i received on the assignment and X 2i the mark person number i received on the midterm exam. The standard error of the estimate is also quite low (3.75) indicating that the overall differences between the observed marks on the final exam and those predicted by the linear model are small. ANOVA(b) Model 1 Regression Sum of Squares 1223,184 Residual Total 380,183 1603,367 df 2 Mean Square 611,592 27 29 F 43,434 Sig. ,000(a) 14,081 a Predictors: (Constant), Mark on midterm test, Mark on assignment b Dependent Variable: Mark on final exam The regression sum of squares (1223) is a considerable part of the total sum of squares (1603); more specifically it constitutes 76.3 percent of the total variation in the marks on the final exam; this is the coefficient of determination. The value of the F-statistic is high (43.434) and significant (p-value = 0.000), indicating that we can reject the null hypothesis that b 0 = b 1 = 0 . Consequently one, or perhaps both, are different from zero. In order to determine which one of them has a significant effect on predicting the final mark, we make individual t-tests, using confidence intervals computed for each variable. Coefficients(a) Unstandardized Coefficients Model 1 B 13,009 Std. Error 3,528 Mark on assignment ,194 ,200 Mark on midterm test 1,112 ,122 (Constant) Standardized Coefficients 95% Confidence Interval for B t Sig. 3,688 ,001 Tolerance 5,771 VIF 20,248 ,091 ,968 ,342 -,217 ,605 ,859 9,120 ,000 ,862 1,362 Beta a Dependent Variable: Mark on final exam The confidence interval for the coefficient b1 (Mark on assignment) contains the number zero, indicating that we cannot reject the hypothesis b1 = 0 on the level of significance 0.05. The mark a person receives on assignment is therefore not (statistically) significant in predicting the mark received on the final exam; this is consistent with our previous observations in the scatterplot-matrix where we did not detect any pattern between marks on assignment and marks on final grade. On the other hand the mark on the midterm exam is significant since the confidence interval for b 2 does not contain zero, so we can reject the hypothesis that b 2 = 0 on the level of significance 0.05. We conclude by stating that the following model seems to fit the data well: Mark on final exam = 13 + Mark on midterm exam.