Name Spring 2007 Multiple Regression Exam General Directions Please read all questions carefully and answer all parts of each question. In answering the questions on this exam, be specific about the statistical evidence that you use to answer each question. The point value for each question is listed in square brackets after the question number. To facilitate grading the exam, you are asked to copy and paste sections of output from SPSS that support your answer to the question. The specific questions where you need to include SPSS output end with the statement “Include the statistical output that substantiates your answer.” When you have completed this exam, you will need to print a copy of your answer to hand in. You will be graded based on the contents of your printed answers. As a backup, please email a copy to me at jimSchwab@mail.utexas.edu. At the end of this answer sheet, insert copies of the syntax notes for the regression analyses you conducted for this exam. The Research Question We are exploring the question whether the quality of life discrepancy between rich and poor nations is likely to increase or not. We think that it will increase if poorer nations have higher birth rates, leading to increasing populations attempting to sustain themselves with the same meager resources. Obviously we can think of many other things that could affect the relationship between wealth and birth rate, but rather than complicate the analysis, we will assume that these other factors can be represented by taking geographic region (country group) into account. The data set for this exam, poverty.sav, contains information on selected demographic characteristics for 97 different nations in the world. Using this data set, conduct a standard multiple regression to analyze the relationship between "live birth rate per 1,000 of population" [brthrate], "gross national product per capita in U.S. dollars" [gnp], and "country group" [group]. Treat "live birth rate per 1,000 of population" [brthrate], "gross national product per capita in U.S. dollars" [gnp] as metric variables. Treat "country group" [group] as a non-metric variable and dummy-code it using 3=Western Europe, North America, Japan, Australia, New Zealand" (highly industrialized nations) as the reference category. page 1 Use .05 for alpha for interpreting the statistical relationships and .01 as alpha for diagnostic tests. Question 1 [10 points] a) Based on the research question, which variable should be the dependent variable: "live birth rate per 1,000 of population" [brthrate] or "gross national product per capita in U.S. dollars" [gnp]? Explain your choice. The dependent variable is: "live birth rate per 1,000 of population" [birthrate], because the problem states that we want to explain differences in birth rates by wealth and other demographic factors. b) State the research hypothesis that you will test with the regression analysis of the three variables. [5 points] Region of the world and gross national product account for differences in birth rates. Question 2 [5 points] State the level of measurement requirements for the analysis. Evaluate the level of measurement requirements for each of the variables to be included in your analysis. Standard multiple regression requires the dependent variable and the metric independent variables be interval level, and the non-metric independent variables be dummy-coded if they are not dichotomous. o The metric dependent variable "live birth rate per 1,000 of population" [brthrate] was interval level, satisfying the requirement for dependent variables. o The metric independent variable "Gross National Product per capita in U.S. dollars" [gnp] was interval level, satisfying the requirement for independent variables. o The non-metric independent variable "country group" [group] was nominal level, but will satisfy the requirement for independent variables when dummy coded. Question 3 [20 points] a) State and define the assumptions of multiple regression . What are the consequences of violating each of the assumption? page 2 b) For the variables in this analysis, evaluate each of the regression assumptions. Do we need to transform variables or remove extreme outliers to satisfy the assumptions? Include the statistical output that substantiates your answer. The linear regression of "live birth rate per 1,000 of population" [brthrate] by "Gross National Product per capita in U.S. dollars" [gnp],"countries which were in Eastern Europe" [group_1],"countries which were in South America and Mexico" [group_2],"countries which were in Middle East" [group_4],"countries which were in Asia" [group_5] and "countries which were in Africa" [group_6] satisfied all of the regression assumptions (independence of variables, linearity, homogeneity of error variance, normality of the residuals, and independence of errors): The tolerance values for all of the independent variables are larger than 0.10: "Gross National Product per capita in U.S. dollars" [gnp] (0.261), "countries which were in Eastern Europe" [group_1] (0.392), "countries which were in South America and Mexico" [group_2] (0.423), "countries which were in Middle East" [group_4] (0.435), "countries which were in Asia" [group_5] (0.458) and "countries which were in Africa" [group_6] (0.431). Multicollinearity is not a problem in this regression analysis. Coeffi cientsa Model 1 (Const ant) Country Group=Eastern Europe Country Group=South Americ a and Mexic o Country Group=Middle East Country Group=As ia Country Group=Africa Gross National Product per capita in U.S. dollars Unstandardized Coeffic ient s B St d. E rror 30.327 1.080 St andardiz ed Coeffic ient s Beta t 28.075 Sig. .000 Zero-order Correlations Partial Part Collinearity Statistics Tolerance VIF -14.169 1.855 -.565 -7. 639 .000 .277 -.640 -.354 .392 2.554 -.284 1.678 -.012 -.169 .866 .435 -.018 -.008 .423 2.365 7.631 1.721 .311 4.435 .000 .526 .436 .205 .435 2.296 -.059 14.642 1.555 1.363 -.003 .758 -.038 10.740 .970 .000 .417 .826 -.004 .761 -.002 .498 .458 .431 2.184 2.323 -.001 .000 -.307 -3. 382 .001 -.629 -.346 -.157 .261 3.834 a. Dependent Variable: Live birth rate per 1,000 of population In the lack of fit test, the probability of the F test statistic (F=1.52) was p = .345, greater than the alpha level of significance of 0.01. The null hypothesis that "a linear regression model is appropriate" is not rejected. The research hypothesis that "a linear regression model is not appropriate" is not supported by this test. The assumption of linearity is satisfied. La ck of Fit Tests Dependent Variable: Live birt h rat e per 1,000 of populat ion Sum of Source Squares df Mean Square F Lack of Fit 2922.706 79 36.996 1.517 Pure Error 121.935 5 24.387 Sig. .345 The homogeneity of error variance is tested with the Breusch-Pagan test. For this analysis, the Breusch-Pagan statistic was 8.672. The probability of the statistic was p = .193, which was greater than the alpha level for diagnostic tests (p = .010). The null hypothesis that "the variance of the residuals is the same for all values of the independent variable" is not rejected. The research hypothesis that "the variance of the page 3 residuals is different for some values of the independent variable" is not supported. The assumption of homogeneity of error variance is satisfied. Homoscedasticity : Test Breusch-Pagan Koenker Statistic 8.6715 8.9282 Statistics df 6 6 Sig. .1929 .1777 Regression analysis assumes that the errors or residuals are normally distributed. The Shapiro-Wilk test of studentized residuals yielded a statistical value of 0.987, which had a probability of p = .481, which was greater than the alpha level for diagnostic tests (p = .010). The null hypothesis that "the distribution of the residuals is normally distributed" is not rejected. The research hypothesis that "the distribution of the residuals is not normally distributed" is not supported. The assumption of normality of errors is satisfied. Tests of Normality Studentized Residual Kolmogorov-Smirnova Statistic df Sig. .065 91 .200* Statistic .987 Shapiro-Wilk df 91 Sig. .481 *. This is a lower bound of the true significance. a. Lilliefors Significance Correction Regression analysis assumes that the errors (residuals) are independent and there is no serial correlation. No serial correlation implies that the size of the residual for one case has no impact on the size of the residual for the next case. The Durbin-Watson statistic tests for the presence of serial correlation among the residuals. The value of the DurbinWatson statistic ranges from 0 to 4. As a general rule of thumb, the residuals are not correlated if the Durbin-Watson statistic is approximately 2, and an acceptable range is 1.50 - 2.50. The Durbin-Watson statistic for this problem is 2.11 which falls within the acceptable range from 1.50 to 2.50. The analysis satisfies the assumption of independence of errors. Model Summ aryb Model 1 R R Square .905a .820 Adjust ed R Square .807 St d. Error of the Es timate 6.020 DurbinW atson 2.110 a. Predic tors: (Constant), Gross National Product per c apit a in U.S. dollars , Country Group= Middle Eas t, Country Group=As ia, Country Group= Africa, Country Group=Sout h Americ a and Mexic o, Country Group= Eas tern Europe b. Dependent Variable: Live birth rate per 1,000 of population The model satisfied the assumptions of multiple regression without transforming any variables or excluding any countries. page 4 Question 4 [5 points] Do we have sufficient data to meet the same size requirements for the proposed analysis? Include the statistical output that substantiates your answer. The analysis included 6 independent variables: 1 for the covariate ("Gross National Product per capita in U.S. dollars" [gnp]) plus 5 dummy-coded variables for the factor "country group" [group]. The number of cases available for the analysis was 91, not satisfying the requirement for 111 cases based on the rule of thumb that the required number of cases should be the larger of the number of independent variables x 8 + 50 or the number of independent variables + 105. We should consider mentioning the sample size issue as a limitation of the analysis. De scriptive Statistics St d. Deviat ion Mean Live birth rate per 1,000 of population Country Group=Eastern Europe Country Group=South Americ a and Mexico Country Group=Middle East Country Group=As ia Country Group=Africa Gross Nat ional Produc t per capita in U.S. dollars N 29.46 13.699 91 -.109890 ******** 91 -.076923 ******** 91 -.098901 ******** 91 -.054945 .0879121 ******** ******** 91 91 5741.25 8093.680 91 Question 5 [10 points] a. Interpret the overall relationship between the dependent variable and the independent variables. Include the statistical output that substantiates your answer. The relationship between "live birth rate per 1,000 of population" and the combination of "Gross National Product per capita in U.S. dollars" and "country group" was statistically significant (F(6, 84) = 63.66, p < .001. The null hypothesis that "all of the partial slopes (b coefficients) = 0" is rejected, supporting the research hypothesis that "at least one of the partial slopes (b coefficients) is not equal to 0". Applying Cohen's criteria for effect size (less than .01 = trivial; .01 up to 0.30 = weak; .30 up to .50 = moderately strong; .50 or greater = strong), the relationship was correctly characterized as strong (Multiple R = .905). Model Summ aryb Model 1 R R Square .905a .820 Adjust ed R Square .807 St d. Error of the Es timate 6.020 DurbinW atson 2.110 a. Predic tors: (Constant), Gross National Product per c apit a in U.S. dollars , Country Group= Middle Eas t, Country Group=As ia, Country Group= Africa, Country Group=Sout h Americ a and Mexic o, Country Group= Eas tern Europe b. Dependent Variable: Live birth rate per 1,000 of population page 5 ANOVAb Model 1 Regres sion Residual Total Sum of Squares 13845. 30 3044.641 16889. 94 df 6 84 90 Mean Square 2307.549 36.246 F 63.664 Sig. .000a a. Predic tors: (Constant), Gros s National Product per capita in U.S. dollars, Country Group= Middle East , Country Group= Asia, Country Group=Africa, Country Group= South America and Mexico, Country Group=Eastern Europe b. Dependent Variable: Live birth rate per 1, 000 of population Question 6 [25 points] a. List the independent variables that had a statistically significant relationship to the dependent variable and interpret each relationship. Include the statistical output that substantiates your answer. Countries who had a higher gross national product per capita had a lower live birth rate. The individual relationship between the independent variable "Gross National Product per capita in U.S. dollars" [gnp] and the dependent variable "live birth rate per 1,000 of population" [brthrate] was statistically significant, β = -.307, t(84) = -3.38, p = .001. We reject the null hypothesis that the partial slope (b coefficient) for the variable "Gross National Product per capita in U.S. dollars" = 0 and conclude that the partial slope (b coefficient) for the variable "Gross National Product per capita in U.S. dollars" is not equal to 0. The negative sign of the b coefficient (-0.001) means that higher values of "Gross National Product per capita in U.S. dollars" were associated with lower values of "live birth rate per 1,000 of population". The statement that ""countries which were in Eastern Europe" had a lower live birth rate compared to the average for all countries" is correct. The individual relationship between the independent variable "countries which were in Eastern Europe" [group_1] and the dependent variable "live birth rate per 1,000 of population" [brthrate] was statistically significant, β = -.565, t(84) = -7.64, p < .001. We reject the null hypothesis that the partial slope (b coefficient) for the variable "countries which were in Eastern Europe" = 0 and conclude that the partial slope (b coefficient) for the variable "countries which were in Eastern Europe" is not equal to 0. The negative sign of the b coefficient (14.170) means that "countries which were in Eastern Europe" had a lower live birth rate compared to the average for all countries. The statement that ""countries which were in Middle East" had a higher live birth rate compared to the average for all countries" is correct. The individual relationship between the independent variable "countries which were in Middle East" [group_4] and the dependent variable "live birth rate per 1,000 of population" [brthrate] was statistically significant, β = .311, t(84) = 4.43, p < .001. We reject the null hypothesis that the partial slope (b coefficient) for the variable "countries which were in Middle East" = 0 and conclude that the partial slope (b coefficient) for the variable "countries which were in Middle East" is not equal to 0. The positive sign of the b coefficient (7.630) means that "countries which were in Middle East" had a higher live birth rate compared to the average for all countries. page 6 The statement that ""countries which were in Africa" had a higher live birth rate compared to the average for all countries" is correct. The individual relationship between the independent variable "countries which were in Africa" [group_6] and the dependent variable "live birth rate per 1,000 of population" [brthrate] was statistically significant, β = .758, t(84) = 10.74, p < .001. We reject the null hypothesis that the partial slope (b coefficient) for the variable "countries which were in Africa" = 0 and conclude that the partial slope (b coefficient) for the variable "countries which were in Africa" is not equal to 0. The positive sign of the b coefficient (14.640) means that "countries which were in Africa" had a higher live birth rate compared to the average for all countries. Coeffi cientsa Model 1 (Const ant) Country Group=Eastern Europe Country Group=South Americ a and Mexic o Country Group=Middle East Country Group=As ia Country Group=Africa Gross National Product per capita in U.S. dollars Unstandardized Coeffic ient s B St d. E rror 30.327 1.080 St andardiz ed Coeffic ient s Beta -14.169 1.855 -.565 t 28.075 Sig. .000 Zero-order Correlations Partial -7. 639 .000 .277 -.640 -.354 .392 2.554 Part Collinearity Statistics Tolerance VIF -.284 1.678 -.012 -.169 .866 .435 -.018 -.008 .423 2.365 7.631 1.721 .311 4.435 .000 .526 .436 .205 .435 2.296 -.059 14.642 1.555 1.363 -.003 .758 -.038 10.740 .970 .000 .417 .826 -.004 .761 -.002 .498 .458 .431 2.184 2.323 -.001 .000 -.307 -3. 382 .001 -.629 -.346 -.157 .261 3.834 a. Dependent Variable: Live birth rate per 1,000 of population b. List the independent variables that did not have a statistically significant relationship to the dependent variable and state the statistical criteria upon which you based your decision. Include the statistical output that substantiates your answer. The statement that ""countries which were in South America and Mexico" had a lower live birth rate compared to the average for all countries" is not correct. The individual relationship between the independent variable "countries which were in South America and Mexico" [group_2] and the dependent variable "live birth rate per 1,000 of population" [brthrate] was not statistically significant, β = -.012, t(84) = -.17, p = .866. We are not able to reject the null hypothesis that the partial slope (b coefficient) for the variable "countries which were in South America and Mexico" = 0. The statement that ""countries which were in Asia" had a lower live birth rate compared to the average for all countries" is not correct. The individual relationship between the independent variable "countries which were in Asia" [group_5] and the dependent variable "live birth rate per 1,000 of population" [brthrate] was not statistically significant, β = -.003, t(84) = -.04, p = .970. We are not able to reject the null hypothesis that the partial slope (b coefficient) for the variable "countries which were in Asia" = 0. page 7 Coeffi cientsa Model 1 (Const ant) Country Group=Eastern Europe Country Group=South Americ a and Mexic o Country Group=Middle East Country Group=As ia Country Group=Africa Gross National Product per capita in U.S. dollars Unstandardized Coeffic ient s B St d. E rror 30.327 1.080 St andardiz ed Coeffic ient s Beta -14.169 1.855 -.565 t 28.075 Sig. .000 Zero-order Correlations Partial -7. 639 .000 .277 -.640 -.354 .392 2.554 Part Collinearity Statistics Tolerance VIF -.284 1.678 -.012 -.169 .866 .435 -.018 -.008 .423 2.365 7.631 1.721 .311 4.435 .000 .526 .436 .205 .435 2.296 -.059 14.642 1.555 1.363 -.003 .758 -.038 10.740 .970 .000 .417 .826 -.004 .761 -.002 .498 .458 .431 2.184 2.323 -.001 .000 -.307 -3. 382 .001 -.629 -.346 -.157 .261 3.834 a. Dependent Variable: Live birth rate per 1,000 of population Question 7 [15 points] Note: it is not necessary to do an analysis of covariance to answer this question, but you may do so if it is helpful in formulating your answer. a) The relationship between two metric variables and a non-metric variable could also be tested with an analysis of covariance? Would an analysis of covariance have answered the same research question that we answered with standard multiple regression? In analysis of covariance the metric covariate is treated as a control variable, so we would have answered the question what were the differences in birth rates among geographic regions, controlling for differences in gross national product. This is a different question that the one we answered with standard multiple regression. b) How do the results of an analysis of covariance differ from the results of a standard multiple regression? [Hint: what does an analysis of covariance always test for that is not usually tested in standard multiple regression?] An analysis of covariance tests for an interaction between the covariate and the factor. c) How might this difference between analysis of covariance and standard multiple regression affect our interpretation? If the interaction effect is significant, we do not interpret main effects in light of the interaction. Question 8 [10 points] a) The script uses deviation (or effects) coding to create dummy-coded variables. If we used indicator coding to create dummy-coded variables instead, how would the interpretation of individual relationships differ? The comparison for deviation coding is a comparison of the group mean to the mean across all groups. The comparison for indicator coding would compare the difference in means between the dummy group and the reference category. page 8 b) Would we be likely to find that the same relationships were statistically significant? Why or why not? It is not likely that the same relationships would be significant, because we are not comparing the same combinations of means. Syntax notes: Notes Output Created Comments Input Missing Value Handling 16-MA R-2007 13:01:38 Data Ac tive Dataset Filter W eight Split File N of Rows in W orking Data File Definition of Mis sing Cases Used Sy ntax Resources Elapsed Time Memory Required Additional Memory Required for Residual P lots C: \2007_Spring_ SW 388R7\ Regress ion Ex am\ poverty.s av DataSet1 <none> <none> <none> 97 Us er-defined missing values are treated as miss ing. St atist ics are based on cases with no mis sing values for any variable us ed. RE GRESS ION /DESCRIPTIVE S MEAN STDDE V CORR SIG N /MISSING LISTWIS E /S TATISTICS COEFF OUTS R ANOV A COLLIN TOL ZP P /CRITE RIA =PIN(.05) POUT(. 10) /NOORIGIN /DEPE NDE NT brthrate /METHOD= ENTER gnp group_1 group_2 group_4 group_5 group_6 /RESIDUALS DURB IN . 0:00:00.02 3252 bytes 0 bytes page 9