Exam 1 answers Data is available as listed below on several financial and demographic variables for households sampled from a large city in the Western United States. INCOME1 INCOME2 FAMLSIZE OWNORENT TOTLDEBT HPAYRENT LOCATION INCOME Principal wage earner income ($) Secondary wage earner income ($) Family size 1 = own home, 0 = rent home Total debt (excluding home mortgage) ($) Home mortgage payment or rent ($) Location of residence (NE=1, NW=2, SW=3, SE=4) Total family income (sum of income1 and income2) For each of the following short scenarios, describe briefly and specifically how you would analyze the data: 1. a. You wish to test whether population mean total debt is less than $20000. Set up appropriate hypotheses. Also explain any assumption that you would need to make and how you would check. H0: µ > 20,000 vs. H1: µ < 20,000 You need to know whether the data is normal or not. If the sample size is at least 30, the data can be non-normal and the results will still be accurate. b. You wish to test whether more than 60% of households own their home. Set up appropriate hypotheses. H0: π < 0.6 vs H1: π > 0.6 2. You feel that total family income of a household depends to a great degree on the location of residence. State the specific hypotheses for this type of problem. H0: Means are equal for the 4 locations vs H1: Not all means are equal 3. You wish to know how much family size influences total debt. Specifically you wish to be 95% confident about how much total debt changes as family size increases. What would you use to address this concern? A 95% confidence interval for the population slope given as Lower and Upper 95% values. 4. In using family size to predict total debt, you have a direct relationship. You notice that prediction errors were much larger for large families than for small families. What type of problem does this cause and how would you confirm by using a graph that this type of problem really exists? Plot residuals against predicted values to check for constant variability. 1 Refer to Appendix A. The analysis refers to the amount of total debt a household has depending upon whether the head of household owns or rents the home. 5. Check the assumptions for this test by using Appendix A. Normality not OK for owners but sample size larger than 30 makes up for that. Data is normal for renters. Equality of variance test allows to accept the assumption of equal variability. 6. Given your check of assumptions what test should you use? t-test pooled variance 7. Given the results of checking the assumptions set up appropriate hypotheses Use a level of significance of .05. Write up a conclusion stating your confidence. H0: µ1 = µ2 vs H1: µ1 ≠ µ2 We can be 98.58% confident of a difference between the population means and so can conclude that they are different. Sample data shows owners have higher total debt ($18,620 to $15,509) For this last set of questions, refer to Appendix B. The dependent variable is Total Debt ($), and the independent variables are the income of the household ($) and a dummy variable for whether the family owns (1) or rents (0).. 8. Use the regression stats table to interpret clearly the meaning of the measures Adjusted R2 and Std. Error. The prediction equation using income and Own indicates a linear relationship that explains 79.1 percent of the variability in Total Debt after adjusting for the number of variables in the equation. This relationship can predict Total Debt with a typical error of $2218.73. 9. Interpret the number 2,167.337 in the coefficients column Total Debt is @2,167.337 higher on the average for families that own their home than those who rent holding income constant. 2 10. Interpret clearly the meaning of the 95% lower and upper limits across from the number .252. For each additional dollar of income, holding home ownership constant, total debt increases by $.252 (or we could say a hundred extra dollars of income is associated with a $25.20 increase in total debt). We can be 95 % confident that this increase is somewhere between $.20 and $.30. 11. Set up specific hypotheses to test for a relationship between total debt and income and write up a sentence detailing what you conclude and how confident you can be. You may use a significance level of .05. What specifically is meant by a relationship in this case? H0: β1 = 0 (or Total debt and income are not related) H1: β1 ≠ 0 (or Total Debt and income are related) We can be almost 100% confident that a relationship exists and so we can conclude a relationship at the 95% level. A relationship in this case means that increased income is associated with increased total debt. 12. Although none of the Cook’s D values give any real concern the largest value was observation 23. Discuss why that might have the largest value by looking at leverage and studentized residuals. Cook’s D is high when the combination of high leverage and large residuals occur. In this case both obs 23 and 24 have large residuals but obs 23 has higher leverage than obs 24. 13. Interpret the meaning of the 95% Prediction Interval for when Income =$69,483 and the family owns their home. We can be 95% confident that when family income = $69,483 and the family owns their home that total debt will fall between $14,453 and $23,778. 3 14. Use the information given in Appendix B to address any potential problems with the assumptions for using a simple linear regression. For each assumption: A. State the assumption B. How did you check the assumption? (What chart did you use?) C. What do you conclude? Constant variability Residuals vs predicted This assumption looks OK since there is no evidence of increasing variability as the predicted value increases Linearity Income residual plot Linearity looks OK since no rainbow or smile patterns are seen. Normality Last Plot Seeing no real departure from a straight line, it looks like normality is OK 15. Do you need to check the assumption of independence for this problem? Why? Independence – Not time series, so irrelevant 4 Appendix A Descriptive statistics count mean sample standard deviation skewness kurtosis Own 38 18,620.29 4,356.16 1.38 -0.90 Rent 23 15,509.09 5,133.12 -0.04 -0.83 Hypothesis Test: Independent Groups (t-test, pooled variance) Own 18,620.29 4,356.16 38 Rent 15,509.09 mean 5,133.12 std. dev. 23 n 2.527 t .0142 p-value (two-tailed) F-test for equality of variance 26,348,878 variance: Rent 18,976,156 variance: Own 1.39 F .3696 p-value Hypothesis Test: Independent Groups (t-test, unequal variance) Own 18,620.29 4,356.16 38 Rent 15,509.09 mean 5,133.12 std. dev. 23 n 2.426 t .0199 p-value (two-tailed) Wilcoxon - Mann/Whitney Test n 38 23 61 sum of ranks 1334 Own 557 Rent 1891 total 1178.000 67.199 2.321 .0203 expected value standard deviation z p-value (two-tailed) 5 Appendix B Regression Analysis R² Adjusted R² R Std. Error ANOVA table Source SS Regression 550,181,195.9025 Residual 132,859,807.5642 Total 683,041,003.4667 Regression output variables Intercept INCOME Own Observation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 coefficients -559.7520 0.2520 2,167.3373 TOTLDEBT 19,795.0 19,539.0 14,045.0 12,943.0 23,658.0 27,532.0 16,056.0 16,537.0 20,321.0 21,492.0 8,416.0 22,661.0 10,616.0 20,938.0 18,933.0 14,331.0 15,726.0 22,097.0 26,116.0 17,548.0 21,306.0 10,195.0 15,572.0 10,988.0 13,923.0 19,876.0 10,594.0 18,246.0 16,904.0 13,480.0 0.805 0.791 0.897 2218.273 df 2 27 29 n 30 k 2 Dep. Var. TOTLDEBT MS 275,090,597.9512 4,920,733.6135 F 55.90 std. error t (df=27) p-value 0.0249 885.6143 10.108 2.447 1.13E-10 .0212 Predicted 20,843.9 20,167.3 13,973.7 14,618.0 21,592.2 24,010.7 19,313.5 15,075.8 22,789.9 23,638.0 9,923.3 20,945.7 9,483.3 19,627.2 18,274.9 12,963.2 15,877.6 22,173.3 24,568.6 17,218.1 18,212.7 12,187.9 11,259.3 15,547.3 17,576.0 19,864.7 10,917.6 15,917.6 18,754.1 13,068.6 Residual -1,048.9 -628.3 71.3 -1,675.0 2,065.8 3,521.3 -3,257.5 1,461.2 -2,468.9 -2,146.0 -1,507.3 1,715.3 1,132.7 1,310.8 658.1 1,367.8 -151.6 -76.3 1,547.4 329.9 3,093.3 -1,992.9 4,312.7 -4,559.3 -3,653.0 11.3 -323.6 2,328.4 -1,850.1 411.4 Leverage 0.062 0.056 0.083 0.073 0.071 0.115 0.141 0.067 0.090 0.245 0.171 0.063 0.181 0.146 0.048 0.102 0.058 0.080 0.129 0.049 0.048 0.119 0.146 0.061 0.048 0.053 0.152 0.112 0.133 0.099 p-value 2.52E-10 confidence interval 95% lower 95% upper 0.2008 350.2069 Studentized Residual -0.488 -0.291 0.034 -0.784 0.966 1.688 -1.585 0.682 -1.167 -1.114 -0.746 0.799 0.564 0.639 0.304 0.650 -0.070 -0.036 0.747 0.153 1.429 -0.957 2.103 -2.121 -1.688 0.005 -0.158 1.114 -0.896 0.195 0.3031 3,984.4678 Studentized Deleted Residual Cook 's D -0.481 0.005 -0.286 0.002 0.033 0.000 -0.778 0.016 0.965 0.024 1.751 0.124 -1.633 0.138 0.675 0.011 -1.175 0.045 -1.119 0.134 -0.740 0.038 0.793 0.014 0.557 0.024 0.632 0.023 0.299 0.002 0.643 0.016 -0.069 0.000 -0.035 0.000 0.741 0.028 0.150 0.000 1.458 0.034 -0.955 0.041 2.257 0.251 -2.280 0.098 -1.751 0.048 0.005 0.000 -0.155 0.001 1.119 0.052 -0.892 0.041 0.192 0.001 Predicted values for: TOTLDEBT 95% Confidence Intervals INCOME 69,483 42,329 Own Predicted 1 19,115.790 1 12,273.573 lower upper 18,104.146 20,127.434 10,718.389 13,828.756 95% Prediction Intervals lower 14,453.199 7,463.695 upper 23,778.381 17,083.450 Leverage 0.049 0.117 6 APPENDIX C Residuals by Predicted Residuals by INCOME 6,654.8 Residual (gridlines = std. error) Residual (gridlines = std. error) 6,654.8 4,436.5 2,218.3 0.0 -2,218.3 -4,436.5 -6,654.8 4,436.5 2,218.3 0.0 -2,218.3 -4,436.5 -6,654.8 20,000 5000 10000 15000 20000 25000 30000 Predicted 40,000 60,000 80,000 100,000 INCOME Normal Probability Plot of Residuals 5,000.0 4,000.0 3,000.0 Residual 2,000.0 1,000.0 0.0 -1,000.0 -2,000.0 -3,000.0 -4,000.0 -5,000.0 -6,000.0 -3.0 -2.0 -1.0 0.0 1.0 Normal Score 2.0 3.0 7