Chapter 8: Regression and Correlation 1. False. The size of the slope is unrelated to the size of the correlation. A linear regression can result in a large value for the slope, but the valueitself might be non-signficant. 2. False. The size of the slope is not related to the size of the correlation. 3. True 4. False. A correlation of zero means that there is no linear relationship , but there could be a nonlinear relationship between the two variables. 5. False. The Runs test is only appropriate for time-ordered residuals. 6. μ = 2(10)(15)/(25)+1 = 13; σ = 2.345; z = (10–13+0.5)/2.345 = –1.066; p(z <= –1.066) = 0.1432. 7. a. 10 b. 37.83 c. 0.3806 d. 38.06% e. 0.6169 f. 0.057 g. 5.138 a. Select the rows for the first region and delete the data. b. The scatterplot appears as: 8. y = 3.0152x - 52.618 R2 = 0.8534 Mortality vs. Temperature 110.0 100.0 Mortality Index 90.0 80.0 70.0 60.0 50.0 40.0 30.0 35.0 40.0 45.0 Mean Annual Temperature 1 50.0 55.0 Chapter 8: Regression and Correlation The regression statistics are: c. Regression Statistics Multiple R 0.924 R Square Adjusted R Square 0.853 Standard Error 5.933 0.842 Observations 15 ANOVA df Regression SS MS F 1 2664.336 2664.336 Residual 13 457.541 35.195 Total 14 3121.877 Coefficients Standard Error -52.62 15.82 -3.33 0.01 -86.80 -18.43 3.02 0.35 8.70 0.00 2.27 3.76 Intercept Temperature t Stat 75.701 Significance F P-value 8.816E-07 Lower 95% The residual plots appear as follows: d. Temperature Residual Plot 15.00 10.00 Residuals 5.00 0.00 30.0 35.0 40.0 45.0 -5.00 -10.00 -15.00 Temperature 2 50.0 55.0 Upper 95% Chapter 8: Regression and Correlation Residuals vs. Predicted Values 15.00 10.00 Residuals 5.00 0.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00 110.00 -5.00 -10.00 -15.00 Predicted Values The P-plot is: 1.261 0.761 0.261 -0.239 -0.739 -1.239 -1.739 -9.83 e. -4.83 0.17 10.17 The correlation statistics are: Correlation p-value Pearson 0.924 0.000 Spearman 0.900 0.000 f. 5.17 The slope of the regression line is 3.105, meaning that the mortality index rises 3.105 points for every degree increase in mean annual temperature. There is no reason to doubt the validity of the regression based on the residual plots. 3 Chapter 8: Regression and Correlation 9. The scatterplot appears as: a. y = 4.0269x + 3.7639 R2 = 0.9475 Calories vs. Total 160 140 Calories 120 100 80 60 40 10 15 20 25 30 35 Total r = 0.830. p-value = 0.003. b. The regression statistics are: c. Regression Statistics Multiple R 0.830 R Square Adjusted R Square 0.689 Standard Error 0.650 17.507 Observations 10 ANOVA df SS MS Regression 1 5438.105128 5438.105 Residual 8 2451.894872 306.4869 Total 9 7890 Coefficients Standard Error t Stat F 17.74335 P-value Significance F 0.0029 Lower 95% Upper 95% Intercept 41.18 14.80 2.78 0.0238 7.05 75.31 Serving oz 49.29 11.70 4.21 0.0029 22.31 76.27 4 Chapter 8: Regression and Correlation The plot of residuals vs. serving oz appears as follows: d. Serving oz Residual Plot 40.000 30.000 Residuals 20.000 10.000 0.000 0 0.5 1 1.5 2 2.5 -10.000 -20.000 -30.000 Serving oz The Normal P-plot of the residuals is: 1.453 0.953 0.453 -0.047 -0.547 -1.047 -1.547 -22.587 -12.587 -2.587 7.413 17.413 27.413 There is no reason to doubt the regression assumptions based on the residual plots. 5 Chapter 8: Regression and Correlation The residual plot appears as follows: e. Residuals vs. Predicted Values Residuals 40.000 30.000 OHs Cereal 20.000 Pretzel 10.000 Cereal Bagel 0.000 60.000 Eng Muffin 70.000 80.000 90.000 100.000 110.000 120.000 130.000 140.000 150.000 Bread Grah Cracker Eng Muffin -10.000 -20.000 Bread Bread -30.000 Predicted Values All of the breads have negative residual values. The Total values are: f. Brand Food Anderson Pretzel Total Uncle B Bagel Bays Eng Muffin 32 Thomas Eng Muffin 30 Quaker OHs Cereal 27 Nabisco Grah Cracker 13 Wheaties Cereal 27 Wonder Bread 18 Brownberry Bread 14 Pepperidge Bread 18 27 30.5 The plot of Calories vs. Total is: g. y = 4.0269x + 3.7639 R2 = 0.9475 Calories vs. Total 160 140 Calories 120 100 80 60 40 10 15 20 25 30 Total 6 35 Chapter 8: Regression and Correlation The regression statistics are: Regression Statistics Multiple R 0.973 R Square 0.948 Adjusted R Square 0.941 Standard Error 7.194 Observations 10 ANOVA df SS MS Regression 1 7475.933518 7475.934 Residual 8 414.0664823 51.75831 Total 9 7890 Standard Error Coefficients Significance F F t Stat 144.4393 P-value 2.119E-06 Lower 95% Upper 95% Intercept 3.764 8.244 0.457 0.660 -15.248 22.775 Total 4.027 0.335 12.018 0.000 3.254 4.800 The plot of residuals vs. predicted values is: Residuals vs. Predicted Values 2 10.000 OHs Cereal 5.000 Grah Cracker Residuals Eng Muffin Eng Muffin 0.000 40.000 Bread Bread 50.000 60.000 70.000 80.000 90.000 100.000 110.000 120.000 130.000 140.000 Pretzel -5.000 Bread Bagel -10.000 Cereal -15.000 Predicted Values The R2 value changes from 0.689 for the Calories vs. Serving oz. regression to 0.948 for the Calories vs. Total regression. So the results of the second regression are much stronger. Moreover the breads in the residual plot are evenly distributed, unlike the first regression. 7 Chapter 8: Regression and Correlation 10. The correlation matrix appears as follows: a. Pearson Correlations Calories Calories Carbo 1.000 Carbo Fat Protein 0.961 0.226 1.000 Fat Serving oz 0.645 0.830 0.059 0.617 0.723 1.000 -0.248 0.086 1.000 0.837 Protein Serving oz 1.000 Pearson Probabilities Calories Calories Carbo - Fat 0.000 Carbo - Fat Protein Serving oz 0.530 0.044 0.003 0.872 0.057 0.018 0.489 0.814 - Protein - Serving oz 0.003 - The scatterplot matrix is: Calories Carbo Fat Protein Serving oz b. There is very little fat in any of these foods, so it is difficult ot see the effect of fat. Also, calories can also come from other sources, such as starches. c. If foods with higher fat content were included, a larger proportion of the calories would be from fat, and there would be a stronger relationship. 8 Chapter 8: Regression and Correlation 11. a. The correlations and p-values are: Correlations Pearson -0.455 p-value 0.017 Spearman -0.703 p-value 0.000 b. The scatterplot appears as: Price vs. Age $18,000 $16,000 $14,000 Price $12,000 $10,000 $8,000 $6,000 $4,000 $2,000 $0 0 5 10 15 20 25 30 Age The plot shows that the price can go up for old Mustangs. The old ones are perceived as classic antiques, and therefore are worth more. This means there is not a linear relationship between Price and Age, which invalidates one of the assumptions of the Pearson correlation coefficient. c. The correlations for the younger Mustangs is: Correlations <10 Pearson -0.895 p-value 0.000 Spearman -0.924 p-value 0.000 The correlations between Price and Age is much stronger for the younger cars. 9 Chapter 8: Regression and Correlation The regression statistics for the younger Mustangs are: d. Regression Statistics Multiple R 0.895 R Square 0.801 Adjusted R Square 0.790 Standard Error 1704.422 Observations 20 ANOVA df SS Regression MS 1 210100905.2 2.1E+08 Residual 18 52290963.76 2905054 Total 19 262391869 Coefficients Standard Error Significance F F 72.32256 t Stat P-value 1.01522E-07 Lower 95% Upper 95% Intercept 14380.62 861.67 16.69 0.00 12570.31 16190.92 Age -1383.60 162.70 -8.50 0.00 -1725.41 -1041.79 Using the Regression command, we can calculate the regression equation of Pirce on Age as PRICE = $14,381 – $1383.60(AGE). Which means that there is a drop of about $1384 per year in price. e. The diagnostic plots are: Age Residual Plot 5000.000 4000.000 3000.000 Residuals 2000.000 1000.000 0.000 0 1 2 3 4 5 6 -1000.000 -2000.000 -3000.000 Age 10 7 8 9 10 Chapter 8: Regression and Correlation Normal P-Plot 1.632 1.132 0.632 0.132 -0.368 -0.868 -1.368 -1.868 -2578.995 -1578.995 -578.995 421.005 1421.005 2421.005 3421.005 There is one observation whose residuals seem markedly higher than the others. This belongs to a 2-year old Mustang that was sold for $16,000–or $4,386.59 more than would be expected based on the regression equation. It would be interesting to see the effect on the regression equation if this one car was removed from the data set. 12. a. Calculus = 56.999 + 1.192(Algebra Placement) The 95% confidence internval for the slope = (0.715, 1.668) b. When the placement score increases by one point, according to the regression equation there is a 1.19 point increase in the final Calculs grade. c. The residual plot is: Alg Place Residual Plot 25.000 20.000 15.000 10.000 Residuals 5.000 0.000 0 5 10 15 20 25 30 35 -5.000 -10.000 -15.000 -20.000 -25.000 Alg Place The vertical spread decreases as we move to the right, although it is a mild trend. Nevertheless it does appear that the assumption of constant variance is not perfectly satistfied. However, there is likely to be very little effect on the coefficient from this problem, because there is only a mild trend in variance as the Algebra Placement score increases. 11 Chapter 8: Regression and Correlation 13. The regression plot appears as follows: a. y = 0.7328x + 5.673 R2 = 0.9053 Net Income vs. Total Assets 400 350 300 Net Income 250 200 150 100 50 0 0 100 200 300 400 500 600 Total Assets b. The regression statistics are: Regression Statistics Multiple R 0.951 R Square 0.905 Adjusted R Square 0.903 Standard Error 24.578 Observations 45 ANOVA df SS Regression MS 1 248242.30 248242.30 Residual 43 25976.14 604.10 Total 44 274218.44 Coefficients Standard Error t Stat F Significance F 410.9317 P-value 1.25188E-23 Lower 95% Upper 95% Intercept 5.673 4.717 1.203 0.236 -3.839 15.185 Total Asset 0.733 0.036 20.271 0.000 0.660 0.806 12 Chapter 8: Regression and Correlation The residual plot is: Residuals vs. Predicted Values 140.000 120.000 100.000 80.000 Residuals 60.000 40.000 20.000 0.000 0.000 50.000 100.000 150.000 200.000 250.000 300.000 350.000 400.000 -20.000 -40.000 -60.000 Predicted Values The regression statistics are: c. Regression Statistics Multiple R 0.893 R Square 0.798 Adjusted R Square 0.793 Standard Error 0.169 Observations 45 ANOVA df SS Regression MS 1 4.869708 4.869708 Residual 43 1.234506 0.028709 Total 44 6.104213 Coefficients Standard Error t Stat F 169.6205 P-value Significance F 1.6E-16 Lower 95% Upper 95% Intercept 0.070 0.123 0.573 0.570 -0.177 0.318 LogTA 0.906 0.070 13.024 0.000 0.765 1.046 13 Chapter 8: Regression and Correlation The plot of log(Net Income) vs. log(Total Assets) and the plot of the standardized residuals vs. log(Total Assets) appears as follows: y = 0.9057x + 0.0703 R2 = 0.7978 LogNI vs. LogTA 3.000 2.500 LogTA 2.000 1.500 1.000 0.500 1.000 1.200 1.400 1.600 1.800 2.000 2.200 2.400 2.600 2.800 LogNI Residuals vs. Predicted Values Log 0.400 0.300 0.200 Residuals 0.100 0.000 1.000 1.200 1.400 1.600 1.800 2.000 2.200 2.400 2.600 -0.100 -0.200 -0.300 -0.400 -0.500 -0.600 Predicted Values Log The residuals plot for the transformed data is much better, revealing no difficulties with the regression assumptions. 14 Chapter 8: Regression and Correlation 14. The scatterplot appears as: a. Mass vs. Volume 20 Mass 15 10 5 0 0 2 4 6 8 Volume b. After removing the outlier and the contstant term from the model, the slope = 2.693. c. The 95% confidence interval is (2.629, 2.757), which includes the accepted value, 2.699. 15. a. The correlation values are: Correlations Pearson's r 0.406 p-value 0.003 Spearman's s 0.295 p-value 0.037 The scatterplot appears as: b. Pulmon vs. Cardio 45.0 40.0 AZ ME 35.0 30.0 NM MT OR CONV Pulmon 25.0 NH CA ID TX 20.0 KY IA RI KS NB MO PA TN OH OK MA SD AR DE IN NJ NY MI AL CTNDWIIL VA GANCMN MS MD SC LA WA WY FL WV VT UT 15.0 HI 10.0 AK 5.0 0.0 0.0 100.0 200.0 300.0 400.0 500.0 Cardio Alaska, Hawaii, and Utah are isolated in the lower left. 15 600.0 Chapter 8: Regression and Correlation c. The correlations are: Correlations Pearson's r 0.259 p-value 0.072 Spearman's s 0.251 p-value 0.082 d. The Pearson correlation is dramatically reduced by the omission of these outliers, while the Spearman correlation is moderately reduced. The original correlation with all 50 states exaggerates the relationship between the two variables. The nonparametric correlation partly alleviates the problem, but not entirel, when the outliers are included. An important lesson to learn from this is that a correlation statistic by itself can be misleading, and a plot is useful to see if there are outliers. e. Alaska and Hawaii are thousands of miles from the continental U.S., with different climates and racial compositions. It is quite possible that they will not represent the U.S. population of the lower 48 states. There are many reasons why Alaska may be low; different eating and exercise habits, less smoking , different physical characteristics of a large portion of the population, and more deaths due to other circumstances. a. r = 0.313. The p-value for r is 0.076. b. r = 0.515. The p-value for r is 0.002. c. The following scatterplots are generated: 16. CAPRET91 vs. CAPRET90 120.000 100.000 Biotech CAPRET91 60.000 40.000 20.000 Health Broker 80.000 Medical Del RegionalBank Savings & Ln Financial Transport Retailing Technology Software Constr&Hous Chemicals Brdcst/MediaAir Trans Automotive Electronics Indust Mat Insurance Food & Agri Leisure Paper/Forest Telecommun Defense Indust Tech Elec Utils Utilities Computers Environ Ser 0.000 -30.000 -20.000 Prec Metals -20.000 AmericanGold -10.000 Energy 0.000 10.000 Energy Servs -40.000 CAPRET90 16 20.000 30.000 40.000 50.000 Chapter 8: Regression and Correlation INC91 vs. INC90 6.000 Utilities 5.000 Elec Utils INC91 4.000 Paper/Forest 3.000 2.000 Insurance Savings & Ln Computers Financial Indust Tech 1.000 RegionalBank Telecommun Energy Technology Prec Metals Chemicals Health Transport Biotech 0.000 Software Retailing Medical Environ Brdcst/Media Air AmericanGold Trans Electronics Ser Del Energy Servs 0.000 0.500 Indust Mat Defense Food & Agri Broker Leisure Automotive Constr&Hous 1.000 1.500 2.000 2.500 INC90 d. You should expect a strong correlation in income from one year to the next because income refers to interest on bonds, and preferred stocks, dividend payments, etc., and these are faily stable from year to year. e. r = 0.182. The p-value fro r is 0.309. The scatterplot appears as: NAV91 vs. NAV90 100.000 60.000 NAV91 40.000 Biotech Broker 80.000 Savings & Ln Financial RegionalBank Transport Medical Del Health Technology Retailing Brdcst/Media Electronics IndustTrans Mat Insurance Chemicals Leisure AirPaper/Forest Constr&Hous Telecommun Automotive Software Food & Agri Defense Indust Tech 20.000 Computers Elec Utils Utilities 0.000 -30.000 -20.000 Prec Metals -20.000 Energy -10.000 AmericanGold Environ Ser 0.000 10.000 Energy Servs -40.000 NAV90 17 20.000 30.000 40.000 50.000 Chapter 8: Regression and Correlation f. Without the Biotech stock the value of r is –0.019 with a p-value of 0.920. g. With Biotech: s = –0.019 (p-value = 0.916) Without Biotech: s = –0.118 (p-value = 0.521). h. Previously there was a slight positive (but not statistically significant) correlation between the gains in net asset value in the two years; but without Biotech the correlation is slightly negative. The lesson is that past performance is not necessarily a good guide to future performance when it comes to picking market sectors. Some investment advisors feel that you should diversify your investments across sectors because sector performance is so difficult to predict. a. r = –0.226. p-value < 0.001. Since the correlation is negative, lower draft numbers will be more likely given persons with later birth dates. b. The scatterplot appears as follows: 17. Number vs. Day 400 350 300 Day 250 200 150 100 50 0 0 50 100 150 200 Number There is no apparent trend in the scatterplot. 18 250 300 350 400 Chapter 8: Regression and Correlation The trend line appears as: c. Number vs. Day y = -0.2261x + 225.01 R2 = 0.0511 400 350 300 Day 250 200 150 100 50 0 0 50 100 150 200 250 300 350 400 Number The regression equation is Draft Number = 225.01 – 0.2261(Birth Date Number) The regression explains only 5.11% of the variation in draft numbers. r = –0.867. p-value < 0.001. d. The correlation between the average monthly draft number and the birth month is much higher than the correlation between individual draft numbers and birth date numbers. The scatterplot is: e. Average Number y = -7.0644x + 229.47 R2 = 0.7523 250.00 Avg. Number 200.00 150.00 100.00 50.00 0.00 0 2 4 6 8 Month 19 10 12 14 Chapter 8: Regression and Correlation The regression equation is: Average Draft Number = 229.47 – 7.0644 (Birth Month Number) This regression explains 75.23% of the variation in the average monthly draft numbers. f. There is too much variability in the individual draft numbers to present an effective display of the problem with the draft lottery. While the p-value is highly significant, the problem with the lottery is not apparent from the scatterplot. By averaging the draft numbers over each month, some of the day-to-day variability in the draft numbers is taken out of the problem and a clearer picture of the problem with the draft lottery emerges. a. Emerald: Price = –16,377 + 8.34 (Year) 18. Urban: Price = –10,727 + 5.46 (Year) Medical: Price = –23,718 + 12.00 (Year) Emerald's prices increase at the rate of 8.34 points per year. The Urban CPI increases at the rate of 5.46 points per year, and the medical CPI increases at the rate of 12 points per year. Emerald: (5.86 , 10.81) b. Urban: (4.97 , 5.94) Medical: (10.51 , 13.50) There is some overlapping of the confidence intervals, therefore it does not appear that there is significant differences in the rate of increase for Emerald as compared to the other indexes. c. Emerald's rate of increase is less than the general medical CPI, but since the confidence intervals overlap, the differences do not appear to be statistically significant. a. The scatterplot appears as: 19. Chart Title y = 0.2107x - 1434.3 R2 = 0.6968 9,000 8,000 7,000 6,000 5,000 4,000 3,000 2,000 1,000 15,000 20,000 25,000 30,000 20 35,000 40,000 45,000 Chapter 8: Regression and Correlation The regression statistics are: b. Regression Statistics Multiple R 0.835 R Square Adjusted R Square 0.697 Standard Error 0.691 586.704 Observations 51 ANOVA df Regression SS MS 1 38759153.937 38759153.937 Residual 49 16866844.220 344221.311 Total 50 55625998.157 Coefficients Intercept Teacher Salary Standard Error t Stat F 112.600 P-value Significance F 2.707E-14 Lower 95% Upper 95% -1434.312 490.464 -2.924 0.005 -2419.935 -448.690 0.211 0.020 10.611 0.000 0.171 0.251 The residual plots are: Teacher Salary Residual Plot 1500.000 1000.000 Residuals 500.000 0.000 15,000 20,000 25,000 30,000 -500.000 -1000.000 -1500.000 Teacher Salary 21 35,000 40,000 45,000 Chapter 8: Regression and Correlation 2.249 1.749 1.249 0.749 0.249 -0.251 -0.751 -1.251 -1.751 -2.251 -1138.928 -638.928 -138.928 361.072 861.072 While there is no evidence of a failure in the model assumptions, there is one state (Alaska) that has a markedly higher teacher salary than other states. However, its residual value is not much higher than many other states in the model. The chart broken down by category appears as follows: c. Spending per Pupil 9000 8000 7000 6000 5000 4000 3000 North y = 0.1613x - 38.462 2 R = 0.5685 2000 South y = 0.1844x - 946.58 R2 = 0.7494 1000 West y = 0.2729x - 3218.5 2 R = 0.803 0 15000 20000 25000 30000 35000 40000 North South West Linear (North) Linear (South) Linear (West) 45000 The slopes are very similar for the North and South regions, while the slope for the West region is higher. Much of this appears to be due to the influence of the value for Alaska. 22 Chapter 8: Regression and Correlation The regression statistics for the Nothern region are: d. Regression Statistics Multiple R 0.754 R Square 0.568 Adjusted R Square Standard Error 0.546 537.075 Observations 21 ANOVA df Regression SS MS F 1 7,220,389.90 7,220,389.90 Residual 19 5,480,547.06 288,449.85 Total 20 12,700,936.95 Coefficients Intercept Teacher Salary Standard Error t Stat 25.03 P-value Significance F 7.89392E-05 Lower 95% Upper 95% -38.462 795.993 -0.048 0.962 -1704.494 1627.570 0.161 0.032 5.003 0.000 0.094 0.229 For the Southern region: Regression Statistics Multiple R 0.866 R Square 0.749 Adjusted R Square Standard Error 0.733 391.358 Observations 17 ANOVA SS MS F Significance F 1 6,869,183.524 6,869,183.524 44.849 7.14247E-06 Residual 15 2,297,418.594 153,161.240 Total 16 9,166,602.118 Coefficients Standard Error P-value Lower 95% df Regression Intercept X Variable 1 t Stat Upper 95% -946.578 637.391 -1.485 0.158 -2305.146 411.989 0.184 0.028 6.697 0.000 0.126 0.243 23 Chapter 8: Regression and Correlation For the Western region: Regression Statistics Multiple R 0.896 R Square 0.803 Adjusted R Square Standard Error 0.785 723.318 Observations 13 ANOVA df Regression 1 SS MS F 23,455,267.145 23,455,267.145 44.831 523,188.232 Residual 11 5,755,070.547 Total 12 29,210,337.692 Coefficients Intercept X Variable 1 Standard Error t Stat P-value Significance F 3.396E-05 Lower 95% Upper 95% -3218.537 1084.735 -2.967 0.013 -5606.024 -831.050 0.273 0.041 6.696 0.000 0.183 0.363 The three region equations are: e. North: Spending = –38.46 + 0.161 (Salary) 95% CI for slope (0.094 , 0.229) South: Spending = –946.58 + 0.184 (Salary) 95% CI for slope (0.126 , 0.243) West: Spending = –3218.54 + 0.273 (Salary) 95% CI for slope (0.183 , 0.363) It would appear that the rate at which spending per pupil increases relative to the average teacher's salary is higher in the Western states than in the Northern and Southern states. This could be due, in part, to the influence of the large value for the state of Alaska and an another analysis is probably warranted withtout the inclusion of the Alaskan value. 24 Chapter 8: Regression and Correlation 20. The scatterplot is: a. Highway Fatalities 16.0 New Mexico United States Linear (New Mexico) Linear (United States) 14.0 12.0 Fatality Rate 10.0 New Mexico 8.0 y = -0.2362x + 13.154 2 R = 0.9066 6.0 United States 4.0 y = -0.1584x + 8.7369 R2 = 0.8839 2.0 0.0 0 5 10 15 20 25 30 35 40 45 Year About 90% of the variation is explained by the regression on the New Mexico data nad 88.4% of the variation is explained by the regression on the United States data. The slopes of the two trend lines appear to be different. One problem with these trend lines (if extended out into the future) is that they will eventually cross the x-axis indicating a negative fatality rate–an impossible result. b. The regression statistics for the New Mexico data are: Regression Statistics Multiple R 0.952173153 R Square Adjusted R Square 0.906633714 Standard Error 0.897559381 0.904176706 Observations 40 ANOVA df SS Regression MS 1 297.270462 297.270462 Residual 38 30.61328799 0.805612842 Total 39 327.88375 Coefficients Intercept Year Standard Error t Stat F 368.9992 3.65574E-21 P-value Lower 95% 13.15384615 0.28924003 45.47726724 9.6E-35 -0.236163227 0.012294181 -19.20935085 3.66E-21 25 Significance F 12.5683103 0.261051495 Upper 95% 13.73938 -0.21127 Chapter 8: Regression and Correlation The regression statistics for the United States data are: Regression Statistics Multiple R 0.940149084 R Square Adjusted R Square 0.883880299 0.880824518 Standard Error 0.67990177 Observations 40 ANOVA df SS Regression MS F 1 133.7098762 133.7098762 Residual 38 17.56612383 0.462266417 Total 39 151.276 Standard Error Coefficients Intercept Year t Stat Significance F 289.2485 2.33234E-19 P-value Lower 95% Upper 95% 9.180466835 0.139533613 8.736923077 0.219099497 39.87650913 1.28E-32 8.293379319 -0.158386492 0.009312849 -17.0073078 2.33E-19 -0.17723937 The residual plot for the New Mexico data is: Year Residual Plot 2 1.5 1 Residuals 0.5 0 0 5 10 15 20 25 -0.5 -1 -1.5 -2 -2.5 Year 26 30 35 40 45 Chapter 8: Regression and Correlation The residual plot for the United States data is: Year Residual Plot 3 2.5 2 Residuals 1.5 1 0.5 0 0 5 10 15 20 25 30 35 40 45 -0.5 -1 -1.5 Year There is some indication in these plots that the regression assumption that residuals should be independent has been violated. There appears to be some time factor at which in the residual plot. c. New Mexico: Durbin-Watson = 0.719. Runs = 11. Runs p-value = 0.001 United States: Durbin-Watson = 0.270 Runs = 5. Runs p-value < 0.0001 The Durbin-Watson statistics are close to 0 for the United States data and for both sets of data the p-value of the runs test is significant indicating fewer runs that would be expected. This would cause us to doubt that the assumption of indepence of residuals has been met. 27 Chapter 8: Regression and Correlation 21. The scatterplot is: a. Tax vs. Price 2000 y = 0.7028x + 36.344 R2 = 0.7668 1800 1600 1400 Tax 1200 1000 800 600 400 200 0 0 500 1000 1500 2000 2500 Price About 77% of the variation in tax is explained by the home price. The regression statistics are: b. Regression Statistics Multiple R 0.875664739 R Square Adjusted R Square 0.766788735 Standard Error 149.5332887 0.764567675 Observations 107 ANOVA df Regression SS MS F 345.2355 1 7719537.265 7719537.265 Residual 105 2347821.464 22360.20442 Total 106 10067358.73 Coefficients Standard Error t Stat P-value Significance F 5.67E-35 Lower 95% Upper 95% Intercept 36.34435129 43.23740728 0.840576565 0.402495 -49.3875 122.0762 Price 0.702784226 0.037823721 18.58051508 5.67E-35 0.627787 0.777782 28 Chapter 8: Regression and Correlation The residual plot is: Price Residual Plot 600 500 400 300 Residuals 200 100 0 -100 0 500 1000 1500 2000 2500 -200 -300 -400 -500 Price The Normal plot is: c. 2.478 1.478 0.478 -0.522 -1.522 -2.522 -2.803795723 -1.803795723 -0.803795723 0.196204277 1.196204277 2.196204277 The residual plot seems to indicate a possible vioalation of the assumption of constant variance. The variation of the residual values seems to increase as the home price increases. There is also some indication in the Normal P-plot that the residuals do not follow the Normal distribution. 29 Chapter 8: Regression and Correlation The scatterplot appears as: d. Log(Tax) vs Log(Price) 3.4 y = 1.047x - 0.2824 R2 = 0.741 3.2 Log(Tax) 3 2.8 2.6 2.4 2.2 2 2.5 2.6 2.7 2.8 2.9 3 3.1 3.2 3.3 3.4 Log(Price) The regression statistics are: Regression Statistics Multiple R 0.860809 R Square Adjusted R Square 0.740992 Standard Error 0.085644 0.738526 Observations 107 ANOVA df Regression SS MS 1 2.203361 2.203361 Residual 105 0.770167 0.007335 Total 106 2.973527 Coefficients Standard Error F 300.3933 t Stat P-value Significance F 1.42E-32 Lower 95% Upper 95% Intercept -0.28245 0.181979 -1.55209 0.123649 -0.64328 0.078383 Log(Price) 1.047043 0.060411 17.33186 1.42E-32 0.927258 1.166828 30 Chapter 8: Regression and Correlation The residual plots appear as: Log(Price) Residual Plot 0.2 0.1 0 Residuals 2.5 2.6 2.7 2.8 2.9 3 3.1 3.2 3.3 3.4 -0.1 -0.2 -0.3 -0.4 Log(Price) 2.478 1.478 0.478 -0.522 -1.522 -2.522 -3.72543402 -2.72543402 -1.72543402 -0.72543402 0.27456598 1.27456598 Using the logarithmic transformation appears to have removed the problem with nonconstant variance. However, there is still a problem with the apparent lack of Normality in the distribution of the residuals. 31 33