1) (a) How does correlation analysis differ from regression analysis? Correlation analysis is used to study whether two variables are related. If they are related what is the direction and magnitude of this relationship. (b) What does a correlation coefficient reveal? The sign of the correlation coefficient reveals the direction or nature of the relationship. If the correlation coefficient is positive, the two variables move in the same direction. An increase in x is accompanied by an increase in y and a decrease in y is accompanied by a decrease in y. If the correlation coefficient is negative, the two variables move in opposite directions. An increase in x is accompanied by a decrease in y and vice versa. The value of the correlation coefficient r always lies between -1 and +1. The magnitude is between 0 and 1. If magnitude is near 1, a very high correlation is there. If the magnitude is near 0, thereis very low correlation or no correlation. (c) State the quick rule for a significant correlation and explain its limitations. When r = +1, there is perfect positive correlation. When r = -1. There is perfect negative correlation. When r = 0, there is no correlation. When r is near ±1 , high correlation can be inferred. When r is near 0 low correlation can be inferred. (d) What sums are needed to calculate a correlation coefficient? Suppose x and y be the variables. We need the following quantities to calculate the Pearson’s product moment correlation coefficient. ๐ = ๐๐ข๐๐๐๐ ๐๐ ๐๐๐๐๐ ๐๐ ๐๐๐ ๐๐๐ฃ๐๐ก๐๐๐๐ (๐ฅ, ๐ฆ) ∑ ๐ฅ = ๐ ๐ข๐ ๐๐ ๐กโ๐ ๐ฃ๐๐๐ข๐๐ ๐๐ ๐กโ๐ ๐ฃ๐๐๐๐๐๐๐ ๐ฅ ∑ ๐ฆ = ๐ ๐ข๐ ๐๐ ๐กโ๐ ๐ฃ๐๐๐ข๐๐ ๐๐ ๐กโ๐ ๐ฃ๐๐๐๐๐๐๐ ๐ฆ ∑ ๐ฅ 2 = ๐ ๐ข๐ ๐๐ ๐กโ๐ ๐ ๐๐ข๐๐๐๐ ๐๐๐กโ๐ ๐ฃ๐๐๐ข๐๐ ๐๐ ๐กโ๐ ๐ฃ๐๐๐๐๐๐๐ ๐ฅ ∑ ๐ฆ 2 = ๐ ๐ข๐ ๐๐ ๐กโ๐ ๐ ๐๐ข๐๐๐๐ ๐๐๐กโ๐ ๐ฃ๐๐๐ข๐๐ ๐๐ ๐กโ๐ ๐ฃ๐๐๐๐๐๐๐ ๐ฆ ∑ ๐ฅ๐ฆ = ๐กโ๐ ๐ ๐ข๐ ๐๐ ๐กโ๐ ๐๐๐๐๐ข๐๐ก๐ ๐๐ ๐กโ๐ ๐ฃ๐๐๐ข๐๐ ๐๐ ๐กโ๐ ๐ฃ๐๐๐๐๐๐๐๐ ๐ฅ ๐๐๐ ๐ฆ (e) What are the two ways of testing a correlation coefficient for significance? The t test using the statistic = √(๐−2) √1−๐ 2 ๐ , which follows the students t distribution with n-2 degrees of freedom and the F test using the statistic ๐น = (๐−2)๐ 2 , (1−๐ 2 ) which follows the F distribution with 1 and (n-2) degrees of freedom are two ways to test the significance of the correlation coefficient. 2) In the following regression, X = weekly pay, Y = income tax withheld, and n = 35 McDonald’s employees. (a) Write the fitted regression equation. Y = 30.7963 + 0.0343 X (b) State the degrees of freedom for a two tailed test for zero slope, and use Appendix D to find the critical value at a = .05. Degrees of freedom = n-2 = 35-2 = 33 Critical value = t(0.05/2 , 33) = 2.0345 (c) What is your conclusion about the slope? The t value is 2.889. |t| = 2.889 > 2.0345(critical value). The null hypothesis is rejected. The slope is significantly different from 0. (d) Interpret the 95 percent confidence limits for the slope. The 95% confidence interval for the slope is (0.0101, 0.0584). This is the interval we obtained for our particular sample. When the sampling experiment is repeated independently, and each time we find the 95% confidence interval, then 95% of such intervals will include the true value of the slope. (e) Verify that F = t2 for the slope. F = 8.35 t = 2.889 ๐ก 2 = 2.8892 = 8.3463 ≅ 8.35 ๐น = ๐ก2 (The difference seen here is due to rounding off error in t and F. Theoretically, they are exactly equal) (f) In your own words, describe the fit of this regression. The p-value is 0.0068 < 0.01. So the F is significant. So regression model is good fit to the data. But since ๐ 2 = 0.202, only 20.2% of the total variation in Y is explained by the variable X. This can be rectified if we consider a nonlinear model and include some other variables that may influence Y in the model. Regression output confidence interval Variables coefficients std. error t (df = 33) p-value 95% lower 95% upper Intercept 30.7963 6.4078 4.806 .0000 17.7595 43.8331 Slope 0.0343 0.0119 2.889 .0068 0.0101 0.0584 Source SS df MS F p-value Regression 387.6959 1 387.6959 8.35 .0068 Residual 1,533.0614 33 46.4564 Total 1,920.7573 34 R2 0.202 Std. Error 6.816 n 35 ANOVA table 3) In the following regression, X = total assets ($ billions), Y = total revenue ($ billions), and n = 64 large banks. (a) Write the fitted regression equation. Y = 6.5763 + 0.0452 X (b) State the degrees of freedom for a two tailed test for zero slope, and use Appendix D to find the critical value at a = .05. Df = 62 Critical value of t = 1.9990 (c) What is your conclusion about the slope? |t| = 8.183 > 1.9990(critical value) The slope is significantly different from 0. (d) Interpret the 95 percent confidence limits for the slope. The 95% confidence interval for the slope is (0.0342, 0.0563). This is the interval we obtained for our particular sample. When the sampling experiment is repeated independently, and each time we find the 95% confidence interval, then 95% of such intervals will include the true value of the slope. (e) Verify that F = t2 for the slope. F = 66.97 t = 8.183 ๐ก 2 = 8.1832 = 66.9615 ≅ 66.97 ๐น = ๐ก2 (The difference seen here is due to rounding off error in t and F. Theoretically, they are exactly equal) (f) In your own words, describe the fit of this regression. The p-value is 1.90E-11 < 0.01. So the F is significant. So regression model is good fit to the data. But since ๐ 2 = 0.519, only 51.9% of the total variation in Y is explained by the variable X. This may be rectified if we consider a nonlinear model and include some other variables that may influence Y in the model. R2 0.519 Std. Error 6.977 n 64 ANOVA table Source SS df MS F p-value Regression 3,260.0981 1 3,260.0981 66.97 1.90E-11 Residual 3,018.3339 62 48.6828 Total 6,278.4320 63 Regression output confidence interval variables Intercept coefficients 6.5763 std. error 1.9254 X1 0.0452 0.0055 t (df = 62) p-value 3.416 .0011 8.183 1.90E-11 95% lower 95% upper 2.7275 10.4252 0.0342 0.0563 4) A researcher used stepwise regression to create regression models to predict BirthRate (births per 1,000) using five predictors: LifeExp (life expectancy in years), InfMort (infant mortality rate), Density (population density per square kilometer), GDPCap (Gross Domestic Product per capita), and Literate (literacy percent). Interpret these results. Regression Analysis—Stepwise Selection (best model of each size) 153 observations BirthRate is the dependent variable p-values for the coefficients Nvar LifeExp InfMort Density GDPCap Literate s Adj R2 R2 1 .0000 6.318 .722 .724 2 .0000 .0000 5.334 .802 .805 3 .0000 .0242 .0000 5.261 .807 .811 4 .5764 .0000 .0311 .0000 5.273 .806 .812 5 .5937 .0000 .6289 .0440 .0000 5.287 .805 .812 5) An expert witness in a case of alleged racial discrimination in a state university school of nursing introduced a regression of the determinants of Salary of each professor for each year during an 8year period (n = 423) with the following results, with dependent variable Year (year in which the salary was observed) and predictors YearHire (year when the individual was hired), Race (1 if individual is black, 0 otherwise), and Rank (1 if individual is an assistant professor, 0 otherwise). Interpret these results. Variable Coefficient t p Intercept -3,816,521 -29.4 .000 Year 1,948 29.8 .000 YearHire -826 -5.5 .000 Race -2,093 -4.3 .000 Rank -6,438 -22.3 .000 R2 = 0.811 R2 adj = 0.809 s = 3,318 6) (a) Plot the data on U.S. general aviation shipments. (b) Describe the pattern and discuss possible causes. (c) Would a fitted trend be helpful? Explain. (d) Make a similar graph for 1992–2003 only. Would a fitted trend be helpful in making a prediction for 2004? (e) Fit a trend model of your choice to the 1992–2003 data. (f) Make a forecast for 2004, using either the fitted trend model or a judgment forecast. Why is it best to ignore earlier years in this data set? U.S. Manufactured General Aviation Shipments, 1966–2003 Year Planes Year Planes Year Planes Year Planes 1966 15,587 1976 15,451 1986 1,495 1996 1,053 1967 13,484 1977 16,904 1987 1,085 1997 1,482 1968 13,556 1978 17,811 1988 1,143 1998 2,115 1969 12,407 1979 17,048 1989 1,535 1999 2,421 1970 7,277 1980 11,877 1990 1,134 2000 2,714 1971 7,346 1981 9,457 1991 1,021 2001 2,538 1972 9,774 1982 4,266 1992 856 2002 2,169 1973 13,646 1983 2,691 1993 870 2003 2,090 1974 14,166 1984 2,431 1994 881 1975 14,056 1985 2,029 1995 1,028