1) (a) How does correlation analysis differ from regression analysis

advertisement
1) (a) How does correlation analysis differ from regression analysis?
Correlation analysis is used to study whether two variables are related. If they are related what is
the direction and magnitude of this relationship.
(b) What does a correlation coefficient reveal?
The sign of the correlation coefficient reveals the direction or nature of the relationship. If the
correlation coefficient is positive, the two variables move in the same direction. An increase in x is
accompanied by an increase in y and a decrease in y is accompanied by a decrease in y.
If the correlation coefficient is negative, the two variables move in opposite directions. An increase
in x is accompanied by a decrease in y and vice versa.
The value of the correlation coefficient r always lies between -1 and +1. The magnitude is between 0
and 1. If magnitude is near 1, a very high correlation is there. If the magnitude is near 0, thereis very
low correlation or no correlation.
(c) State the quick rule for a significant correlation and explain its limitations.
When r = +1, there is perfect positive correlation.
When r = -1. There is perfect negative correlation.
When r = 0, there is no correlation.
When r is near ±1 , high correlation can be inferred. When r is near 0 low correlation can be
inferred.
(d) What sums are needed to calculate a correlation coefficient?
Suppose x and y be the variables. We need the following quantities to calculate the Pearson’s
product moment correlation coefficient.
๐‘› = ๐‘›๐‘ข๐‘š๐‘๐‘’๐‘Ÿ ๐‘œ๐‘“ ๐‘๐‘Ž๐‘–๐‘Ÿ๐‘  ๐‘œ๐‘“ ๐‘œ๐‘๐‘ ๐‘’๐‘Ÿ๐‘ฃ๐‘Ž๐‘ก๐‘–๐‘œ๐‘›๐‘  (๐‘ฅ, ๐‘ฆ)
∑ ๐‘ฅ = ๐‘ ๐‘ข๐‘š ๐‘œ๐‘“ ๐‘กโ„Ž๐‘’ ๐‘ฃ๐‘Ž๐‘™๐‘ข๐‘’๐‘  ๐‘œ๐‘“ ๐‘กโ„Ž๐‘’ ๐‘ฃ๐‘Ž๐‘Ÿ๐‘–๐‘Ž๐‘๐‘™๐‘’ ๐‘ฅ
∑ ๐‘ฆ = ๐‘ ๐‘ข๐‘š ๐‘œ๐‘“ ๐‘กโ„Ž๐‘’ ๐‘ฃ๐‘Ž๐‘™๐‘ข๐‘’๐‘  ๐‘œ๐‘“ ๐‘กโ„Ž๐‘’ ๐‘ฃ๐‘Ž๐‘Ÿ๐‘–๐‘Ž๐‘๐‘™๐‘’ ๐‘ฆ
∑ ๐‘ฅ 2 = ๐‘ ๐‘ข๐‘š ๐‘œ๐‘“ ๐‘กโ„Ž๐‘’ ๐‘ ๐‘ž๐‘ข๐‘Ž๐‘Ÿ๐‘’๐‘  ๐‘œ๐‘“๐‘กโ„Ž๐‘’ ๐‘ฃ๐‘Ž๐‘™๐‘ข๐‘’๐‘  ๐‘œ๐‘“ ๐‘กโ„Ž๐‘’ ๐‘ฃ๐‘Ž๐‘Ÿ๐‘–๐‘Ž๐‘๐‘™๐‘’ ๐‘ฅ
∑ ๐‘ฆ 2 = ๐‘ ๐‘ข๐‘š ๐‘œ๐‘“ ๐‘กโ„Ž๐‘’ ๐‘ ๐‘ž๐‘ข๐‘Ž๐‘Ÿ๐‘’๐‘  ๐‘œ๐‘“๐‘กโ„Ž๐‘’ ๐‘ฃ๐‘Ž๐‘™๐‘ข๐‘’๐‘  ๐‘œ๐‘“ ๐‘กโ„Ž๐‘’ ๐‘ฃ๐‘Ž๐‘Ÿ๐‘–๐‘Ž๐‘๐‘™๐‘’ ๐‘ฆ
∑ ๐‘ฅ๐‘ฆ = ๐‘กโ„Ž๐‘’ ๐‘ ๐‘ข๐‘š ๐‘œ๐‘“ ๐‘กโ„Ž๐‘’ ๐‘๐‘Ÿ๐‘œ๐‘‘๐‘ข๐‘๐‘ก๐‘  ๐‘œ๐‘“ ๐‘กโ„Ž๐‘’ ๐‘ฃ๐‘Ž๐‘™๐‘ข๐‘’๐‘  ๐‘œ๐‘“ ๐‘กโ„Ž๐‘’ ๐‘ฃ๐‘Ž๐‘Ÿ๐‘–๐‘Ž๐‘๐‘™๐‘’๐‘  ๐‘ฅ ๐‘Ž๐‘›๐‘‘ ๐‘ฆ
(e) What are the two ways of testing a correlation coefficient for significance?
The t test using the statistic =
√(๐‘›−2)
√1−๐‘Ÿ 2
๐‘Ÿ
, which follows the students t distribution with n-2 degrees
of freedom and the F test using the statistic ๐น =
(๐‘›−2)๐‘Ÿ 2
,
(1−๐‘Ÿ 2 )
which follows the F distribution with 1 and
(n-2) degrees of freedom are two ways to test the significance of the correlation coefficient.
2) In the following regression, X = weekly pay, Y = income tax withheld, and n = 35 McDonald’s
employees.
(a) Write the fitted regression equation.
Y = 30.7963 + 0.0343 X
(b) State the degrees of freedom for a two tailed test for zero slope, and use Appendix D to find the
critical value at a = .05.
Degrees of freedom = n-2 = 35-2 = 33
Critical value = t(0.05/2 , 33) = 2.0345
(c) What is your conclusion about the slope?
The t value is 2.889. |t| = 2.889 > 2.0345(critical value). The null hypothesis is rejected. The slope is
significantly different from 0.
(d) Interpret the 95 percent confidence limits for the slope.
The 95% confidence interval for the slope is (0.0101, 0.0584).
This is the interval we obtained for our particular sample. When the sampling experiment is
repeated independently, and each time we find the 95% confidence interval, then 95% of such
intervals will include the true value of the slope.
(e) Verify that F = t2 for the slope.
F = 8.35
t = 2.889
๐‘ก 2 = 2.8892 = 8.3463 ≅ 8.35
๐น = ๐‘ก2
(The difference seen here is due to rounding off error in t and F. Theoretically, they are exactly
equal)
(f) In your own words, describe the fit of this regression.
The p-value is 0.0068 < 0.01. So the F is significant. So regression model is good fit to the data.
But since ๐‘… 2 = 0.202, only 20.2% of the total variation in Y is explained by the variable X. This can be
rectified if we consider a nonlinear model and include some other variables that may influence Y in
the model.
Regression output confidence interval
Variables
coefficients
std. error
t (df = 33) p-value 95% lower 95% upper
Intercept
30.7963
6.4078
4.806 .0000 17.7595
43.8331
Slope
0.0343
0.0119
2.889 .0068 0.0101
0.0584
Source
SS
df
MS
F
p-value
Regression
387.6959
1
387.6959
8.35
.0068
Residual
1,533.0614
33
46.4564
Total
1,920.7573
34
R2
0.202
Std. Error
6.816
n
35
ANOVA table
3) In the following regression, X = total assets ($ billions), Y = total revenue ($ billions), and n = 64
large banks.
(a) Write the fitted regression equation.
Y = 6.5763 + 0.0452 X
(b) State the degrees of freedom for a two tailed test for zero slope, and use Appendix D to find the
critical value at a = .05.
Df = 62
Critical value of t = 1.9990
(c) What is your conclusion about the slope?
|t| = 8.183 > 1.9990(critical value)
The slope is significantly different from 0.
(d) Interpret the 95 percent confidence limits for the slope.
The 95% confidence interval for the slope is (0.0342, 0.0563).
This is the interval we obtained for our particular sample. When the sampling experiment is
repeated independently, and each time we find the 95% confidence interval, then 95% of such
intervals will include the true value of the slope.
(e) Verify that F = t2 for the slope.
F = 66.97
t = 8.183
๐‘ก 2 = 8.1832 = 66.9615 ≅ 66.97
๐น = ๐‘ก2
(The difference seen here is due to rounding off error in t and F. Theoretically, they are exactly
equal)
(f) In your own words, describe the fit of this regression.
The p-value is 1.90E-11 < 0.01. So the F is significant. So regression model is good fit to the data.
But since ๐‘… 2 = 0.519, only 51.9% of the total variation in Y is explained by the variable X. This may
be rectified if we consider a nonlinear model and include some other variables that may influence Y
in the model.
R2
0.519
Std. Error
6.977
n
64
ANOVA table
Source
SS
df
MS
F
p-value
Regression
3,260.0981
1
3,260.0981
66.97
1.90E-11
Residual
3,018.3339
62
48.6828
Total
6,278.4320
63
Regression output confidence interval
variables
Intercept
coefficients
6.5763
std. error
1.9254
X1
0.0452
0.0055
t (df = 62) p-value
3.416 .0011
8.183 1.90E-11
95% lower 95% upper
2.7275
10.4252
0.0342
0.0563
4) A researcher used stepwise regression to create regression models to predict BirthRate (births per
1,000) using five predictors: LifeExp (life expectancy in years), InfMort (infant mortality rate), Density
(population density per square kilometer), GDPCap (Gross Domestic Product per capita), and Literate
(literacy percent). Interpret these results. Regression Analysis—Stepwise Selection (best model of
each size) 153 observations BirthRate is the dependent variable p-values for the coefficients Nvar
LifeExp InfMort Density GDPCap Literate s Adj R2 R2 1 .0000 6.318 .722 .724 2 .0000 .0000 5.334
.802 .805 3 .0000 .0242 .0000 5.261 .807 .811 4 .5764 .0000 .0311 .0000 5.273 .806 .812 5 .5937
.0000 .6289 .0440 .0000 5.287 .805 .812
5) An expert witness in a case of alleged racial discrimination in a state university school of nursing
introduced a regression of the determinants of Salary of each professor for each year during an 8year period (n = 423) with the following results, with dependent variable Year (year in which the
salary was observed) and predictors YearHire (year when the individual was hired), Race (1 if
individual is black, 0 otherwise), and Rank (1 if individual is an assistant professor, 0 otherwise).
Interpret these results. Variable Coefficient t p Intercept -3,816,521 -29.4 .000 Year 1,948 29.8 .000
YearHire -826 -5.5 .000 Race -2,093 -4.3 .000 Rank -6,438 -22.3 .000 R2 = 0.811 R2 adj = 0.809 s =
3,318 6) (a) Plot the data on U.S. general aviation shipments. (b) Describe the pattern and discuss
possible causes. (c) Would a fitted trend be helpful? Explain. (d) Make a similar graph for 1992–2003
only. Would a fitted trend be helpful in making a prediction for 2004? (e) Fit a trend model of your
choice to the 1992–2003 data. (f) Make a forecast for 2004, using either the fitted trend model or a
judgment forecast. Why is it best to ignore earlier years in this data set? U.S. Manufactured General
Aviation Shipments, 1966–2003 Year Planes Year Planes Year Planes Year Planes 1966 15,587 1976
15,451 1986 1,495 1996 1,053 1967 13,484 1977 16,904 1987 1,085 1997 1,482 1968 13,556 1978
17,811 1988 1,143 1998 2,115 1969 12,407 1979 17,048 1989 1,535 1999 2,421 1970 7,277 1980
11,877 1990 1,134 2000 2,714 1971 7,346 1981 9,457 1991 1,021 2001 2,538 1972 9,774 1982 4,266
1992 856 2002 2,169 1973 13,646 1983 2,691 1993 870 2003 2,090 1974 14,166 1984 2,431 1994
881 1975 14,056 1985 2,029 1995 1,028
Download