Regression and Correlation

advertisement
Chapter 8: Regression and Correlation
1.
False. The size of the slope is unrelated to the size of the correlation. A linear regression can result in a
large value for the slope, but the valueitself might be non-signficant.
2.
False. The size of the slope is not related to the size of the correlation.
3.
True
4.
False. A correlation of zero means that there is no linear relationship , but there could be a nonlinear
relationship between the two variables.
5.
False. The Runs test is only appropriate for time-ordered residuals.
6.
μ = 2(10)(15)/(25)+1 = 13; σ = 2.345; z = (10–13+0.5)/2.345 = –1.066; p(z <= –1.066) = 0.1432.
7.
a.
10
b.
37.83
c.
0.3806
d.
38.06%
e.
0.6169
f.
0.057
g.
5.138
a.
Select the rows for the first region and delete the data.
b.
The scatterplot appears as:
8.
y = 3.0152x - 52.618
R2 = 0.8534
Mortality vs. Temperature
110.0
100.0
Mortality Index
90.0
80.0
70.0
60.0
50.0
40.0
30.0
35.0
40.0
45.0
Mean Annual Temperature
1
50.0
55.0
Chapter 8: Regression and Correlation
The regression statistics are:
c.
Regression Statistics
Multiple R
0.924
R Square
Adjusted R
Square
0.853
Standard Error
5.933
0.842
Observations
15
ANOVA
df
Regression
SS
MS
F
1
2664.336
2664.336
Residual
13
457.541
35.195
Total
14
3121.877
Coefficients
Standard Error
-52.62
15.82
-3.33
0.01
-86.80
-18.43
3.02
0.35
8.70
0.00
2.27
3.76
Intercept
Temperature
t Stat
75.701
Significance
F
P-value
8.816E-07
Lower 95%
The residual plots appear as follows:
d.
Temperature Residual Plot
15.00
10.00
Residuals
5.00
0.00
30.0
35.0
40.0
45.0
-5.00
-10.00
-15.00
Temperature
2
50.0
55.0
Upper 95%
Chapter 8: Regression and Correlation
Residuals vs. Predicted Values
15.00
10.00
Residuals
5.00
0.00
40.00
50.00
60.00
70.00
80.00
90.00
100.00
110.00
-5.00
-10.00
-15.00
Predicted Values
The P-plot is:
1.261
0.761
0.261
-0.239
-0.739
-1.239
-1.739
-9.83
e.
-4.83
0.17
10.17
The correlation statistics are:
Correlation
p-value
Pearson
0.924
0.000
Spearman
0.900
0.000
f.
5.17
The slope of the regression line is 3.105, meaning that the mortality index rises 3.105 points for
every degree increase in mean annual temperature. There is no reason to doubt the validity of the
regression based on the residual plots.
3
Chapter 8: Regression and Correlation
9.
The scatterplot appears as:
a.
y = 4.0269x + 3.7639
R2 = 0.9475
Calories vs. Total
160
140
Calories
120
100
80
60
40
10
15
20
25
30
35
Total
r = 0.830. p-value = 0.003.
b.
The regression statistics are:
c.
Regression Statistics
Multiple R
0.830
R Square
Adjusted R
Square
0.689
Standard Error
0.650
17.507
Observations
10
ANOVA
df
SS
MS
Regression
1
5438.105128
5438.105
Residual
8
2451.894872
306.4869
Total
9
7890
Coefficients
Standard
Error
t Stat
F
17.74335
P-value
Significance
F
0.0029
Lower 95%
Upper
95%
Intercept
41.18
14.80
2.78
0.0238
7.05
75.31
Serving oz
49.29
11.70
4.21
0.0029
22.31
76.27
4
Chapter 8: Regression and Correlation
The plot of residuals vs. serving oz appears as follows:
d.
Serving oz Residual Plot
40.000
30.000
Residuals
20.000
10.000
0.000
0
0.5
1
1.5
2
2.5
-10.000
-20.000
-30.000
Serving oz
The Normal P-plot of the residuals is:
1.453
0.953
0.453
-0.047
-0.547
-1.047
-1.547
-22.587
-12.587
-2.587
7.413
17.413
27.413
There is no reason to doubt the regression assumptions based on the residual plots.
5
Chapter 8: Regression and Correlation
The residual plot appears as follows:
e.
Residuals vs. Predicted Values
Residuals
40.000
30.000
OHs Cereal
20.000
Pretzel
10.000
Cereal
Bagel
0.000
60.000
Eng Muffin
70.000
80.000
90.000
100.000
110.000
120.000
130.000
140.000
150.000
Bread
Grah Cracker
Eng Muffin
-10.000
-20.000
Bread
Bread
-30.000
Predicted Values
All of the breads have negative residual values.
The Total values are:
f.
Brand
Food
Anderson
Pretzel
Total
Uncle B
Bagel
Bays
Eng Muffin
32
Thomas
Eng Muffin
30
Quaker
OHs Cereal
27
Nabisco
Grah Cracker
13
Wheaties
Cereal
27
Wonder
Bread
18
Brownberry
Bread
14
Pepperidge
Bread
18
27
30.5
The plot of Calories vs. Total is:
g.
y = 4.0269x + 3.7639
R2 = 0.9475
Calories vs. Total
160
140
Calories
120
100
80
60
40
10
15
20
25
30
Total
6
35
Chapter 8: Regression and Correlation
The regression statistics are:
Regression Statistics
Multiple R
0.973
R Square
0.948
Adjusted R Square
0.941
Standard Error
7.194
Observations
10
ANOVA
df
SS
MS
Regression
1
7475.933518
7475.934
Residual
8
414.0664823
51.75831
Total
9
7890
Standard
Error
Coefficients
Significance
F
F
t Stat
144.4393
P-value
2.119E-06
Lower 95%
Upper 95%
Intercept
3.764
8.244
0.457
0.660
-15.248
22.775
Total
4.027
0.335
12.018
0.000
3.254
4.800
The plot of residuals vs. predicted values is:
Residuals vs. Predicted Values 2
10.000
OHs Cereal
5.000
Grah Cracker
Residuals
Eng Muffin
Eng Muffin
0.000
40.000
Bread
Bread
50.000
60.000
70.000
80.000
90.000
100.000
110.000
120.000
130.000
140.000
Pretzel
-5.000
Bread
Bagel
-10.000
Cereal
-15.000
Predicted Values
The R2 value changes from 0.689 for the Calories vs. Serving oz. regression to 0.948 for the
Calories vs. Total regression. So the results of the second regression are much stronger. Moreover
the breads in the residual plot are evenly distributed, unlike the first regression.
7
Chapter 8: Regression and Correlation
10.
The correlation matrix appears as follows:
a.
Pearson Correlations
Calories
Calories
Carbo
1.000
Carbo
Fat
Protein
0.961
0.226
1.000
Fat
Serving oz
0.645
0.830
0.059
0.617
0.723
1.000
-0.248
0.086
1.000
0.837
Protein
Serving oz
1.000
Pearson Probabilities
Calories
Calories
Carbo
-
Fat
0.000
Carbo
-
Fat
Protein
Serving oz
0.530
0.044
0.003
0.872
0.057
0.018
0.489
0.814
-
Protein
-
Serving oz
0.003
-
The scatterplot matrix is:
Calories
Carbo
Fat
Protein
Serving
oz
b.
There is very little fat in any of these foods, so it is difficult ot see the effect of fat. Also, calories
can also come from other sources, such as starches.
c.
If foods with higher fat content were included, a larger proportion of the calories would be from fat,
and there would be a stronger relationship.
8
Chapter 8: Regression and Correlation
11.
a.
The correlations and p-values are:
Correlations
Pearson
-0.455
p-value
0.017
Spearman
-0.703
p-value
0.000
b.
The scatterplot appears as:
Price vs. Age
$18,000
$16,000
$14,000
Price
$12,000
$10,000
$8,000
$6,000
$4,000
$2,000
$0
0
5
10
15
20
25
30
Age
The plot shows that the price can go up for old Mustangs. The old ones are perceived as classic
antiques, and therefore are worth more. This means there is not a linear relationship between Price
and Age, which invalidates one of the assumptions of the Pearson correlation coefficient.
c.
The correlations for the younger Mustangs is:
Correlations <10
Pearson
-0.895
p-value
0.000
Spearman
-0.924
p-value
0.000
The correlations between Price and Age is much stronger for the younger cars.
9
Chapter 8: Regression and Correlation
The regression statistics for the younger Mustangs are:
d.
Regression Statistics
Multiple R
0.895
R Square
0.801
Adjusted R Square
0.790
Standard Error
1704.422
Observations
20
ANOVA
df
SS
Regression
MS
1
210100905.2
2.1E+08
Residual
18
52290963.76
2905054
Total
19
262391869
Coefficients
Standard Error
Significance
F
F
72.32256
t Stat
P-value
1.01522E-07
Lower 95%
Upper 95%
Intercept
14380.62
861.67
16.69
0.00
12570.31
16190.92
Age
-1383.60
162.70
-8.50
0.00
-1725.41
-1041.79
Using the Regression command, we can calculate the regression equation of Pirce on Age as
PRICE = $14,381 – $1383.60(AGE). Which means that there is a drop of about $1384 per year in
price.
e.
The diagnostic plots are:
Age Residual Plot
5000.000
4000.000
3000.000
Residuals
2000.000
1000.000
0.000
0
1
2
3
4
5
6
-1000.000
-2000.000
-3000.000
Age
10
7
8
9
10
Chapter 8: Regression and Correlation
Normal P-Plot
1.632
1.132
0.632
0.132
-0.368
-0.868
-1.368
-1.868
-2578.995
-1578.995
-578.995
421.005
1421.005
2421.005
3421.005
There is one observation whose residuals seem markedly higher than the others. This belongs to a
2-year old Mustang that was sold for $16,000–or $4,386.59 more than would be expected based
on the regression equation. It would be interesting to see the effect on the regression equation if
this one car was removed from the data set.
12.
a.
Calculus = 56.999 + 1.192(Algebra Placement)
The 95% confidence internval for the slope = (0.715, 1.668)
b.
When the placement score increases by one point, according to the regression equation there is a
1.19 point increase in the final Calculs grade.
c.
The residual plot is:
Alg Place Residual Plot
25.000
20.000
15.000
10.000
Residuals
5.000
0.000
0
5
10
15
20
25
30
35
-5.000
-10.000
-15.000
-20.000
-25.000
Alg Place
The vertical spread decreases as we move to the right, although it is a mild trend. Nevertheless it
does appear that the assumption of constant variance is not perfectly satistfied. However, there is
likely to be very little effect on the coefficient from this problem, because there is only a mild
trend in variance as the Algebra Placement score increases.
11
Chapter 8: Regression and Correlation
13.
The regression plot appears as follows:
a.
y = 0.7328x + 5.673
R2 = 0.9053
Net Income vs. Total Assets
400
350
300
Net Income
250
200
150
100
50
0
0
100
200
300
400
500
600
Total Assets
b.
The regression statistics are:
Regression Statistics
Multiple R
0.951
R Square
0.905
Adjusted R Square
0.903
Standard Error
24.578
Observations
45
ANOVA
df
SS
Regression
MS
1
248242.30
248242.30
Residual
43
25976.14
604.10
Total
44
274218.44
Coefficients
Standard Error
t Stat
F
Significance F
410.9317
P-value
1.25188E-23
Lower 95%
Upper 95%
Intercept
5.673
4.717
1.203
0.236
-3.839
15.185
Total Asset
0.733
0.036
20.271
0.000
0.660
0.806
12
Chapter 8: Regression and Correlation
The residual plot is:
Residuals vs. Predicted Values
140.000
120.000
100.000
80.000
Residuals
60.000
40.000
20.000
0.000
0.000
50.000
100.000
150.000
200.000
250.000
300.000
350.000
400.000
-20.000
-40.000
-60.000
Predicted Values
The regression statistics are:
c.
Regression Statistics
Multiple R
0.893
R Square
0.798
Adjusted R Square
0.793
Standard Error
0.169
Observations
45
ANOVA
df
SS
Regression
MS
1
4.869708
4.869708
Residual
43
1.234506
0.028709
Total
44
6.104213
Coefficients
Standard Error
t Stat
F
169.6205
P-value
Significance F
1.6E-16
Lower 95%
Upper 95%
Intercept
0.070
0.123
0.573
0.570
-0.177
0.318
LogTA
0.906
0.070
13.024
0.000
0.765
1.046
13
Chapter 8: Regression and Correlation
The plot of log(Net Income) vs. log(Total Assets) and the plot of the standardized residuals vs.
log(Total Assets) appears as follows:
y = 0.9057x + 0.0703
R2 = 0.7978
LogNI vs. LogTA
3.000
2.500
LogTA
2.000
1.500
1.000
0.500
1.000
1.200
1.400
1.600
1.800
2.000
2.200
2.400
2.600
2.800
LogNI
Residuals vs. Predicted Values Log
0.400
0.300
0.200
Residuals
0.100
0.000
1.000
1.200
1.400
1.600
1.800
2.000
2.200
2.400
2.600
-0.100
-0.200
-0.300
-0.400
-0.500
-0.600
Predicted Values Log
The residuals plot for the transformed data is much better, revealing no difficulties with the
regression assumptions.
14
Chapter 8: Regression and Correlation
14.
The scatterplot appears as:
a.
Mass vs. Volume
20
Mass
15
10
5
0
0
2
4
6
8
Volume
b.
After removing the outlier and the contstant term from the model, the slope = 2.693.
c.
The 95% confidence interval is (2.629, 2.757), which includes the accepted value, 2.699.
15.
a.
The correlation values are:
Correlations
Pearson's r
0.406
p-value
0.003
Spearman's s
0.295
p-value
0.037
The scatterplot appears as:
b.
Pulmon vs. Cardio
45.0
40.0
AZ
ME
35.0
30.0
NM
MT
OR
CONV
Pulmon
25.0
NH
CA
ID
TX
20.0
KY IA
RI
KS
NB MO
PA
TN OH
OK MA
SD
AR
DE IN
NJ NY
MI
AL CTNDWIIL
VA
GANCMN
MS
MD
SC LA
WA
WY
FL
WV
VT
UT
15.0
HI
10.0
AK
5.0
0.0
0.0
100.0
200.0
300.0
400.0
500.0
Cardio
Alaska, Hawaii, and Utah are isolated in the lower left.
15
600.0
Chapter 8: Regression and Correlation
c.
The correlations are:
Correlations
Pearson's r
0.259
p-value
0.072
Spearman's s
0.251
p-value
0.082
d.
The Pearson correlation is dramatically reduced by the omission of these outliers, while the
Spearman correlation is moderately reduced. The original correlation with all 50 states exaggerates
the relationship between the two variables. The nonparametric correlation partly alleviates the
problem, but not entirel, when the outliers are included. An important lesson to learn from this is
that a correlation statistic by itself can be misleading, and a plot is useful to see if there are outliers.
e.
Alaska and Hawaii are thousands of miles from the continental U.S., with different climates and
racial compositions. It is quite possible that they will not represent the U.S. population of the lower
48 states. There are many reasons why Alaska may be low; different eating and exercise habits, less
smoking , different physical characteristics of a large portion of the population, and more deaths
due to other circumstances.
a.
r = 0.313. The p-value for r is 0.076.
b.
r = 0.515. The p-value for r is 0.002.
c.
The following scatterplots are generated:
16.
CAPRET91 vs. CAPRET90
120.000
100.000
Biotech
CAPRET91
60.000
40.000
20.000
Health
Broker
80.000
Medical Del
RegionalBank
Savings & Ln
Financial
Transport
Retailing
Technology
Software
Constr&Hous
Chemicals
Brdcst/MediaAir Trans
Automotive
Electronics
Indust Mat
Insurance
Food & Agri
Leisure Paper/Forest
Telecommun
Defense
Indust Tech
Elec Utils
Utilities
Computers
Environ Ser
0.000
-30.000
-20.000
Prec Metals
-20.000
AmericanGold
-10.000
Energy
0.000
10.000
Energy Servs
-40.000
CAPRET90
16
20.000
30.000
40.000
50.000
Chapter 8: Regression and Correlation
INC91 vs. INC90
6.000
Utilities
5.000
Elec Utils
INC91
4.000
Paper/Forest
3.000
2.000
Insurance
Savings & Ln
Computers
Financial
Indust Tech
1.000
RegionalBank
Telecommun
Energy
Technology
Prec Metals
Chemicals
Health
Transport
Biotech
0.000
Software
Retailing
Medical
Environ
Brdcst/Media
Air
AmericanGold
Trans
Electronics
Ser
Del
Energy Servs
0.000
0.500
Indust Mat
Defense
Food & Agri
Broker
Leisure
Automotive
Constr&Hous
1.000
1.500
2.000
2.500
INC90
d.
You should expect a strong correlation in income from one year to the next because income refers
to interest on bonds, and preferred stocks, dividend payments, etc., and these are faily stable from
year to year.
e.
r = 0.182. The p-value fro r is 0.309. The scatterplot appears as:
NAV91 vs. NAV90
100.000
60.000
NAV91
40.000
Biotech
Broker
80.000
Savings & Ln
Financial
RegionalBank
Transport
Medical Del
Health
Technology
Retailing
Brdcst/Media
Electronics
IndustTrans
Mat
Insurance
Chemicals
Leisure AirPaper/Forest
Constr&Hous
Telecommun Automotive
Software
Food
& Agri
Defense
Indust Tech
20.000
Computers
Elec Utils
Utilities
0.000
-30.000
-20.000
Prec Metals
-20.000
Energy
-10.000
AmericanGold
Environ Ser
0.000
10.000
Energy Servs
-40.000
NAV90
17
20.000
30.000
40.000
50.000
Chapter 8: Regression and Correlation
f.
Without the Biotech stock the value of r is –0.019 with a p-value of 0.920.
g.
With Biotech: s = –0.019 (p-value = 0.916)
Without Biotech: s = –0.118 (p-value = 0.521).
h.
Previously there was a slight positive (but not statistically significant) correlation between the gains
in net asset value in the two years; but without Biotech the correlation is slightly negative. The
lesson is that past performance is not necessarily a good guide to future performance when it comes
to picking market sectors. Some investment advisors feel that you should diversify your investments
across sectors because sector performance is so difficult to predict.
a.
r = –0.226. p-value < 0.001. Since the correlation is negative, lower draft numbers will be more
likely given persons with later birth dates.
b.
The scatterplot appears as follows:
17.
Number vs. Day
400
350
300
Day
250
200
150
100
50
0
0
50
100
150
200
Number
There is no apparent trend in the scatterplot.
18
250
300
350
400
Chapter 8: Regression and Correlation
The trend line appears as:
c.
Number vs. Day
y = -0.2261x + 225.01
R2 = 0.0511
400
350
300
Day
250
200
150
100
50
0
0
50
100
150
200
250
300
350
400
Number
The regression equation is Draft Number = 225.01 – 0.2261(Birth Date Number)
The regression explains only 5.11% of the variation in draft numbers.
r = –0.867. p-value < 0.001.
d.
The correlation between the average monthly draft number and the birth month is much higher than
the correlation between individual draft numbers and birth date numbers.
The scatterplot is:
e.
Average Number
y = -7.0644x + 229.47
R2 = 0.7523
250.00
Avg. Number
200.00
150.00
100.00
50.00
0.00
0
2
4
6
8
Month
19
10
12
14
Chapter 8: Regression and Correlation
The regression equation is: Average Draft Number = 229.47 – 7.0644 (Birth Month Number)
This regression explains 75.23% of the variation in the average monthly draft numbers.
f.
There is too much variability in the individual draft numbers to present an effective display of the
problem with the draft lottery. While the p-value is highly significant, the problem with the lottery
is not apparent from the scatterplot. By averaging the draft numbers over each month, some of the
day-to-day variability in the draft numbers is taken out of the problem and a clearer picture of the
problem with the draft lottery emerges.
a.
Emerald: Price = –16,377 + 8.34 (Year)
18.
Urban: Price = –10,727 + 5.46 (Year)
Medical: Price = –23,718 + 12.00 (Year)
Emerald's prices increase at the rate of 8.34 points per year. The Urban CPI increases at the rate of
5.46 points per year, and the medical CPI increases at the rate of 12 points per year.
Emerald: (5.86 , 10.81)
b.
Urban: (4.97 , 5.94)
Medical: (10.51 , 13.50)
There is some overlapping of the confidence intervals, therefore it does not appear that there is
significant differences in the rate of increase for Emerald as compared to the other indexes.
c.
Emerald's rate of increase is less than the general medical CPI, but since the confidence intervals
overlap, the differences do not appear to be statistically significant.
a.
The scatterplot appears as:
19.
Chart Title
y = 0.2107x - 1434.3
R2 = 0.6968
9,000
8,000
7,000
6,000
5,000
4,000
3,000
2,000
1,000
15,000
20,000
25,000
30,000
20
35,000
40,000
45,000
Chapter 8: Regression and Correlation
The regression statistics are:
b.
Regression Statistics
Multiple R
0.835
R Square
Adjusted R
Square
0.697
Standard Error
0.691
586.704
Observations
51
ANOVA
df
Regression
SS
MS
1
38759153.937
38759153.937
Residual
49
16866844.220
344221.311
Total
50
55625998.157
Coefficients
Intercept
Teacher Salary
Standard
Error
t Stat
F
112.600
P-value
Significance
F
2.707E-14
Lower 95%
Upper
95%
-1434.312
490.464
-2.924
0.005
-2419.935
-448.690
0.211
0.020
10.611
0.000
0.171
0.251
The residual plots are:
Teacher Salary Residual Plot
1500.000
1000.000
Residuals
500.000
0.000
15,000
20,000
25,000
30,000
-500.000
-1000.000
-1500.000
Teacher Salary
21
35,000
40,000
45,000
Chapter 8: Regression and Correlation
2.249
1.749
1.249
0.749
0.249
-0.251
-0.751
-1.251
-1.751
-2.251
-1138.928
-638.928
-138.928
361.072
861.072
While there is no evidence of a failure in the model assumptions, there is one state (Alaska) that
has a markedly higher teacher salary than other states. However, its residual value is not much
higher than many other states in the model.
The chart broken down by category appears as follows:
c.
Spending per Pupil
9000
8000
7000
6000
5000
4000
3000
North y = 0.1613x - 38.462
2
R = 0.5685
2000
South
y = 0.1844x - 946.58
R2 = 0.7494
1000
West
y = 0.2729x - 3218.5
2
R = 0.803
0
15000
20000
25000
30000
35000
40000
North
South
West
Linear (North)
Linear (South)
Linear (West)
45000
The slopes are very similar for the North and South regions, while the slope for the West region is
higher. Much of this appears to be due to the influence of the value for Alaska.
22
Chapter 8: Regression and Correlation
The regression statistics for the Nothern region are:
d.
Regression Statistics
Multiple R
0.754
R Square
0.568
Adjusted R Square
Standard Error
0.546
537.075
Observations
21
ANOVA
df
Regression
SS
MS
F
1
7,220,389.90
7,220,389.90
Residual
19
5,480,547.06
288,449.85
Total
20
12,700,936.95
Coefficients
Intercept
Teacher Salary
Standard Error
t Stat
25.03
P-value
Significance
F
7.89392E-05
Lower 95%
Upper 95%
-38.462
795.993
-0.048
0.962
-1704.494
1627.570
0.161
0.032
5.003
0.000
0.094
0.229
For the Southern region:
Regression Statistics
Multiple R
0.866
R Square
0.749
Adjusted R Square
Standard Error
0.733
391.358
Observations
17
ANOVA
SS
MS
F
Significance
F
1
6,869,183.524
6,869,183.524
44.849
7.14247E-06
Residual
15
2,297,418.594
153,161.240
Total
16
9,166,602.118
Coefficients
Standard Error
P-value
Lower 95%
df
Regression
Intercept
X Variable 1
t Stat
Upper 95%
-946.578
637.391
-1.485
0.158
-2305.146
411.989
0.184
0.028
6.697
0.000
0.126
0.243
23
Chapter 8: Regression and Correlation
For the Western region:
Regression Statistics
Multiple R
0.896
R Square
0.803
Adjusted R Square
Standard Error
0.785
723.318
Observations
13
ANOVA
df
Regression
1
SS
MS
F
23,455,267.145
23,455,267.145
44.831
523,188.232
Residual
11
5,755,070.547
Total
12
29,210,337.692
Coefficients
Intercept
X Variable 1
Standard Error
t Stat
P-value
Significance
F
3.396E-05
Lower 95%
Upper 95%
-3218.537
1084.735
-2.967
0.013
-5606.024
-831.050
0.273
0.041
6.696
0.000
0.183
0.363
The three region equations are:
e.
North: Spending = –38.46 + 0.161 (Salary)
95% CI for slope (0.094 , 0.229)
South: Spending = –946.58 + 0.184 (Salary)
95% CI for slope (0.126 , 0.243)
West: Spending = –3218.54 + 0.273 (Salary)
95% CI for slope (0.183 , 0.363)
It would appear that the rate at which spending per pupil increases relative to the average teacher's
salary is higher in the Western states than in the Northern and Southern states. This could be due, in
part, to the influence of the large value for the state of Alaska and an another analysis is probably
warranted withtout the inclusion of the Alaskan value.
24
Chapter 8: Regression and Correlation
20.
The scatterplot is:
a.
Highway Fatalities
16.0
New Mexico
United States
Linear (New Mexico)
Linear (United States)
14.0
12.0
Fatality Rate
10.0
New Mexico
8.0
y = -0.2362x + 13.154
2
R = 0.9066
6.0
United States
4.0
y = -0.1584x + 8.7369
R2 = 0.8839
2.0
0.0
0
5
10
15
20
25
30
35
40
45
Year
About 90% of the variation is explained by the regression on the New Mexico data nad 88.4% of
the variation is explained by the regression on the United States data. The slopes of the two trend
lines appear to be different. One problem with these trend lines (if extended out into the future) is
that they will eventually cross the x-axis indicating a negative fatality rate–an impossible result.
b.
The regression statistics for the New Mexico data are:
Regression Statistics
Multiple R
0.952173153
R Square
Adjusted R
Square
0.906633714
Standard Error
0.897559381
0.904176706
Observations
40
ANOVA
df
SS
Regression
MS
1
297.270462
297.270462
Residual
38
30.61328799
0.805612842
Total
39
327.88375
Coefficients
Intercept
Year
Standard
Error
t Stat
F
368.9992
3.65574E-21
P-value
Lower 95%
13.15384615
0.28924003
45.47726724
9.6E-35
-0.236163227
0.012294181
-19.20935085
3.66E-21
25
Significance
F
12.5683103
0.261051495
Upper
95%
13.73938
-0.21127
Chapter 8: Regression and Correlation
The regression statistics for the United States data are:
Regression Statistics
Multiple R
0.940149084
R Square
Adjusted R
Square
0.883880299
0.880824518
Standard Error
0.67990177
Observations
40
ANOVA
df
SS
Regression
MS
F
1
133.7098762
133.7098762
Residual
38
17.56612383
0.462266417
Total
39
151.276
Standard
Error
Coefficients
Intercept
Year
t Stat
Significance
F
289.2485
2.33234E-19
P-value
Lower 95%
Upper 95%
9.180466835
0.139533613
8.736923077
0.219099497
39.87650913
1.28E-32
8.293379319
-0.158386492
0.009312849
-17.0073078
2.33E-19
-0.17723937
The residual plot for the New Mexico data is:
Year Residual Plot
2
1.5
1
Residuals
0.5
0
0
5
10
15
20
25
-0.5
-1
-1.5
-2
-2.5
Year
26
30
35
40
45
Chapter 8: Regression and Correlation
The residual plot for the United States data is:
Year Residual Plot
3
2.5
2
Residuals
1.5
1
0.5
0
0
5
10
15
20
25
30
35
40
45
-0.5
-1
-1.5
Year
There is some indication in these plots that the regression assumption that residuals should be
independent has been violated. There appears to be some time factor at which in the residual plot.
c.
New Mexico: Durbin-Watson = 0.719. Runs = 11. Runs p-value = 0.001
United States: Durbin-Watson = 0.270 Runs = 5. Runs p-value < 0.0001
The Durbin-Watson statistics are close to 0 for the United States data and for both sets of data the
p-value of the runs test is significant indicating fewer runs that would be expected. This would
cause us to doubt that the assumption of indepence of residuals has been met.
27
Chapter 8: Regression and Correlation
21.
The scatterplot is:
a.
Tax vs. Price
2000
y = 0.7028x + 36.344
R2 = 0.7668
1800
1600
1400
Tax
1200
1000
800
600
400
200
0
0
500
1000
1500
2000
2500
Price
About 77% of the variation in tax is explained by the home price.
The regression statistics are:
b.
Regression Statistics
Multiple R
0.875664739
R Square
Adjusted R
Square
0.766788735
Standard Error
149.5332887
0.764567675
Observations
107
ANOVA
df
Regression
SS
MS
F
345.2355
1
7719537.265
7719537.265
Residual
105
2347821.464
22360.20442
Total
106
10067358.73
Coefficients
Standard Error
t Stat
P-value
Significance
F
5.67E-35
Lower 95%
Upper
95%
Intercept
36.34435129
43.23740728
0.840576565
0.402495
-49.3875
122.0762
Price
0.702784226
0.037823721
18.58051508
5.67E-35
0.627787
0.777782
28
Chapter 8: Regression and Correlation
The residual plot is:
Price Residual Plot
600
500
400
300
Residuals
200
100
0
-100
0
500
1000
1500
2000
2500
-200
-300
-400
-500
Price
The Normal plot is:
c.
2.478
1.478
0.478
-0.522
-1.522
-2.522
-2.803795723
-1.803795723
-0.803795723
0.196204277
1.196204277
2.196204277
The residual plot seems to indicate a possible vioalation of the assumption of constant variance.
The variation of the residual values seems to increase as the home price increases. There is also
some indication in the Normal P-plot that the residuals do not follow the Normal distribution.
29
Chapter 8: Regression and Correlation
The scatterplot appears as:
d.
Log(Tax) vs Log(Price)
3.4
y = 1.047x - 0.2824
R2 = 0.741
3.2
Log(Tax)
3
2.8
2.6
2.4
2.2
2
2.5
2.6
2.7
2.8
2.9
3
3.1
3.2
3.3
3.4
Log(Price)
The regression statistics are:
Regression Statistics
Multiple R
0.860809
R Square
Adjusted R
Square
0.740992
Standard Error
0.085644
0.738526
Observations
107
ANOVA
df
Regression
SS
MS
1
2.203361
2.203361
Residual
105
0.770167
0.007335
Total
106
2.973527
Coefficients
Standard
Error
F
300.3933
t Stat
P-value
Significance
F
1.42E-32
Lower 95%
Upper
95%
Intercept
-0.28245
0.181979
-1.55209
0.123649
-0.64328
0.078383
Log(Price)
1.047043
0.060411
17.33186
1.42E-32
0.927258
1.166828
30
Chapter 8: Regression and Correlation
The residual plots appear as:
Log(Price) Residual Plot
0.2
0.1
0
Residuals
2.5
2.6
2.7
2.8
2.9
3
3.1
3.2
3.3
3.4
-0.1
-0.2
-0.3
-0.4
Log(Price)
2.478
1.478
0.478
-0.522
-1.522
-2.522
-3.72543402
-2.72543402
-1.72543402
-0.72543402
0.27456598
1.27456598
Using the logarithmic transformation appears to have removed the problem with nonconstant
variance. However, there is still a problem with the apparent lack of Normality in the distribution
of the residuals.
31
33
Download