Chapter 6
Exercise 1
X=c(5,8,9,7,14)
Y=c(3,1,6,7,19)
R function ols(x,y) returns (Intercept) -8.477876 x (slope): 1.823009
mean(x)=8.6, mean(y)=7.2
b
1
= b
1
= b
0 b
0
å (
X i
-
X
)
å
(
Y i
-
Y
)
(
X i
-
X
)
2
(
5
-
8.6
) (
3
-
7.2
) + (
8
-
8.6
(
5
-
8.6
)
)
2
(
1
+
-
(
7.2
8
-
) +
8.6
)
(
9
2
-
8.6
) (
6
-
+ (
9
-
8.6
)
2
7.2
) + (
7
-
+ (
7
-
8.6
)
8.6
2
) (
7
-
7.2
+ (
14
-
8.6
)
)
+
2
(
14
-
8.6
) (
19
-
7.2
)
=
Y
b
1
X
=
7.2
-
1.82
´
8.6
= -
8.452
=
1.82
Exercise 2
X=c(5,8,9,7,14)
Y=c(3,1,6,7,19)
Y
ˆ i
Y
ˆ i
= b
0
+ b
1
X i
= -
8.45
+
1.82
X i r i
å
=
Y i r i
2
-
= (
3
-
(1.82
´
5
-
8.45)
) 2 + (
1
-
(1.82
´
8
-
8.45)
) 2 + (
6
-
(1.82
´
9
-
8.45)
) 2 + (
7
-
(1.82
´
7
-
8.45)
) 2 + (
19
-
(1.82
´
14
-
8.45)
) 2 @
47
Exercise 3
X=c(5,8,9,7,14)
Y=c(3,1,6,7,19)
The sum of squared residuals will be larger in this line relative to LSR because the LSR line is designed to minimize the residuals.
Y
ˆ i
Y
ˆ i
= b
0
+ b
1
X i
= -
9
+
2 X i r i
=
Y i
å r i
2
-
= (
3
-
(2
´
5
-
9)
)
2 + (
1
-
(2
´
8
-
9)
)
2 + (
6
-
(2
´
9
-
9)
)
2 + (
7
-
(2
´
7
-
9)
)
2 + (
19
-
(2
´
14
-
9)
)
2 =
53
Exercise 4 b
1
= r s y s x b
1
=
0.6
25
=
0.866
12
Exercise 5 a=c(3,104,50,9,68,29,74,11,18,39,0,56,54,77,14,32,
34,13,96,84,5,4,18,76,34,14,9,28,7,11,21,30,26,2,11
,12,6,3,3,47,19,2,25,37,11,14,0) b=c(0,5,0,0,0,6,0,1,1,2,17,0,3,6,4,2,4,2,0,0,13,9,1,4,
2,0,4,0,4,6,4,4,1,6,6,13,3,1,0,3,1,6,1,0,2,11,3)
The R function ols(a,b)returns
(Intercept) 4.58061839 x (slope) -0.04051423
Exercise 6 c=c(300,280,305,340,348,357,380,397,453,456,510,535,275,270,335,342,354,394,383,450,446,513,520,520) d=c(32.75,28,30.75,29,27,31.20,27,27,23.50,21,21.5,22.8,30.75,27.25,31,26.50,23.50,22.70,25.80,27.80,21.50,
22.50,20.60,21)
Ols(c,d) yields:
Higher levels of solar radiation predict lower rates of cancer.
300 350 400 x
450 500
Exercise 7 a=c(500,530,590,660,610,700,570,640) b=c(2.3,3.1,2.6,3.0,2.4,3.3,2.6,3.5)
R function ols(a,b) returns
(Intercept) 0.484615385
X (slope) 0.003942308
Y
ˆ =
0.00039
X
+
0.4846
Exercise 8
R function ols(a,b) returns
$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.484615385 1.289275061 0.3758821 0.7199360
x 0.003942308 0.002137246 1.8445735 0.1146492
$Ftest.p.value
value
0.1146492
$R.squared
[1] 0.3618685
This means that SAT accounts for about 36% of the variance in
GPA. This gives an indication of the strength of the assocition
Exercise 9 x=c(40,41,42,43,44,45,46) y=c(1.62,1.63,1.90,2.64,2.05,2.13,1.94) ols(x,y)
$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.25321429 2.73157319 -0.4587885 0.6656396
x 0.07535714 0.06345636 1.1875429 0.2883482
$Ftest.p.value
value
0.2883482
$R.squared
[1] 0.2200002
Y
ˆ =
0.075
X
-
1.253
Exercise 10 c=c(300,280,305,340,348,357,380,397,453,456,510
,535,275,270,335,342,354,394,383,450,446,513,52
0,520) d=c(32.75,28,30.75,29,27,31.20,27,27,23.50,21,21.
5,22.8,30.75,27.25,31,26.50,23.50,22.70,25.80,27.8
0,21.50,22.50,20.60,21)
Ols(c,d) yields
(Intercept) 39.99094634
ˆ = -
0.035
X
+
39.99
ˆ = -
0.035
´
600
+
39.99
=
18.64
X(slope) -0.03565283
600 exceeds the range of X values, so the prediction is based on extrapolation.The relationship between the variables may change in extreme values.
Exercise 11 mou=c(63.3,60.1,53.6,58.8,67.5,62.5) time=c(241.5,249.8,246.1,232.4,237.2,238.4)
R function cor.test(mou,time) returns Pearson's product-moment correlation
T
= r t = -0.7872, df = 4, p-value = 0.4752
sample estimates:cor -0.3662634
There is insufficient evidence to determine that the correlation is different than 0. n
-
2
1
r
2
= -
.366
6
-
2
1
(
.366
)
2
=
0.786
> qt(0.975,4): [1] 2.776445
pt(-0.7872,4): [1] 0.2375939, for two tailed
0.234*2=0.475
P>0.05
T=-0.78 does not
Exceed crticial value
Of 2.77 or -2.77
Exercise 12 x=c(1,2,3,4,5,6) y=c(1,4,7,7,4,1) ols(x,y) (Intercept) 4.000000e+00 (slope) -
5.838669e-16 (reasonably close to 0)
Data is consistent with an inverted U shape rather than with the linear model.
There might be an association here that is not detected.
1 2 3 x
4 5 6
x=c(1,2,3,4,5,6) y=c(4,5,6,7,8,2)
Exercise 13
1 2 3 4 5 6 x
The LSR slope is still 0 even though there is a clear linear trend to the data, which is masked by a single outlier
Exercise 14
The nature of the relationship between two variables can vary with the predictor value. In other words, the association between Y and X can change as a function of X values.
Extrapolating beyond the data range, therefore, can be problematic, even when the association appears to be linear. In non-linear associations, the LSR line can be misleading.
Exercise 15 age=c(5.2,8.8,10.5,10.6,10.4,1.8,12.7,15.6,5.8,1.9,2.2,4.8,7.9,5.2,0.9,11.8,7.9,1.5,10.6,8.5,11.1,12.8,11.3,1
,14.5,11.9,8.1,13.8,15.5,9.8,11.0,14.4,11.1,5.1,4.8,4.2,6.9,13.2,9.9,12.5,13.2,8.9,10.8) cpep=c(4.8,4.1,5.2,5.5,5,3.4,3.4,4.9,5.6,3.7,3.9,4.5,4.8,4.9,3.0,4.6,4.8,5.5,4.5,5.3,4.7,6.6,5.1,3.9,5.7,5.1,5.2
,3.7,4.9,4.8,4.4,5.2,5.1,4.6,3.9,5.1,5.1,6.0,4.9,4.1,4.6,4.9,5.1)
R function: cor(age,cpep) returns;
[1] 0.3906776
R function: hc4test(age,cpep) returns:
$test
[1] 4.705966
Thus, r=0.39, and the hc4test rejects at 0.05
$p.value
[1] 0.03005811
Exercise 16 age=c(5.2,8.8,10.5,10.6,10.4,1.8,12.7,15.6,5.8,1.9,2.2,4.8,7.9,5.2,0.9,11.8,7.9,1.5,10.6,
8.5,11.1,12.8,11.3,1,14.5,11.9,8.1,13.8,15.5,9.8,11.0,14.4,11.1,5.1,4.8,4.2,6.9,13.2,9.9,
12.5,13.2,8.9,10.8) cpep=c(4.8,4.1,5.2,5.5,5,3.4,3.4,4.9,5.6,3.7,3.9,4.5,4.8,4.9,3.0,4.6,4.8,5.5,4.5,5.3,4.7,6.
6,5.1,3.9,5.7,5.1,5.2,3.7,4.9,4.8,4.4,5.2,5.1,4.6,3.9,5.1,5.1,6.0,4.9,4.1,4.6,4.9,5.1) ols(age[age<7],cpep[age<7])
$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.5148814 0.37014633 9.495924 6.244186e-07 x 0.2474008 0.08924835 2.772049 1.689761e-02
C-peptide concentrations increase to about age 7. The regression ls(age[age>7],cpep[age>7])
$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.7535568 0.64125948 7.4128445 5.654828e-08 x 0.0132083 0.05550626 0.2379606 8.137083e-01 line plateaus beyond that age.
Using a single line or correlation
To describe the relationship is misleading
Exercise 17 size=c(2359,3397,1232,2608,4870,4225,1390,2028,3700,2949,688,3147,4000,4180,38
83,1937,2565,2722,4231,1488,4261,1613,2746,1550,3000,1743,2388,4522) price=c(510,690,365,592,1125,850,363,559,860,695,182,860,1050,675,859,435,555,5
25,805,369,930,375,670,290,715,365,610,1290)
R Function ols(size,price) returns
(Intercept) 38.1921217
X (Slope) 0.2153008
ˆ =
0.215
X
+
38.192
ˆ =
0.215
´
0
+
38.192
=
38.192
The conclusion here is that a home size of 0 cost 38.192, which makes no sense.
This illustrates ho non-linear relationships can make the regression land midleading.
Extrapolation beyond the data can be problematic.
Exercise 18 lot=c(18200,12900,10060,14500,76670,22800,10880,10880,23090,10875,3498,42689,177
90,38330,18460,17000,15710,14180,19840,9150,40511,9060,15038,5807,16000,3173,240
00,16600) price=c(510,690,365,592,1125,850,363,559,860,695,182,860,1050,675,859,435,555,525,80
5,369,930,375,670,290,715,365,610,1290)
R function ols(lot,price) returns
Estimate Std. Error t value Pr(>|t|)
(Intercept) 436.83367567 66.609568133 6.558122 5.927679e-07 x (slope) 0.01104288 0.002754693 4.008752 4.569549e-04
Exercise 19
This would generally be the case when the relationship are linear and homoscedastic.
Exercise 20 x=c(18,20,35,16,12) y=c(36,29,48,64,18)
R function ols(x,y) returns:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.3283679 23.774217 1.0653713 0.3648449
x 0.6768135 1.096856 0.6170485 0.5808715
$Ftest.p.value: 0.5808715
R function cor.test(x,y) returns: t = 0.617, df = 3, p-value = 0.5809
sample estimates cor: 0.3355929
Both analyses agree, both not significant. X and Y can still be dependent in nonlinear ways, and there are power considerations with a small sample size.
Exercise 21 x=c(12.2,41,5.4,13,22.6,35.9,7.2,5.2,55,2.4,6.8,29.6,58.7) y=c(1.8,7.8,0.9,2.6,4.1,6.4,1.3,0.9,9.1,0.7,1.5,4.7,8.2)
R function ols(x,y) returns
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.3269323 0.248122843 1.317623 2.144131e-01 x 0.1550843 0.008413901 18.431919 1.280856e-09
The estimate of the slope is 0.155 with a SE of 0.0084. The 0.975 quantile of T with 24 df is:
> qt(0.975,24)
[1] 2.063899
CI
=
0.155
±
2.06
´
0.0084
=
(0.137, 0.172)
The scatter plot suggests that X and Y increase together, but with the same confidence interval situations arise when it is not always the case
Exercise 22 x=c(34,49,49,44,66,48,49,39,54,57,39,65,43,43,44,42,71,40,41,38,42,77,40,38,43,42,36,55,57,57,41,66,69,38,49,51,45,141,133,7
6,44,40,56,50,75,44,181,45,61,15,23,42,61,146,144,89,71,83,49,43,68,57,60,56,63,136,49,57,64,43,71,38,74,84,75,64,48) y=c(129,107,91,110,104,101,105,125,82,92,104,134,105,95,101,104,105,122,98,104,95,93,105,132,98,112,95,102,72,103,102,1
02,80,125,93,105,79,125,102,91,58,104,58,129,58,90,108,95,85,84,77,85,82,82,111,58,99,77,102,82,95,95,82,72,93,114,108,95,
72,95,68,119,84,75,75,122,127)
R function ols(x,y) returns
$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 97.95728197 4.73432147 20.6908809 9.985891e-33 x (slope) -0.02136595 0.07096758 -0.3010664 7.641969e-01 pq(0.975.df=77)
[1] 1.99
CI
b
1
t
ˆ 2
X i
X
2
0 .
02
1 .
99
0 .
07
0 .
16 , 0 .
12
Exercise 23 khomreg(size,price)
$test
[1,] 6.115014, $p.value [1,] 0.01340384
khomreg(lot,price)
$test
[1,] 0.1683221
$p.value
[1,] 0.6816073
We actually do reject for house size but not for lot size. This test may not have sufficient power to detect heteroscedasticity, so when we fail to reject, it is difficult to draw conclusions
Exercise 24 ols(x,y)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 65.46175413 18.4508380 3.5479014 0.000673844
x (slope) -0.05649584 0.1876524 -0.3010664 0.764196940
(Slope is close to 0, with P<0.764, do not rejet with OLS)
$Ftest.p.value
value
0.7641969 (book has typo) regplot(y,x,regfun=rqfit) ols(y,x) rqfit(x,y)
$coef
(Intercept) x
95.2000000 -0.4333333
$ci lower bd upper bd
(Intercept) 64.4610733 105.972735
60 80
X
100 120
X (Slope) -0.5505706 -0.1450298 (CI for slope does not contain 0, so reject with rqfit.
60 80 x
100
As is evident in the scatterplot of OLS, there are several outliers between the X values of
100-130. To minimize least squared distances, these outliers pull the regression line upward in a manner that makes it horizontal.The rqfit is based on the median of Y instead of mean. It is thus insensitive to outliers , making the regression line (in blue) go through the middle (0.5 y quantile/X) of the bulk of the observations.
120
The data can be accessed by library(MASS )
Exercise 25
X=c(2300,750,4300,2600,6000, 10500, 10000, 17000, 5400, 7000, 9400, 32000, 35000, 100000, 100000, 52000, 100000,
4400, 3000, 4000, 1500, 9000, 5300, 10000, 19000, 27000, 28000, 31000, 26000, 21000, 79000, 100000,100000)
Y=c(65,156,100,134,16,108,121,4,39,143,56,26,22,1,1,5,65,56,65,17,7,16,22,3,4,2,3,8,4,3,30,4,43) ols(X,Y)
$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 53.8899623928 1.027986e+01 5.242286 1.072131e-05 x -0.0004461206 2.296306e-04 1.942775 6.117379e-02
$Ftest.p.value0.06117379
Olshc4 reject. It has a smaller standard error for the slope olshc4(X,Y)
0e+00
$ci
Coef. Estimates ci.lower
ci.upper
p-value Std.Error
(Intercept) 0 53.8899623928 30.5619402421 7.721798e+01 4.902827e-05 1.143803e+01
Slope 1 -0.0004461206 -0.0008776261 -1.461508e-05 4.315956e-02 2.115728e-04
2e+04 4e+04 x
6e+04 8e+04 1e+05