r
2
(a) r
2
: r
2 i n
1
( i n
1
(
ˆ i y i
y )
2
y )
2
SSR
SST , is the ratio of regression sum of square and total sum of square
(corrected). r
2 is also the ratio of (the distance between the straight line and the horizontal line) and (the distance between data and the horizontal line). Since the total sum of square i n
1
( y i
y )
2 is sum of the regression sum of square i n
1
( i
y )
2 and the residual sum of square i n
1
( y i
i
)
2 , large r
2 implies the proportion of the total sum of square contributed by the regression sum of square is large. For example, if r
2 =0.9, then 90% of total sum of square comes from the regression sum of square.
Heuristically, that indicates 90% of y i
.can be explained by ˆ i
. That is, the straight line can fit the data well. In addition, large r
2 also implies the regression sum of square is large relative to the residual sum of square.
In the above example, the regression sum of square is 9 times larger than the residual sum of square since the residual sum of square contributes
10% of total sum of square (corrected). That is, the distance between the straight line and the horizontal line is large relative to the variation of the data. As we explain in the previous section, this might imply the slope in the regression is significant. Thus, the straight line might be sensible. r
2 is usually recommended as a “useful first thing to look at” in a regression printout.
Motivating Example (continue):
We have the following:
1
b
1
5 , s
XY
2840 , SSR
b
1 s
XY
5
2840
14200 ,
SST
10 i
1
y i
y
2 i
10
1 y i
2
10 y 2
184730
10
130 2
15730
Thus, r 2
SSR
SST
14200
15730
0 .
9027
.
For this pizza example, we can conclude that 90.27% of the total sum of squares can be explained by using the estimated regression equation y
ˆ
60
5 x to predict quarterly sales. In other words, 90.27% of the variation in quarterly sales can be explained by the linear relationship between the size of the student population and quarterly sales.
(b) Correlation:
The correlation coefficient between the variables x and y is r
XY
i n
1
( x i
x )( y i
y ) i n
1
( x i
x )
2 i n
1
( y i
y )
2
s
XY s
1 / 2
XX s
1 /
YY
2
,
1
r
XY
1
As y i
ax i
b , then r
XY
1 or r
XY
1 . That is, r
XY
1 implies a significant linear relationship between x and y . The correlation coefficient is also associated with the regression coefficient b
1
. b
1
s
XY s
XX
1 / s
YY
2 s 1 /
XX
2
s
XY s 1 / 2
XX s 1 /
YY
2
s
YY s
XX
1 / 2 r
XY
.
As b
1
0
r
XY
0
a positively linear relation.
As b
1
0
r
XY
0
a negatively linear relation
As b
1
0
r
XY
0
there is no significantly linear relation between x and y.
2
Note: r
XY
measures linear association between x and y, while b
1
measures the size of the change in y due to a unit change in x. r
XY
is unit-free and scale-free. Scale change in the data will affect b
1
but not r
XY
Note: the value of a correlation r
XY
shows only the extent to which x and y are linearly associated. It does not by itself imply that any sort of casual relationship exists between X and Y. Such a false assumption has lead to erroneous conclusions on many occasions.
Note that r is also associated with
XY
R
2 since r 2 i n
1
ˆ i
y
2 i n
1
y i
y
2
b
1 s
XY s
YY
s
XY s
XX s
YY s
XY
s
2
XY s
XX s
YY and r
XY
s
XY s
1 / 2
XX s
1 /
YY
2
( sign .
of .
b
1
)
1 / 2
( sign .
of .
b
1
) r
2
, where r
XY
has the same sign as b
1
.
The above equation indicates that large between the variables x and y . r
2 implies strong correlation
Note: r
XY
( sign .
of .
b
1
) r
2
, only holds for the simple linear regression Y
0
1
X
.
3
The correlation between y and the fitted value y
ˆ is r
Y Y
ˆ
i n
1
( y i
y )( y
ˆ i
y
ˆ
) i n
1
( y i
y )
2 i n
1
( y
ˆ i
y
ˆ
)
2
r
2
, where y
ˆ n
i
1 n i
.
Motivating Example (continue):
Since b
1
5 , r 2
0 .
9027
r
XY
sign of b
1
r 2
0 .
9027
0 .
9501 indicates x and y are highly correlated and a strong positive linear association between x and y .
Example 2 (continue):
Suppose the model is y i
0
1 x i
i
, i
1 , , 20 ,
i
~ N
, and i
20
1 x i
1330 , i
20
1 y i
1862 .
8 , i
20
1 x i
2
90662 ,
20 i
1 y i
2
173554 .
26 , i
20
1 x i y i
124206 .
9
(d) Determine
2 r .
[solution:] r
2
SSR
SST
49 .
220
53 .
068
0 .
9275
Exercise 14.3.1
4