1
R
2
0
R
2 i n
1
( i n
1
( y i i
y )
2
y ) 2
1
, is the ratio of regression sum of squares and total sum of squares (corrected).
2
R is also the ratio of (the distance between model 2 and model 1) and (the distance between data (model 0) and model 1). Since the total sum of squares i n
1
( y i
y )
2 is the sum of the regression sum of squares i n
1
( i
y )
2
and the residual sum of squares i n
1
( y i
ˆ i
)
2
. Large
2
R implies the proportion of the total sum of squares contributed by the regression sum of squares is large. For example, if
2
R =0.9, then
90% of total sum of squares comes from the regression sum of squares. Heuristically, that indicates 90% of Y .can be explained by i
Y
ˆ . That is, model 2 can fit the data i well. In addition, large
2
R also implies the regression sum of squares is large relative to the residual sum of squares. In the above example, the regression sum of squares is 9 times larger than the residual sum of squares since the residual sum of squares contributes 10% of total sum of squares (corrected). That is, the distance between model 2 and model 1 is large relative to the variation of the data. As we explain in the previous section, this might imply the slope in the regression is significant. Thus, model 2 might be sensible.
2
R is usually recommended as a
“useful first thing to look at” in a regression printout.
The correlation coefficient between the covariate X and the response Y is r
XY
i n
1
( x i
x )( y i
y ) i n
1
( x i
x )
2 i n
1
( y i
y )
2
s
XY s
1 / 2
XX s
1
YY
/ 2
,
1
r
XY
1
2
As Y i
aX i
b , then r
XY
1 or r
XY
1 . That is, r
XY
1 implies a significant linear relationship between X and Y . The correlation coefficient is also associated with the regression coefficient b
1
. b
1
s s
XY
XX
s 1 /
YY
2 s 1 / 2
XX
s
XY s 1 /
XX
2 s 1 /
YY
2
s
YY s
XX
1 / 2 r
XY
.
As
As
As b
1 b
1 b
1
0
r
XY
0
r
XY
0
r
XY
0
a positively linear relation.
0
a negatively linear relation
0
there is no significantly linear relation between X and Y.
Note: r
XY
measures linear association between X and Y, while b
1 measures the size of the change in Y due to a unit change in X . r
XY
is unit-free and scale-free. Scale change in the data will affect b
1
but not r
XY
.
Note: the value of a correlation r
XY
shows only the extent to which X and Y are linearly associated. It does not by itself imply that any sort of casual relationship exists between X and Y. Such a false assumption has lead to erroneous conclusions on many occations.
Note: r is also associated with
XY
R
2
since
R 2 i n
1
y
ˆ i
y
2 i n
1
y i
y
2
b
1 s
XY s
YY
s
XY s
XX s
YY s
XY
s
2
XY s
XX s
YY and r
XY
s
XY s
1 /
XX
2 s
1 /
YY
2
( sign .
of .
b
1
)
1 / 2
( sign .
of .
b
1
) R
, where
R
1 / 2
and r
XY has the same sign as b
1
.
3
The above equation indicates that large
2
R implies strong correlation between the response and the covariate.
Note: r
XY
(sign of b
1
) R only holds for the simple linear regression
Y
0
1
X
.
The correlation between the response Y and the fitted value Y
ˆ
is r
Y
ˆ
i n
1
( y i
y )(
ˆ i
) i n
1
( y i
y )
2 i n
1
( y
ˆ i
y
ˆ
)
2
R
, where y
ˆ n
i
1 n y
ˆ i
.
The derivation of r
ˆ
Y Y
:
Since y
ˆ i n
1 y
ˆ i n
i n
1
( b
0 n
b
1 x i
)
b
0
b
1 x , i n
1
( y
ˆ i
y
ˆ
)
2 i n
1
b
0
b
1 x i
( b
0
b
1 x )
2 i n
1 b
1
2
( x i
x )
2 b
1
2 s
XX
, and i n
1
( y i
y )( y
ˆ i
y
ˆ
)
i n
1
( y i
y )
b
0
b
1 x i
( b
0
b
1 x )
i n
1
( y i
y ) b
1
( x i
x )
b
1 n i
1
( y i
y )( x i
x )
b
1 s
XY
, thus r
Y Y
ˆ
b
1 s
XY s
1 /
YY
2
( b
1
2 s
XX
)
1 / 2
( b
1
2 b
1
)
1 / 2
s
XY s
1 /
YY
2 s
1 /
XX
2
( sign .
of .
b
1
) r
XY
( sign .
of .
b
1
)
2
R
R .
The equation r
Y Y
ˆ
R implies large value of
2
R also implies the significantly positively linear relation between the observation other word, the prediction of y i y i
and the predicted values y
ˆ . In
is not unrelevant to y i
.
4
Note: r
Y Y
ˆ
R holds not only for the simple linear regression
Y
0
1
X
, but also for the multiple linear regression!!
Note: Common misunderstandings of
R
2
Misunderstanding 1: A high coefficient of determination indicates that useful predictions can be made.
Misunderstanding 2: A high coefficient of determination indicates that the estimated regression line is a good fit.
Misunderstanding 3: A coefficient of determination near 0 indicates that X and Y are not related.