2.9 Descriptive measures of linear association between X and Y

advertisement

1

2.9 Descriptive Measures of Linear Association between X and Y

(a)

R

2

: Coefficient of Determination

0

R

2  i n 

1

( i n 

1

( y i i

 y )

2

 y ) 2

1

, is the ratio of regression sum of squares and total sum of squares (corrected).

2

R is also the ratio of (the distance between model 2 and model 1) and (the distance between data (model 0) and model 1). Since the total sum of squares i n 

1

( y i

 y )

2 is the sum of the regression sum of squares i n 

1

( i

 y )

2

and the residual sum of squares i n 

1

( y i

 ˆ i

)

2

. Large

2

R implies the proportion of the total sum of squares contributed by the regression sum of squares is large. For example, if

2

R =0.9, then

90% of total sum of squares comes from the regression sum of squares. Heuristically, that indicates 90% of Y .can be explained by i

Y

ˆ . That is, model 2 can fit the data i well. In addition, large

2

R also implies the regression sum of squares is large relative to the residual sum of squares. In the above example, the regression sum of squares is 9 times larger than the residual sum of squares since the residual sum of squares contributes 10% of total sum of squares (corrected). That is, the distance between model 2 and model 1 is large relative to the variation of the data. As we explain in the previous section, this might imply the slope in the regression is significant. Thus, model 2 might be sensible.

2

R is usually recommended as a

“useful first thing to look at” in a regression printout.

(b) Correlation:

The correlation coefficient between the covariate X and the response Y is r

XY

 i n 

1

( x i

 x )( y i

 y ) i n 

1

( x i

 x )

2 i n 

1

( y i

 y )

2

 s

XY s

1 / 2

XX s

1

YY

/ 2

,

1

 r

XY

1

2

As Y i

 aX i

 b , then r

XY

1 or r

XY

 

1 . That is, r

XY

1 implies a significant linear relationship between X and Y . The correlation coefficient is also associated with the regression coefficient b

1

. b

1

 s s

XY

XX

 s 1 /

YY

2 s 1 / 2

XX

 s

XY s 1 /

XX

2 s 1 /

YY

2

 s

YY s

XX



1 / 2 r

XY

.

As

As

As b

1 b

1 b

1

0

 r

XY

0

 r

XY

0

 r

XY

0

a positively linear relation.

0

a negatively linear relation

0

there is no significantly linear relation between X and Y.

Note: r

XY

measures linear association between X and Y, while b

1 measures the size of the change in Y due to a unit change in X . r

XY

is unit-free and scale-free. Scale change in the data will affect b

1

but not r

XY

.

Note: the value of a correlation r

XY

shows only the extent to which X and Y are linearly associated. It does not by itself imply that any sort of casual relationship exists between X and Y. Such a false assumption has lead to erroneous conclusions on many occations.

Note: r is also associated with

XY

R

2

since

R 2  i n 

1

 y

ˆ i

 y

2 i n 

1

 y i

 y

2

 b

1 s

XY s

YY

 s

XY s

XX s

YY s

XY

 s

2

XY s

XX s

YY and r

XY

 s

XY s

1 /

XX

2 s

1 /

YY

2

( sign .

of .

b

1

)

 

1 / 2 

( sign .

of .

b

1

) R

, where

R

 

1 / 2

and r

XY has the same sign as b

1

.

3

The above equation indicates that large

2

R implies strong correlation between the response and the covariate.

Note: r

XY

 (sign of b

1

) R only holds for the simple linear regression

Y

 

0

 

1

X

 

.

The correlation between the response Y and the fitted value Y

ˆ

is r

Y

ˆ

 i n 

1

( y i

 y )(

ˆ i

) i n 

1

( y i

 y )

2 i n 

1

( y

ˆ i

 y

ˆ

)

2

R

, where y

ˆ n 

 i

1 n y

ˆ i

.

The derivation of r

ˆ

Y Y

:

Since y

ˆ  i n 

1 y

ˆ i n

 i n 

1

( b

0 n

 b

1 x i

)

 b

0

 b

1 x , i n 

1

( y

ˆ i

 y

ˆ

)

2  i n 

1

 b

0

 b

1 x i

( b

0

 b

1 x )

2  i n 

1 b

1

2

( x i

 x )

2  b

1

2 s

XX

, and i n 

1

( y i

 y )( y

ˆ i

 y

ˆ

)

 i n 

1

( y i

 y )

 b

0

 b

1 x i

( b

0

 b

1 x )

 i n 

1

( y i

 y ) b

1

( x i

 x )

 b

1 n  i

1

( y i

 y )( x i

 x )

 b

1 s

XY

, thus r

Y Y

ˆ

 b

1 s

XY s

1 /

YY

2

( b

1

2 s

XX

)

1 / 2

( b

1

2 b

1

)

1 / 2

 s

XY s

1 /

YY

2 s

1 /

XX

2

( sign .

of .

b

1

) r

XY

( sign .

of .

b

1

)

2

R

R .

The equation r

Y Y

ˆ

R implies large value of

2

R also implies the significantly positively linear relation between the observation other word, the prediction of y i y i

and the predicted values y

ˆ . In

is not unrelevant to y i

.

4

Note: r

Y Y

ˆ

R holds not only for the simple linear regression

Y

 

0

 

1

X

 

, but also for the multiple linear regression!!

Note: Common misunderstandings of

R

2

:

Misunderstanding 1: A high coefficient of determination indicates that useful predictions can be made.

Misunderstanding 2: A high coefficient of determination indicates that the estimated regression line is a good fit.

Misunderstanding 3: A coefficient of determination near 0 indicates that X and Y are not related.

Download