Chapter 11

advertisement

Chapter 3 Regression and Correlation

Simple Least squares regression SLR): is a method to find the

“best fitting line”

to a set of n points (x, y). SLR minimizes the sum of squares of vertical distances from points to the fitted line. That is why the procedure is also called least squares regression.

Correlation coefficient (r):

A number that measures the strength and direction of the linear association between X and Y (both quantitative variables).

 – 1 

r

+ 1 always .

Correlation (r) = + 1 when there is a perfectly linear increasing relationship between X and Y.

Correlation (r) = – 1 when there is a perfectly linear decreasing relationship between X and Y.

No units. Correlation is a unit-less entity

R 2 = (r) 2 = is called coefficient of determination.

R 2 measures the percent of variability in the response (Y) explained by the changes in X [or by the regression on X]. What does R 2 = 0.81

(=81%) mean?

How do you find r when you are given R 2 ? For example what is r, if R 2 = 0.81 = 81%?

Chapters 3 and 11 Fall 2007

Page 1 of 34

Example : Suppose your friend claims that she can guess a persons age correctly (well, almost). So, to see if this claim is justifiable, you select a random sample of 10 people, ask your friend to guess their ages and then ask the person his/her true age. The following are observed:

ID 1 2 3 4 5 6 7 8 9 10

Guessed age 18 52 65 90 28 58 13 66 44 35

True Age 20 45 70 85 25 50 15 60 40 35

The very first step in regression analysis is to identify the independent (explanatory) and the dependent

(response) variables. Since the true age determines your friend’s guesses, (and your friend’s guess has no effect on a person’s true age) we have

X = Independent Variable = True age

Y = Response = Dependent Variable = Guessed Age.

The next step is to draw a scatter diagram of the data and interpret what you see (to get some ideas about the relation between two variables).

Chapters 3 and 11 Fall 2007

Page 2 of 34

Scatterplot of Guessed Age vs True Age

100

90

80

70

60

50

40

30

20

10

10 20 30 40 50

True Age

60 70 80 90

1. What do you see?

2. Verify the following summary statistics using your calculator: x

44.5

s

X

22.02

y

46.9

s

Y

24.02

3. Compute the slope and intercept of the least squares regression line, given that r = 0.9844.

Slope = b = r

s

Y

/ s

X

= 0.9844

24.02 /22.42 = 1.054651561

Intercept = a = y b x

= 46.9 – 1.054651561

44.5

= – 0.0319944692

Chapters 3 and 11 Fall 2007

Page 3 of 34

Hence the prediction equation is ˆ

   x

.

Are these results consistent with what you have observed in the scatter plot?

3. Interpret the numerical results:

Correlation = r = 0.9844, so there is a strong, increasing, linear relationship between the true and guessed ages

Slope = 1.05, for every unit increase in the true age, the guessed age increases by 1.05 years.

 Intercept = – 0.03 DO NOT INTERPRET the intercept in this case because a) Zero is not within the range of observed values of the independent variable (X) and b) Zero and – 0.03 are not meaningful in this context.

4. Compute R 2 (coefficient of determination) and interpret it.

R 2 = (r) 2 = (0.9844) 2 = 0.969

= 0.969

100% = 96.9%

Interpretation:

96.9% of the variation in guessed ages (Y) is explained by the true age (X).

96.9% of variability in guessed ages is explained by linear regression on true ages.

5. Plot the estimated regression line on the scatter diagram.

Chapters 3 and 11 Fall 2007

Page 4 of 34

For this we choose two values of X (as far as meaningful) and predict the value of Y for those two values of X, using the prediction equation y

   x .

For x = 15 we have ˆ = 15.72

ˆ

   x

    

For x = 90, we get ˆ = 94.47 y

ˆ    x

    

94.47

.

These give us 2 points (15, 15.72) and (90, 94.47) that we connect on the scatter diagram. Now mark the points (15, 15.72) and (90, 94.47) on the graph and joint them with a ruler.

Scatterplot of Guessed Age vs True Age

60

50

40

30

20

100

90

80

70

10

10 20 30 40 50

True Age

60 70 80 90

Chapters 3 and 11 Fall 2007

Page 5 of 34

Chapter 11 Inferences for SLR

In Chapter 3, a linear relation between two quantitative variables (denoted by X and Y) was shown by ˆ a bX .

In this equation (called the prediction equation )

Y is called the response (dependent) variable,

X is called the explanatory variable

ˆy

is called the predicted value of Y. a = b =

 ˆ

 ˆ

=

= estimate of the intercept

estimate of the slope (

(α) and

).

Hence, ˆ

  is an estimate for a simple linear regression model.

Also,

ˆy

is called the estimate of true (but unknown) regression line

µ

Y

= α +

X, so we also write

ˆ y

 

Y

Regression Model: A mathematical (or theoretical) equation that shows the linear relation between the explanatory variable and the response. The simple linear regression model we will use is

Y = α +

X +

, where

is called the error term .

Chapters 3 and 11 Fall 2007

Page 6 of 34

Let’s see the error terms graphically:

Total error = y – y

= (y

– y

ˆ

) + ( y

ˆ – y )

= Random error + Regression error

In this relation, total error is divided into two parts, the random error = y – ˆ and the regression error = ˆ – y = error due to using the regression model instead of the sample mean, y =46.9.

Scatterplot of Guessed Age vs True Age

90

80

50

40

70

60

30

20

10

10 20 30 40 50 60

True Age

70 80 90 100

Chapters 3 and 11 Fall 2007

Page 7 of 34

The unbiased estimators of the parameters (α and

) of the regression line (those given in Chapter 3) are found by the method of least squares estimation

(LSE) technique as

 ˆ   

S

Y

S

X

    bX

Hence,

 

is the (true and unknown) slope of the regression line,

 

is estimated by

 ˆ   

S

Y and

S

X

 α is the (true and unknown) y-intercept (or

 simply intercept ) of the regression line

     bX .

What do the slope and intercept of the regression line tell us?

Slope is the average amount of change in Y for one unit of increase in X.

Note: slope ≠ rise / run. Why?

Intercept is the value of Y when X = 0.

Important Note: We DO NOT use the above interpretation when a) X = 0 is not meaningful or b) Zero is not within the range or near the observed values of X

Chapters 3 and 11 Fall 2007

Page 8 of 34

Assumptions of Simple Linear Regression:

1.

A random sample of n pairs of observations,

(X

1

, Y

1

), (X

2

, Y

2

), …, (X n

, Y n

)

2.

The population of Ys have normal distribution with mean µ

Y

= α +

X, which changes for each value of X, and the same standard deviation,

, which is the same at every value of the independent variable, X. The relation between X and Y may also be formulated as Y

 

X

 

.

3.

As a result of the above assumptions, the error terms,

, are iid (identically and independently distributed) random variables that have a normal distribution with mean zero and standard deviation

, i.e.,

~ N(0,

).

4.

These mean that both Y and

are random variables [we may choose any value for X hence it is assumed to be a non-random variable (even when it is random)].

Are these assumptions satisfied in Example – 1?

Chapters 3 and 11 Fall 2007

Page 9 of 34

Assumption 1 (random sample) is satisfied.

To check assumptions 2 and 3 we look at the residuals , where,

Residual = Observed value of Y – Predicted value of Y

= y

 y

ˆ

.

If these residuals do not have any extreme value, we say assumptions 2 and 3 are justifiable, since we do not have any reason to suspect otherwise (more later).

So, let’s calculate the residuals using the prediction equation, ˆ

 

0.03 1.05

x found in chapter 3. Then we will plot the residuals.

Observed Predicted Residual

Value of y Value ( y

ˆ

) y

 y

ˆ

20 21.06 – 3.06

45 47.43 4.57

70 73.79 – 8.79

85 89.61 0.39

25 26.33 1.67

50 52.70 5.30

15 15.79 – 2.79

60 63.25 2.75

40 42.15 1.85

35 36.88 – 1.88

Chapters 3 and 11 Fall 2007

Page 10 of 34

Histogram

(response is Guessed Age)

1.5

1.0

0.5

3.0

2.5

2.0

0.0

-5.0

-2.5

0.0

2.5

Residual

5.0

7.5

10.0

Do you think the assumption of normality is satisfied? Why or why not?

Chapters 3 and 11 Fall 2007

Page 11 of 34

Inferences about parameters of SLR

The parameters of the regression model are

α, 

and

These parameters are estimated by a, b and S, respectively.

Chapter 11 deals with inferences about the true

Simple Linear Regression (SLR) model 1 , i.e., a regression model with one explanatory variable (X).

When making inferences about the parameters of the regression model, we will determine

 If X is a “good predictor” of Y

If the regression line is useful for making predictions about Y

If the slope is different from zero

In this chapter we will also see how to find

 Prediction interval for an individual response ,

Y at X = x.

 Confidence intervals for the mean of Y , that is

µ

Y

= mean response, at X = x.

We carry out these using ANOVA.

1 In Chapter 12 we will see how to make inferences about the parameters of a multiple regression model , i.e., a regression model with several (k

2) explanatory variables, X

1

, X

2

, …, X k

.

Chapters 3 and 11 Fall 2007

Page 12 of 34

ANOVA FOR SLR

Is X a good predictor of Y? This is equivalent to saying is the slope of the line significantly different from zero? [If not, we might as well use Y as a predictor.] We can answer these questions using an

ANOVA table:

Source

ANOVA for SLR df SS MS

Regression

Model

1 SSReg MS Re g

SS Re /1 F

F

MS Re g

MSE

Residuals

(Error) n – 2 SSE MSE

 n

SSE

2

Total n – 1 SST

Total SS = Model SS + Error SS

SST = SSReg + SSE i n 

1

( y i

 y )

2  i n 

1

(

ˆ i

 y )

2  i n 

1

( y i

 ˆ i

)

2

df = (n – 1) = 1 + (n – 2)

The df for regression = 1 because there is only 1 independent variable

The df for residuals = n – 2 because we estimate 2 parameters (α and 

).

Chapters 3 and 11 Fall 2007

Page 13 of 34

Assumptions for ANOVA:

Random sample

Normal distribution (of

and hence Y)

Constant variance (of

and Y)

The hypothesis of interest is Ho:

= 0 vs. Ha:

≠ 0.

Test statistic = F = MSReg / MSE

To find the p-value, first find the tabulated F-value from the F-tables with df

1

= 1 and df

2

= n – 2; then compare that value with the F in ANOVA table. The following is the output obtained from Minitab:

Regression Analysis: Guessed Age vs. True Age

Analysis of Variance

Source

Regression

DF SS MS F

1 5030.0 5030.0

Residual Error 8 160.9 20.1

P

250.08 0.000

Total 9 5190.9

To test Ho:

= 0 vs. Ha:

≠ 0, the test statistic is

F = 250.08 from the ANOVA table,

The F-value is extremely large!!! What does it mean?

The p-value = 0.000. What does it mean?

Decision?

Conclusion?

Chapters 3 and 11 Fall 2007

Page 14 of 34

Decision: Reject Ho since the p-value < 0.0005 is less than any reasonable level of significance.

Conclusion: The observed data indicate that the slope is significantly different from zero.

Using the t-test

We may also use t-test for testing the above hypotheses as explained in Chapter 8. Fro this we use the first block of the Minitab output:

The regression equation is

Guessed Age = – 0.03 + 1.05 True Age

Predictor Coef SE Coef T P

Constant – 0.030 3.289 – 0.01 0.993

True Age 1.05462 0.06669 15.81 0.000

S = 4.48483 R-Sq = 96.9% R-Sq(adj) = 96.5%

In this case parameter is

Estimate = b = 1.05462, SE(Estimate) = 0.06669

Significance Test: Ho:

= 0 vs. Ha:

≠ 0

Test statistic = T

)

Calculated value of the test statistic:

T cal

1.05462 0

0.06669

15.81

Chapters 3 and 11 Fall 2007

Page 15 of 34

To find the p-value, go back and look at Ha. We have a 2-sided alternative and hence,

P-Value = 2

P(T

| T cal

| ) = 2

P(T

15.81) = 0.

This p-value gives us the same decision and conclusion as the one we got from the ANOVA table.

In general, to find the p-value we would look at the ttable to see T cal

on the line with df = n – 2. [This is the df for error in ANOVA table.]

Compare the T cal

above with the F cal

in ANOVA table. We have the following general relation between the T cal and F cal

in SLR:

 

2 

F cal

and equivalently T cal

 

F cal

.

So the p-value for the t-test is the same as the p-value for the F-test. Hence in SLR, the two significance tests for the slope give the same results in both the Ftest (using ANOVA table) and the t-test.

Observe that the above conclusion does not tell us in what way

is different from zero. We could use the t-test for testing one-sided alternatives about

.

However, these should be decided before looking at the data.

Chapters 3 and 11 Fall 2007

Page 16 of 34

Confidence Interval for the Slope:

Remember the general formula for the confidence interval:

CI

 

Estimate

Estimate

ME

 t

* 

( )

This is used in finding a CI for

, where the estimate is b and SE(estimate) is given in the Minitab output.

All we need to do is to find t* from the t-tables with df = (n – 2) = df error

in ANOVA table.

For the above example we had the following results from Minitab:

Predictor Coef SE Coef T P

Constant – 0.030 3.289 – 0.01 0.993

True Age 1.05462 0.06669 15.81 0.000

That is, b = slope = 1.05462, SE(Slope) = 0.06669.

Also, since df error

= 8 in the ANOVA table, we use the table of t-distribution and read the t-value for a

95% CI on the row with df = 8 as t* = 2.306, which gives

ME = t*

SE(Estimate) = 2.306

0.06669 = 0.153787.

Hence a 95% CI for

is

CI = (1.05462 ± 0.15379)

= (0.90083, 1.20841) = (0.9, 1.2)

Chapters 3 and 11 Fall 2007

Page 17 of 34

As in Chapters 7 – 9 we can use the CI to make a decision for significance test: when zero is not in the

CI we reject Ho and conclude that

The observed data give strong evidence that the slope of the regression line is different from zero.

Actually, we can say more: since the CI for

in this example is (0.9, 1.2), we see that both ends of the CI are positive thus we can conclude with 95% confidence that the slope of the true regression line is some number between 0.9 and 1.2.

Alternatively we interpret the CI as follows:

We are 95% confident that on the average, as the true age increases by one year, the guessed age increases by somewhere between 0.9 and 1.2 years.

Chapters 3 and 11 Fall 2007

Page 18 of 34

Confidence Interval for Mean Response

And Prediction Interval

General formula for CIs:

CI

Estimator

 t

* 

( )

Additional symbols:

 µ

Y|x*

= α +

 x*

= Mean response for the population of

ALL Y’s that have to X = x*

  ˆ

= The point on the true regression line that

* corresponds to X = x*

=

ˆ | * a bx *

=

Estimator of mean response at X = x*

 y x ) = SE(Estimator of Mean Response )

S

1 n

 i n 

1

( *

( X i

X

X

)

)

2

CI

Hence CI for Mean Response is

( M ean Response)

ˆ y

 t *

 

1 n

 i n 

1

( *

( X i

X

X

)

)

2

Chapters 3 and 11 Fall 2007

Page 19 of 34

 ˆ | *  

*

=

Predicted value for one new response at X = x*

 y x ) = SE( One new response )

=

S

1

1

( *

X ) n i n 

1

( X i

X )

2

Hence prediction interval (PI) for one new response is

PI ( One New Response)

 y

ˆ  t *

S

 

1

1

( *

X ) n n i

1

( X i

X )

2

Chapters 3 and 11 Fall 2007

Page 20 of 34

Compare the formula for CI and PI to see the

CI ( difference between them:

M ean Response)

ˆ y

 t *

S

1 n

 i n 

1

( *

( X i

X

X

)

)

2

PI ( One New Response)

 y

ˆ  t *

S

 

1

1

( *

X ) n i n 

1

( X i

X )

2

In both of the above formulas

S = Standard deviation of points around the regression line =

MSE

 df = df error

 x* = a particular value of X for which we are making prediction.

Both CI and PI are centered around

ˆ | *  

* = prediction at X = x*

PI for a new response is always wider than CI for mean response at the same value of X = x*.

(Why?)

 SE’s and hence intervals will be smaller when x* is closer to X = the mean of the sample of

X’s and wider when x* is far from X . (Why?)

Chapters 3 and 11 Fall 2007

Page 21 of 34

CI and PI for Age Prediction problem

Guessed Age = - 0.030 + 1.055 True Age

100

80

60

40

20

S

R-Sq

R-Sq(adj)

Regression

95% CI

95% PI

4.48483

96.9%

96.5%

0

10 20 30 40 50

True Age

60 70 80 90

Age prediction example (Continued): a) Suppose you want to know with 95% confidence the range of your friend’s guesses for a 60 year old person.

Here we have one value of X = x* = 65, hence you want a 95% prediction interval at this value. Using the prediction equation we have found, we get the predicted value of Y at X = 65 as y

   x

 

0.03 1.05 65

68.22

Chapters 3 and 11 Fall 2007

Page 22 of 34

Calculations for the SE’s are long and tedious.

However, we can use any one of statistical software to get what we want easily. For example, we got the prediction interval for X = 65 as PI = (57.22, 79.82) using Minitab.

Observe that the center of the above interval is also

68.52. This is the predicted value of Y ( ˆ ) Minitab calculated, using X = 65. [This is slightly different from what we’ve found because Minitab carries more digits after the decimal point in its calculations.] b) You want to know, with 95% confidence what would be the average of your friend’s guesses for all people aged 65 .

Since we are now looking for the mean of all guessed ages with X = 65, this is a problem of CI for mean response.

Minitab gives this as CI = (63.98, 73.06).

Observe that both the CI and the PI are centered on the same point, i.e., around ˆ = 68.52.

Chapters 3 and 11 Fall 2007

Page 23 of 34

Finally, observe the difference in the lengths of the intervals we got from Minitab:

95% CI at X = 65 is (63.98, 73.06).

Length of CI = 73.06 – 63.98 = 9.08

95% PI at X = 65 is (57.22, 79.82).

Length of PI = 79.82 – 57.22 = 22.6

As mentioned before, the PI is ALWAYS wider than the CI at the same level of confidence and the same value of X.

Chapters 3 and 11 Fall 2007

Page 24 of 34

More on R 2 :

We have seen that R 2 = (r) 2 . This can also be defined and calculated from the following relation:

R

2 

SS Re g

SST

Variation in Y explain by Regression

Total variation in Y

This leads to alternative interpretation of R 2 :

R 2 is the proportion of variability in Y that is explained by the regression on X or equivalently,

R 2 is the proportional reduction in the prediction error, that is,

R

2

is the percentage of reduction in prediction error we will see when the prediction equation is used, instead of y = the sample mean of Y as the predicted value of Y.

Chapters 3 and 11 Fall 2007

Page 25 of 34

Example: In the ANOVA table for the analysis of guessed ages we had the following output:

S = 4.48483 R-Sq = 96.9% R-Sq(adj) = 96.5%

Source

Regression

Residual Error

Total

Then, R

2 

SSR

SST

DF SS MS F P

1 5030.0 5030.0 250.08 0.000

8 160.9 20.1

9 5190.9

5030.0

5190.0

0.969

= 96.9%.

This is the same result we had from Minitab, as it should be. We may now interpret this as follows:

The regression model yields a predicted value for Y that has 96.9% less error than we would have if we used the sample mean of Y’s as a predicted value.

Chapters 3 and 11 Fall 2007

Page 26 of 34

More on Residuals:

Residual = Vertical distance from an observed point to the predicted value for the same X.

= Observed y – predicted y

= y

 y

ˆ

Where ˆ

   x

Observed Predicted Residuals

Values of y Values ( y

ˆ

) y

 y

ˆ

20 21.06 – 3.06

45 47.43 4.57

70 73.79 – 8.79

85 89.61 0.39

25 26.33 1.67

50 52.70 5.30

15 15.79 – 2.79

60 63.25 2.75

40 42.15 1.85

35 36.88 – 1.88

Hence, for someone whose actual age is 35, the predicted value of his/her age is 36.88. This means the prediction was 1.88 years higher than the true age.

Chapters 3 and 11 Fall 2007

Page 27 of 34

 Positive residuals: Observations above regression line

 Negative residuals: Observations below regression line

Sum of residuals = 0 ALWAYS.

We (or computers) can make residual plots to see if there are any problems with the assumptions.

 Computer finds “standardized residuals = z-score for each observation. Any point that has a z score bigger than 3 in absolute value, i.e., |z| > 3 is called an outlier .

More on Correlation:

If the distance between a given value of X, say x* and X (in absolute value) is k standard deviations, i.e., | x* – X | = k

S, then the distance (in absolute value) between the predicted value of y ( ˆ ) at x* and

Y is r

k standard deviations, i.e., | y

ˆ –

Y | = r

k

S.

Chapters 3 and 11 Fall 2007

Page 28 of 34

Example: Suppose Y = Height of children and X = heights of their fathers and the correlation between the two variables is r = 0.5.

Then,

 If a father’s height is 2 standard deviations above the mean height of all fathers, then the predicted height of his child will be 0.5

2 = 1 standard deviation above the mean height of children.

 If the father’s height is 1.5 standard deviations below the mean height of all fathers, then his child’s predicted height will be 0.5 

1.5 = 0.75 standard deviations below the mean height of all children.

Chapters 3 and 11 Fall 2007

Page 29 of 34

Some more on correlation

Correlation is very much affected by outliers and influential points.

Outliers weaken the correlation.

Influential points (far from the rest of observations in the x-direction that does not follow the trend) may change the sign and value of the slope.

Chapters 3 and 11 Fall 2007

Page 30 of 34

Residual Plots

Residuals are the estimators of the error term (

) in the regression model. Thus, the assumption of normality of

can be checked by looking at the histogram of the residuals.

A histogram of residuals that is (almost) bellshaped (symmetric) supports the assumption of normality of the residuals.

A histogram or a dot plot that shows outliers is indicative of the violation of the assumption of normality.

Normal probability plot or normal quantile plot can also be used to check the normality assumption. Points in a normal PP or QP around a straight line support assumption of normality.

Chapters 3 and 11 Fall 2007

Page 31 of 34

Plot of residuals

Against the explanatory variable (X)

Magnify any problems with assumptions.

If the residuals are randomly scattered around the line residuals = 0, this is good. It means nothing else is left after using X to predict Y.

If the residual plot shows a curved pattern this indicates that a curvilinear fit (quadratic?) will give better results.

If the residual plot is funnel shaped this means the assumption of constant variance is violated.

If the residual plot shows an outlier, this may mean the violation of normality and/or constant variance or show an influential point.

Chapters 3 and 11 Fall 2007

Page 32 of 34

11.5 Exponential regression

This is one of the nonlinear regression models of the following form: Y

 

X  

or equivalently,

Y

 

X . The model is called “exponential” because the independent variable, X appears as the exponent of the coefficient

.

Observe that when we take the logarithm of the model we obtain log(

Y

)

 log

 

(log

)X , hence logarithm of the mean of Y is a linear function of X with coefficients log(

) and log(

).

Note that when X = 0,

 X =

 0 = 1. Thus,

gives us the mean of Y at X = 0, since

Y

=



0 =

(1) =

.

The parameter

represents the multiplicative effect of X on Y (as opposed to the additive effect in simple linear regression we have seen so far.). So, if, for example,

= 1.5, increasing X by one unit will increase Y by 50% from its previous value, i.e., we need to multiply the value of Y at the previous value by 1.5 to obtain the current value.

Chapters 3 and 11 Fall 2007

Page 33 of 34

Summary of SLR

Model: y

  x

 

Assumptions: a) Random sample b) Normal distribution c) Constant variance d)

~ N(0,

).

Parameters and Estimators:

 Intercept = α Estimated by a Y bX

Slope =

Estimated by b

 

S

Y

S

X

 Standard deviation =

Estimated by S = MSE

Interpretation of

Slope

 Intercept

R 2

 r

Testing if the model is good:

 ANOVA

The t-test for slope

 CI for slope

PI and CI for response

Residual plots and interpretations.

Chapters 3 and 11 Fall 2007

Page 34 of 34

Download