Correlation and Regression

advertisement

Correlation and Regression

Correlation

A quantitative relationship between two interval or ratio level variables

Explanatory

(Independent) Variable x

Hours of Training

Shoe Size

Cigarettes smoked per day

Score on SAT

Height

Response

(Dependent) Variable y

Number of Accidents

Height

Lung Capacity

Grade Point Average

IQ

What type of relationship exists between the two variables and is the correlation significant?

Correlation

 measures and describes the strength and direction of the relationship

 requires two scores from the same individuals

(dependent and independent variables)

 Denoted by correlation coefficient “r”

Scatter Plots and Types of Correlation x = hours of training y = number of accidents

60

50

40

30

20

10

0

0 2 4 6 8 10 12 14 16 18 20

Hours of Training

Negative Correlation –as x increases, y decreases

Scatter Plots and Types of Correlation x = SAT score y = GPA

4.00

3.75

3.50

3.25

3.00

2.75

2.50

2.25

2.00

1.75

1.50

300 350 400 450 500 550 600 650 700 750 800

Math SAT

Positive Correlation –as x increases, y increases

160

150

140

130

120

110

100

90

80

Scatter Plots and Types of Correlation x = height y = IQ

60 64 68

Height

72

No linear correlation

76 80

Scatter Plots and Types of Correlation

Strong, negative relationship but non-linear!

Pearson's correlation coefficient is not appropriate.......

Correlation Coefficient

A measure of the strength and direction of a linear relationship between two variables

The range of r is from –1 to 1.

–1

If r is close to –1 there is a strong negative correlation.

0

If r is close to

0 there is no linear correlation.

1

If r is close to

1 there is a strong positive correlation.

Outliers are dangerous

Outliers.....

Here we have a spurious correlation of r =0.68

without IBM, r =0.48

without IBM & GE, r =0.21

75

70

65

60

55

50

45

40

95

90

85

80

0 2 4

Application

6 8 10 12 14 16

Absences

X

Final

Absences Grade x y

8 78

2 92

5 90

12 58

15 43

9 74

6 81

Computation of

r x y xy x 2 y 2

1 8 78

624 64 6084

2 2 92

184 4 8464

3 5 90

450 25 8100

4 12 58

696 144 3364

5 15 43

645 225 1849

6 9 74

666 81 5476

7 6 81

486 36 6561

57 516 3751 579 39898

Hypothesis Test for Significance

r is the correlation coefficient for the sample. The correlation coefficient for the population is (rho).

For a two tail test for significance:

(The correlation is not significant)

(The correlation is significant)

The sampling distribution for r is a t-distribution with n – 2 d.f.

Standardized test statistic

Test of Significance

You found the correlation between the number of times absent and a final grade r = –0.975. There were seven pairs of data.Test the significance of this correlation. Use = 0.01.

1. Write the null and alternative hypothesis.

(The correlation is not significant)

(The correlation is significant)

2. State the level of significance.

= 0.01

3. Identify the sampling distribution.

A t -distribution with 5 degrees of freedom

Rejection Regions

Critical Values ± t

0 t

–4.032

0 4.032

df\p 0.40

4. Find the critical value .

5. Find the rejection region .

3

2

1

0.25 0.10 0.05 0.025 0.01 0.005 0.0005

0.324920

1.000000

3.077684

6.313752

12.70620

31.82052

63.65674

636.6192

0.288675

0.816497

1.885618

2.919986

4.30265

6.96456

9.92484

31.5991

0.276671

0.764892

1.637744

2.353363

3.18245

4.54070

5.84091

12.9240

6. Find the test statistic.

4 0.270722

0.740697

1.533206

2.131847

2.77645

3.74695

4.60409

8.6103

5 0.267181

0.726687

1.475884

2.015048

2.57058

3.36493

4.03214

6.8688

t

–4.032

0

–4.032

7. Make your decision.

t = –9.811 falls in the rejection region. Reject the null hypothesis.

8. Interpret your decision.

There is a significant negative correlation between the number of times absent and final grades.

The Line of Regression

Regression indicates the degree to which the variation in one variable X, is related to or can be explained by the variation in another variable Y

Once you know there is a significant linear correlation, you can write an equation describing the relationship between the

x and y variables. This equation is called the line of regression or least squares line.

The equation of a line may be written as y = mx + b where m is the slope of the line and b is the yintercept.

The line of regression is:

The slope m is:

The y-intercept is:

220

210

200

190

180

260

250

240

230

( x i

, y i

) = a data point

= a point on the line with the same x-value

= a residual

Best fitting straight line

1.5

2.0

Ad $

2.5

3.0

x y

1 8 78

2 2 92

3 5 90

4 12 58

5 15 43

6 9 74

7 6 81 xy

645

666

486 x

624

184

64

4

450 25

696 144

225

81

36

2 y 2

6084

8464

8100

3364

1849

5476

6561

57 516 3751 579 39898

Write the equation of the line of regression with x = number of absences and y = final grade.

Calculate m and b.

The line of regression is: = –3.924

x + 105.667

The Line of Regression

m = –3.924 and b = 105.667

The line of regression is:

75

70

65

60

55

50

45

40

95

90

85

80

0 2 4 6 8

Absences

10 12 14 16

Note that the point = (8.143, 73.714) is on the line.

Predicting

y

Values

The regression line can be used to predict values of y for values of x falling within the range of the data.

The regression equation for number of times absent and final grade is:

= –3.924

x + 105.667

Use this equation to predict the expected grade for a student with

(a) 3 absences (b) 12 absences

(a)

(b)

= –3.924(3) + 105.667 = 93.895

= –3.924(12) + 105.667 = 58.579

Strength of the Association

The coefficient of determination, r 2 , measures the strength of the association and is the ratio of explained variation in y to the total variation in y.

The correlation coefficient of number of times absent and final grade is r = –0.975. The coefficient of determination is r 2 = (–0.975) 2 = 0.9506.

Interpretation: About 95% of the variation in final grades can be explained by the number of times a student is absent. The other 5% is unexplained and can be due to sampling error or other variables such as intelligence, amount of time studied, etc.

Download