A quantitative relationship between two interval or ratio level variables
Explanatory
(Independent) Variable x
Hours of Training
Shoe Size
Cigarettes smoked per day
Score on SAT
Height
Response
(Dependent) Variable y
Number of Accidents
Height
Lung Capacity
Grade Point Average
IQ
What type of relationship exists between the two variables and is the correlation significant?
measures and describes the strength and direction of the relationship
requires two scores from the same individuals
(dependent and independent variables)
Denoted by correlation coefficient “r”
Scatter Plots and Types of Correlation x = hours of training y = number of accidents
60
50
40
30
20
10
0
0 2 4 6 8 10 12 14 16 18 20
Hours of Training
Negative Correlation –as x increases, y decreases
Scatter Plots and Types of Correlation x = SAT score y = GPA
4.00
3.75
3.50
3.25
3.00
2.75
2.50
2.25
2.00
1.75
1.50
300 350 400 450 500 550 600 650 700 750 800
Math SAT
Positive Correlation –as x increases, y increases
160
150
140
130
120
110
100
90
80
Scatter Plots and Types of Correlation x = height y = IQ
60 64 68
Height
72
No linear correlation
76 80
Scatter Plots and Types of Correlation
Strong, negative relationship but non-linear!
Pearson's correlation coefficient is not appropriate.......
A measure of the strength and direction of a linear relationship between two variables
The range of r is from –1 to 1.
–1
If r is close to –1 there is a strong negative correlation.
0
If r is close to
0 there is no linear correlation.
1
If r is close to
1 there is a strong positive correlation.
Outliers are dangerous
Here we have a spurious correlation of r =0.68
without IBM, r =0.48
without IBM & GE, r =0.21
75
70
65
60
55
50
45
40
95
90
85
80
0 2 4
6 8 10 12 14 16
Absences
X
Final
Absences Grade x y
8 78
2 92
5 90
12 58
15 43
9 74
6 81
r x y xy x 2 y 2
1 8 78
624 64 6084
2 2 92
184 4 8464
3 5 90
450 25 8100
4 12 58
696 144 3364
5 15 43
645 225 1849
6 9 74
666 81 5476
7 6 81
486 36 6561
57 516 3751 579 39898
r is the correlation coefficient for the sample. The correlation coefficient for the population is (rho).
For a two tail test for significance:
(The correlation is not significant)
(The correlation is significant)
The sampling distribution for r is a t-distribution with n – 2 d.f.
Standardized test statistic
You found the correlation between the number of times absent and a final grade r = –0.975. There were seven pairs of data.Test the significance of this correlation. Use = 0.01.
1. Write the null and alternative hypothesis.
(The correlation is not significant)
(The correlation is significant)
2. State the level of significance.
= 0.01
3. Identify the sampling distribution.
A t -distribution with 5 degrees of freedom
Rejection Regions
Critical Values ± t
0 t
–4.032
0 4.032
df\p 0.40
4. Find the critical value .
5. Find the rejection region .
3
2
1
0.25 0.10 0.05 0.025 0.01 0.005 0.0005
0.324920
1.000000
3.077684
6.313752
12.70620
31.82052
63.65674
636.6192
0.288675
0.816497
1.885618
2.919986
4.30265
6.96456
9.92484
31.5991
0.276671
0.764892
1.637744
2.353363
3.18245
4.54070
5.84091
12.9240
6. Find the test statistic.
4 0.270722
0.740697
1.533206
2.131847
2.77645
3.74695
4.60409
8.6103
5 0.267181
0.726687
1.475884
2.015048
2.57058
3.36493
4.03214
6.8688
t
–4.032
0
–4.032
7. Make your decision.
t = –9.811 falls in the rejection region. Reject the null hypothesis.
8. Interpret your decision.
There is a significant negative correlation between the number of times absent and final grades.
Regression indicates the degree to which the variation in one variable X, is related to or can be explained by the variation in another variable Y
Once you know there is a significant linear correlation, you can write an equation describing the relationship between the
x and y variables. This equation is called the line of regression or least squares line.
The equation of a line may be written as y = mx + b where m is the slope of the line and b is the yintercept.
The line of regression is:
The slope m is:
The y-intercept is:
220
210
200
190
180
260
250
240
230
( x i
, y i
) = a data point
= a point on the line with the same x-value
= a residual
Best fitting straight line
1.5
2.0
Ad $
2.5
3.0
x y
1 8 78
2 2 92
3 5 90
4 12 58
5 15 43
6 9 74
7 6 81 xy
645
666
486 x
624
184
64
4
450 25
696 144
225
81
36
2 y 2
6084
8464
8100
3364
1849
5476
6561
57 516 3751 579 39898
Write the equation of the line of regression with x = number of absences and y = final grade.
Calculate m and b.
The line of regression is: = –3.924
x + 105.667
m = –3.924 and b = 105.667
The line of regression is:
75
70
65
60
55
50
45
40
95
90
85
80
0 2 4 6 8
Absences
10 12 14 16
Note that the point = (8.143, 73.714) is on the line.
y
The regression line can be used to predict values of y for values of x falling within the range of the data.
The regression equation for number of times absent and final grade is:
= –3.924
x + 105.667
Use this equation to predict the expected grade for a student with
(a) 3 absences (b) 12 absences
(a)
(b)
= –3.924(3) + 105.667 = 93.895
= –3.924(12) + 105.667 = 58.579
The coefficient of determination, r 2 , measures the strength of the association and is the ratio of explained variation in y to the total variation in y.
The correlation coefficient of number of times absent and final grade is r = –0.975. The coefficient of determination is r 2 = (–0.975) 2 = 0.9506.
Interpretation: About 95% of the variation in final grades can be explained by the number of times a student is absent. The other 5% is unexplained and can be due to sampling error or other variables such as intelligence, amount of time studied, etc.