Linear regression and correlation - Sun Yat

advertisement
Medical Statistics
(full English class)
Ji-Qian Fang
School of Public Health
Sun Yat-Sen University
Chapter 12
Linear Correlation
and
Linear Regression
Vocabulary for Chapter 12-1
association
function
exponential function
logarithmic function
sine function
linear relationship
linear correlation
linear regression
correlation coefficient
联系
函数
指数函数
对数函数
正弦函数
线性关系
线性相关
线性回归
相关系数
Pearson’s correlation coefficient
sample correlation coefficient
population correlation
direction
strength
rank correlation
Spearman’s correlation coefficient
chest circumference
vital capacity
caution
anniversary
Pearson 相关系数
样本相关系数
总体相关系数
方向
强度
秩相关
Spearman 秩相关系数
胸围
肺活量
小心、谨慎
周年纪念日
Up to now, the statistical methods we have learnt
concern with single variable only
Such as
Estimate the average height among high
school students
Comparing the average height of high school
students between city and country side
The relationship between two variables are often
concerned:
Example: For high school students,
Height and Age – linear relation?
Height and Vital capacity -- linear relation?
In this chapter, we are going to study
two variables
linear relationship
Two types of questions:
Whether there is a linear relationship?
-- Linear correlation
How to predict one variable by another variable?
-- Linear regression
Example Chest circumference and vital
capacity of 15 high school female students
No.
(1)
Chest circumference
X
(2)
Vital Capacity
Y
(3)
X2
(4)
Y2
(5)
XY
(6)
1
2
3
4
…
14
15
72
68
78
66
…
75
69
2400
2200
2750
1800
…
2500
2350
5184
4624
6084
4356
…
5625
4761
5760000
4840000
7562500
3240000
…
6250000
5522500
172800
149600
214500
118800
…
187500
162150
Total
1036
35150
71858
83567500
2441450
• Is there a linear relationship between Vital Capacity Y
and Chest circumference X? -- Linear correlation
• If Chest circumference X is known, can we predict her
Vital Capacity Y? -- Linear regression
12.1 Linear correlation
(1)
Chest circumference
X
(2)
Vital Capacity
Y
(3)
X2
(4)
Y2
(5)
XY
(6)
1
2
3
4
…
14
15
72
68
78
66
…
75
69
2400
2200
2750
1800
…
2500
2350
5184
4624
6084
4356
…
5625
4761
5760000
4840000
7562500
3240000
…
6250000
5522500
172800
149600
214500
118800
…
187500
162150
Total
1036
35150
71858
83567500
2441450
Scatter Diagram
Vital Capacity
No.
2900
2700
2500
2300
2100
1900
1700
55
60
65
70
75
80
Chest Circumference(cm)
85
Function between Y and X
Exponential function Logarithm function
Sine Function
There is a fixed value of Y
corresponding to any given value of X
Fig 12-2 Linear correlation
1.Correlation Coefficient and Calculation
A measurement of linear relationship:
1) Whether there is a correlation;
If the correlation coefficient is 0 or not big
enough
-- no correlation
2) If correlation coefficient is big enough
The direction of correlation?
-- positive + or negative The strength of correlation? high or not?
-- +1 or -1, complete correlation
Sample Correlatio n Coeffient :
r
 ( X  X )(Y  Y )
 ( X  X )  (Y  Y )
2
2

l XY
l XX lYY
1
l XY   ( X  X )(Y  Y )   XY  ( X )(  Y )
n
1
2
2
2
l XX   ( X  X )   X  ( X )
n
1
2
2
lYY   (Y  Y )   Y  ( Y ) 2
n
No.
(1)
Chest circumference
X
(2)
Vital Capacity
Y
(3)
X2
(4)
Y2
(5)
XY
(6)
1
2
3
4
…
14
15
72
68
78
66
…
75
69
2400
2200
2750
1800
…
2500
2350
5184
4624
6084
4356
…
5625
4761
5760000
4840000
7562500
3240000
…
6250000
5522500
172800
149600
214500
118800
…
187500
162150
Total
1036
35150
71858
83567500
2441450
1
1
l XY   XY  ( X )(  Y )  2441450  (1036)(35150)  13756.667
n
15
1
1
l XX   X 2  ( X ) 2  71858  (1036) 2  304.9333
n
15
1
1
lYY   Y 2  ( Y ) 2  83567500  (35150) 2  1199333.33
n
15
l XY
13756.667
r

 0.7194
l XX lYY
304.9333  1199333.33
2. Hypothesis test
 r is sample correlation coefficient, change from
sample to sample
 There is a population correlation coefficient,
denoted by ρ
 Question : Whether ρ=0 or not?
 H0: ρ=0,
H1: ρ≠0
α=0.05
(1) Checking a special table (Table 12-3)
Two-side 0.05 and 0.01
  15  2  13
r0.05(13)  0.514, r0.01(13)  0.641
r  0.7194  r0.01,(13)
P  0.01
H0 is rejected.
Conclusion: There is linear correlation between Vital
Capacity and Chest Circumference
Question: Since P  0.01 , very small, can we say the
correlation is very strong?
Table 12-3 Critical values for r
Degrees of
Freedom

1
2
3
4
5
…
12
13
14
15
…
One-side:
Two-side:
0.05
0.10
0.998
0.900
0.805
0.729
0.669
…
0.457
0.441
0.426
0.412
…
Probability, P
0.025
0.01
0.05
0.02
0.997
1.00
0.950
0.980
0.878
0.934
0.811
0.882
0.755
0.833
…
0.532
0.612
0.514
0.592
0.497
0.574
0.482
0.558
…
…
0.005
0.01
1.00
0.990
0.959
0.917
0.875
…
0.661
0.641
0.623
0.606
…
Question:
If r=0.90, can you claim the two variables are correlated
each other?
Does a small P value mean that the correlation is strong ?
(2) t test
(Assume normal distribution)
H0: ρ=0,
t
r 0
1 r2
n2
H1: ρ≠0
  n2
If P-value <α, then reject H0 , conclude that
the population correlation coefficient is
significantly different from 0.

t
r 0

0.6945
 2.73
1 r
1  0.6945
n2
10  2
υ=10-2=8, p<0.05. The population correlation coefficient
might not be 0.
2
2
12.2 Rank correlation
1. Spearman rank correlation coefficient
 It is useful to:
Ranked data
Measurement data
-not follow normal distribution;
or not precisely measured
Table 12-2 The mother’s frequence of eclamptic convusion and the new-born’s Apgar score
Apgar score
of new-born
Y
(3)
Rank
of X
(1)
Frequency of mother’s
eclamptic convulsion
X
(2)
(4)
1
2
3
4
5
6
7
8
9
10
1
4
2
2
3
1
3
1
5
2
9
1
6
8
5
10
4
9
2
7
Total
--
--
No.
6 d 2
Rank
of Y
(5)
Difference
between ranks
d
(6)
d2
(7)
2
9
5
5
7.5
2
7.5
2
10
5
8.5
1
5
7
4
10
3
8.5
2
6
-6.5
8
0
-2
3.5
-8
4.5
-6.5
8
-1
42.25
64
0
4
12.25
64
20.25
42.25
64
1
--
--
--
314
6  314
rs  1 
 1
 0.903
2
2
n(n  1)
10(10  1)
2. Hypothesis test for rs
(1) Checking a special table (Table 12-4)
rs (10,0.05)  0.648, rs (10,0.01)  0.794
P=0.01 and it is significant
(2) t test Same as the t test for Pearson’s correlation
coefficient
H0: ρ=0,
t
rs  0
1  rs2
n2
H1: ρ≠0
  n2
What if there are more ties?
(1) ranking the values of x and y separately
-- Calculate the mean rank
(2) Calculating the spearman rs:
Use the formula for Pearson correlation
-- Put the ranks (column 4 and 5) into the
formula of Pearson’s correlation coefficient r
Caution for correlation
Story 1 Correlation between height of son and tree.
Time
1
Height of son (cm) X 50
Height of tree (cm) Y 35
2
3
4
…
11
12
54
42
59
50
65
57
…
…
75
60
81
66
A correlation coefficient was calculated at the first
anniversary
Cor ( X , Y )  0.97, P  0.05
Conclusion: The tree made his son growing up quickly, or
his son made the tree growing up quickly?!
Story 2 Correlation between swimming and ice
cream.
Day
Number of people
Swimming
X
Number of people
Buy ice cream Y
1
2
…
6
7
8
…
11
12
20
14
…
129
235
198
…
45
31
15
12
…
120
237
203
…
40
36
They calculated a correlation coefficient at the end of
year
Cor ( X , Y )  0.92, P  0.05
Conclusion: Swimming people must like ice ream, or
buying ice ream must go to swimming?!
1) Don’t put any two variables together for
correlation
-- They must have some relation in subject
matter
2) Simple correlation
= Direct association + indirect association
Simple correlation does not necessary mean a
direct association
Son
?
Time
Tree
Swimming
?
Ice ream
Temperature
Summary
 Concept of linear correlation
-- The scatter diagram shows a linear tendency
Correlation
Direct association
Correlation
Causation
 After calculating a sample correlation coefficient, it is
necessary to have a test for
H0: ρ=0, H1: ρ≠0
Rejecting H0 just means ρ≠0.
 There are two correlation coefficients commonly used:
Pearson’s product moment correlation coefficient r
Spearmen’s rank correlation correlation coefficient rs
-- The formulas do not have to be remembered


Next lecture
Question:
If there is linear correlation between X and Y,
Given a value of X, can we predict the value
of Y ? How?
-- Linear regression
Download