Medical Statistics (full English class) Ji-Qian Fang School of Public Health Sun Yat-Sen University Chapter 12 Linear Correlation and Linear Regression Vocabulary for Chapter 12-1 association function exponential function logarithmic function sine function linear relationship linear correlation linear regression correlation coefficient 联系 函数 指数函数 对数函数 正弦函数 线性关系 线性相关 线性回归 相关系数 Pearson’s correlation coefficient sample correlation coefficient population correlation direction strength rank correlation Spearman’s correlation coefficient chest circumference vital capacity caution anniversary Pearson 相关系数 样本相关系数 总体相关系数 方向 强度 秩相关 Spearman 秩相关系数 胸围 肺活量 小心、谨慎 周年纪念日 Up to now, the statistical methods we have learnt concern with single variable only Such as Estimate the average height among high school students Comparing the average height of high school students between city and country side The relationship between two variables are often concerned: Example: For high school students, Height and Age – linear relation? Height and Vital capacity -- linear relation? In this chapter, we are going to study two variables linear relationship Two types of questions: Whether there is a linear relationship? -- Linear correlation How to predict one variable by another variable? -- Linear regression Example Chest circumference and vital capacity of 15 high school female students No. (1) Chest circumference X (2) Vital Capacity Y (3) X2 (4) Y2 (5) XY (6) 1 2 3 4 … 14 15 72 68 78 66 … 75 69 2400 2200 2750 1800 … 2500 2350 5184 4624 6084 4356 … 5625 4761 5760000 4840000 7562500 3240000 … 6250000 5522500 172800 149600 214500 118800 … 187500 162150 Total 1036 35150 71858 83567500 2441450 • Is there a linear relationship between Vital Capacity Y and Chest circumference X? -- Linear correlation • If Chest circumference X is known, can we predict her Vital Capacity Y? -- Linear regression 12.1 Linear correlation (1) Chest circumference X (2) Vital Capacity Y (3) X2 (4) Y2 (5) XY (6) 1 2 3 4 … 14 15 72 68 78 66 … 75 69 2400 2200 2750 1800 … 2500 2350 5184 4624 6084 4356 … 5625 4761 5760000 4840000 7562500 3240000 … 6250000 5522500 172800 149600 214500 118800 … 187500 162150 Total 1036 35150 71858 83567500 2441450 Scatter Diagram Vital Capacity No. 2900 2700 2500 2300 2100 1900 1700 55 60 65 70 75 80 Chest Circumference(cm) 85 Function between Y and X Exponential function Logarithm function Sine Function There is a fixed value of Y corresponding to any given value of X Fig 12-2 Linear correlation 1.Correlation Coefficient and Calculation A measurement of linear relationship: 1) Whether there is a correlation; If the correlation coefficient is 0 or not big enough -- no correlation 2) If correlation coefficient is big enough The direction of correlation? -- positive + or negative The strength of correlation? high or not? -- +1 or -1, complete correlation Sample Correlatio n Coeffient : r ( X X )(Y Y ) ( X X ) (Y Y ) 2 2 l XY l XX lYY 1 l XY ( X X )(Y Y ) XY ( X )( Y ) n 1 2 2 2 l XX ( X X ) X ( X ) n 1 2 2 lYY (Y Y ) Y ( Y ) 2 n No. (1) Chest circumference X (2) Vital Capacity Y (3) X2 (4) Y2 (5) XY (6) 1 2 3 4 … 14 15 72 68 78 66 … 75 69 2400 2200 2750 1800 … 2500 2350 5184 4624 6084 4356 … 5625 4761 5760000 4840000 7562500 3240000 … 6250000 5522500 172800 149600 214500 118800 … 187500 162150 Total 1036 35150 71858 83567500 2441450 1 1 l XY XY ( X )( Y ) 2441450 (1036)(35150) 13756.667 n 15 1 1 l XX X 2 ( X ) 2 71858 (1036) 2 304.9333 n 15 1 1 lYY Y 2 ( Y ) 2 83567500 (35150) 2 1199333.33 n 15 l XY 13756.667 r 0.7194 l XX lYY 304.9333 1199333.33 2. Hypothesis test r is sample correlation coefficient, change from sample to sample There is a population correlation coefficient, denoted by ρ Question : Whether ρ=0 or not? H0: ρ=0, H1: ρ≠0 α=0.05 (1) Checking a special table (Table 12-3) Two-side 0.05 and 0.01 15 2 13 r0.05(13) 0.514, r0.01(13) 0.641 r 0.7194 r0.01,(13) P 0.01 H0 is rejected. Conclusion: There is linear correlation between Vital Capacity and Chest Circumference Question: Since P 0.01 , very small, can we say the correlation is very strong? Table 12-3 Critical values for r Degrees of Freedom 1 2 3 4 5 … 12 13 14 15 … One-side: Two-side: 0.05 0.10 0.998 0.900 0.805 0.729 0.669 … 0.457 0.441 0.426 0.412 … Probability, P 0.025 0.01 0.05 0.02 0.997 1.00 0.950 0.980 0.878 0.934 0.811 0.882 0.755 0.833 … 0.532 0.612 0.514 0.592 0.497 0.574 0.482 0.558 … … 0.005 0.01 1.00 0.990 0.959 0.917 0.875 … 0.661 0.641 0.623 0.606 … Question: If r=0.90, can you claim the two variables are correlated each other? Does a small P value mean that the correlation is strong ? (2) t test (Assume normal distribution) H0: ρ=0, t r 0 1 r2 n2 H1: ρ≠0 n2 If P-value <α, then reject H0 , conclude that the population correlation coefficient is significantly different from 0. t r 0 0.6945 2.73 1 r 1 0.6945 n2 10 2 υ=10-2=8, p<0.05. The population correlation coefficient might not be 0. 2 2 12.2 Rank correlation 1. Spearman rank correlation coefficient It is useful to: Ranked data Measurement data -not follow normal distribution; or not precisely measured Table 12-2 The mother’s frequence of eclamptic convusion and the new-born’s Apgar score Apgar score of new-born Y (3) Rank of X (1) Frequency of mother’s eclamptic convulsion X (2) (4) 1 2 3 4 5 6 7 8 9 10 1 4 2 2 3 1 3 1 5 2 9 1 6 8 5 10 4 9 2 7 Total -- -- No. 6 d 2 Rank of Y (5) Difference between ranks d (6) d2 (7) 2 9 5 5 7.5 2 7.5 2 10 5 8.5 1 5 7 4 10 3 8.5 2 6 -6.5 8 0 -2 3.5 -8 4.5 -6.5 8 -1 42.25 64 0 4 12.25 64 20.25 42.25 64 1 -- -- -- 314 6 314 rs 1 1 0.903 2 2 n(n 1) 10(10 1) 2. Hypothesis test for rs (1) Checking a special table (Table 12-4) rs (10,0.05) 0.648, rs (10,0.01) 0.794 P=0.01 and it is significant (2) t test Same as the t test for Pearson’s correlation coefficient H0: ρ=0, t rs 0 1 rs2 n2 H1: ρ≠0 n2 What if there are more ties? (1) ranking the values of x and y separately -- Calculate the mean rank (2) Calculating the spearman rs: Use the formula for Pearson correlation -- Put the ranks (column 4 and 5) into the formula of Pearson’s correlation coefficient r Caution for correlation Story 1 Correlation between height of son and tree. Time 1 Height of son (cm) X 50 Height of tree (cm) Y 35 2 3 4 … 11 12 54 42 59 50 65 57 … … 75 60 81 66 A correlation coefficient was calculated at the first anniversary Cor ( X , Y ) 0.97, P 0.05 Conclusion: The tree made his son growing up quickly, or his son made the tree growing up quickly?! Story 2 Correlation between swimming and ice cream. Day Number of people Swimming X Number of people Buy ice cream Y 1 2 … 6 7 8 … 11 12 20 14 … 129 235 198 … 45 31 15 12 … 120 237 203 … 40 36 They calculated a correlation coefficient at the end of year Cor ( X , Y ) 0.92, P 0.05 Conclusion: Swimming people must like ice ream, or buying ice ream must go to swimming?! 1) Don’t put any two variables together for correlation -- They must have some relation in subject matter 2) Simple correlation = Direct association + indirect association Simple correlation does not necessary mean a direct association Son ? Time Tree Swimming ? Ice ream Temperature Summary Concept of linear correlation -- The scatter diagram shows a linear tendency Correlation Direct association Correlation Causation After calculating a sample correlation coefficient, it is necessary to have a test for H0: ρ=0, H1: ρ≠0 Rejecting H0 just means ρ≠0. There are two correlation coefficients commonly used: Pearson’s product moment correlation coefficient r Spearmen’s rank correlation correlation coefficient rs -- The formulas do not have to be remembered Next lecture Question: If there is linear correlation between X and Y, Given a value of X, can we predict the value of Y ? How? -- Linear regression