4.2 Numerical Measures of Association: There are several numerical measures of association. We first introduce the covariance of two variables. (I) Covariance: Suppose we have two populations, population 1: y1 , y 2 ,, y N and population 2: w1 , w2 ,, wN . Also, let sample 1: x1 , x2 ,, xn z1 , z2 ,, z n and sample 2: are drawn from population 1 and population 2, respectively. Let u y and u w be the population means of populations 1 and 2, respectively. Let n x x i 1 n i z and n z i 1 i n be the sample means of samples 1 and 2, respectively. Then, the population covariance is N yw (y i 1 y )( wi w ) i , N while the sample covariance n s xz ( x x )( z i 1 i i n z) n 1 x z i 1 i i nx z n 1 . Intuitively, s xz would be very large (positive) as the observations in two population are larger or smaller than the sample means simultaneously. That is, the observations are positively correlated. On the other hand, s xz would be very small (negative) as the observations in one population are larger than the sample 1 mean while the ones in the other population are smaller than the sample mean. Therefore, the observations are negative correlated. Finally, s xz would be close to 0 as the observations in one population being larger than the sample mean while the ones in the other population are sometimes larger but sometimes smaller than the sample mean, i.e., the observations in the two populations are not correlated. Example: . Let xi be the total money spent on advertisement for some product and z i be the sales volume (1 unit 100 packs). xi 2 5 1 zi 50 57 41 3 4 1 5 3 4 2 54 54 38 63 48 59 46 ( xi x )( z i z ) 0 3 26 24 0 8 5 1 12 20 10 Note: s xz (x i 1 i x )( z i z ) 10 1 99 11 . 10 1 s xz is not scale invariant. For example, in the above example, if the sales volume is 1 unit 1 pack. Then, z i would be 5000, 5700, 4100, 5400, 5400, 3800, 6300, 4800, 5900, 4600. Thus, s xz will be 1100, which 100 times larger than the original one. It is not plausible since the correlation between the total money on advertisement and the sales volume would change as the measurement unit changes. The quantity introduced next is scale-invariant and can be used to measure the correlation of two populations. (II) Correlation Coefficient: Let y : population standard deviation for y1 , y 2 ,, y N w : population standard deviation for w1 , w2 ,, wN s x : sample standard deviation for 2 x1 , x2 ,, xn z1 , z2 ,, z n . s z : sample standard deviation for Then, the population correlation coefficient is yw y w yw , while the sample correlation coefficient is n rxz x s xz sx sz i i 1 n xi x zi z x 2 i 1 Note: yw 1 and n zi z 2 . i 1 rxz 1 Example (continue): 10 (x x) s x2 i 1 i 10 1 10 2 1.4907 2 and s z2 (z i 1 i z )2 10 1 7.9303 2 Then, rxz Note: s xz 0.93 . sx sz rxz is scale-invariant. For example, even the sales volume is measured in 1 pack per unit, the value of rxz is still the same, 0.93. Example: Let z i 2 xi , i 1,2,3,4,5 . xi 1 2 3 4 5 zi 2 4 6 8 10 Then, 3 5 x 3, z 6, s x 5 ( xi x ) 2 i 1 5 1 5 s xz (x i 1 i 5 , sz 2 x )( z i z ) 5 1 (z i 1 i z)2 5 1 10 , 5. Thus, rxz s xz sx sz 5 5 10 2 1 . Note: when there is a perfect positive linear relationship between variable x and z, then rxz 1 . rxz 1 might indicate a positive linear relationship. Online Exercise: Exercise 4.2.1 4