Chapter 4 Association Between Two Variables In this chapter, we introduce several methods to measure the association. 4.1 Crosstabulations and Scatter Diagrams: The crosstabulation (table) and the scatter diagram (graph) can help us understand the relationship between two variables. Suppose we have the following scatter diagrams for the weights and heights of the students: 165 height height 170 150 155 169 155 160 160 165 height 170 170 171 175 175 180 180 Sc atter Diagram of Weight v.s . Height 50 55 60 65 70 w e ig h t 75 80 50 55 60 65 70 75 80 50 55 w e ig h t 60 65 70 75 80 w e ig h t The left scatter diagram indicates the positive relationship between weight and height while the right scatter diagram implies the negative relationship between the two variables. The middle scatter diagram shows that there is no apparent relationship between the weight and height. 4.2 Numerical Measures of Association: There are several numerical measures of association. We first introduce the covariance of two variables. 1 (I) Covariance: Suppose we have two populations, y1 , y 2 ,, y N population 1: and population 2: w1 , w2 ,, wN . Also, let sample 1 x1 , x2 ,, xn and sample 2 z1 , z2 ,, z n are drawn from population 1 and population 2, respectively. Let u y and u w be the population means of populations 1 and 2, respectively. Let n x x i 1 n i and n z z i 1 i n be the sample means of samples 1 and 2, respectively. Then, the population covariance is N yw (y i 1 y )( wi w ) i , N while the sample covariance n s xz ( x x )( z i 1 i i n z) n 1 x z i 1 i i nx z n 1 . Intuitively, s xz would be very large (positive) as the observations in two population are larger or smaller than the sample means simultaneously. That is, the observations are positively correlated. On the other hand, s xz would be very small (negative) as the observations in one population are larger than the sample mean while the ones in the other population are smaller than the sample mean. Therefore, the observations are negative correlated. Finally, s xz would be close to 0 as the observations in one population being larger than the sample mean while the ones in the other population are sometimes larger but sometimes smaller than the sample mean, i.e., the observations in the two populations are not correlated. 2 Example: . Let xi be the total money spent on advertisement for some product and z i be the sales volume (1 unit 1000 packs). xi 2 5 1 3 4 1 5 3 4 2 zi 50 57 41 54 54 38 63 48 59 46 ( xi x )( z i z ) 1 12 20 0 3 26 24 0 8 5 10 s xz (x i 1 i x )( z i z ) 10 1 99 11 . 10 1 Note: s xz is not scale invariant. For example, in the above example, if the sales volume is 1 unit 1 pack. Then, z i would be 5000, 5700, 4100, 5400, 5400, 3800, 6300, 4800, 5900, 4600. Thus, s xz will be 1100, which 1000 times larger than the original one. It is not plausible since the correlation between the total money on advertisement and the sales volume would change as the measurement unit changes. The quantity introduced next is scale-invariant and can be used to measure the correlation of two populations. (II) Let Correlation Coefficient: y : population standard deviation for y1 , y 2 ,, y N w : population standard deviation for w1 , w2 ,, wN s x : sample standard deviation for x1 , x2 ,, xn s z : sample standard deviation for z1 , z2 ,, z n . Then, the population correlation coefficient is yw yw y w 3 , while the sample correlation coefficient is s xz sx sz ryw yw 1 Note: and . rxz 1 Example (continue): 10 (x x) s x2 10 2 i i 1 10 1 1.4907 s z2 and (z i 1 i z )2 7.9303 10 1 Then, 10 ( x x )( z s rxz xz sx sz i i 1 10 ( xi x ) 2 i 1 i z) 0.93 10 ( zi z ) 2 . i 1 Note: rxz is scale-invariant. For example, even the sales volume is measured in 1 pack per unit, the value of rxz is still the same, 0.93. Example: Let z i 2 xi , i 1,2,3,4,5 . xi 1 2 3 4 5 zi 2 4 6 8 10 Then, 5 x 3, z 6, s x 5 s xz (x i 1 i ( xi x ) 2 i 1 x )( z i z ) 5 1 5 1 5 5 , sz 2 5 . Thus, 4 (z i 1 i z)2 5 1 10 , rxz s xz sx sz 5 5 10 2 1 . Note: when there is a perfect positive linear relationship between variable x and z, then rxz 1. rxz 1 might indicate a positive linear relationship. 5