Chapter 4 Association Between Two Variables

advertisement
Chapter 4 Association Between Two Variables
In this chapter, we introduce several methods to measure the association.
4.1 Crosstabulations and Scatter Diagrams:
The crosstabulation (table) and the scatter diagram (graph) can help us understand the
relationship between two variables.
Suppose we have the following scatter diagrams for the weights and heights of the
students:
165
height
height
170
150
155
169
155
160
160
165
height
170
170
171
175
175
180
180
Sc atter Diagram of Weight v.s . Height
50
55
60
65
70
w e ig h t
75
80
50
55
60
65
70
75
80
50
55
w e ig h t
60
65
70
75
80
w e ig h t
The left scatter diagram indicates the positive relationship between weight and height
while the right scatter diagram implies the negative relationship between the two
variables. The middle scatter diagram shows that there is no apparent relationship
between the weight and height.
4.2 Numerical Measures of Association:
There are several numerical measures of association. We first introduce the covariance
of two variables.
1
(I)
Covariance:
Suppose we have two populations,
y1 , y 2 ,, y N
population 1:
and population 2:
w1 , w2 ,, wN .
Also, let
sample 1
x1 , x2 ,, xn
and sample 2
z1 , z2 ,, z n
are drawn from population 1 and population 2, respectively.
Let u y and u w be the population means of populations 1 and 2, respectively. Let
n
x
x
i 1
n
i
and
n
z
z
i 1
i
n
be the sample means of samples 1 and 2, respectively.
Then, the population covariance is
N
 yw 
(y
i 1
 y )( wi  w )
i
,
N
while the sample covariance
n
s xz 
 ( x  x )( z
i 1
i
i
n
 z)

n 1
x z
i 1
i i
 nx z
n 1
.
Intuitively, s xz would be very large (positive) as the observations in two population
are larger or smaller than the sample means simultaneously. That is, the observations
are positively correlated. On the other hand, s xz would be very small (negative) as
the observations in one population are larger than the sample mean while the ones in
the other population are smaller than the sample mean. Therefore, the observations are
negative correlated. Finally, s xz would be close to 0 as the observations in one
population being larger than the sample mean while the ones in the other population
are sometimes larger but sometimes smaller than the sample mean, i.e., the
observations in the two populations are not correlated.
2
Example: .
Let xi be the total money spent on advertisement for some product and z i be the
sales volume (1 unit  1000 packs).
xi
2
5
1
3
4
1
5
3
4
2
zi
50
57
41
54
54
38
63
48
59
46
( xi  x )( z i  z )
1
12
20
0
3
26
24
0
8
5
10

s xz 
 (x
i 1
i
 x )( z i  z )
10  1

99
 11 .
10  1
Note: s xz is not scale invariant. For example, in the above example, if the sales
volume is 1 unit  1 pack. Then, z i would be 5000, 5700, 4100, 5400, 5400, 3800,
6300, 4800, 5900, 4600. Thus, s xz will be 1100, which 1000 times larger than the
original one. It is not plausible since the correlation between the total money on
advertisement and the sales volume would change as the measurement unit changes.
The quantity introduced next is scale-invariant and can be used to measure the
correlation of two populations.
(II)
Let
Correlation Coefficient:
 y : population standard deviation for
y1 , y 2 ,, y N
 w : population standard deviation for
w1 , w2 ,, wN
s x : sample standard deviation for
x1 , x2 ,, xn
s z : sample standard deviation for
z1 , z2 ,, z n .
Then, the population correlation coefficient is
 yw
 yw

 y w
3
,
while the sample correlation coefficient is
s xz
sx sz
ryw 
 yw  1
Note:
and
.
rxz  1
Example (continue):
10
(x  x)
s x2 
10
2
i
i 1
10  1
 1.4907
s z2 
and
(z
i 1
i
 z )2
 7.9303
10  1
Then,
10
 ( x  x )( z
s
rxz  xz 
sx sz
i
i 1
10
 ( xi  x ) 2
i 1
i
 z)
 0.93
10
 ( zi  z ) 2
.
i 1
Note: rxz is scale-invariant. For example, even the sales volume is measured in 1
pack per unit, the value of rxz is still the same, 0.93.
Example:
Let z i  2 xi , i  1,2,3,4,5 .
xi
1
2
3
4
5
zi
2
4
6
8
10
Then,
5
x  3, z  6, s x 
5
s xz 
 (x
i 1
i
 ( xi  x ) 2
i 1
 x )( z i  z )
5 1
5 1
5

5
, sz 
2
 5 . Thus,
4
 (z
i 1
i
 z)2
5 1
 10
,
rxz 
s xz

sx sz
5
5
10
2
1
.
Note: when there is a perfect positive linear relationship between variable x and z,
then rxz  1. rxz  1 might indicate a positive linear relationship.
5
Download