Nov. 13, 2007 LEC #13 ECON 140A/240A1 L. Phillips
Expected Vs. Observed Frequencies, Contingency Tables & Chi Square
I.
Introduction
The Chi Square Distribution can be used to compare expected and observed distributions. There are a number of applications. One example is throwing a die. In this case we have an expected theoretical distribution for each face if the die is fair. We could conduct an experiment and roll a die six hundred times and record which face comes up for each trial and calculate the experimental frequencies. Alternatively, we could simulate such an experiment. Once we have the experimental frequencies for each face we can compare them to the expected frequency of 100. Unless this experiment is extraordinary, the experimental frequencies will differ somewhat from the expected frequencies. The issue is do they differ significantly?
The Chi Square test is based on squaring the difference between the expected frequency and the observed frequency for each face and dividing this square by the expected frequency and summing over all six faces. This number is distributed as Chi
Square with 5 degrees of freedom. The null hypothesis is that the observed frequency equals the expected frequency, in which case this statistic will be zero. Only if the statistic is significantly large would we accept the alternative hypothesis that the observed distribution differs from the expected. We test this at the 5% level.
Another example is searching for a probability model that will fit the observed frequency of the number of men on base when home runs were hit for a particular year in the National League. One possibility is to use the binomial, but a better fit is obtained from the Poisson distribution.
Another application is contingency table analysis, which can be used to test for association or interdependence between variables. An example is a simple two by two
Nov. 13, 2007 LEC #13 ECON 140A/240A2 L. Phillips
Expected Vs. Observed Frequencies, Contingency Tables & Chi Square table. For example, is there a connection between consumer information and purchasing behavior. This example looks at two kinds of refrigerators purchased, frost-free and not frost-free and how that varies by whether the consumer knew that frost-free refrigerators consume more electricity. We could look at each of the marginal distributions, for example what fraction purchased frost-free refrigerators and the remaining fraction that did not. We could examine which consumers were informed about electricity use and which were not. The expected cell frequency in the two by two table would be the product of these marginal frequencies if the purchase were independent of the consumer information. We could calculate the four expected cell frequencies using the product of the marginal distributions under the null hypothesis of independence and then compare these to the observed cell frequencies. If refrigerator choice is independent of consumer information, then we should get an insignificant Chi Square statistic.
II.
The Multinomial Distribution
The Bernoulli event with only two classes, such as yes or no, or heads versus tails, can be extended to accommodate more classes. The resulting distribution is called the multinomial. For example, rolling a fair die is an example where there are six possible elementary outcomes for one toss, {1, 2, 3, 4, 5, 6}. If the die is fair, we know the probability of each outcome, P(j), is one sixth.
Consider two tosses of the die, as illustrated partially in Figure 1. We could obtain
36 elementary events {1,1}, {1, 2}, {1,3}, ……..{6, 5}, {6, 6}. Using n
1 to count the number of ones, etc, if the elementary outcome is {1, 1}, then n
1
= 2, n
2
= 0, etc. , where
6 j
1 n j
= n. If the elementary event were {1, 2}, we would have n
1
=1, and n
2
=1, and the
Nov. 13, 2007 LEC #13 ECON 140A/240A3 L. Phillips
Expected Vs. Observed Frequencies, Contingency Tables & Chi Square rest of the n j
= 0.. But we could obtain one one and one two two ways, {1, 2} and {2, 1}, so we have to count the combinations as well.
--------------------------------------------------------------------
1
2
1 3
2
3
4
5
4
6
5
6
Figure 1: Two Throws of a Die, Partially Illustrated.
----------------------------------------------------------------------------
The probability of one one and one two is:
P(n
1
=1, n
2
=1, n
3
=0, n
4
=0, n
5
=0, n
6
=0) = [n!/ j
6
1 n j
] j
6
1
[ p j
] n(j)
= 2!/1!1!0!0!0!0! (1/6)
= 2*(1/36) = 2/36
III.
Expected Versus Observed Frequencies: The Die
1
(1/6)
1
(1/6)
0
(1/6)
0
(1/6)
0
(1/6)
0
Our expectations for the probabilities of each face are listed in Table 1. If we were to simulate this to obtain experimental frequencies for rolling a die 600 times, we might obtain the following empirical (simulated) distribution, from data file XR15-09, as listed in Table 2. The Chi-Square statistic is calculated from this comparison of observed and expected frequencies, squaring the difference and dividing by the expected frequency and
Nov. 13, 2007 LEC #13 ECON 140A/240A4 L. Phillips
Expected Vs. Observed Frequencies, Contingency Tables & Chi Square summing these values: j
6
1
(O j
– E j
) 2 /E j
, which is distributed as Chi-Square with 5 degrees of freedom, one degree being lost since the probabilities sum to one, and hence only five are independent.
Face
Table 1: Expected Frequencies For Each Face of the Die in 600 Throws.
Probability
1 1/6
Expected Frequency
100
4
5
2
3
1/6
1/6
1/6
1/6
100
100
100
100
6 1/6 100
---------------------------------------------------------------------------------------
Face
Table 2: Observed Versus Expected Frequencies, for Die Faces
Probability Expected Observed
3
4
1
2
1/6
1/6
1/6
1/6
100
100
100
100
114
92
84
101
5
6
1/6
1/6
100
100
107
107
Nov. 13, 2007 LEC #13 ECON 140A/240A5 L. Phillips
Expected Vs. Observed Frequencies, Contingency Tables & Chi Square
The difference between observed and expected frequencies is reported in Table 3, along with each cell’s contribution to the Chi-Square statistic.
Face
Table 3: Simulated frequencies Compared to Theoretical
Observed, O j
Expected, E j
O j
- E j
1
2
114
92
100
100
14
- 8
3
4
5
84
101
107
100
100
100
- 16
1
7
(O j
– E j
)
2
/E j
196/100 = 1.96
64/100 = 0.64
256/100 = 2.56
1/100 = 0.01
49/100 = 0.49
6 107 100
2
= 1.96 + 0.64 + 2.56 + 0.01 + 0.49 + 0.49 = 6.15
7 49/100 = 0.49
------------------------------------------------------------------------------------
The critical value for five degrees of freedom, at a level of significance of 5%, is 11.07 so there is no significant difference between the theoretical distribution and the simulated distribution. The Chi-Square distribution for five degrees of freedom is illustrated in
Figure 2.
Nov. 13, 2007 LEC #13 ECON 140A/240A6 L. Phillips
Expected Vs. Observed Frequencies, Contingency Tables & Chi Square
Figure 2: Chi-Square Density for 5 Degrees of Freedom
0.20
0.15
0.10
0.05
5 %
0.00
0 5 10
Chi Square Variable
11.07
15
IV.
Searching For an Appropriate Probability Model
The number of men on base when a home run is hit during a season in the
National League is given in Table 4. This example is from Yvonne Bishop, Stephen
Fienberg, and Paul Holland, Discrete Multivariate Analysis, Theory and Practice.
-----------------------------------------------------------------------------
Table 4: Number of Men on Base
Men On Base 0 1 2
96
3
21
Sum
765 Observed # 421
Observed Fraction 0.550
227
0.298 0.125 0.027 1
In the search for a probability model that could explain the observed frequencies, the binomial comes to mind. We can check how well the binomial fits the observed data.
First, we need an estimate of the parameter p for the probability of getting on base. We
Nov. 13, 2007 LEC #13 ECON 140A/240A7 L. Phillips
Expected Vs. Observed Frequencies, Contingency Tables & Chi Square use the average number of successes, np, estimating the average number of men on base, weighing the observed fractions by the corresponding integer for the category: number of men on base:
Average # = 0*0.550 + 1* 0.298 + 2* 0.125 + 3*0.027 = 0.63 = n p ,
And n = 3 being the maximum on base, we have p = n p /n = 0.63/3 = 0.21.
Using the binomial distribution where the random variable k is the number of men on base and the number of trials, n, is the bases loaded, i.e. n=3,
P(k=0) = n!/k!(n-k)! p k
(1-p) n-k
= 3!/0!3! (0.21)
0
(0.79)
3
= 0.493
P(k=1) = n!/k!(n-k)! p k
(1-p) n-k
= 3!/1!2! (0.21)
1
(0.79)
2
= 0.393
P(k=2) = n!/k!(n-k)! p k
(1-p) n-k
= 3!/2!1! (0.21)
2
(0.79)
1
= 0.105
P(k=3) = n!/k!(n-k)! p k
(1-p) n-k
= 3!/3!0! (0.21)
3
(0.79)
0
= 0.0.009
These binomial fractions are compared to the observed fractions in Table 5, and multiplying by the observed sum of total men on base, the binomial frequencies can be compared to the observed frequencies.
Table 5: Men on Base, Observed Vs. Binomial Frequencies
Men on Base 0 1 2 3
Observed Fraction
Binomial Fraction
0.550
0.493
0.298
0.393
0.125
0.105
0.027
0.009
Observed Frequency
Binomial Frequency
(O j
– E j
)
(O j
– E j
)
2
/ E j
421
377.1
43.9
5.1
227
300.6
-73.6
18.0
96
80.3
15.7
2.6
21
6.9
14.1
28.8
Sum
1
1
765
764.9
54.5
Nov. 13, 2007 LEC #13 ECON 140A/240A8 L. Phillips
Expected Vs. Observed Frequencies, Contingency Tables & Chi Square
3
2
= 54.5, with
3
2
(0.05) = 7.81, where there are three degrees of freedom since the fractions for the four categories, (0,1,
2, and 3 men on base) add to one. The Chi-Square distribution for three degrees of freedom is illustrated in Figure 3. So we conclude that the binomial distribution does not fit the data well, i.e. does not explain the observed distribution of the number of men on base when a home run is hit.
-------------------------------------------------------------------------------------------
Figure 3: Chi-Square Density for Three Degrees of Freedom
0.25
0.20
0.15
0.10
0.05
5%
0.00
0 5 10 15
7.81
Chi-Square Variable
20
----------------------------------------------------------------------------------------
An alternative probability model is the Poisson, introduced in Lecture Eleven, with only one parameter,
, instead of two, n and p, for the binomial. Since
is the
Nov. 13, 2007 LEC #13 ECON 140A/240A9 L. Phillips
Expected Vs. Observed Frequencies, Contingency Tables & Chi Square average number of successes, we calculate the probabilities of zero, one, two and three successes for
equal 0.63, using the estimated average for np from above. Since the probabilities for the four categories of number of men on base should add to one, we approximate the probability of three men on base as:
P(k=3) = 1 – P(k=0) – P(k=1) – P(k=2) = 1 – 0.5326 –0.3355 –0.1067 = 0.0262
P(k=0) = e
-
k
/ k! = e
-0.63
(0.63)
0
/ 0! = e
-0.63
= 0.5326
P(k=1) = e
-
k
/ k! = e
-0.63
(0.63)
1
/ 1! = e
-0.63
(0.63) = 0.3355
P(k=2) = e
-
k
/ k! = e
-0.63
(0.63)
2
/ 2! = e
-0.63
(0.63)
2
/2 = 0.1057
Table 6 lists the observed fraction and the Poisson fraction and the observed frequencies and the Poisson frequencies.
-------------------------------------------------------------------------
Table 6: Men On Base, Observed Versus Poisson Frequencies
Men On Base
Observed Fraction
Poisson Fraction
Observed Frequency
0
0.550
0.5326
421
1
0.298
0.3355
227
2
0.125
0.1057
96
3
0.027
0.0262
21
Sum
1
1
765
Poisson Frequency
(O j
– E j
)
2
/E j
407.4
0.454
256.7
3.44
3
2
= 6.76, with
3
2
(0.05) = 7.81,
80.9
2.82
20.0
0.05
765
So we accept the Poisson as a satisfactory model for the number of men on base when a home run is hit.
Nov. 13, 2007 LEC #13 ECON 140A/240A10 L. Phillips
Expected Vs. Observed Frequencies, Contingency Tables & Chi Square
V.
Two By Two (2x2) Contingency Tables
The cross-classification of purchases of frost-free and not frost-free refrigerators by consumer information, i.e. aware that frost-free refrigerators use more electricity, or not aware, is illustrated in Table 7. This example, modified, is taken from Richard L.
Mills, Statistics for Applied Economics and Business (1977), McGraw-Hill.
----------------------------------------------------------------------------
Table 7: Refrigerator Purchase Versus Consumer Information
Consumer Knew Frost-Free Consumer Totals
Purchase Frost-Free
Uses More Electricity
314
Unaware
118 432
Purchase Not Frost-Free
Totals
226
540
62
180
288
720
The marginal fractional distributions are calculated from the row sums as a fraction of the grand total, and from the column sums as a fraction of the grand total and are shown in Table 8.
Table 8: Refrigerator Purchase Vs.Consumer Information, Marginal Probabilities
Consumer Knew Frost-Free Consumer Totals
Purchase Frost-Free
Uses More Electricity Unaware
0.6
Purchase Not Frost-Free
Totals 0.75 0.25
0.4
1.0
Nov. 13, 2007 LEC #13 ECON 140A/240A11 L. Phillips
Expected Vs. Observed Frequencies, Contingency Tables & Chi Square
Under the assumption of independence, each cell fraction P ij
, is calculated as the product of the corresponding row fraction, P i
, and corresponding column fraction, P j
, i.e
P ij
= P i
* P j
.
For example, for the cell in row one and column2,
P
12
= P
1
* P
2
= 0.6* 0.25 = 0.15.
These calculated cell probabilities are shown in Table 9:
---------------------------------------------------------------------------------------------
Table 9: Refrigerator Purchase Vs. Consumer Information, Cell Probabilities
Consumer Knew Frost-Free Consumer Totals
Uses More Electricity Unaware
Purchase Frost-Free
Purchase Not Frost-Free
0.45
0.3
0.15
0.1
0.6
0.4
Totals 0.75 0.25 1.0
The expected cell counts, under the null hypothesis of independence between consumer purchase and consumer information, is calculated as the product of the cell probabilities in Table 9, times the grand total of 720 purchases. Note the four cell probabilities add to one. The expected cell counts are listed in Table 10. Note, for example that the expected cell count in row one and column one, E
11
, is :
E
11
= 0.45 * 720 = 324.
This expected cell count for row one and column one is greater than the observed cell count, showing that informed consumers tended to purchase a not frost-free refrigerator.
The remaining question, is that difference significant?
Nov. 13, 2007 LEC #13 ECON 140A/240A12 L. Phillips
Expected Vs. Observed Frequencies, Contingency Tables & Chi Square
Table 10: Refrigerator Purchase Vs. Consumer Information, Expected Cell Counts
Consumer Knew Frost-Free Consumer Totals
Uses More Electricity Unaware
Purchase Frost-Free
Purchase Not Frost-Free
324
216
108
72
432
288
Totals 540 180 720
Lastly, in Table 11, for each cell, the contribution to Chi-Square , (O ij
– E ij
) 2 /E ij
,
Is calculated from the observed cell counts in Table 7, and the expected cell counts under independence in Table 10.
---------------------------------------------------------------------------------------------------
Table 11: Refrigerator Purchase Vs. Consumer Information, Contribution to
2
Purchase Frost-Free
Purchase Not Frost-Free
Consumer Knew Frost-Free Consumer Totals
Uses More Electricity Unaware
0.31
0.46
0.93
1.39
Totals
2
= 0.31 + 0.93 +0.46 +1.39 = 3.09.
Only one of the two column probabilities is independent, since they sum to one, and the same is true of the two row probabilities, so under the null hypothesis of independence, only one of the four cell probabilities is independent, the one that is the product of the
Nov. 13, 2007 LEC #13 ECON 140A/240A13 L. Phillips
Expected Vs. Observed Frequencies, Contingency Tables & Chi Square two independent marginal probabilities. At the 5% level of significance, the critical value of Chi-Square is 5.02, so we would accept the hypothesis of no association between consumer purchase behavior and consumer product information. The Chi-Square distribution for one degree of freedom is shown in Figure 4.
-------------------------------------------------------------------------------
Figure 4: Chi-Square Dens ity, One Degree of Freedom
1.0
0.8
Dens ity
0.6
0.4
0.2
5%
0.0
0 2 4 6 8 10
5.02
Chi-Square Variable
12 14
------------------------------------------------------------------------------------------------
The same procedure is applied to two-way tables with I rows and J columns, where the number of degrees of freedom is (I-1)*(J-1).