Lecture 13 - UCSB Economics

Nov. 13, 2007 LEC #13 ECON 140A/240A1 L. Phillips

Expected Vs. Observed Frequencies, Contingency Tables & Chi Square

I.

Introduction

The Chi Square Distribution can be used to compare expected and observed distributions. There are a number of applications. One example is throwing a die. In this case we have an expected theoretical distribution for each face if the die is fair. We could conduct an experiment and roll a die six hundred times and record which face comes up for each trial and calculate the experimental frequencies. Alternatively, we could simulate such an experiment. Once we have the experimental frequencies for each face we can compare them to the expected frequency of 100. Unless this experiment is extraordinary, the experimental frequencies will differ somewhat from the expected frequencies. The issue is do they differ significantly?

The Chi Square test is based on squaring the difference between the expected frequency and the observed frequency for each face and dividing this square by the expected frequency and summing over all six faces. This number is distributed as Chi

Square with 5 degrees of freedom. The null hypothesis is that the observed frequency equals the expected frequency, in which case this statistic will be zero. Only if the statistic is significantly large would we accept the alternative hypothesis that the observed distribution differs from the expected. We test this at the 5% level.

Another example is searching for a probability model that will fit the observed frequency of the number of men on base when home runs were hit for a particular year in the National League. One possibility is to use the binomial, but a better fit is obtained from the Poisson distribution.

Another application is contingency table analysis, which can be used to test for association or interdependence between variables. An example is a simple two by two


Expected Vs. Observed Frequencies, Contingency Tables & Chi Square table. For example, is there a connection between consumer information and purchasing behavior. This example looks at two kinds of refrigerators purchased, frost-free and not frost-free and how that varies by whether the consumer knew that frost-free refrigerators consume more electricity. We could look at each of the marginal distributions, for example what fraction purchased frost-free refrigerators and the remaining fraction that did not. We could examine which consumers were informed about electricity use and which were not. The expected cell frequency in the two by two table would be the product of these marginal frequencies if the purchase were independent of the consumer information. We could calculate the four expected cell frequencies using the product of the marginal distributions under the null hypothesis of independence and then compare these to the observed cell frequencies. If refrigerator choice is independent of consumer information, then we should get an insignificant Chi Square statistic.

II.

The Multinomial Distribution

The Bernoulli event with only two classes, such as yes or no, or heads versus tails, can be extended to accommodate more classes. The resulting distribution is called the multinomial. For example, rolling a fair die is an example where there are six possible elementary outcomes for one toss, {1, 2, 3, 4, 5, 6}. If the die is fair, we know the probability of each outcome, P(j), is one sixth.

Consider two tosses of the die, as illustrated partially in Figure 1. We could obtain

36 elementary events {1,1}, {1, 2}, {1,3}, ……..{6, 5}, {6, 6}. Using n

1 to count the number of ones, etc, if the elementary outcome is {1, 1}, then n

1

= 2, n

2

= 0, etc. , where

6  j



1 n j

= n. If the elementary event were {1, 2}, we would have n

1

=1, and n

2

=1, and the


Expected Vs. Observed Frequencies, Contingency Tables & Chi Square rest of the n j

= 0.. But we could obtain one one and one two two ways, {1, 2} and {2, 1}, so we have to count the combinations as well.

--------------------------------------------------------------------

1

2

1 3

2

3

4

5

4

6

5

6

Figure 1: Two Throws of a Die, Partially Illustrated.

----------------------------------------------------------------------------

The probability of one one and one two is:

P(n

1

=1, n

2

=1, n

3

=0, n

4

=0, n

5

=0, n

6

=0) = [n!/ j

6 



1 n j

] j

6 



1

[ p j

] n(j)

= 2!/1!1!0!0!0!0! (1/6)

= 2*(1/36) = 2/36

III.

Expected Versus Observed Frequencies: The Die

1

(1/6)

1

(1/6)

0

(1/6)

0

(1/6)

0

(1/6)

0

Our expectations for the probabilities of each face are listed in Table 1. If we were to simulate this to obtain experimental frequencies for rolling a die 600 times, we might obtain the following empirical (simulated) distribution, from data file XR15-09, as listed in Table 2. The Chi-Square statistic is calculated from this comparison of observed and expected frequencies, squaring the difference and dividing by the expected frequency and


Expected Vs. Observed Frequencies, Contingency Tables & Chi Square summing these values: j

6 



1

(O j

– E j

) 2 /E j

, which is distributed as Chi-Square with 5 degrees of freedom, one degree being lost since the probabilities sum to one, and hence only five are independent.

Face

Table 1: Expected Frequencies For Each Face of the Die in 600 Throws.

Probability

1 1/6

Expected Frequency

100

4

5

2

3

1/6

1/6

1/6

1/6

100

100

100

100

6 1/6 100

---------------------------------------------------------------------------------------

Face

Table 2: Observed Versus Expected Frequencies, for Die Faces

Probability Expected Observed

3

4

1

2

1/6

1/6

1/6

1/6

100

100

100

100

114

92

84

101

5

6

1/6

1/6

100

100

107

107



The difference between observed and expected frequencies is reported in Table 3, along with each cell’s contribution to the Chi-Square statistic.

Face

Table 3: Simulated frequencies Compared to Theoretical

Observed, O j

Expected, E j

O j

- E j

1

2

114

92

100

100

14

- 8

3

4

5

84

101

107

100

100

100

- 16

1

7

(O j

– E j

)

2

/E j

196/100 = 1.96

64/100 = 0.64

256/100 = 2.56

1/100 = 0.01

49/100 = 0.49

6 107 100

 2

= 1.96 + 0.64 + 2.56 + 0.01 + 0.49 + 0.49 = 6.15

7 49/100 = 0.49

------------------------------------------------------------------------------------

The critical value for five degrees of freedom, at a level of significance of 5%, is 11.07 so there is no significant difference between the theoretical distribution and the simulated distribution. The Chi-Square distribution for five degrees of freedom is illustrated in

Figure 2.



Figure 2: Chi-Square Density for 5 Degrees of Freedom

0.20

0.15

0.10

0.05

5 %

0.00

0 5 10

Chi Square Variable

11.07

15

IV.

Searching For an Appropriate Probability Model

The number of men on base when a home run is hit during a season in the

National League is given in Table 4. This example is from Yvonne Bishop, Stephen

Fienberg, and Paul Holland, Discrete Multivariate Analysis, Theory and Practice.

-----------------------------------------------------------------------------

Table 4: Number of Men on Base

Men On Base 0 1 2

96

3

21

Sum

765 Observed # 421

Observed Fraction 0.550

227

0.298 0.125 0.027 1

In the search for a probability model that could explain the observed frequencies, the binomial comes to mind. We can check how well the binomial fits the observed data.

First, we need an estimate of the parameter p for the probability of getting on base. We


Expected Vs. Observed Frequencies, Contingency Tables & Chi Square use the average number of successes, np, estimating the average number of men on base, weighing the observed fractions by the corresponding integer for the category: number of men on base:

Average # = 0*0.550 + 1* 0.298 + 2* 0.125 + 3*0.027 = 0.63 = n p ,

And n = 3 being the maximum on base, we have p = n p /n = 0.63/3 = 0.21.

Using the binomial distribution where the random variable k is the number of men on base and the number of trials, n, is the bases loaded, i.e. n=3,

P(k=0) = n!/k!(n-k)! p k

(1-p) n-k

= 3!/0!3! (0.21)

0

(0.79)

3

= 0.493

P(k=1) = n!/k!(n-k)! p k

(1-p) n-k

= 3!/1!2! (0.21)

1

(0.79)

2

= 0.393

P(k=2) = n!/k!(n-k)! p k

(1-p) n-k

= 3!/2!1! (0.21)

2

(0.79)

1

= 0.105

P(k=3) = n!/k!(n-k)! p k

(1-p) n-k

= 3!/3!0! (0.21)

3

(0.79)

0

= 0.0.009

These binomial fractions are compared to the observed fractions in Table 5, and multiplying by the observed sum of total men on base, the binomial frequencies can be compared to the observed frequencies.

Table 5: Men on Base, Observed Vs. Binomial Frequencies

Men on Base 0 1 2 3

Observed Fraction

Binomial Fraction

0.550

0.493

0.298

0.393

0.125

0.105

0.027

0.009

Observed Frequency

Binomial Frequency

(O j

– E j

)

(O j

– E j

)

2

/ E j

421

377.1

43.9

5.1

227

300.6

-73.6

18.0

96

80.3

15.7

2.6

21

6.9

14.1

28.8

Sum

1

1

765

764.9

54.5





3

2

= 54.5, with



3

2

(0.05) = 7.81, where there are three degrees of freedom since the fractions for the four categories, (0,1,

2, and 3 men on base) add to one. The Chi-Square distribution for three degrees of freedom is illustrated in Figure 3. So we conclude that the binomial distribution does not fit the data well, i.e. does not explain the observed distribution of the number of men on base when a home run is hit.

-------------------------------------------------------------------------------------------

Figure 3: Chi-Square Density for Three Degrees of Freedom

0.25

0.20

0.15

0.10

0.05

5%

0.00

0 5 10 15

7.81

Chi-Square Variable

20

----------------------------------------------------------------------------------------

An alternative probability model is the Poisson, introduced in Lecture Eleven, with only one parameter,



, instead of two, n and p, for the binomial. Since



is the


Expected Vs. Observed Frequencies, Contingency Tables & Chi Square average number of successes, we calculate the probabilities of zero, one, two and three successes for



equal 0.63, using the estimated average for np from above. Since the probabilities for the four categories of number of men on base should add to one, we approximate the probability of three men on base as:

P(k=3) = 1 – P(k=0) – P(k=1) – P(k=2) = 1 – 0.5326 –0.3355 –0.1067 = 0.0262

P(k=0) = e

-

  k

/ k! = e

-0.63

(0.63)

0

/ 0! = e

-0.63

= 0.5326

P(k=1) = e

-

  k

/ k! = e

-0.63

(0.63)

1

/ 1! = e

-0.63

(0.63) = 0.3355

P(k=2) = e

-

  k

/ k! = e

-0.63

(0.63)

2

/ 2! = e

-0.63

(0.63)

2

/2 = 0.1057

Table 6 lists the observed fraction and the Poisson fraction and the observed frequencies and the Poisson frequencies.

-------------------------------------------------------------------------

Table 6: Men On Base, Observed Versus Poisson Frequencies

Men On Base

Observed Fraction

Poisson Fraction

Observed Frequency

0

0.550

0.5326

421

1

0.298

0.3355

227

2

0.125

0.1057

96

3

0.027

0.0262

21

Sum

1

1

765

Poisson Frequency

(O j

– E j

)

2

/E j

407.4

0.454

256.7

3.44



3

2

= 6.76, with



3

2

(0.05) = 7.81,

80.9

2.82

20.0

0.05

765

So we accept the Poisson as a satisfactory model for the number of men on base when a home run is hit.



V.

Two By Two (2x2) Contingency Tables

The cross-classification of purchases of frost-free and not frost-free refrigerators by consumer information, i.e. aware that frost-free refrigerators use more electricity, or not aware, is illustrated in Table 7. This example, modified, is taken from Richard L.

Mills, Statistics for Applied Economics and Business (1977), McGraw-Hill.

----------------------------------------------------------------------------

Table 7: Refrigerator Purchase Versus Consumer Information

Consumer Knew Frost-Free Consumer Totals

Purchase Frost-Free

Uses More Electricity

314

Unaware

118 432

Purchase Not Frost-Free

Totals

226

540

62

180

288

720

The marginal fractional distributions are calculated from the row sums as a fraction of the grand total, and from the column sums as a fraction of the grand total and are shown in Table 8.

Table 8: Refrigerator Purchase Vs.Consumer Information, Marginal Probabilities


Purchase Frost-Free

Uses More Electricity Unaware

0.6


Totals 0.75 0.25

0.4

1.0



Under the assumption of independence, each cell fraction P ij

, is calculated as the product of the corresponding row fraction, P i

, and corresponding column fraction, P j

, i.e

P ij

= P i

* P j

.

For example, for the cell in row one and column2,

P

12

= P

1

* P

2

= 0.6* 0.25 = 0.15.

These calculated cell probabilities are shown in Table 9:

---------------------------------------------------------------------------------------------

Table 9: Refrigerator Purchase Vs. Consumer Information, Cell Probabilities



Purchase Frost-Free


0.45

0.3

0.15

0.1

0.6

0.4

Totals 0.75 0.25 1.0

The expected cell counts, under the null hypothesis of independence between consumer purchase and consumer information, is calculated as the product of the cell probabilities in Table 9, times the grand total of 720 purchases. Note the four cell probabilities add to one. The expected cell counts are listed in Table 10. Note, for example that the expected cell count in row one and column one, E

11

, is :

E

11

= 0.45 * 720 = 324.

This expected cell count for row one and column one is greater than the observed cell count, showing that informed consumers tended to purchase a not frost-free refrigerator.

The remaining question, is that difference significant?



Table 10: Refrigerator Purchase Vs. Consumer Information, Expected Cell Counts



Purchase Frost-Free


324

216

108

72

432

288

Totals 540 180 720

Lastly, in Table 11, for each cell, the contribution to Chi-Square , (O ij

– E ij

) 2 /E ij

,

Is calculated from the observed cell counts in Table 7, and the expected cell counts under independence in Table 10.

---------------------------------------------------------------------------------------------------

Table 11: Refrigerator Purchase Vs. Consumer Information, Contribution to

 2

Purchase Frost-Free




0.31

0.46

0.93

1.39

Totals

 2

= 0.31 + 0.93 +0.46 +1.39 = 3.09.

Only one of the two column probabilities is independent, since they sum to one, and the same is true of the two row probabilities, so under the null hypothesis of independence, only one of the four cell probabilities is independent, the one that is the product of the


Expected Vs. Observed Frequencies, Contingency Tables & Chi Square two independent marginal probabilities. At the 5% level of significance, the critical value of Chi-Square is 5.02, so we would accept the hypothesis of no association between consumer purchase behavior and consumer product information. The Chi-Square distribution for one degree of freedom is shown in Figure 4.

-------------------------------------------------------------------------------

Figure 4: Chi-Square Dens ity, One Degree of Freedom

1.0

0.8

Dens ity

0.6

0.4

0.2

5%

0.0

0 2 4 6 8 10

5.02

Chi-Square Variable

12 14

------------------------------------------------------------------------------------------------

The same procedure is applied to two-way tables with I rows and J columns, where the number of degrees of freedom is (I-1)*(J-1).

Lecture 13 - UCSB Economics

Related documents

Products

Support

Lecture 13 - UCSB Economics

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib