Chi-square

advertisement
Chi-Square Sample Proportion versus Population Proportion
Suppose that you were interested in how the typical mutual fund investor made buy/sell
decisions with respect to how the funds performed that he or she owned. In the
following example, we’ll explore a useful statistic that can be used to compare
differences between a sample proportion and a population proportion. In our example
it’s reasonable to assume that in the population of mutual fund investors that a third sell
funds that have gained, a third buy more of those that have gained, and a third hold
onto their investment and neither sell or buy more of the funds. Thus, with
representing the frequency expected in the equation below, we would expect there to be
.333 for each of our three situations.
Now suppose that we sample the population and find that actually 50% sell their
winners, 25% buy more of their winners, and 25% just hold onto their winners, neither
buying nor selling. The statistical test that can be used to determine if our sample
proportion is statistically different from our population proportion is , Chi-square. To
compute the
we basically compare , the frequency expected, with , the frequency
observed using the following formula:
In our problem above:
We’ll defer the interpretation of the
Excel please see this clip.
until later in this section. To do this problem in
When there are only two groups, the formula above is modified by subtracting .5 (the
correction for continuity), in each of the two categories as illustrated below.
Chi-Square Test of Independence
Many times there are situations in which it’s necessary to determine if there’s a
relationship between two (or more) categorical variables. Examples of these kinds of
problems include:

Is there a relationship between whether a sports team wins or loses and whether
the game is played at home or away?

Is there a relationship between gender (male or female) and level in the
managerial hierarchy (supervisory, mid-level, or executive)?

Is there a relationship between type of pay plan (incentive, longevity, or a
combination of the two) and employee job satisfaction level (high, medium, or
low)?

Is there a relationship between the dosage of a cholesterol medication (5 mg., 10
mg., 15 mg., or 20 mg.) and cholesterol (normal or high)?

Is there a relationship between the size of the discount on a product (10%, 20%,
or 30%) and whether or not the product is purchased?
In problems such as the above our data sometimes is in tables in which the numbers of
combinations are in the various cells. In the following table, a segment of the data
(2009) from the Bureau of Labor Statistics website (http://www.bls.gov/) contains, in
thousands, the union membership status for workers in six different industries.
Industry
Mining
Construction
Manufacturing
Transportation/Utilities
Finance Activities
Health and Social Services
Union Membership
Yes
No
57
692
958
6613
1470
13454
1144
5162
150
8236
1161
15454
Tables such as the above are referred to as contingency tables and can be informative
without conducting any statistical analysis. Normally, however, questions about
whether there’s a relationship between the two variables are asked and to answer such
a question the Chi-Square Test of Independence is performed. In our example above
there are two categories with six levels (mining, construction, manufacturing,
transportation/utilities, financial activities, and health/social services) in the industry
category and two (yes, no) in the union membership category. The table would be
referred to as a 6 X 2 Chi-Square or a 2 X 6 Chi-Square. If we had two categories and
two levels of each category it would be a 2 X 2 Chi-Square. Of course we could have a
2 X 3 Chi-Square, a 3 X 3 Chi-Square, etc. Additionally we are not restricted to only two
categories, we could have a 2 X 2 X 3 Chi-Square or any other combination but Excel is
limited in its abilities to only two categorical variables. Further, there are other, better
techniques for analyzing relationships among more than two categorical variables so
we’ll limit our analyses to situations in which we only have two categorical variables.
We’ll begin with a somewhat smaller table to illustrate the Chi-Square Test of
Independence. The following table, based on a study of legal and white collar
professionals performed by WorldOne Research, contains the number of individuals, by
generational group who have used the listed programs during work hours. The baby
boomers were born before 1965, the generation Xers were born between 1965 and
1979, and the generation Yers were born after 1979.
Type of Program
Music Playing
Video Playing
Gaming
Generation
Baby Boomer
Generation X
Generation Y
96
165
87
69
135
76
38
77
59
The initial step, illustrated below, is to compute the row totals, column totals, and the
grand total in the table.
Type of Program
Music Playing
Video Playing
Gaming
Column Totals
Baby Boomer
96
69
38
203
Generation
Generation X
Generation Y
Row Totals
165
87
348
135
76
280
77
59
174
377
222
802
The next step is to compute the for each cell, the frequency that would be expected in
each cell if the null hypothesis were true. In other words this step involves calculating
the expected frequency for each cell assuming that there were no relationship
whatsoever between the two categorical variable (that they were completely
independent of one another). This is done with the following equation that prorates the
values. For example 348/802 or .4339 or a little over 43% of the 203 Baby Boomers
would use music playing software at work if the two categorical variables were
independent of one another. Thus we’d expect 88.08 of the 203 Baby Boomers to play
music at work. So, each expected cell value is
Our table of expected cell frequencies, following the above equation, would be
Type of Program
Music Playing
Video Playing
Gaming
Generation
Baby Boomer
Generation X
Generation Y
88.08478803
163.5860349
96.32917706
70.87281796
131.6209476
77.50623441
44.04239401
81.79301746
48.16458853
The next step is to compute the Chi-Square with the following
This equation is the same that we used for the sample proportion versus the population
proportion analysis earlier in this module. Please note that the df (degrees of freedom)
for a Chi-square are the number of rows in the table minus one times the number of
columns in the table minus one (# of rows – 1)(# of columns – 1).To see how to perform
the Chi-Square Test of Independence using Excel please see this video clip.
The results of our analysis would be as follows.
, df = 4, p = .254
Because the p-value is greater than α (.05), we would conclude that Ho is supported
and that there is no relationship between software playing at work and generation.
The above examples have used data that is already in summarized tables. Frequently,
however, the data is in a “raw” format such as below.
Music Playing
Video Playing
Gaming
Music Playing
Music Playing
Gaming
Gaming
Gaming
Video Playing
Music Playing
Video Playing
Gaming
Music Playing
Gaming
Music Playing
Gaming
Video Playing
Video Playing
Gaming
Music Playing
Baby Boomer
Generation X
Generation Y
Generation X
Generation Y
Generation X
Generation X
Baby Boomer
Baby Boomer
Baby Boomer
Generation X
Generation X
Generation X
Generation X
Baby Boomer
Generation X
Baby Boomer
Generation X
Generation Y
Baby Boomer
Gaming
Video Playing
Music Playing
Video Playing
Baby Boomer
Baby Boomer
Baby Boomer
Generation X
In cases such as the above, in which the data is not already in a summarized table,
the data must be sorted and the various combinations be counted in order to construct
the contingency table for analysis. Instead of manually counting the combinations a
better way to create the table is to use the Pivot Table in Excel.
Download