Chi-Square Sample Proportion versus Population Proportion Suppose that you were interested in how the typical mutual fund investor made buy/sell decisions with respect to how the funds performed that he or she owned. In the following example, we’ll explore a useful statistic that can be used to compare differences between a sample proportion and a population proportion. In our example it’s reasonable to assume that in the population of mutual fund investors that a third sell funds that have gained, a third buy more of those that have gained, and a third hold onto their investment and neither sell or buy more of the funds. Thus, with representing the frequency expected in the equation below, we would expect there to be .333 for each of our three situations. Now suppose that we sample the population and find that actually 50% sell their winners, 25% buy more of their winners, and 25% just hold onto their winners, neither buying nor selling. The statistical test that can be used to determine if our sample proportion is statistically different from our population proportion is , Chi-square. To compute the we basically compare , the frequency expected, with , the frequency observed using the following formula: In our problem above: We’ll defer the interpretation of the Excel please see this clip. until later in this section. To do this problem in When there are only two groups, the formula above is modified by subtracting .5 (the correction for continuity), in each of the two categories as illustrated below. Chi-Square Test of Independence Many times there are situations in which it’s necessary to determine if there’s a relationship between two (or more) categorical variables. Examples of these kinds of problems include: Is there a relationship between whether a sports team wins or loses and whether the game is played at home or away? Is there a relationship between gender (male or female) and level in the managerial hierarchy (supervisory, mid-level, or executive)? Is there a relationship between type of pay plan (incentive, longevity, or a combination of the two) and employee job satisfaction level (high, medium, or low)? Is there a relationship between the dosage of a cholesterol medication (5 mg., 10 mg., 15 mg., or 20 mg.) and cholesterol (normal or high)? Is there a relationship between the size of the discount on a product (10%, 20%, or 30%) and whether or not the product is purchased? In problems such as the above our data sometimes is in tables in which the numbers of combinations are in the various cells. In the following table, a segment of the data (2009) from the Bureau of Labor Statistics website (http://www.bls.gov/) contains, in thousands, the union membership status for workers in six different industries. Industry Mining Construction Manufacturing Transportation/Utilities Finance Activities Health and Social Services Union Membership Yes No 57 692 958 6613 1470 13454 1144 5162 150 8236 1161 15454 Tables such as the above are referred to as contingency tables and can be informative without conducting any statistical analysis. Normally, however, questions about whether there’s a relationship between the two variables are asked and to answer such a question the Chi-Square Test of Independence is performed. In our example above there are two categories with six levels (mining, construction, manufacturing, transportation/utilities, financial activities, and health/social services) in the industry category and two (yes, no) in the union membership category. The table would be referred to as a 6 X 2 Chi-Square or a 2 X 6 Chi-Square. If we had two categories and two levels of each category it would be a 2 X 2 Chi-Square. Of course we could have a 2 X 3 Chi-Square, a 3 X 3 Chi-Square, etc. Additionally we are not restricted to only two categories, we could have a 2 X 2 X 3 Chi-Square or any other combination but Excel is limited in its abilities to only two categorical variables. Further, there are other, better techniques for analyzing relationships among more than two categorical variables so we’ll limit our analyses to situations in which we only have two categorical variables. We’ll begin with a somewhat smaller table to illustrate the Chi-Square Test of Independence. The following table, based on a study of legal and white collar professionals performed by WorldOne Research, contains the number of individuals, by generational group who have used the listed programs during work hours. The baby boomers were born before 1965, the generation Xers were born between 1965 and 1979, and the generation Yers were born after 1979. Type of Program Music Playing Video Playing Gaming Generation Baby Boomer Generation X Generation Y 96 165 87 69 135 76 38 77 59 The initial step, illustrated below, is to compute the row totals, column totals, and the grand total in the table. Type of Program Music Playing Video Playing Gaming Column Totals Baby Boomer 96 69 38 203 Generation Generation X Generation Y Row Totals 165 87 348 135 76 280 77 59 174 377 222 802 The next step is to compute the for each cell, the frequency that would be expected in each cell if the null hypothesis were true. In other words this step involves calculating the expected frequency for each cell assuming that there were no relationship whatsoever between the two categorical variable (that they were completely independent of one another). This is done with the following equation that prorates the values. For example 348/802 or .4339 or a little over 43% of the 203 Baby Boomers would use music playing software at work if the two categorical variables were independent of one another. Thus we’d expect 88.08 of the 203 Baby Boomers to play music at work. So, each expected cell value is Our table of expected cell frequencies, following the above equation, would be Type of Program Music Playing Video Playing Gaming Generation Baby Boomer Generation X Generation Y 88.08478803 163.5860349 96.32917706 70.87281796 131.6209476 77.50623441 44.04239401 81.79301746 48.16458853 The next step is to compute the Chi-Square with the following This equation is the same that we used for the sample proportion versus the population proportion analysis earlier in this module. Please note that the df (degrees of freedom) for a Chi-square are the number of rows in the table minus one times the number of columns in the table minus one (# of rows – 1)(# of columns – 1).To see how to perform the Chi-Square Test of Independence using Excel please see this video clip. The results of our analysis would be as follows. , df = 4, p = .254 Because the p-value is greater than α (.05), we would conclude that Ho is supported and that there is no relationship between software playing at work and generation. The above examples have used data that is already in summarized tables. Frequently, however, the data is in a “raw” format such as below. Music Playing Video Playing Gaming Music Playing Music Playing Gaming Gaming Gaming Video Playing Music Playing Video Playing Gaming Music Playing Gaming Music Playing Gaming Video Playing Video Playing Gaming Music Playing Baby Boomer Generation X Generation Y Generation X Generation Y Generation X Generation X Baby Boomer Baby Boomer Baby Boomer Generation X Generation X Generation X Generation X Baby Boomer Generation X Baby Boomer Generation X Generation Y Baby Boomer Gaming Video Playing Music Playing Video Playing Baby Boomer Baby Boomer Baby Boomer Generation X In cases such as the above, in which the data is not already in a summarized table, the data must be sorted and the various combinations be counted in order to construct the contingency table for analysis. Instead of manually counting the combinations a better way to create the table is to use the Pivot Table in Excel.