Author: Brenda Gunderson, Ph.D., 2012 License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution-NonCommercial-Share Alike 3.0 Unported License: http://creativecommons.org/licenses/by-nc-sa/3.0/ The University of Michigan Open.Michigan initiative has reviewed this material in accordance with U.S. Copyright Law and have tried to maximize your ability to use, share, and adapt it. The attribution key provides information about how you may share and adapt this material. Copyright holders of content included in this material should contact open.michigan@umich.edu with any questions, corrections, or clarification regarding the use of content. For more information about how to attribute these materials visit: http://open.umich.edu/education/about/terms-of-use. Some materials are used with permission from the copyright holders. You may need to obtain new permission to use those materials for other uses. This includes all content from: Mind on Statistics Utts/Heckard, 4th Edition, Cengage L, 2012 Text Only: ISBN 9781285135984 Bundled version: ISBN 9780538733489 SPSS and its associated programs are trademarks of SPSS Inc. for its proprietary computer software. Other product names mentioned in this resource are used for identification purposes only and may be trademarks of their respective companies. Attribution Key For more information see: http:://open.umich.edu/wiki/AttributionPolicy Content the copyright holder, author, or law permits you to use, share and adapt: Creative Commons Attribution-NonCommercial-Share Alike License Public Domain – Self Dedicated: Works that a copyright holder has dedicated to the public domain. Make Your Own Assessment Content Open.Michigan believes can be used, shared, and adapted because it is ineligible for copyright. Public Domain – Ineligible. WOrkds that are ineligible for copyright protection in the U.S. (17 USC §102(b)) *laws in your jurisdiction may differ. Content Open.Michigan has used under a Fair Use determination Fair Use: Use of works that is determined to be Fair consistent with the U.S. Copyright Act (17 USC § 107) *laws in your jurisdiction may differ. Our determination DOES NOT mean that all uses of this third-party content are Fair Uses and we DO NOT guarantee that your use of the content is Fair. To use t his content you should conduct your own independent analysis to determine whether or not your use will be Fair. Module 10: Chi-Square Tests Objective: In this module, you will learn how to perform three Chi-square tests (the test of goodness of fit, the test of independence, and the test of homogeneity) that are used to analyze categorical responses. Overview: There are three Chi-Square tests presented in this module: the tests of goodness of fit, independence, and homogeneity. For all three tests, the data are generally presented in the form of a contingency table (a rectangular array of numbers in cells). All three tests are based on the Chi-Square statistic: (O Ei ) 2 , where Oi is the observed count and Ei is the expected count 2 i Ei under the corresponding null hypothesis. Goodness of Fit Test: This test answers the question, “Do the data fit well compared to a specified distribution?” It considers one categorical variable, and assesses whether the proportion of sampled observations falling into each category matches well enough to the null distribution for the given problem. The null hypothesis for the goodness of fit test specifies this null distribution which describes the population proportion of observations in each category. Test of Homogeneity: This test answers the question, “Do two or more populations have the same distribution for one categorical variable?” It considers one categorical variable, and assesses whether this variable is distributed the same in two (or more) different populations. The null hypothesis for the test of homogeneity is that the distribution of the categorical variable is the same for the two (or more) populations. Test of Independence: This test answers the question, “Are two factors (or variables) independent for a population under study?” It considers two categorical variables, and assesses whether there is a relationship between these two variables for a single population. The null hypothesis for the test of independence is that the two categorical variables are independent (that is, they are not related) for the population of interest. There are a few properties of the Chi-square distribution that you might find useful. The expected value of a Chi-square distribution is its degrees of freedom (mean = df ), and its variance is 2 times its degrees of freedom. Thus, its standard deviation is the square root of 2 times the degrees of freedom ( 2 2 * df so 2 * df ). This frame of reference can help us assess if our observed statistic is unusual under the null hypothesis or somewhat consistent with the null hypothesis. 128 For details on how to enter cross-tabulated data and perform these tests in SPSS, please refer to the How To Do It in SPSS section for Module 10 online in CTOOLS. Output will be provided for all activities. Formula Card: Activity 1: Is There a Different Pattern in the Distribution of Accidental Deaths in a Certain Region Compared to the Pattern in the Entire United States? Background: According to the records of the National Safety Council, accidental deaths in the United States during 2002 had the following distribution according to the principal types of accidents. Motor Vehicle 45% Falls Drowning Fire Poison Other 15% 4% 3% 16% 17% Suppose that an accidental death data set from a particular geographical region yielded the following frequency distribution for the principal types of accidents: Motor Falls Drowning Fire Poison Other Vehicle 442 161 42 33 162 150 Do these data show a significantly different pattern in the distribution of accidental deaths in the particular region compared to the pattern in the entire United States? Use a 5% significance level. (Source: National Safety Council Website, 2005) 129 Task: Perform a Chi-square goodness of fit test to assess whether the data fit well with the model specified in the null hypothesis. Recall: Write out the Five Steps for conducting a test of hypotheses (reference page 53). 1. 2. 3. 4. 5. Hypothesis Test: You can now implement the Five Steps for conducting a test of hypotheses. 1. State the hypotheses: a. Explain in one sentence why the test for this scenario is a goodness of fit test. b. State the null hypothesis H0: ________________________________ , where __________ represents: Remember: Your hypotheses and parameter definition should always be a statement about the population(s) under study. 2. Checking the Assumptions and Computing the Test Statistic a. Find the expected counts for the different categories of accidental deaths, and fill them in the table that has the null hypothesis proportions and the observed counts. 130 Null % Observed Expected Motor Vehicle 45% 442 Falls Drowning Fire Poison Other Total 15% 161 4% 42 3% 33 16% 162 17% 150 100% 990 b. Do all cells have expected counts greater than 5? Yes No c. Complete the calculation of the test statistic based on your table above. X2 442 445.52 161 148.52 42 39.62 33 29.7 2 162 158.42 ..........................2 445.5 148.5 39.6 29.7 158.4 3.66 3. Calculate the p-Value: Based on the SPSS output, answer the following: Test Statistics Category 3.663 5 .599 Chi-Square df Asymp. Sig. i. The p-value is _____________ . ii. The expected value of the test statistic assuming the null hypothesis is true is ____________ . iii. The large p-value is consistent with the fact that our observed test statistic value is greater than less than the expected test statistic value (under the null hypothesis). 4. Decision: What is your decision at a 5% significance level? Reject H0 Fail to Reject H0 Remember: Reject H0 Fail to Reject H0 Results statistically significant Results not statistically significant 5. Conclusion: What is your conclusion in the context of the problem? 131 Activity 2: Are Angry People More Likely to Have Heart Disease? Background: “People who get angry easily tend to be more likely to have heart disease.” That is the conclusion of a study that followed a random sample of 12,986 people from three locations over about four years. All subjects were free of heart disease at the beginning of the study. The subjects took the Spielberger Trait Anger Scale, which measures how prone a person is to sudden anger. The 8,474 people in the sample who had normal blood pressure were classified according to whether they had coronary heart disease (CHD) or not and whether they had low anger, moderate anger, or high anger according to the Anger Scale. The classification summary is given. Low Anger Moderate Anger CHD 53 110 No CHD 3057 4621 (Source: Moore, 2001, page 476) High Anger 27 606 Task: Perform a Chi-square test of independence at the 10% level to assess if Anger classification and Heart Disease Status are related. Hypothesis Test: You can now implement the Five Steps for conducting a test of hypotheses. 1. State the hypotheses: a. Explain in one sentence why the test for this scenario is a test of independence. b. State the null hypothesis H0: ______________________________________________________ c. Intuition about what the conclusion may be can be derived from looking at some basic probabilities. What proportion of sampled subjects had CHD? What proportion of High anger subjects had CHD? What proportion of Moderate anger subjects had CHD? What proportion of Low anger subjects had CHD? If the null hypothesis were true, we would expect the above four numbers to be _________. 132 2. Checking the Assumptions and Computing the Test Statistic CHD * TEMPER Crosstabulation CHD CHD No CHD Total Count Expected Count Count Expected Count Count Expected Count Low anger 53 69.7 3057 3040.3 3110 3110.0 TEMPER Moderate anger 110 106.1 4621 4624.9 4731 4731.0 High anger 27 14.2 606 618.8 633 633.0 Total 190 190.0 8284 8284.0 8474 8474.0 Chi-Square Te sts Value 16.077 a 13.999 Pearson Chi-Square Lik elihood Ratio Linear-by-Linear As soc iation N of Valid Cases 2 2 As ymp. Sig. (2-sided) .000 .001 1 .000 df 13.184 8474 a. 0 c ells (.0% ) have expected count less than 5. The minimum expected count is 14. 19. a. Briefly, explain what the “expected counts” are. b. Based on the table, do the assumptions appear to be met to perform the test? (Are all expected counts greater than 5?) Yes No c. The expected value of the test statistic assuming the null hypothesis is true is _________. d. What is the value of the test statistic? ____ = ____________ 3. Calculate the p-Value: Based on the SPSS output, answer the following: i. The p-value is _____________ . ii. The small p-value is consistent with the fact that our observed test statistic value is greater than less than the expected test statistic value (under the null hypothesis). 4. Decision: What is your decision at a 5% significance level? Reject H0 Fail to Reject H0 Remember: Reject H0 Fail to Reject H0 Results statistically significant Results not statistically significant 133 5. Conclusion: What is your conclusion in the context of the problem? Activity 3: Comparison of the Distribution of Academic Degrees: Males Versus Females Background: How do women and men compare in the pursuit of academic degrees? The table below present counts (in thousands) from the Statistical Abstract of degrees earned in 1996 categorized by the level of the degree and the sex of the recipient. Female Male Bachelor 642 522 Master 227 179 Professional 32 45 Doctorate 18 27 Task: Perform a Chi-square test of homogeneity. Use a 1% significance level. Hypothesis Test: You can now implement the Five Steps for conducting a test of hypotheses. 1. State the hypotheses: State the null hypothesis H0: __________________________________________________________ __________________________________________________________ 2. Checking the Assumptions and Computing the Test Statistic 134 a. Show how the expected count 531.8 (first cell for males) was computed. b. Based on the table, do the assumptions appear to be met to perform the test? (Are all expected counts greater than 5?) Yes No c. The expected value of the test statistic assuming the null hypothesis is true is _________. d. What is the value of the test statistic? ____ = ____________ Chi-Square Te sts Pearson Chi-Square Lik elihood Ratio Linear-by-Linear As soc iation N of Valid Cases Value 9.514a 9.485 5.099 3 3 As ymp. Sig. (2-sided) .023 .023 1 .024 df 1692 a. 0 c ells (.0% ) have expected count less than 5. The minimum expected count is 20. 56. 3. Calculate the p-Value: Based on the SPSS output, the p-value is _____________ . 4. Decision: What is your decision at a 5% significance level? Reject H0 Remember: Reject H0 Fail to Reject H0 Fail to Reject H0 Results statistically significant Results not statistically significant 135 5. Conclusion: What is your conclusion at a 1% significance level in the context of the problem? Would your decision and conclusion change if the significance level was: 5% instead of 1%? 3% instead of 1%? 2.3% instead of 1%? 2% instead of 1%? Based on your answers above, you can see that the p-value represents the _________________ significance level at which the results would be statistically significant. Check Your Understanding: Fill in the blank with the most appropriate Chi-square test to address the research question. A researcher wants to determine if scoring high or low on an artistic ability test depends on being right or left-handed. ___________________________________________ A national organization wants to compare the distribution of level of highest education completed (high school, college, masters, doctoral) for Republicans versus Democrats. ___________________________________________ A preservation society has the percentages of five main types of fish in the river from 10 years ago. After noticing an imbalance recently, they add some fish from hatcheries to the river. How can they determine if they restored the ecosystem from a new sample of fish? ___________________________________________ 136 Example Exam Question on Chi-Square Tests A study was performed to examine the attendance pattern and exam performance of students in an intro statistics course. A sample of 96 students enrolled in an intro stat course was selected (these can be considered a random sample of such students). These 96 students were classified by their attendance status (regularly attend lecture or not) and their performance on the midterm exam, classified as low (below 50%), middle (50% to 80%), and high (above 80%). The data and SPSS output are provided. Regularly Attend? * Exam Performance Crosstabulation Count Low Regularly Attend? Yes No Total Exam Performance Middle High 12 20 28 16 12 8 28 32 36 Total 60 36 96 Chi-Square Tests Pears on Chi-Square Likelihood Ratio Linear-by-Linear As sociation N of Valid Cases Value 8.195 a 8.298 8.067 96 df 2 2 1 Asymp. Sig. .017 .016 .005 a. 0 cells (.0%) have expected count les s than 5. The minimum expected count is 10.50. Give the name of the chi-square test for assessing if there is a relationship between attendance status and exam performance. _______________________________________________________________ b. Based on the above data, what proportion of regular attendees performed high on the exam? Final answer: ________________________ c. Assuming there is no relationship between attendance status and exam performance, how many regular attendees would you expect to perform high on the exam? Show all work. Final answer: ________________________ 137 d. Assuming there is no relationship between attendance status and exam performance, what is the distribution of the test statistic? (Include all relevant information.) Final answer: ___________________________________ e. Use a level of 0.05 to assess if there is a significant relationship between attendance status and exam performance. Test Statistic Value: ________________ Thus, there (circle your answer): does p-value: ______________________ does not appear to be an association between attendance status and exam performance. 138 139