Chi-Square Goodness of Fit Test DEFINITIONS Qualitative variables are those which classify the units into categories. The categories may or may not have a natural ordering to them. Qualitative variables are also called categorical variables. Quantitative variables have numerical values that are measurements (length, weight, and so on) or counts (of how many). Arithmetic operations on such numerical values do have meaning. Analysis of Count Data Three tests If we have qualitative data on just one variable, a test of goodnessof-fit is used to assess if the qualitative data “fit” or are consistent with a particular discrete model for the percentages in each category. The null hypothesis would state the hypothesized discrete model. A test of homogeneity is used to assess if two or more populations are homogeneous or alike with respect to the distribution for some categorical variable. The null hypothesis is that the distributions are the same across the two or more populations. A test of independence determines if two qualitative variables are related or not for a given population. The null hypothesis is that the two variables are independent, that there is no apparent association. Big Idea for Chi-Square Tests 1. The data consist of observed counts—that is, how many of the items or subjects fall into each category. 2. We will compute expected counts under H0 , that is, the counts that we would expect to see for each category if the corresponding null hypothesis were true. 3. We will compare the observed and expected counts to each other via a test statistic that will be a measure of how close the observed counts are to the expected counts under H 0 . So if this “distance” is large, we have some support for rejecting H 0 . The test statistic that is computed for all three tests is called a chisquare test statistic. THE CHI-SQUARE STATISTIC Chi-Square Test Statistic: X 2 O E 2 all cells E DEFINITIONS The observed counts are the data, the number of observations that fall into each category or cell. The expected counts are the number of observations that would be expected to fall into each category or cell if the null hypothesis being tested were true. The chi-square test statistic measures the distance between the observed and expected counts across all cells and is computed as: X 2 all cells Think About It O E 2 E Could you get an X 2 statistic that is negative? The Chi-Square Distribution Various chi-square distributions df=1 df=4 df=10 0 5 10 15 X2 20 Properties of the Chi-Square Distribution 2 df The distribution is not symmetric and is skewed to the right. The values are non-negative. There is a different chi-square distribution for different degrees of freedom. The mean of the chi-square distribution is equal to its degrees of freedom and is located to the right of the mode. The variance of the chi-square distribution is 2(df). The 95th percentile of a chi-square distribution with three degrees of freedom is 7.81 and is denoted by 0.95 3 = 7.81 2 Chi-square Distribution with 3 degrees of freedom 0.05 area to left = 0.95 0 2 3 7.81 0.95 X2 Example Working with the Chi -Square Distribution A study in which researchers wanted to assess whether having a pet increased the length of survival for coronary heart disease patients. Approximately 94.3% of the patients with a pet survived for one year, while only 71.8% of those without a pet survived for one year. From a descriptive standpoint there seemed to be an advantage to having a pet .Is this difference of 22.5% significant? Is there a significant relationship between pet status and survival status? (a) State the appropriate hypotheses in words. H0: There is no association between having a pet and survival for coronary heart disease patient H1: Having a pet increases the survival for coronary heart disease patients. (b) Suppose that the observed chi-square test Chi-square Distribution with 1 degree of freedom statistic value is 8.85.We want to measure the chance of getting a value of 8.85 or Area=p-value larger under the null hypothesis. 2 2 8.85 The distribution for under Ho is a chi square 0 X distribution with one degree of freedom. Using your calculator, find the corresponding p-value. (c) Using a 5% significance level, what is the decision? State the conclusion in the context of the problem. Since the p-value for the test of no association is so small, we would reject H 0 . The data support that there is a statistically significant relationship between pet status and survival status. Let's Do It! Youth and Sports The Trends & Tudes Newsletter produced by Harris Interactive presented the model given below for the responses to the question: “Have you ever participated in organized youth sports outside of school?” Response Percent 1.Yes—currently participate 29% 2.Yes—participated in the past 39% 3. No—have never participated 32% Suppose a survey of young people aged 8 to 18 attending schools in Ann Arbor, Michigan gave the following responses to the sport participation question: n =200 Response Observed Counts Expected Counts 1 82 2 64 3 54 Total 200 200 Do the data indicate that youth from Ann Arbor have a different distribution of sport participation as compared to the national model? (a) State the null hypothesis. (Hint: p1 is the proportion of youth in Ann Arbor stating that they are currently participating in organized youth sports outside of school, so based on the model being tested p1 =0.29. H0: p1 =0.29, p2 , p3 . (b) Compute the expected counts and enter them in the previous table. Compute the observed test statistic. (c) X (d) (e) 2 OBS all cells O E 2 E Find the p-value. State your decision and conclusion using 5% Let's Do It! According to USA Today (Mary 7, 1991) here is how sports team members say athletes do as role models for children: Response: Excellent Good Fair Poor Percent: 16% 38% 41% 5% A poll of 350 adults within a community was taken and the following data were obtained: Response: Excellent Good Fair Poor Observed Responses: 44 145 133 28 Expected Number: 56 ___ ___ ___ We wish to determine if the data support the conjecture that the community adults have the same idea about athletes as role models as do sports team members. a. What is the appropriate null hypothesis of interest here? H 0: ______________________________________________ b. Carry out the appropriate test at the 5% significance level. i. Compute the remaining expected counts and write them in the above table. ii. Compute the appropriate test statistic and report (bounds for) the p-value. iii. State your decision. Chi-Square as a Test of Independence Example High Blood Pressure Many studies have suggested that there is a link between high blood pressure and heart attacks. In one study, white male subjects aged 35 to 64 were classified according to whether their systolic blood pressure was low (less than 140 millimeters of mercury) or high (140 or higher) and then followed for five years to determine whether or not they suffered from a heart attack during the five years. The data are summarized in the following table: Heart Attack? Yes Low High Blood Pressure a) 21 ( 55 ( No ) ) 2655 3283 ( ( ) ) We wish to assess if these data support the hypothesis that heart attack status is dependent on blood pressure level. State the hypotheses to be tested (in the context of this scenario). H0: H1: b) Compute the expected counts under H 0 . 21 0.0078 or 0.78%. 2676 55 0.0165 or 1.65% Proportion of high blood pressure who had a heart attack: 3338 Proportion of low blood pressure who had heart attack: Overall, 76 of the 6014 subjects, or 1.26%, had a heart attack. Expected number of low blood pressure subjects with heart attack 762676 33.82 0.1262676 Expected count column totalrow total overall total 6014 c) The distribution is approximately a chi-square distribution with df = (r-1) (c-1). What are the degrees of freedom for our example? d) Compute the test statistic X2. X 2 OBS 21 33.822 33.82 2 2655 2642.18 2642.18 2 55 42.18 42.18 2 3283 3295.82 3295.82 8.87 e) Give the decision and conclusion using a 5% significance level. df=1 p-value = area to right of 8.87 = 0.0029 0 8.87 X2 Let's Do It! Hodgkin’s disease and Tonsillectomies A study investigated whether any relationship exists between Hodgkin's disease and tonsillectomies. The counts at the right are based on a random sample of 85 patients suffering from Hodgkin's disease and who had a sibling of the same sex who was free of the disease and whose age was within 5 years of the patient's age. Sibling No Tonsillectomy Tonsillectomy No Tonsillectomy 37 ( ) 7 ( ) Patient Tonsillectomy 15 ( ) 26 ( ) (a) What is the appropriate hypothesis of interest? (b)Find the expected counts for each category. (c) Carry out a test at the 5% significance level, and state your conclusion. Homework will be posted on my website.