Chi-Square Tests Tutorial Question 1 Among students taking a statistics course, 52 came from families with 3 children. For these threechildren families, the numbers with 0, 1, 2 and 3 female children are 5, 17, 24 and 6 respectively. Test the hypothesis that the number of female children in a three-children family has a Binomial(3, 0.6) distribution. Use a significance level of α = 0.05. State your null and alternative hypotheses clearly. Let X = number of female children in a 3-child family. The hypotheses to be tested are: H0 : X ~ Binomial(n = 3, p = 0.6) H1 : X does not have a Binomial(n = 3, p = 0.6) distribution Number of female children in a 3-child family, X = x X=0 X=1 X=2 X=3 Total Observed number of families 5 17 24 6 n = 52 Cell probabilities under H0 pi0 3 3 3 3 ( ) 0.60 0.43 ( ) 0.61 0.42 ( ) 0.62 0.41 ( ) 0.63 0.40 0 1 2 3 n = ( ) px (1 − p)n−x x = 0.064 = 0.288 = 0.432 = 0.216 Expected number of days, npi0 52 × 0.064 = 3.328 52 × 0.288 = 14.976 52 × 0.432 = 22.464 52 × 0.216 = 11.232 1 52 Note that the cell for (X = 0) has an expected count that is less than 5. The chi-squared approximation is suitable when the expected count for each cell is at least 5. So merge the cells for (X = 0) and (X = 1) to ensure that all cells have an expected count of at least 5. 1 Number of female children in a 3-child family, X = x X = 0 or X=1 X=2 X=3 Total Observed number of families 5 + 17 = 22 24 6 n = 52 Cell probabilities under H0 3 ( ) 0.60 0.43 0 3 + ( ) 0.61 0.42 1 3 ( ) 0.62 0.41 2 3 ( ) 0.63 0.40 3 1 = 0.432 = 0.216 52 × 0.432 = 22.464 52 × 0.216 = 11.232 n pi0 = ( ) px (1 − p)n−x x Expected number of days, npi0 = 0.352 52 × 0.352 = 18.304 2 (Expected − Observed)2 Expected 2 52 2 χ2obs (18.304 − 22) (22.464 − 24) (11.232 − 6) (Expected − Observed)2 = ∑ 18.304 22.464 11.232 Expected all cells = 0.7463 = 0.1050 = 2.437 = 3.288 Degrees of freedom of the chi-squared statistic = k − m − 1 where k = number of cells = 3 {Note that the table now has 3 cells, since we merged the cells for (X = 0) and (X = 1)} and m = number of parameters estimated from sample data = 0 So: degrees of freedom = 3 − 0 − 1 = 2 The test is: Reject H0 if χ2obs ≥ χ2α,(2) α = 0.05. From tables, χ2α,(2) = χ20.05,(2) = 5.991 χ2obs = 3.288 → Do not reject H0 at the α = 0.05 level of significance. The observed number of female children in a three-children family follows a Binomial(n = 3, p = 0.6) distribution. 2 Question 2 An insurance company has collected data on claim frequency over a period of 365 days. Number of claims per day 0 1 2 Observed number of days 50 290 25 Apply a suitable test with α = 0.001 to evaluate the hypothesis that the number of claims in one day follows a Poisson distribution. State your null and alternative hypotheses clearly. We must estimate λ, the parameter of the Poisson distribution, from the sample data. λ is the population mean number of claims in one day, so we should estimate it by the mean number of claims per day in the sample. Estimate of λ = Sample mean number of claims per day = Total number of claims Total number of days = (0×50)+(1×290)+(2×25) (50+290+25) 340 = 365 = 0.9315 Let X = number of claims per day. The hypotheses to be tested are: H0 : X ~ Poisson(λ = 0.9315) H1 : X does not have a Poisson(λ = 0.9315) distribution Note that we must add a column to the table of observed and expected counts for X ≥ 3, to make the cell probabilities sum to 1. Number of claims per day, X=x X=0 X=1 X=2 X≥3 Total Observed number of days 50 290 25 0 n = 365 Cell probabilities under H0 e−0.9315 0.93150 0! e−0.9315 0.93151 1! e−0.9315 0.93152 2! = 0.3940 = 0.3670 = 0.1709 = 0.06814 365 × 0.3940 = 143.795 365 × 0.3670 = 133.946 365 × 0.1709 = 62.386 365 × 0.06814 = 24.873 (62.386 − 25)2 62.386 (24.873 − 0)2 24.873 = 22.404 = 24.873 pi0 = e−λ λx x! Expected number of days, npi0 (Expected − Observed)2 Expected (143.795 − 50)2 (133.946 − 290)2 143.795 133.946 2 1−∑ x=0 e−λ λx x! 1 365 χ2obs = ∑ all cells = 61.181 = 181.810 (Expected − Observed)2 Expected = 290.268 3 Degrees of freedom of the chi-squared statistic = k − m − 1 where k = number of cells = 4 and m = number of parameters estimated from sample data = 1 So: degrees of freedom = 4 − 1 − 1 = 2 The test is: Reject H0 if χ2obs ≥ χ2α,(2) α = 0.001. From tables, χ2α,(2) = χ20.001,(2) = 13.815 χ2obs = 290.268 → Reject H0 at the α = 0.001 level of significance. The number of insurance policies collected over a 365-day period do not follow a Poisson(λ = 0.9315) distribution. Question 3 Sixty captured specimens of sharks are classified according to species (Great White, Tiger or Hammerhead) and the presence or absence of bacterial skin infections. The table below shows the number of sharks observed in each category. Great White Sharks Tiger Sharks Hammerhead Sharks No Infection 14 9 17 Infection 6 6 8 A marine biologist wants to investigate if there is a relationship between the attributes of species and presence of infection. a) What is the appropriate test for this situation (ANOVA, goodness-of-fit, test of independence or test of homogeneity)? Explain briefly. The appropriate test for this situation is a test of independence. The marine biologist wants to investigate if there is a relationship between the attributes of species and presence of infection, i.e. if these attributes are dependent or independent. One way of recognizing if this situation requires a test of independence or a test of homogeneity is by looking at the sampling method. If h samples of pre-determined sizes were taken from h populations, and each individual in each of the h samples is classified into exactly one of k categories, then this usually requires a test of homogeneity. If, however, one sample is taken from one population, and each individual in the sample is classified according to two attributes, of which there are h categories of the first attribute and k categories of the second attribute, then this usually requires a test of independence. b) Write down the hypotheses being tested by the test statistic. H0 : The attributes of species and presence of infection are independent. 4 H1 : The attributes of species and presence of infection are not independent. OR H0 : The rows and columns are independent. H1 : The rows and columns are not independent. OR H0 : pij = pi. × p.j ; i = 1 … h, j = 1 … k H1 : At least one pij ≠ pi. × p.j where pij = population proportion of individuals in the (i, j)th cell, i.e. in the ith category of the first attribute and the jth category of the second attribute pi. = population proportion of individuals in the ith row, i.e. in the ith category of the first attribute p.j = population proportion of individuals in the jth column, i.e. in the jth category of the the second attribute c) Derive an expression for the expected count eij in cell (i, j) under the assumption of the null hypothesis. Using the last set of hypotheses above: H0 : pij = pi. × p.j ; i = 1 … h, j = 1 … k H1 : At least one pij ≠ pi. × p.j Let the total number of observations be n. Estimate pi. , the population proportion of individuals in the ith row, by the sample proportion of individuals in the ith row. So pĖi. = Number of individuals in the ith row Total number of individuals in the sample = ni. n Estimate p.j, the population proportion of individuals in the jth column, by the sample proportion of individuals in the jth column. So pĖ.j = Number of individuals in the jth column Total number of individuals in the sample = n.j n Expected count eij in cell (i, j) = npij = n × pi. × p.j, under the assumption of the null hypothesis. So an estimate of the expected count in cell (i, j), eij = n × pĖi. × pĖ.j = n × ni. n × n.j n = ni. n.j n 5 d) Find the value of the test statistic and the number of degrees of freedom. Observed count, ð§ðĒðĢ Great White Sharks Tiger Sharks Hammerhead Sharks Row Totals, ð§ðĒ. Expected count, ððĒðĢ = ð§ðĒ. ð§.ðĢ ð§ No Infection 14 9 17 40 Infection 6 6 8 20 No Infection Infection 13.333 10 16.667 40 6.667 5 8.333 20 Great White Sharks Tiger Sharks Hammerhead Sharks Row Totals (Check) (ððąðĐððððð − ðððŽððŦðŊðð)ð ððąðĐððððð Great White Sharks Tiger Sharks Hammerhead Sharks χ2obs Column Totals, ð§.ðĢ 20 15 25 ð§ = ðð Column Totals (Check) 20 15 25 ð§ = ðð No Infection Infection 0.0333 0.1 0.00667 0.0667 0.2 0.0133 (Expected − Observed)2 = ∑ = 0.42 Expected all cells Degrees of freedom = (h − 1)(k − 1) = (3 − 1)(2 − 1) = 2 × 1 = 2 e) Find the critical value(s) for this test, using a significance level of α = 0.05. From tables, the critical value is χ2α,(2) = χ20.05,(2) = 5.991 f) What do you conclude from this test? The test is: Reject H0 if χ2obs ≥ χ2α,(2) Since χ2obs âą χ2α,(2) , do not reject H0 . The sample data do not provide evidence of a relationship between the attributes of species and presence of infection. 6 Question 4 A researcher wishes to test whether the proportion of university students who own cars is the same in three different faculties of the University of the West Indies, St. Augustine. She randomly selects 150 students from each of the three faculties and records the number that own cars. The results are shown below: Science and Technology Engineering Social Sciences Own a car Don’t own a car 36 114 23 18 127 132 (ð−ð)ð For this data, ∑ ð = ð. ððð. Note that since you are given this value, you do not have to manually calculate expected counts or the chi-squared test statistic. a) What is the appropriate test for this situation (ANOVA, goodness-of-fit, test of independence or test of homogeneity)? Explain briefly. The appropriate test for this situation is a test of homogeneity. The researcher wants to investigate if proportion of university students who own cars is the same in three different faculties, i.e. if these three populations of students are homogenous with respect to car ownership. Notice that h = 3 samples of pre-determined sizes were taken from h = 3 populations, and each individual in each of the h = 3 samples is classified into exactly one of k = 2 categories (own a car / don’t own a car). This type of sampling usually requires a test of homogeneity. b) What is the distribution of the test statistic? How many degrees of freedom does it have? The test statistic has a chi-squared distribution. Number of rows in the contingency table, h = 3. Number of columns in the contingency table, k = 2. Degrees of freedom = (h − 1) × (k − 1) = 2×1= 2 c) Write down the hypotheses being tested by the test statistic. H0 : The proportion of university students who own cars is the same in all three faculties. H1 : The proportion of university students who own cars in at least one faculty is different from the other faculties. OR H0 : The distribution of car ownership among students is the same in all three faculties. H1 : The distribution of car ownership among students in at least one faculty is different from the other faculties. 7 d) Find the critical value(s) for this test, using a significance level of α = 0.10. From tables, the critical value is χ2α,(2) = χ20.10,(2) = 4.605 e) What do you conclude from this test? The chi-squared statistic, χ2obs = ∑ (O−E)2 E = 8.116 Since χ2obs ≥ χ2α,(2), reject H0 . The sample data provide evidence that the proportion of university students who own cars in at least one faculty is different from the other faculties 8