The Chi-Square Distribution Χ2 has 3 main uses: 1. Goodness of fit tests 2. Independence Tests 3. One Sample Variance Tests The Chi-square distribution is an asymmetrical distribution with one parameter, v = degrees of freedom or df. If you square a Z you get a Χ2 with 1 degree of freedom. All chi-square distributions are always > 0. They are continuous. Most looked like skewed Normal distributions. The mean of a chi-square = v and the standard deviation =√2π£ There is a Χ2–cdf on your calculators. It takes 3 inputs (min, max, df) The min should always be 0, that is actually the min of any Χ2 -distribution. You can then find the right tail probability (usually the p-value) by subtracting from 1. Ex. If X ~ Χ2 with df = 9, then P(Χ ≤ 10) = 0.6495 ο P(X > 10) = 0.3505 Ex. What the probability that if X ~ Χ2 with df = 8, is within 1 standard deviation of its mean? Goodness of fit: Ex. You pick 20 cards from a standard deck with replacement and count the number of each suit. You count 8 spades, 6 hearts, 4 clubs and 2 diamonds. Do you think that the selection was random from a standard deck? What would you expect? 5 of each suit. How different is this? Is there evidence to conclude that this would be very unlikely to happen by chance? Ho: The suits are uniformly distributed (each probability = ¼) Ha: The suits are not uniformly distributed (at least one probability ≠ ¼) Note that α = ? so assume that α = 0.05. The test statistic: π = ∑π (π−πΈ)2 πΈ Where n is the number of categories, O is the observed frequency in each cell, and E is the expected(if Ho is true) frequency in each cell. If Ho is true then X ~ Χ2 with df = n – 1, where n is the number of categories. The p-value = P(Χ2 > X). In the above example: (Show table on board) TS = 4, n = 4 so df = 3 and the p-value = 0.261. Do Not reject ho. There is not sufficient evidence to support the claim that the selection was not random from a standard deck. Exercise 11.11.2 A 6-sided die is rolled 120 times. Fill in the expected frequency column. Then, conduct a hypothesis test to determine if the die is fair. The data below are the result of the 120 rolls. Face Value Frequency Expected Frequency 1 15 2 29 3 16 4 15 5 30 6 15 Exercise 11.11.6 The City of South Lake Tahoe, CA, has an Asian population of 1419 people, out of a total population of 23,609 (Source: U.S. Census Bureau, Census 2000). Suppose that a survey of 1419 self reported Asians in Manhattan, NY, area yielded the data in the table below. Conduct a goodness of fit test to determine if the self-reported sub-groups of Asians in the Manhattan area fit that of the Lake Tahoe area. Race Lake Tahoe Frequency Manhattan Frequency Asian Indian 131 174 Chinese 118 557 Filipino 1045 518 Japanese 80 54 Korean 12 29 Vietnamese 9 21 Other 24 66 Testing to determine if two categorical variables are Independent You have 2 categorical variables and you want to test if they are associated (NOT independent). Make a contingency table. You will have a row variable and a column variable and the frequencies will go in the table. Ho: The row variable and column variable are independent Ha: The row variable and column variable are not independent The test statistic: π = ∑π ∑π (π−πΈ)2 πΈ O’s are your observed frequencies E’s are your expected frequencies, assuming that Ho is true. (π π‘β πππ€ π‘ππ‘ππ) ∗ (π π‘β ππππ’ππ π‘ππ‘ππ) πΈππ = πΊππππ π‘ππ‘ππ Degrees of freedom = (r-1)(c-1) where r is the number of row and c is the number of columns. The p-value = P(X2 > TS) The more dependent your variables are, the bigger the differences between O and E and as a result the bigger your TS will be. The chi-square approximation will be accurate when all the E’s ≥ 5. * Ex. A study was conducted at a large university to see if age and drink preference were associated. Some abridged results are shown below. Is there evidence at the 5% significance level to support the claim that age and drink preference are associated? Observed Under 19 19 and older Total Soda 40 20 60 Coffee 15 25 40 Total 55 45 100 Expected Soda Coffee Total Under 19 33 22 55 19 and older 27 18 45 Total 60 40 100 TS = 8.249 p-value = 0.004 Reject Ho. There is sufficient evidence at the 5% significance level to support the claim that age and drink preference are associated. Degrees of freedom? NOTE: Calculator instructions follow. TI-83+ and TI-84 calculator: Press the MATRX key and arrow over to EDIT. Press 1:[A]. Press 2 ENTER 2 ENTER. Enter the table values by row. Press ENTER after each. Press STAT and arrow over to TESTS. Arrow down to C:χ2-TEST. Press ENTER. You should see Observed:[A] and Expected:[B]. Arrow down to Calculate. Press ENTER. Exercise 11.11.10 Car manufacturers are interested in whether there is a relationship between the size of car an individual drives and the number of people in the driver’s family (that is, whether car size and family size are independent). To test this, suppose that 800 car owners were randomly surveyed with the following results. Conduct a test for independence. Family Size 1 2 3-4 5+ Sub & Compact 20 20 20 20 Mid-size 35 50 50 30 Full-size 40 70 100 70 Van & Truck 35 80 90 70 Statistical Inference for a Population Variance. the quantity (n-1)s2 / σ2 has a chi-square distribution when the population from which the sample is taken is approximately normally distributed. The Chi-square distribution is a non-symmetric continuous distribution. It has one parameter called v, the degrees of freedom = n – 1. The χ2 distribution > 0. There are tables, but they are not very useful. There is a test on your calculator labeled χ2 but it is not the right test. But you do have a χ2 cdf on your calculator. This will allow you to find p-values for HT’s. There are 3 forms for the HT’s for σ2 . H’s Ho: σ2 ≥ σ2o Ha: σ2 < σ2o p-values P(χ2 < TS) Ho: σ2 ≤ σ2o Ha: σ2 > σ2o P( χ2> TS) Ho: σ2 = σ2o P(χ2 >TS) + P(χ2 < 1/TS) if TS >1 2 2 Ha: σ ≠σ o P(χ2 <TS) + P(χ2 > 1/TS) if TS <1 (the p-value is the sum of the tail probabilities) TS = (n-1)s2 / σ2 . To find a (1 – α) 100% CI for σ2, (π − 1)π 2 (π − 1)π 2 2 < π < 2 2 ππΌ/2 π1−πΌ/2 Recall, to get the p-values on the TI 83/84, under DISTR, Go to χ2 –cdf, which takes 3 inputs: min, max, df. the min will usually be 0. ex. The P(χ2 < 30.1435) = .95 for χ2 with 19 df. so P(χ2 > 30.1435) = .05 Ex. Test that the population variance is different from 225 (standard deviation is different from 15) at the 5% significance level, when you have a random sample of 7 with mean 123 and standard deviation 9 from a normal population. Ho: σ2 = 225 Ha: σ2 ≠225 n = 7 so v = 6 TS = 6 * 81 / 225 = 2.16 p-value = P(χ2 > 2.16) +P(χ2 < .4630) = .9044 +.0017 = .9061 Accept Ho. There is not sufficient evidence to support the claim that the variance differs from 225 or the standard deviation differs from 15. 95% CI for σ2: X2(α/2) = 12.591 X2(1-α/2) = 1.636 Chapter 11 The Chi-Square Distribution Recap: π 2 = π 2 with 1 df. P(|Z| > 1.96) = P(Z < -1.96) + P(Z > +1.96) = 0.05 ο π(π 2 > 3.8416) = 0.05 1. Goodness of fit tests Ho: Good fit Ha: Not a good fit The test statistic: π = ∑π (π−πΈ)2 πΈ Degrees of freedom = n – 1 where n is the number of categories. The p-value = P(X2 > TS) The chi-square approximation will be accurate when all the E’s ≥ 5. 2. Independence Tests Ho: The row variable and column variable are independent Ha: The row variable and column variable are not independent The test statistic: π = ∑π ∑π (π−πΈ)2 πΈ Degrees of freedom = (r-1)(c-1) where r is the number of row and c is the number of columns. The p-value = P(X2 > TS) The chi-square approximation will be accurate when all the E’s ≥ 5. * 3. One Sample Variance Tests Ho: σ2 =σo2 Ha: σ2 ≠ σo2 . The Test statistic: π = (π−1)π 2 π02 Degrees of freedom = n – 1 where n is the sample size. The p-value = P(X2 > TS) + P(X2 < 1/TS) (if TS > 1) The p-value = P(X2 < TS) + P(X2 > 1/TS) (if TS < 1) The chi-square approximation will be accurate when you have a random sample from a Normal Population. 100(1 – α)% Confidence Interval (π−1)π 2 2 ππΌ/2 < π2 < (π−1)π 2 2 π1−πΌ/2 Examples: 1. FDA regulations require that the standard deviation of a 16 ounce soda can should be less than 0.1 ounce. A random sample of 10 cans is taken from a large Normal population of 16 ounce cans. The sample has a mean of 15.8 ounces and a standard deviation of 0.08 ounces. Does this provide evidence that the variability is as small as desired? Justify using a hypothesis test at the 5% significance level. 2. JAMA published a study in Alcohol consumption in patients suffering from myocardial infarction. The data are given below. Does this provide evidence that congestive heart failure (CHF) depends on alcohol consumption? Justify! CHF Yes Yes Yes No No No Alc. Cons. Abstain Less7 7More Abstain Less7 7More Freq 146 106 29 750 590 292 3. An article in Chance showed the results of an Ancient Greek excavation. There were 837 pieces of pottery found. Test whether or not one type of piece of pottery is more likely to be found than the others. Pot Category Burnished Monochrome Painted Other Total Number found 133 460 183 61 837 Chapter 11 Homework: 1, 3, 5, 9, 11, 13, 15, 17, 21, 23