Chapter 10. Discrete Data Analysis 10.1 10.2 10.3 10.4 10.5 NIPRL Inferences on a Population Proportion Comparing Two Population Proportions Goodness of Fit Tests for One-Way Contingency Tables Testing for Independence in Two-Way Contingency Tables Supplementary Problems 10.1 Inferences on a Population Proportion Population Proportion p with characteristic · Sample proportion µ p X ~ B ( n, p ) x µ p= n Random sample of size n With characteristic Without characteristic Cell probability p Cell probability 1-p Cell frequency x Cell frequency n-x NIPRL p (1- p ) µ µ E ( p) = p and Var ( p ) = n · For large n, æ p (1- p ) ÷ ö µ ç p ~ N ç p, ÷ çè ø n ÷ µ p- p ~ N (0,1) p (1- p ) n 2 10.1.1 Confidence Intervals for Population Proportions µ p- p · Since ~ N (0,1) for large n, two-sided 1- a confidence interval is p (1- p) n æ ö p (1- p) µ p(1- p) ÷ ç µ ÷ p Î çç p - za / 2 , p + za / 2 ÷ ÷ çè n n ø x p (1- p) where µ p = , s.e.( µ p) = . n n µ p (1- µ p) 1 µ But it is customary that s.e.( p) = = n n x(n - x) . n · Two-Sided Confidence Intervals for a Population Proportion X ~ B(n, p) Approximate two-sided 1- a confidence interval for p is æ ö za / 2 x(n - x) µ za / 2 x(n - x) ÷ ç µ ÷ p Î çç p , p+ . ÷ ÷ çè n n n n ø The approximation is reasonable as long as both x and n - x are larger than 5. NIPRL 3 10.1.1 Confidence Intervals for Population Proportions Example 55 : Building Tile Cracks Random sample n = 1250 of tiles in a certain group of downtown building for cracking. x = 98 are found to be cracked. x 98 µ p= = = 0.0784, z0.005 = 2.576 n 1250 99% two-sided confidence interval æ za / 2 x ( n - x ) µ z a / 2 x ( n - x ) ö ÷ ç µ ÷ p Î çç p , p+ ÷ ÷ çè n n n n ø æ 2.575 98´ (1250 - 98) 2.575 98´ (1250 - 98) ö ÷ ç ÷ = çç0.0784 , 0.0784 + ÷ ÷ çè 1250 1250 1250 1250 ø = (0.0588, 0.0980) » (0.06, 0.10) Total number of cracked tiles is between 0.06´ 5, 000, 000 = 300, 000 and 0.10´ 5, 000, 000 = 500, 000 NIPRL 4 10.1.1 Confidence Intervals for Population Proportions µ p- p · Since ~ N (0,1) for large n, one-sided 1- a confidence interval p (1- p ) n æ ö p (1- p ) ÷ ç µ ÷ for a lower bound is p Î çç p - za ,1÷. ÷ çè n ø · One-Sided Confidence Intervals for a Population Proportion X ~ B( n, p) Approximate two-sided 1- a confidence interval for p is æ æ za x(n - x) ö÷ za x(n - x) ö÷ ç ç µ µ ÷ p Î çç p ,1÷ and p Î çç0, p . ÷ ÷ ÷ ÷ çè ç n n n n ø è ø The approximation is reasonable as long as both x and n - x are larger than 5. NIPRL 5 10.1.2 Hypothesis Tests on a Population Proportion · Two-Sided Hypothesis Tests H 0 : p = p0 versus H A : p ¹ p0 p-value = 2´ P( X ³ x) if µ p = x / n > p0 p-value = 2´ P( X £ x) if µ p= x/n< p 0 where X ~ B(n, p0 ). · When np0 and n(1- p0 ) are both larger than 5, a normal approximation can be used to give thep-value of p -value = 2´ F (- | z |) where F (×) is thestandard normal c.d.f. and µ p - p0 x - np0 z = = . p0 (1- p0 ) np0 (1- p0 ) n NIPRL 6 10.1.2 Hypothesis Tests on a Population Proportion · Continuity Correction In order to improve the normal approximation the value in the numerator of the z -statistic x - np0 - 0.5 may be used when x - np0 > 0.5 x - np0 + 0.5 may be used when x - np0 < 0.5. · A size a hypothesis test accepts H 0 when | z | £ za / 2 or p - value³ a and rejects H 0 when | z | > za / 2 or p - value< a . NIPRL 7 10.1.2 Hypothesis Tests on a Population Proportion · One-Sided Hypothesis Tests for a Population Proportion H 0 : p ³ p0 versus H A : p < p0 p -value = P( X £ x) where X ~ B(n, p0 ). -The normal approximation to this is p -value = F ( z ) and z = x - np0 + 0.5 np0 (1- p0 ) . - Accept H 0 if z ³ - za and reject H 0 if z < - za . · For H 0 : p < p0 versus H A : p ³ p0 , p -value = P( X ³ x) where X ~ B(n, p0 ). -The normal approximation to this is p -value = 1- F ( z ) and z = x - np0 - 0.5 np0 (1- p0 ) . - Accept H 0 if z £ - za and reject H 0 if z > - za . NIPRL 8 10.1.2 Hypothesis Tests on a Population Proportion Example 55 : Building Tile Cracks 10% or more of the building tiles are cracked ? H 0 : p ³ 0.1 versus H A : p < 0.1 n = 1250, x = 98, p0 = 0.1. z= x - np0 + 0.5 = - 2.50. np0 (1- p0 ) 0 z = -2.50 p-value = F (- 2.50) = 0.0062. NIPRL 9 10.1.3 Sample Size Calculations · The most convenient way to assess the amount of precision afforded by a sample size n is to consider L of two-sided confidence interval for p. µ 4 za2 / 2 µ p (1- µ p) p (1- µ p) L = 2 za / 2 Þ n= . 2 n L Since µ p (1- µ p) is unknown, if we take µ p(1- µ p) the largest, za2 / 2 1 n = 2 and µ p= . L 2 However, if µ p is far from 0.5, we can take µ p = p * from prior information for p 4 za2 / 2 p *(1- p*) and n ; .(either p* < 0.5 or p* > 0.5) 2 L NIPRL 10 10.1.3 Sample Size Calculations Example 59 : Political Polling To determine the proportion p of people who agree with the statement “The city mayor is doing a good job.” within 3% accuracy. (-3% ~ +3%), how many people do they need to poll? Construct a 99% confidence interval for p with a length no large than L = 6% ( µ p - 3% ~ µ p + 3%) za2 / 2 2.5762 Þ n= 2 = = 1843.3 (a = 0.01) L 0.062 Þ representative sample : at least 1844 If "1000 respondents, ± 3% sampling error", Þ za / 2 = nL = 1000 ´ 0.06 = 1.90,F (- 1.90) = 0.0287. Þ a ; 0.057 Þ 95% confidence interval NIPRL 11 10.1.3 Sample Size Calculations Example 55 : Building Tile Cracks Recall a 99% confidence interval for p is (0.0588, 0.0980). We need to know p within 1% with 99% confidence.( L = 2%) Based on the above interval, let p* = 0.1 = 0.5. 4 za2 / 2 p *(1- p*) Þ n; = 5972.2 or about 6000. 2 L Þ representative sample : 4750 (initial sample = 1250) After the secondary sampling, n = 6000 and x = 98 + 308 = 406. 406 µ p= = 0.0677 and 6000 æ z x(n - x) µ za / 2 p Î ççç µ p - a /2 , p+ n n n èç x(n - x) ö ÷ ÷ ÷ ÷ n ø = (0.0593, 0.0761) NIPRL 12 10.2 Comparing Two Population Proportions · Comparing two population proportions Assume X ~ B(n, p A ), Y ~ B(m, pB ) and X ^ Y . x y x y and µ pB = Þ µ pA - µ pB = . n m n m p (1- p A ) pB (1- pB ) Var ( µ pA - µ p B ) = Var ( µ p A ) + Var ( µ pB ) = A + . n m (µ pA - µ p B ) - ( p A - pB ) (µ pA - µ p B ) - ( p A - pB ) z = = . µ x(n - x) y (m - y ) p A (1- µ pA) µ p B (1- µ pB ) + + 3 n m3 n m · If H 0 : p A = pB , the pooled estimate is µ pA - µ pB x+ y µ p = and z = . n+ m æ1 1 ö µ p (1- µ p) çç + ÷ ÷ çè n m ø÷ µ pA = NIPRL 13 10.2.1 Confidence Intervals for the Difference Between Two Population Proportions · Confidence Intervals for the Difference Between Two Population Proportions Assume X ~ B(n, p A ), Y ~ B(m, pB ) and X ^ Y . Approximate two-sided 1- a confidence level confidence interval is æ ö µ µ çç µ µ p A (1- µ pA) µ p B (1- µ pB ) µ µ p A (1- µ pA) µ p B (1- µ pB ) ÷ ÷ ÷ p A - p B Î ç p A - p B - za / 2 + , p A - p B - za / 2 + ÷ çç ÷ n m n m ÷ è ø æ x(n - x) y (m - y ) µ µ x(n - x) y (m - y ) ö÷ ç µ µ ÷ = çç p A - p B - za / 2 + , p A - p B - za / 2 + ÷ 3 3 3 3 çè n m n m ø÷ · Approximate one-sided 1- a confidence level confidence interval is æ x(n - x) y (m - y ) ö÷ ç µ µ p A - pB Î çç p A - p B - za + , 1÷ and ÷ 3 3 ÷ çè n m ø æ p A - pB Î ççç- 1, µ pA - µ p B - za çè x(n - x) y (m - y ) ö÷ ÷ + ÷. n3 m3 ø÷ · The approximations are reasonable as long as x, n - x, y and m - y are all larger than 5. NIPRL 14 10.2.1 Confidence Intervals for the Difference Between Two Population Proportions Example 55 : Building Tile Cracks Building A : 406 cracked tiles out of n = 6000. Building B : 83 cracked tiles out of m = 2000. x 406 y 83 µ pA = = = 0.0677 and µ pB = = = 0.0415. n 6000 m 2000 æ ö x(n - x) y (m - y ) µ µ x(n - x) y (m - y ) ÷ ç µ µ ÷ p A - pB Î çç p A - p B - za / 2 + , p A - p B - za / 2 + ÷ 3 3 3 3 ÷ çè n m n m ø = (0.0120, 0.0404) where a = 0.01. The fact this confidence interval contains only positive values indicates that p A > pB . NIPRL 15 10.2.2 Hypothesis Tests on the Difference Between Two Population Proportions · Hypothesis Tests of the Equality of Two Population Proportions Assume X ~ B(n, p A ), Y ~ B(m, pB ) and X ^ Y . · H 0 : p A = pB versus H A : p A ¹ pB has p - value = 2´ F (- | z |) ¶ pA - ¶ pB x+ y where z = and µ p= n+ m æ1 1 ö µ µ ÷ ç p (1- p) ç + ÷ çè n m ø÷ -Accept H 0 if | z | £ za / 2 and Reject H 0 if | z | > za / 2 . · H 0 : p A ³ pB versus H A : p A < pB has p - value = F ( z ) Accept H 0 if z ³ - za and Reject H 0 if z < - za . · H 0 : p A £ pB versus H A : p A > pB has p - value = 1- F ( z ) Accept H 0 if z £ - za and Reject H 0 if z > - za . NIPRL 16 10.2.2 Hypothesis Tests on the Difference Between Two Population Proportions Example 59 : Political Polling 627 421 µ pA = = 0.652 and µ pB = = 0.404 952 1043 H 0 : p A = pB versus H A : p A ¹ pB 627 + 421 Pooled estimate µ p= = 0.525 952 + 1043 z = 11.39, p - value = 2´ F (- 11.39) ; 0 Population age 18-39 A Random sample n=952 age >= 40 B Random sample m=1043 “The city mayor is doing a good job.” -Two-sided 99% confidence interval is p A - pB Î (0.199, 0.311) NIPRL Agree : x = 627 Agree : y = 421 Disagree : n-x = 325 Disagree : m-y = 622 17 Summary problems (1) Why do we assume large sample sizes for statistical inferences concerning proportions? So that the Normal approximation is a reasonable approach. (2) Can you find an exact size NIPRL test concerning proportions? No, in general. 18 10.3 Goodness of Fit Tests for One-Way Contingency Tables 10.3.1 One-Way Classifications · Each of n observations is classified into one of k categories or cells. cell frequencies : x1 , x2 ,¼ , xk with x1 + x2 + L + xk = n cell probabilities : p1 , p2 ,¼ , pk with p1 + p2 + L + pk = 1 xi µ xi ~ B(n, pi ) Þ p i = n · Hypothesis Test H 0 : pi = p*i 1 £ i £ k versus H A : H 0 is false The null hypothesis of homogeneity is that 1 pi = k * NIPRL for 1 £ i £ k 19 10.3.1 One-Way Classifications Example 1 : Machine Breakdowns n = 46 machine breakdowns. x1 = 9 : electrical problems x2 = 24 : mechanical problems x3 = 13 : operator misuse It is suggested that the probabilities of these three kinds are p*1 = 0.2, p*2 = 0.5, p*3 = 0.3. NIPRL 20 10.3.1 One-Way Classifications · Goodness of Fit Tests for One-Way Contingency Tables Consider a multinomial distribution with k-cells and a set of unknown cell probabilities p1 , p2 ,¼ , pk and a set of observed cell frequencies x1 , x2 ,¼ , xk with x1 + x2 + L + xk = n · H 0 : pi = p*i 1 £ i £ k p - value = P (c k2- 1 ³ X 2 ) or p - value = P (c k2- 1 ³ G 2 ) k where X 2 = å i= 1 with ei = np*i , k æxi ö ( xi - ei ) 2 2 ÷ and G = 2å xi ln ççç ÷ ÷ ÷ ei è ei ø i= 1 1£ i £ k The p - value are appropriate as long as ei ³ 5, 1 £ i £ k. At size a , accept H 0 if X 2 £ c a2 ,k - 1 (or if G 2 £ c a2 ,k - 1 ) and reject H 0 if X 2 > c a2 ,k - 1 (or if G 2 > c a2 ,k - 1 ). NIPRL 21 10.3.1 One-Way Classifications Example 1 : Machine Breakdowns H0 : p1 = 0.2, p2 = 0.5, p3 = 0.3 Observed cell freq. Expected cell freq. Electrical Mechanical Operator misuse x1 = 9 x2 = 24 x3 = 13 e1 = 46*0.2 e2 = 46*0.5 e3 = 46*0.3 =23.0 =13.8 = 9.2 n = 46 n = 46 (9.0 - 9.2) 2 (24.0 - 23.0) 2 (13.0 - 13.8) 2 X = + + = 0.0942 9.2 23.0 13.8 æ ö æ9.0 ö ö÷ æ13.0 ö÷÷ ÷+ 24.0´ ln æ çç 24.0 ÷ çç G 2 = 2´ çç9.0´ ln çç ÷ + 13.0 ´ ln = 0.0945 ÷ ÷ ÷ ÷÷ çè9.2 ø çè 23.0 ø çè13.8 ø÷ çè ø 2 p - value = P(c 22 ³ 0.094) ; 0.095 (k - 1 = 3 - 1 = 2) NIPRL 22 10.3.1 One-Way Classifications · H 0 is plausible from the p - value = P(c 22 ³ 0.094) ; 0.095. Consider 95% confidence interval for p2 . æ24 1.96 24´ (46 - 24) 24 1.96 24´ (46 - 24) ö÷ ÷ , + p2 Î ççç ÷ çè 46 46 46 46 46 46 ø÷ = (0.378, 0.666) · Check for the homogeneity. Let p1 = p2 = p3 = 13 . Then e1 = e2 = e3 = 46 3 = 15.33. (9.0 - 15.33) 2 (24.0 - 15.33) 2 (13.0 - 15.33) 2 = 7.87 + + X = 15.33 15.33 15.33 p - value = P(c 22 ³ 7.87) ; 0.02 2 Þ Not plausible. Also notice 13 Ï (0.378, 0.666). NIPRL 23 10.3.2 Testing Distributional Assumptions Example 3 : Software Errors For some of expected values are smaller than 5, it is appropriate to group the cells. Test if the data are from a Poisson distribution with mean=3. NIPRL 24 10.3.2 Testing Distributional Assumptions (17.00 - 16.93) 2 (20.0 - 19.04) 2 (25.00 - 19.04) 2 X = + + 16.93 19.04 19.04 (14.00 - 14.28) 2 (6.00 - 8.57) 2 (3.00 - 7.14) 2 + + + 14.28 8.57 7.14 = 5.12 2 p - value = P(c 52 ³ 5.12) = 0.40 Þ Plausible that the software errors have a Poisson distribution with mean l = 3. · If H 0 : the software errors have a Poisson distribution, ei would be calculated using l = x = 2.76 and p - value would be calculated from a chi-square distribution with k - 1- 1 = 4 degrees of freedom. NIPRL 25 10.4 Testing for Independence in Two-Way Contingency Tables 10.4.1 Two-Way Classifications A two-way (r x c) contingency table. Second Categorization Level 1 Level 2 Level j Level c x11 x12 x1c x1. Level 2 x21 x22 x2c x2. Level i Level r xij xr1 xr2 x.1 x.2 xi. x.j Column marginal frequencies NIPRL 26 xrc xr. x.c n = x.. Row marginal frequencies First Categorization Level 1 10.4.1 Two-Way Classifications Example 55 : Building Tile Cracks Location Tile Condition Building A Building B Undamaged x11 = 5594 x12 = 1917 x1. = 7511 Cracked x21 = 406 x22 = 83 x2. = 489 x.1 = 6000 x.2 = 2000 n = x.. = 8000 Notice that the column marginal frequencies are fixed. ( x.1 = 6000, x.2 = 2000) NIPRL 27 10.4.2 Testing for Independence · Testing for Independence in a Two-Way Contingency Table H 0 : Two categorizations are independent. Test statistics r 2 X = ( xij - eij ) 2 c å å i= 1 j = 1 where eij = eij æxij ÷ ö ç or G = 2å å xij ln çç ÷ ÷ ÷ e ç i= 1 j = 1 è ij ø r c 2 xi. x. j n p - value = P (c n2 ³ X 2 ) or p - value = P (c n2 ³ G 2 ) where n = (r - 1)´ (c - 1) A size a hypothesis test accepts H 0 if c a2 , n > X 2 (orG 2 ) rejects H 0 if c a2 , n < X 2 (orG 2 ) NIPRL 28 10.4.2 Testing for Independence Example 55 : Building Tile Cracks Building A Building B Undamaged x11 = 5594 e11 = 5633.25 x12 = 1917 e12 = 1877.75 x1. = 7511 Cracked x21 = 406 e21 = 366.75 x22 = 83 e22 = 122.25 x2. = 489 x.1 = 6000 x.2 = 2000 n = x.. = 8000 7511´ 6000 7511´ 2000 = 5633.25, e12 = = 1877.75 8000 8000 489´ 6000 489´ 2000 e21 = = 366.75, e22 = = 122.25 8000 8000 e11 = NIPRL 29 10.4.2 Testing for Independence (5594.00 - 5633.25) 2 (1917.00 - 1877.75) 2 X = + 5633.25 1877.75 (406.00 - 366.75) 2 (83.00 - 122.25) 2 + + 366.75 122.25 = 17.896 2 p - value = P (c 12 ³ 17.896) ; 0 Consequently, it is not plausible that p A (the probability thatatileonBuildingA becomescracked)and pB (the probability that a tileonBuildingBbecomescracked) are equal.This is consistent that the confidence interval of p A and pB does not contain zero. p A - pB Î (0.0120, 0.0404) A formula for the Pearson chi-square statistic for 2´ 2 tableis n( x11 x22 - x12 x21 ) 2 X = x1gxg1 x2 gxg2 2 NIPRL 30 Summary problems 1. Construct a goodness-of-fit test for testing a distributional assumption of a normal distribution by applying the one-way classification method. * 2. In testingH 0 : pij pij vsH1 :notH 0 ,what is the df of the Pearson chi-square statistic,wheni 1,..., r , j 1,..., c? NIPRL 31