Chapter 10. Discrete Data Analysis 10.1 10.2 10.3 10.4 10.5 NIPRL Inferences on a Population Proportion Comparing Two Population Proportions Goodness of Fit Tests for One-Way Contingency Tables Testing for Independence in Two-Way Contingency Tables Supplementary Problems 10.1 Inferences on a Population Proportion Population Proportion p with characteristic · Sample proportion µ p X ~ B ( n, p ) x µ p= n Random sample of size n With characteristic Without characteristic Cell probability p Cell probability 1-p Cell frequency x Cell frequency n-x NIPRL p (1- p ) µ µ E ( p) = p and Var ( p ) = n · For large n, æ p (1- p ) ÷ ö µ ç p ~ N ç p, ÷ çè ø n ÷ µ p- p ~ N (0,1) p (1- p ) n 2 10.1.1 Confidence Intervals for Population Proportions µ p- p · Since ~ N (0,1) for large n, two-sided 1- a confidence interval is p (1- p) n æ ö p (1- p) µ p(1- p) ÷ ç µ ÷ p Î çç p - za / 2 , p + za / 2 ÷ ÷ çè n n ø x p (1- p) where µ p = , s.e.( µ p) = . n n µ p (1- µ p) 1 µ But it is customary that s.e.( p) = = n n x(n - x) . n · Two-Sided Confidence Intervals for a Population Proportion X ~ B(n, p) Approximate two-sided 1- a confidence interval for p is æ ö za / 2 x(n - x) µ za / 2 x(n - x) ÷ ç µ ÷ p Î çç p , p+ . ÷ ÷ çè n n n n ø The approximation is reasonable as long as both x and n - x are larger than 5. NIPRL 3 10.1.1 Confidence Intervals for Population Proportions Example 55 : Building Tile Cracks Random sample n = 1250 of tiles in a certain group of downtown building for cracking. x = 98 are found to be cracked. x 98 µ p= = = 0.0784, z0.005 = 2.576 n 1250 99% two-sided confidence interval æ za / 2 x ( n - x ) µ z a / 2 x ( n - x ) ö ÷ ç µ ÷ p Î çç p , p+ ÷ ÷ çè n n n n ø æ 2.575 98´ (1250 - 98) 2.575 98´ (1250 - 98) ö ÷ ç ÷ = çç0.0784 , 0.0784 + ÷ ÷ çè 1250 1250 1250 1250 ø = (0.0588, 0.0980) » (0.06, 0.10) Total number of cracked tiles is between 0.06´ 5, 000, 000 = 300, 000 and 0.10´ 5, 000, 000 = 500, 000 NIPRL 4 10.1.1 Confidence Intervals for Population Proportions µ p- p · Since ~ N (0,1) for large n, one-sided 1- a confidence interval p (1- p ) n æ ö p (1- p ) ÷ ç µ ÷ for a lower bound is p Î çç p - za ,1÷. ÷ çè n ø · One-Sided Confidence Intervals for a Population Proportion X ~ B( n, p) Approximate two-sided 1- a confidence interval for p is æ æ za x(n - x) ö÷ za x(n - x) ö÷ ç ç µ µ ÷ p Î çç p ,1÷ and p Î çç0, p . ÷ ÷ ÷ ÷ çè ç n n n n ø è ø The approximation is reasonable as long as both x and n - x are larger than 5. NIPRL 5 10.1.2 Hypothesis Tests on a Population Proportion · Two-Sided Hypothesis Tests H 0 : p = p0 versus H A : p ¹ p0 p-value = 2´ P( X ³ x) if µ p = x / n > p0 p-value = 2´ P( X £ x) if µ p= x/n< p 0 where X ~ B(n, p0 ). · When np0 and n(1- p0 ) are both larger than 5, a normal approximation can be used to give thep-value of p -value = 2´ F (- | z |) where F (×) is thestandard normal c.d.f. and µ p - p0 x - np0 z = = . p0 (1- p0 ) np0 (1- p0 ) n NIPRL 6 10.1.2 Hypothesis Tests on a Population Proportion · Continuity Correction In order to improve the normal approximation the value in the numerator of the z -statistic x - np0 - 0.5 may be used when x - np0 > 0.5 x - np0 + 0.5 may be used when x - np0 < 0.5. · A size a hypothesis test accepts H 0 when | z | £ za / 2 or p - value³ a and rejects H 0 when | z | > za / 2 or p - value< a . NIPRL 7 10.1.2 Hypothesis Tests on a Population Proportion · One-Sided Hypothesis Tests for a Population Proportion H 0 : p ³ p0 versus H A : p < p0 p -value = P( X £ x) where X ~ B(n, p0 ). -The normal approximation to this is p -value = F ( z ) and z = x - np0 + 0.5 np0 (1- p0 ) . - Accept H 0 if z ³ - za and reject H 0 if z < - za . · For H 0 : p < p0 versus H A : p ³ p0 , p -value = P( X ³ x) where X ~ B(n, p0 ). -The normal approximation to this is p -value = 1- F ( z ) and z = x - np0 - 0.5 np0 (1- p0 ) . - Accept H 0 if z £ - za and reject H 0 if z > - za . NIPRL 8 10.1.2 Hypothesis Tests on a Population Proportion Example 55 : Building Tile Cracks 10% or more of the building tiles are cracked ? H 0 : p ³ 0.1 versus H A : p < 0.1 n = 1250, x = 98, p0 = 0.1. z= x - np0 + 0.5 = - 2.50. np0 (1- p0 ) 0 z = -2.50 p-value = F (- 2.50) = 0.0062. NIPRL 9 10.1.3 Sample Size Calculations · The most convenient way to assess the amount of precision afforded by a sample size n is to consider L of two-sided confidence interval for p. µ 4 za2 / 2 µ p (1- µ p) p (1- µ p) L = 2 za / 2 Þ n= . 2 n L Since µ p (1- µ p) is unknown, if we take µ p(1- µ p) the largest, za2 / 2 1 n = 2 and µ p= . L 2 However, if µ p is far from 0.5, we can take µ p = p * from prior information for p 4 za2 / 2 p *(1- p*) and n ; .(either p* < 0.5 or p* > 0.5) 2 L NIPRL 10 10.1.3 Sample Size Calculations Example 59 : Political Polling To determine the proportion p of people who agree with the statement “The city mayor is doing a good job.” within 3% accuracy. (-3% ~ +3%), how many people do they need to poll? Construct a 99% confidence interval for p with a length no large than L = 6% ( µ p - 3% ~ µ p + 3%) za2 / 2 2.5762 Þ n= 2 = = 1843.3 (a = 0.01) L 0.062 Þ representative sample : at least 1844 If "1000 respondents, ± 3% sampling error", Þ za / 2 = nL = 1000 ´ 0.06 = 1.90,F (- 1.90) = 0.0287. Þ a ; 0.057 Þ 95% confidence interval NIPRL 11 10.1.3 Sample Size Calculations Example 55 : Building Tile Cracks Recall a 99% confidence interval for p is (0.0588, 0.0980). We need to know p within 1% with 99% confidence.( L = 2%) Based on the above interval, let p* = 0.1 = 0.5. 4 za2 / 2 p *(1- p*) Þ n; = 5972.2 or about 6000. 2 L Þ representative sample : 4750 (initial sample = 1250) After the secondary sampling, n = 6000 and x = 98 + 308 = 406. 406 µ p= = 0.0677 and 6000 æ z x(n - x) µ za / 2 p Î ççç µ p - a /2 , p+ n n n èç x(n - x) ö ÷ ÷ ÷ ÷ n ø = (0.0593, 0.0761) NIPRL 12 10.2 Comparing Two Population Proportions · Comparing two population proportions Assume X ~ B(n, p A ), Y ~ B(m, pB ) and X ^ Y . x y x y and µ pB = Þ µ pA - µ pB = . n m n m p (1- p A ) pB (1- pB ) Var ( µ pA - µ p B ) = Var ( µ p A ) + Var ( µ pB ) = A + . n m (µ pA - µ p B ) - ( p A - pB ) (µ pA - µ p B ) - ( p A - pB ) z = = . µ x(n - x) y (m - y ) p A (1- µ pA) µ p B (1- µ pB ) + + 3 n m3 n m · If H 0 : p A = pB , the pooled estimate is µ pA - µ pB x+ y µ p = and z = . n+ m æ1 1 ö µ p (1- µ p) çç + ÷ ÷ çè n m ø÷ µ pA = NIPRL 13 10.2.1 Confidence Intervals for the Difference Between Two Population Proportions · Confidence Intervals for the Difference Between Two Population Proportions Assume X ~ B(n, p A ), Y ~ B(m, pB ) and X ^ Y . Approximate two-sided 1- a confidence level confidence interval is æ ö µ µ çç µ µ p A (1- µ pA) µ p B (1- µ pB ) µ µ p A (1- µ pA) µ p B (1- µ pB ) ÷ ÷ ÷ p A - p B Î ç p A - p B - za / 2 + , p A - p B - za / 2 + ÷ çç ÷ n m n m ÷ è ø æ x(n - x) y (m - y ) µ µ x(n - x) y (m - y ) ö÷ ç µ µ ÷ = çç p A - p B - za / 2 + , p A - p B - za / 2 + ÷ 3 3 3 3 çè n m n m ø÷ · Approximate one-sided 1- a confidence level confidence interval is æ x(n - x) y (m - y ) ö÷ ç µ µ p A - pB Î çç p A - p B - za + , 1÷ and ÷ 3 3 ÷ çè n m ø æ p A - pB Î ççç- 1, µ pA - µ p B - za çè x(n - x) y (m - y ) ö÷ ÷ + ÷. n3 m3 ø÷ · The approximations are reasonable as long as x, n - x, y and m - y are all larger than 5. NIPRL 14 10.2.1 Confidence Intervals for the Difference Between Two Population Proportions Example 55 : Building Tile Cracks Building A : 406 cracked tiles out of n = 6000. Building B : 83 cracked tiles out of m = 2000. x 406 y 83 µ pA = = = 0.0677 and µ pB = = = 0.0415. n 6000 m 2000 æ ö x(n - x) y (m - y ) µ µ x(n - x) y (m - y ) ÷ ç µ µ ÷ p A - pB Î çç p A - p B - za / 2 + , p A - p B - za / 2 + ÷ 3 3 3 3 ÷ çè n m n m ø = (0.0120, 0.0404) where a = 0.01. The fact this confidence interval contains only positive values indicates that p A > pB . NIPRL 15 10.2.2 Hypothesis Tests on the Difference Between Two Population Proportions · Hypothesis Tests of the Equality of Two Population Proportions Assume X ~ B(n, p A ), Y ~ B(m, pB ) and X ^ Y . · H 0 : p A = pB versus H A : p A ¹ pB has p - value = 2´ F (- | z |) ¶ pA - ¶ pB x+ y where z = and µ p= n+ m æ1 1 ö µ µ ÷ ç p (1- p) ç + ÷ çè n m ø÷ -Accept H 0 if | z | £ za / 2 and Reject H 0 if | z | > za / 2 . · H 0 : p A ³ pB versus H A : p A < pB has p - value = F ( z ) Accept H 0 if z ³ - za and Reject H 0 if z < - za . · H 0 : p A £ pB versus H A : p A > pB has p - value = 1- F ( z ) Accept H 0 if z £ - za and Reject H 0 if z > - za . NIPRL 16 10.2.2 Hypothesis Tests on the Difference Between Two Population Proportions Example 59 : Political Polling 627 421 µ pA = = 0.652 and µ pB = = 0.404 952 1043 H 0 : p A = pB versus H A : p A ¹ pB 627 + 421 Pooled estimate µ p= = 0.525 952 + 1043 z = 11.39, p - value = 2´ F (- 11.39) ; 0 Population age 18-39 A Random sample n=952 age >= 40 B Random sample m=1043 “The city mayor is doing a good job.” -Two-sided 99% confidence interval is p A - pB Î (0.199, 0.311) NIPRL Agree : x = 627 Agree : y = 421 Disagree : n-x = 325 Disagree : m-y = 622 17 Summary problems (1) Why do we assume large sample sizes for statistical inferences concerning proportions? So that the Normal approximation is a reasonable approach. (2) Can you find an exact size NIPRL test concerning proportions? No, in general. 18 10.3 Goodness of Fit Tests for One-Way Contingency Tables 10.3.1 One-Way Classifications · Each of n observations is classified into one of k categories or cells. cell frequencies : x1 , x2 ,¼ , xk with x1 + x2 + L + xk = n cell probabilities : p1 , p2 ,¼ , pk with p1 + p2 + L + pk = 1 xi µ xi ~ B(n, pi ) Þ p i = n · Hypothesis Test H 0 : pi = p*i 1 £ i £ k versus H A : H 0 is false The null hypothesis of homogeneity is that 1 pi = k * NIPRL for 1 £ i £ k 19 10.3.1 One-Way Classifications Example 1 : Machine Breakdowns n = 46 machine breakdowns. x1 = 9 : electrical problems x2 = 24 : mechanical problems x3 = 13 : operator misuse It is suggested that the probabilities of these three kinds are p*1 = 0.2, p*2 = 0.5, p*3 = 0.3. NIPRL 20 10.3.1 One-Way Classifications · Goodness of Fit Tests for One-Way Contingency Tables Consider a multinomial distribution with k-cells and a set of unknown cell probabilities p1 , p2 ,¼ , pk and a set of observed cell frequencies x1 , x2 ,¼ , xk with x1 + x2 + L + xk = n · H 0 : pi = p*i 1 £ i £ k p - value = P (c k2- 1 ³ X 2 ) or p - value = P (c k2- 1 ³ G 2 ) k where X 2 = å i= 1 with ei = np*i , k æxi ö ( xi - ei ) 2 2 ÷ and G = 2å xi ln ççç ÷ ÷ ÷ ei è ei ø i= 1 1£ i £ k The p - value are appropriate as long as ei ³ 5, 1 £ i £ k. At size a , accept H 0 if X 2 £ c a2 ,k - 1 (or if G 2 £ c a2 ,k - 1 ) and reject H 0 if X 2 > c a2 ,k - 1 (or if G 2 > c a2 ,k - 1 ). NIPRL 21 10.3.1 One-Way Classifications Example 1 : Machine Breakdowns H0 : p1 = 0.2, p2 = 0.5, p3 = 0.3 Observed cell freq. Expected cell freq. Electrical Mechanical Operator misuse x1 = 9 x2 = 24 x3 = 13 e1 = 46*0.2 e2 = 46*0.5 e3 = 46*0.3 =23.0 =13.8 = 9.2 n = 46 n = 46 (9.0 - 9.2) 2 (24.0 - 23.0) 2 (13.0 - 13.8) 2 X = + + = 0.0942 9.2 23.0 13.8 æ ö æ9.0 ö ö÷ æ13.0 ö÷÷ ÷+ 24.0´ ln æ çç 24.0 ÷ çç G 2 = 2´ çç9.0´ ln çç ÷ + 13.0 ´ ln = 0.0945 ÷ ÷ ÷ ÷÷ çè9.2 ø çè 23.0 ø çè13.8 ø÷ çè ø 2 p - value = P(c 22 ³ 0.094) ; 0.095 (k - 1 = 3 - 1 = 2) NIPRL 22 10.3.1 One-Way Classifications · H 0 is plausible from the p - value = P(c 22 ³ 0.094) ; 0.095. Consider 95% confidence interval for p2 . æ24 1.96 24´ (46 - 24) 24 1.96 24´ (46 - 24) ö÷ ÷ , + p2 Î ççç ÷ çè 46 46 46 46 46 46 ø÷ = (0.378, 0.666) · Check for the homogeneity. Let p1 = p2 = p3 = 13 . Then e1 = e2 = e3 = 46 3 = 15.33. (9.0 - 15.33) 2 (24.0 - 15.33) 2 (13.0 - 15.33) 2 = 7.87 + + X = 15.33 15.33 15.33 p - value = P(c 22 ³ 7.87) ; 0.02 2 Þ Not plausible. Also notice 13 Ï (0.378, 0.666). NIPRL 23 10.3.2 Testing Distributional Assumptions Example 3 : Software Errors For some of expected values are smaller than 5, it is appropriate to group the cells. Test if the data are from a Poisson distribution with mean=3. NIPRL 24 10.3.2 Testing Distributional Assumptions (17.00 - 16.93) 2 (20.0 - 19.04) 2 (25.00 - 19.04) 2 X = + + 16.93 19.04 19.04 (14.00 - 14.28) 2 (6.00 - 8.57) 2 (3.00 - 7.14) 2 + + + 14.28 8.57 7.14 = 5.12 2 p - value = P(c 52 ³ 5.12) = 0.40 Þ Plausible that the software errors have a Poisson distribution with mean l = 3. · If H 0 : the software errors have a Poisson distribution, ei would be calculated using l = x = 2.76 and p - value would be calculated from a chi-square distribution with k - 1- 1 = 4 degrees of freedom. NIPRL 25 10.4 Testing for Independence in Two-Way Contingency Tables 10.4.1 Two-Way Classifications A two-way (r x c) contingency table. Second Categorization Level 1 Level 2 Level j Level c x11 x12 x1c x1. Level 2 x21 x22 x2c x2. Level i Level r xij xr1 xr2 x.1 x.2 xi. x.j Column marginal frequencies NIPRL 26 xrc xr. x.c n = x.. Row marginal frequencies First Categorization Level 1 10.4.1 Two-Way Classifications Example 55 : Building Tile Cracks Location Tile Condition Building A Building B Undamaged x11 = 5594 x12 = 1917 x1. = 7511 Cracked x21 = 406 x22 = 83 x2. = 489 x.1 = 6000 x.2 = 2000 n = x.. = 8000 Notice that the column marginal frequencies are fixed. ( x.1 = 6000, x.2 = 2000) NIPRL 27 10.4.2 Testing for Independence · Testing for Independence in a Two-Way Contingency Table H 0 : Two categorizations are independent. Test statistics r 2 X = ( xij - eij ) 2 c å å i= 1 j = 1 where eij = eij æxij ÷ ö ç or G = 2å å xij ln çç ÷ ÷ ÷ e ç i= 1 j = 1 è ij ø r c 2 xi. x. j n p - value = P (c n2 ³ X 2 ) or p - value = P (c n2 ³ G 2 ) where n = (r - 1)´ (c - 1) A size a hypothesis test accepts H 0 if c a2 , n > X 2 (orG 2 ) rejects H 0 if c a2 , n < X 2 (orG 2 ) NIPRL 28 10.4.2 Testing for Independence Example 55 : Building Tile Cracks Building A Building B Undamaged x11 = 5594 e11 = 5633.25 x12 = 1917 e12 = 1877.75 x1. = 7511 Cracked x21 = 406 e21 = 366.75 x22 = 83 e22 = 122.25 x2. = 489 x.1 = 6000 x.2 = 2000 n = x.. = 8000 7511´ 6000 7511´ 2000 = 5633.25, e12 = = 1877.75 8000 8000 489´ 6000 489´ 2000 e21 = = 366.75, e22 = = 122.25 8000 8000 e11 = NIPRL 29 10.4.2 Testing for Independence (5594.00 - 5633.25) 2 (1917.00 - 1877.75) 2 X = + 5633.25 1877.75 (406.00 - 366.75) 2 (83.00 - 122.25) 2 + + 366.75 122.25 = 17.896 2 p - value = P (c 12 ³ 17.896) ; 0 Consequently, it is not plausible that p A (the probability thatatileonBuildingA becomescracked)and pB (the probability that a tileonBuildingBbecomescracked) are equal.This is consistent that the confidence interval of p A and pB does not contain zero. p A - pB Î (0.0120, 0.0404) A formula for the Pearson chi-square statistic for 2´ 2 tableis n( x11 x22 - x12 x21 ) 2 X = x1gxg1 x2 gxg2 2 NIPRL 30 Summary problems 1. Construct a goodness-of-fit test for testing a distributional assumption of a normal distribution by applying the one-way classification method. * 2. In testingH 0 : pij pij vsH1 :notH 0 ,what is the df of the Pearson chi-square statistic,wheni 1,..., r , j 1,..., c? NIPRL 31
0
You can add this document to your study collection(s)
Sign in Available only to authorized usersYou can add this document to your saved list
Sign in Available only to authorized users(For complaints, use another form )