Chapter 8 Inference Concerning Proportions Inference for a Single Proportion (p) • Goal: Estimate proportion of individuals in a population with a certain characteristic (p). This is equivalent to estimating a binomial probability • Sample: Take a SRS of n individuals from the population and observe X that have the characteristic. The sample proportion is X/n and has the following sampling properties: ^ Sample proportion : p X n Mean and Std. Dev. of sampling distributi on : ^ p ^ p p p (1 p ) n ^ p 1 p Estimated Standard Error : SE ^ p n Shape : approximat ely normal for large samples (Rule of thumb : X , n X 15) ^ Large-Sample Confidence Interval for p • Take SRS of size n from population where p is true (unknown) proportion of successes. – Observe X successes – Set confidence level C and choose z* such that P(-z*Z z*)=C (C = 90% z*=1.645 C = 95% z*=1.96 C = 99% z*=2.576) X Point Estimate : p n ^ p 1 n ^ Estimated Standard Error : SE ^ p Margin of error : m z *SE ^ p ^ C % confidence interval for p : p m p ^ Example - Ginkgo and Azet for AMS • Study Goal: Measure effect of Ginkgo and Acetazolamide on occurrence of Acute Mountain Sickness (AMS) in Himalayan Trackers • Parameter: p = True proportion of all trekkers receiving Ginkgo&Acetaz who would suffer from AMS. • Sample Data: n=126 trekkers received G&A, X=18 suffered from AMS 18 (.14)(. 86) p .143 SE ^ .031 p 126 126 Margin of error (C 95%) : m 1.96(.031) .061 95% CI for p : .143 .061 (.082,.204) ^ Wilson’s “Plus 4” Method • For moderate to small sample sizes, large-sample methods may not work well wrt coverage probabilities • Simple approach that works well in practice (n10): – Pretend you have 4 extra individuals, 2 successes, 2 failures – Compute the estimated sample proportion in light of new “data” as well as standard error: ~ Point Estimate : p X 2 n4 ~ p 1 p n4 ~ Estimated Standard Error : SE ~ p Margin of error : m z *SE ~ p ~ C % confidence interval for p : p m Example: Lister’s Tests with Antiseptic • Experiments with antiseptic in patients with upper limb amputations (John Lister, circa 1870) • n=12 patients received antiseptic X=1 died 1 2 3 .1875(.8125) p .1875 SE ~ .0976 p 12 4 16 16 Margin of error( C 95%) : 1.96(.0976) .1913 95% CI for p : .1875 .1913 (.0038,.3988) (0,.40) ~ Significance Test for a Proportion • Goal test whether a proportion (p) equals some null value p0 H0: p=p0 ^ p p0 Test Statistic : zobs po (1 p0 ) n H a : p p0 P - value P( Z zobs ) H a : p p0 P - value P( Z zobs ) H a : p p0 P - value 2 P( Z zobs ) Large-sample test works well when np0 and n(1-p0) > 10 Ginkgo and Acetaz for AMS • Can we claim that the incidence rate of AMS is less than 25% for trekkers receiving G&A? • H0: p=0.25 Ha: p < 0.25 18 n 126 X 18 p 0.143 p0 0.25 126 .143 .25 .107 Test Statistic : zobs 2.75 .039 .25(.75) 118 P - value P ( Z 2.75) .0030 ^ Strong evidence that incidence rate is below 25% (p<0.25) Comparing Two Population Proportions • Goal: Compare two populations/treatments wrt a nominal (binary) outcome • Sampling Design: Independent vs Dependent Samples • Methods based on large vs small samples • Contingency tables used to summarize data • Measures of Association: Absolute Risk, Relative Risk, Odds Ratio Contingency Tables • Tables representing all combinations of levels of explanatory and response variables • Numbers in table represent Counts of the number of cases in each cell • Row and column totals are called Marginal counts 2x2 Tables - Notation Group 1 Outcome Present X1 Outcome Absent n1-X1 Group Total n1 Group 2 X2 n2-X2 n2 Outcome Total X1+X2 (n1+n2)(X1+X2) n1+n2 Example - Firm Type/Product Quality Not Integrated Vertically Integrated Outcome Total High Quality Low Quality Group Total 33 55 88 5 79 84 38 134 172 • Groups: Not Integrated (Weave only) vs Vertically integrated (Spin and Weave) Cotton Textile Producers • Outcomes: High Quality (High Count) vs Low Quality (Count) Source: Temin (1988) Notation • Proportion in Population 1 with the characteristic of interest: p1 • Sample size from Population 1: n1 • Number of individuals in Sample 1 with the characteristic of interest: X1 • Sample proportion from Sample 1 with the ^ characteristic of interest: X1 p1 n1 • Similar notation for Population/Sample 2 Example - Cotton Textile Producers • p1 - True proportion of all Non-integretated firms that would produce High quality • p2 - True proportion of all vertically integretated firms that would produce High quality n1 88 n2 84 X 1 33 X 1 33 p1 0.375 n1 88 X2 5 X2 5 p2 0.060 n2 84 ^ ^ Notation (Continued) • Parameter of Primary Interest: p1-p2, the difference in the 2 population proportions with the characteristic (2 other measures given below) ^ ^ • Estimator: D p p 1 2 • Standard Error (and its estimate): ^ ^ ^ p1 1 p1 p 2 1 p 2 n1 n2 ^ D p1 (1 p1 ) p2 (1 p2 ) n1 n2 SED • Pooled Estimated Standard Error when p1=p2=p: SEDP ^ 1 1 p1 p n1 n2 ^ X1 X 2 p n1 n2 ^ Cotton Textile Producers (Continued) • Parameter of Primary Interest: p1-p2, the difference in the 2 population proportions that produce High quality output ^ ^ D p1 p 2 0.375 0.060 0.315 • Estimator: • Standard Error (and its estimate): ^ ^ ^ p1 1 p1 p 2 1 p1 0.375(0.625) 0.060(0.94) .003335 .0577 n1 n2 88 84 ^ SED • Pooled Estimated Standard Error when p1=p2=p: SEDP 1 1 0.2210.779 .0633 88 84 ^ p 33 5 0.221 88 84 Confidence Interval for p1-p2 (Wilson’s Estimate) • Method adds a success and a failure to each group to improve the coverage rate under certain conditions: X1 1 p1 n1 2 ~ X 2 1 p2 n2 2 ~ ~ ~ D p1 p 2 ~ ~ ~ p1 1 p1 p 2 1 p 2 n1 2 n2 2 ~ SE ~ D • The confidence interval is of the form: ~ ~ * p1 p 2 z SE ~ D ~ Example - Cotton Textile Production X 1 33 1 34 p1 1 0.378 n1 2 88 2 90 ~ ~ ~ ~ p2 X 2 1 5 1 6 0.070 n2 2 84 2 86 ~ D p1 p 2 0.378 0.070 0.308 0.3780.622 0.0700.930 SE ~ .00261 .00076 .0581 D 90 86 95% Confidence Interval for p1-p2: 0.308 1.96(0.0581) 0.308 0.114 (0.194,0.422) Providing evidence that non-integrated producers are more likely to provide high quality output (p1-p2 > 0) Significance Tests for p1-p2 • Deciding whether p1=p2 can be done by interpreting “plausible values” of p1-p2 from the confidence interval: – If entire interval is positive, conclude p1 > p2 (p1-p2 > 0) – If entire interval is negative, conclude p1 < p2 (p1-p2 < 0) – If interval contains 0, do not conclude that p1 p2 • Alternatively, we can conduct a significance test: – H0: p1 = p2 Ha: p1 p2 (2-sided) ^ ^ – Test Statistic: zobs Ha: p1 > p2 (1-sided) p1 p 2 ^ 1 1 p1 p n1 n2 ^ – P-value: 2P(Z|zobs|) (2-sided) P(Z zobs) (1-sided) Example - Cotton Textile Production H 0 : p1 p2 ( p1 p2 0) H A : p1 p2 ( p1 p2 0) ^ TS : zobs ^ p1 p 2 ^ 1 1 p1 p n1 n2 ^ 0.375 0.060 1 1 0.221(0.779) 88 84 0.315 4.98 0.0633 RR : zobs z.025 1.96 P - value 2 P( Z 4.98) 0 Again, there is strong evidence that non-integrated performs are more likely to produce high quality output than integrated firms