Chapter 8 Inference Concerning Proportions

advertisement
Chapter 8
Inference Concerning Proportions
Inference for a Single Proportion (p)
• Goal: Estimate proportion of individuals in a population with a
certain characteristic (p). This is equivalent to estimating a
binomial probability
• Sample: Take a SRS of n individuals from the population and
observe X that have the characteristic. The sample proportion is
X/n and has the following sampling properties:
^
Sample proportion : p 
X
n
Mean and Std. Dev. of sampling distributi on :  ^  p  ^ 
p
p
p (1  p )
n
 ^
p 1  p 


Estimated Standard Error : SE ^ 
p
n
Shape : approximat ely normal for large samples (Rule of thumb : X , n  X  15)
^
Large-Sample Confidence Interval for p
• Take SRS of size n from population where p is true
(unknown) proportion of successes.
– Observe X successes
– Set confidence level C and choose z* such that P(-z*Z z*)=C
(C = 90%  z*=1.645 C = 95%  z*=1.96 C = 99%  z*=2.576)
X
Point Estimate : p 
n
^

p 1 

n
^
Estimated Standard Error : SE ^ 
p
Margin of error : m  z *SE ^
p
^
C % confidence interval for p :
p m

p

^
Example - Ginkgo and Azet for AMS
• Study Goal: Measure effect of Ginkgo and
Acetazolamide on occurrence of Acute Mountain
Sickness (AMS) in Himalayan Trackers
• Parameter: p = True proportion of all trekkers receiving
Ginkgo&Acetaz who would suffer from AMS.
• Sample Data: n=126 trekkers received G&A, X=18
suffered from AMS
18
(.14)(. 86)
p
 .143
SE ^ 
 .031
p
126
126
Margin of error (C  95%) : m  1.96(.031)  .061
95% CI for p : .143  .061  (.082,.204)
^
Wilson’s “Plus 4” Method
• For moderate to small sample sizes, large-sample
methods may not work well wrt coverage probabilities
• Simple approach that works well in practice (n10):
– Pretend you have 4 extra individuals, 2 successes, 2 failures
– Compute the estimated sample proportion in light of new
“data” as well as standard error:
~
Point Estimate : p 
X 2
n4
~


p 1  p 


n4
~
Estimated Standard Error : SE ~ 
p
Margin of error : m  z *SE ~
p
~
C % confidence interval for p :
p m
Example: Lister’s Tests with Antiseptic
• Experiments with antiseptic in patients with upper
limb amputations (John Lister, circa 1870)
• n=12 patients received antiseptic X=1 died
1 2
3
.1875(.8125)
p

 .1875
SE ~ 
 .0976
p
12  4 16
16
Margin of error( C  95%) : 1.96(.0976)  .1913
95% CI for p : .1875  .1913  (.0038,.3988)  (0,.40)
~
Significance Test for a Proportion
• Goal test whether a proportion (p) equals some null
value p0 H0: p=p0
^
p  p0
Test Statistic : zobs 
po (1  p0 )
n
H a : p  p0
P - value  P( Z  zobs )
H a : p  p0
P - value  P( Z  zobs )
H a : p  p0
P - value  2 P( Z  zobs )
Large-sample test works well when np0 and n(1-p0) > 10
Ginkgo and Acetaz for AMS
• Can we claim that the incidence rate of AMS is less
than 25% for trekkers receiving G&A?
• H0: p=0.25 Ha: p < 0.25
18
n  126 X  18 p 
 0.143 p0  0.25
126
.143  .25  .107
Test Statistic : zobs 

 2.75
.039
.25(.75)
118
P - value  P ( Z  2.75)  .0030
^
Strong evidence that incidence rate is below 25% (p<0.25)
Comparing Two Population Proportions
• Goal: Compare two populations/treatments wrt
a nominal (binary) outcome
• Sampling Design: Independent vs Dependent
Samples
• Methods based on large vs small samples
• Contingency tables used to summarize data
• Measures of Association: Absolute Risk,
Relative Risk, Odds Ratio
Contingency Tables
• Tables representing all combinations of
levels of explanatory and response variables
• Numbers in table represent Counts of the
number of cases in each cell
• Row and column totals are called Marginal
counts
2x2 Tables - Notation
Group 1
Outcome
Present
X1
Outcome
Absent
n1-X1
Group
Total
n1
Group 2
X2
n2-X2
n2
Outcome
Total
X1+X2
(n1+n2)(X1+X2)
n1+n2
Example - Firm Type/Product Quality
Not
Integrated
Vertically
Integrated
Outcome
Total
High
Quality
Low
Quality
Group
Total
33
55
88
5
79
84
38
134
172
• Groups: Not Integrated (Weave only) vs Vertically integrated
(Spin and Weave) Cotton Textile Producers
• Outcomes: High Quality (High Count) vs Low Quality (Count)
Source: Temin (1988)
Notation
• Proportion in Population 1 with the characteristic
of interest: p1
• Sample size from Population 1: n1
• Number of individuals in Sample 1 with the
characteristic of interest: X1
• Sample proportion from Sample 1 with the
^
characteristic of interest:
X1
p1 
n1
• Similar notation for Population/Sample 2
Example - Cotton Textile Producers
• p1 - True proportion of all Non-integretated
firms that would produce High quality
• p2 - True proportion of all vertically integretated
firms that would produce High quality
n1  88
n2  84
X 1  33
X 1 33
p1 

 0.375
n1 88
X2  5
X2 5
p2 

 0.060
n2 84
^
^
Notation (Continued)
• Parameter of Primary Interest: p1-p2, the difference
in the 2 population proportions with the
characteristic (2 other measures given below)
^
^
• Estimator:
D p p
1
2
• Standard Error (and its estimate):
 ^  ^  ^ 
p1 1  p1  p 2 1  p 2 

 

n1
n2
^
D 
p1 (1  p1 ) p2 (1  p2 )

n1
n2
SED 
• Pooled Estimated Standard Error when p1=p2=p:
SEDP 
 ^  1 1 
p1  p   

 n1 n2 
^
X1  X 2
p
n1  n2
^
Cotton Textile Producers (Continued)
• Parameter of Primary Interest: p1-p2, the difference
in the 2 population proportions that produce High
quality output
^
^
D  p1  p 2  0.375  0.060  0.315
• Estimator:
• Standard Error (and its estimate):
 ^  ^  ^ 
p1 1  p1  p 2 1  p1 

 
  0.375(0.625)  0.060(0.94)  .003335  .0577
n1
n2
88
84
^
SED 
• Pooled Estimated Standard Error when p1=p2=p:
SEDP
1 1
 0.2210.779    .0633
 88 84 
^
p
33  5
 0.221
88  84
Confidence Interval for p1-p2 (Wilson’s Estimate)
• Method adds a success and a failure to each group to
improve the coverage rate under certain conditions:
X1 1
p1 
n1  2
~
X 2 1
p2 
n2  2
~
~
~
D  p1  p 2
 ~  ~  ~ 
p1  1  p1  p 2  1  p 2 

 

n1  2
n2  2
~
SE ~ 
D
• The confidence interval is of the form:
~ ~  *
 p1  p 2   z SE ~
D


~
Example - Cotton Textile Production
X  1 33  1 34
p1  1


 0.378
n1  2 88  2 90
~
~
~
~
p2 
X 2 1 5 1
6


 0.070
n2  2 84  2 86
~
D  p1  p 2  0.378  0.070  0.308
0.3780.622 0.0700.930
SE ~ 

 .00261  .00076  .0581
D
90
86
95% Confidence Interval for p1-p2:
0.308  1.96(0.0581)  0.308  0.114  (0.194,0.422)
Providing evidence that non-integrated producers are more likely
to provide high quality output (p1-p2 > 0)
Significance Tests for p1-p2
• Deciding whether p1=p2 can be done by interpreting
“plausible values” of p1-p2 from the confidence interval:
– If entire interval is positive, conclude p1 > p2 (p1-p2 > 0)
– If entire interval is negative, conclude p1 < p2 (p1-p2 < 0)
– If interval contains 0, do not conclude that p1  p2
• Alternatively, we can conduct a significance test:
– H0: p1 = p2 Ha: p1  p2 (2-sided)
^
^
– Test Statistic:
zobs 
Ha: p1 > p2 (1-sided)
p1  p 2
 ^  1 1 
p1  p   

 n1 n2 
^
– P-value: 2P(Z|zobs|) (2-sided)
P(Z zobs) (1-sided)
Example - Cotton Textile Production
H 0 : p1  p2
( p1  p2  0)
H A : p1  p2
( p1  p2  0)
^
TS : zobs 
^
p1  p 2
 ^  1 1 
p1  p   

 n1 n2 
^

0.375  0.060
1 
 1
0.221(0.779)  
 88 84 

0.315
 4.98
0.0633
RR : zobs  z.025  1.96
P - value  2 P( Z  4.98)  0
Again, there is strong evidence that non-integrated performs are
more likely to produce high quality output than integrated firms
Download