Section 3

advertisement
STAT 405 - BIOSTATISTICS
Handout 3 – Methods for a Single Categorical Variable, Part II
In the previous handout, we discussed the binomial distribution. To learn more about
this distribution, refer to Section 4.8 of your text.
In this handout, we will continue to discuss the use of the binomial distribution for
testing hypotheses concerning a single binomial proportion. For additional information,
see Section 7.10 of your text.
THE Z-TEST FOR A SINGLE PROPORTION
In your introductory course, you more than likely discussed normal theory methods (e.g.,
the z-test) for testing hypotheses concerning a single proportion. Let’s review these
methods for the following example.
EXAMPLE 1: Cancer (Example 7.49 of your text, pg. 248)
The safety of people who work at or live by nuclear-power plants has been the subject of
widely publicized debate in recent years. One possible health hazard from radiation
exposure is an excess of cancer deaths among those exposed. One problem with studying
this question is that the number of deaths attributable to either cancer in general or
specific types of cancer is small, and reaching statistically significant conclusions is
difficult, except after long periods of follow-up. An alternative approach is to perform a
proportional-mortality study, whereby the proportion of deaths attributed to a specific
cause in an exposed group is compared with the corresponding proportion in a large
population.
Suppose, for example, that 13 deaths have occurred among 55-64 year-old male workers
in a nuclear-power plant and that in 5 of them the cause of death was cancer. Assume,
based on vital-statistics reports, that approximately 20% of all deaths can be attributed
to some form of cancer. Is there evidence that the proportion of deaths from cancer in
nuclear-power plant workers is greater than the proportion of deaths from cancer in men
of comparable age in the general population?

Set up your null and alternative hypotheses:
1

Find the test statistic:

Find the p-value:

Write a conclusion in the context of the original problem:
EXACT BINOMIAL TEST FOR A SINGLE PROPORTION
Note that the z-test for a single proportion is valid only when the normal approximation
to the binomial distribution is valid. When this criterion is not satisfied, the hypothesis
test should instead be based on exact binomial probabilities.
The exact binomial test for the Cancer example can be carried out as follows:

Check assumptions. For this test, you must check whether the binomial
distribution is appropriate for the problem.
2

Set up your null and alternative hypotheses.

Use the binomial distribution to find the exact p-value.
The following graphic shows the binomial distribution for the number of deaths
attributable to cancer (n = 13 and p = .20):
Recall that the p-value is the probability of observing a sample AT LEAST AS
EXTREME as our data, assuming the null hypothesis is true. For this example,
“at least as extreme” implies observing 5 or more deaths due to cancer.
We can use SAS to find the probability of 5 or more deaths due to cancer:
data BinomialProbabilities;
prob = 1-cdf('Binomial', 4, .20, 13);
proc print data=BinomialProbabilities;
run;
3
The interpretation of this p-value is as follows: Even if the true proportion of
deaths from cancer for nuclear-power plant workers is the same as that of the
general population (p = 20%), there is still about a 10% probability of observing
a sample proportion of 5/13 = 38.5% just by random chance alone.
Note that we reject the null hypothesis for small p-values. This is because a
small p-value indicates that observing a difference at least as extreme as the one
obtained in the sample is NOT LIKELY to happen by random chance alone.
Therefore, when the p-value is small, the observed difference between the
sample proportion and the hypothesized value is statistically significant since it
is not attributable to chance.

Write a conclusion in the context of the original problem:
“We do NOT have evidence to conclude the proportion of deaths from cancer is
different for nuclear-power plant workers than for men of comparable age in the
general population (p-value = .099).”
WHY DO WE DEFINE P-VALUE WITH “AT LEAST AS EXTREME”?
Consider the previous example. One question that often arises is why we used the
probability of 5 or more deaths to find the p-value, rather than the probability of exactly
5 deaths (what we actually observed in the study). One intuitive answer is that if the
number of overall deaths studied was very large, then the probability of any specific
occurrence would be small.
For example, suppose that out of 1,000 deaths in some study population, exactly 200
were attributable to cancer. If we assume the proportion of deaths in the study
population due to cancer is the same as that of the general population for this
age/gender group (20%), then the probability of observing exactly 200 deaths in the
study population is calculated as follows:
data BinomialProbabilities;
prob = pdf('Binomial', 200, .20, 1000);
proc print data=BinomialProbabilities;
run;
Questions:
1. If this were the p-value, what would our conclusion be?
2. Does this conclusion make sense? Explain.
4
The better approach is to find the probability of obtaining a result at least extreme as the
observed data. In this case, the p-value should have been calculated as follows:
data BinomialProbabilities;
prob = 1-cdf('Binomial', 199, .20, 1000);
proc print data=BinomialProbabilities;
run;
Questions:
1. Now, what is our conclusion?
2. Does this conclusion make sense? Explain.
EXACT CONFIDENCE INTERVALS FOR A BINOMIAL PROPORTION
Once again, in your introductory course, you more than likely constructed confidence
intervals for a single proportion using normal-theory methods. Consider the data from
the previous example. Let’s construct a 95% confidence interval for p = the true
proportion of deaths from cancer for nuclear-power plant workers using normal theory
methods:
5
Just like for hypothesis tests, this method is valid only when the normal approximation
to the binomial distribution is valid. When this criterion is not satisfied, the confidence
interval should be based on exact binomial probabilities. For additional information, see
Section 6.8 of your text.
EXAMPLE 1 (cont’d): Consider the data from the previous example. Let’s construct a
95% confidence interval for p = the true proportion of deaths from cancer for nuclearpower plant workers using the binomial distribution.
The basic idea for constructing a confidence interval for a binomial proportion is as
follows: the confidence interval includes all values p* such that the data obtained in the
sample would NOT lead us to reject Ho: p = p*.
One-sided Interval:

Let p* = .15. P(5 or more deaths out of 13 if p = .15) =
data BinomialProbabilities;
prob = 1-cdf('Binomial', 4, .15, 13);
proc print data=BinomialProbabilities;
run;

Let p* = .16. P(5 or more deaths out of 13 if p = .16) =
data BinomialProbabilities;
prob = 1-cdf('Binomial', 4, .15, 13);
proc print data=BinomialProbabilities;
run;

Let p* = .17. P(5 or more deaths out of 13 if p = .17) =
data BinomialProbabilities;
prob = 1-cdf('Binomial', 4, .15, 13);
proc print data=BinomialProbabilities;
run;
6
Two-sided Interval:
Section 6.8 of your text discusses two-sided binomial exact confidence intervals:
An exact (1-α)100% confidence interval for the binomial proportion is given by (p1, p2),
where p1 and p2 satisfy these equations:
P(X ≥ x | p = p1) = α/2
P(X ≤ x | p = p2) = α/2
To calculate these endpoints in SAS, you can use the following program:
data BinomialExactCIs;
n = 13; *Input this--sample size;
x = 5; *Input this--number of "successes" in your sample;
do i = 0 to 1 by .01;
lower = 1-cdf('Binomial', x-1, i, n);
upper = cdf('Binomial', x, i, n);
output;
end;
proc print data=BinomialExactCIs; run;
Output:
7
Usually with exact confidence intervals, we cannot exactly satisfy α/2 in each tail.
Instead, we use a more conservative approach:

Lower endpoint: Find the largest value of p1 so that P(X ≥ x | p = p1) ≤ α/2

Upper endpoint: Find the smallest value of p2 so that P(X ≤ x | p = p2) ≤ α/2
Using these guidelines and the above SAS output, write the two-sided 95% confidence
interval for p = the true proportion of deaths from cancer for nuclear-power plant
workers.
8
One-sided Intervals:
To obtain one-sided confidence intervals use the same process but use  instead of 

One-sided lower bound 100(1- CI for p

Lower endpoint: Find the largest value of p1 so that P(X ≥ x | p = p1) ≤ α
100(1 - )% CI for p is then given by (p1 , 1)
One-sided upper bound 100(1- CI for p

Upper endpoint: Find the smallest value of p2 so that P(X ≤ x | p = p2) ≤ α
100(1 - )% CI for p is then given by (0, p2)
Using these guidelines and the above SAS output, write the one-sided lower bound 95%
confidence interval for p = the true proportion of deaths from cancer for nuclear-power
plant workers.
USING SAS PROC FREQ FOR TESTS AND CONFIDENCE INTERVALS
Once again, consider the example dealing with cancer deaths and nuclear-power plant
workers. Recall that 5 out of 13 deaths of workers were cancer-related, while the rate in
the general population was only 20%.
You can use the following commands to carry out the test in SAS PROC FREQ:
data Cancer;
input cancer_death$ count;
datalines;
yes 5
no 8
;
proc freq order=data;
tables cancer_death / binomial(p=.20);
weight count;
exact binomial;
run;
9
Note: SAS uses an alternative method based on the
F-distribution. The results are slightly different than
those obtained using the binomial distribution.
EXAMPLE 2: Obstetrics (Exercise 7.22 of your text, page 260)
The probability of bearing twins in the U.S. is about 1 in 90. This proportion is thought
to be affected by a number of factors, including age, race, and parity. To study the effect
of age, hospital records are abstracted. Of 538 deliveries for women under 20, 2 resulted
in twins. Is there evidence that the probability of bearing twins is smaller for younger
women (under 20) than for the general population?
Use the binomial exact test to investigate this research question.

Check assumptions. For this test, you must check whether the binomial
distribution is appropriate for the problem.

Set up your null and alternative hypotheses.
Ho:
Ha:
10

Use the binomial distribution to find the exact p-value.
Recall that the p-value is the probability of observing a sample AT LEAST AS
EXTREME as our data, assuming the null hypothesis is true. For this example,
“at least as extreme” implies observing 2 or fewer twin births.
We can use SAS to find the probability of 2 or fewer:
data BinomialProbabilities;
prob = cdf('Binomial', 2, .011, 538);
proc print data=BinomialProbabilities;
run;
The interpretation of this p-value is as follows: If there is in reality no
difference between the proportion of twin births in women under 20 compared
to the proportion of twin births in the general population, there is only a 6.5%
probability of observing results at least as extreme as those obtained in this
sample just by random chance alone.
In this example, the p-value is technically not small enough for us to rule out
observing the difference by random chance alone; that is, we do not have
enough evidence to indicate the results are statistically significant.
However, since this p-value is close to falling below 5%, some may claim we
have marginal evidence that the study population differs from the general
population.

Write a conclusion in the context of the original problem:
“Using a significance level of 5%, we do NOT have evidence to conclude the
probability of bearing twins is smaller for younger women (under 20) than for the
general population (p-value = .065).”
Alternatively, some may say, “We have marginal evidence that the probability of
bearing twins is smaller for younger women (under 20) than for the general
population (p-value = .065).”

To calculate an exact two-sided confidence interval, you need to modify the SAS
program somewhat:
data BinomialExactCIs;
n = 538; *Input this--the sample size;
x = 2; *Input this-- number of "successes" in your sample;
do i = 0 to 1 by .0001;
lower = 1-cdf('Binomial', x-1, i, n);
upper = cdf('Binomial', x, i, n);
output;
end;
11
proc print data=BinomialExactCIs; run;
Finally, let’s verify the p-value and confidence interval using PROC FREQ:
data TwinBirths;
input twins$ count;
datalines;
yes 2
no 536
;
proc freq order=data;
tables twins / binomial(p=.011);
weight count;
exact binomial;
run;
12
Binomial Exact CI’s in R
The package exactci available on CRAN has functions to conduct binomial exact tests
for p and construct exact confidence intervals for p.
Description
Performs an exact test of a simple null hypothesis about the probability of success in a Bernoulli
experiment. This function will also return an binomial exact CI for p.
Usage
binom.test(x, n, p = 0.5,
alternative = c("two.sided", "less", "greater"),
conf.level = 0.95)
Arguments
x
number of successes, or a vector of length 2 giving the numbers of successes and
failures, respectively.
n
number of trials; ignored if x has length 2.
p
hypothesized probability of success.
alternative indicates the alternative hypothesis and must be one of "two.sided", "greater" or
"less". You can specify just the initial letter.
conf.level confidence level for the returned confidence interval.
Examples above done in R
Example 1: Nuclear power plant workers example
> binom.test(5,13,p=.20,"greater")
Exact binomial test
data: 5 and 13
number of successes = 5, number of trials = 13, p-value = 0.09913
alternative hypothesis: true probability of success is greater than 0.2
95 percent confidence interval:
0.1656594 1.0000000
sample estimates:
probability of success
0.3846154
13
To obtain a two-sided confidence interval instead, specify a two-tailed alternative in the
binomial test.
> binom.test(5,13,p=.20,"two")
Exact binomial test
data: 5 and 13
number of successes = 5, number of trials = 13, p-value = 0.1541
alternative hypothesis: true probability of success is not equal to 0.2
95 percent confidence interval:
0.1385793 0.6842224
sample estimates:
probability of success
0.3846154
Example 2: Obstetrics example
> binom.test(2,538,p=1/90,"less")
Exact binomial test
data: 2 and 538
number of successes = 2, number of trials = 538, p-value = 0.06197
alternative hypothesis: true probability of success is less than 0.01111111
95 percent confidence interval:
0.00000000 0.01165559
sample estimates:
probability of success
0.003717472
Two-sided CI is given below although one-sided upper is appropriate here.
> binom.test(2,538,p=1/90,"two")
Exact binomial test
data: 2 and 538
number of successes = 2, number of trials = 538, p-value = 0.1432
alternative hypothesis: true probability of success is not equal to 0.01111111
95 percent confidence interval:
0.0004505206 0.0133637398
sample estimates:
probability of success
0.003717472
14
Binomial Exact Tests in JMP
In JMP you can use the Binomial Table in JMP to find p-values for the exact test.
Remember if you are doing a two-sided test to double the one-sided p-value.
For Binomial Exact CI’s for p you can use the Binomial Exact CI’s in JMP file to
find the exact LCL and UCL.
Example 1: Nuclear power plant workers
15
Conducting the test and CI using Analyze > Distribution in JMP
16
Download