Section 2

advertisement
STAT 405 - BIOSTATISTICS
Handout 2 – Methods for a Single Categorical Variable, Part I
EXAMPLE: Infectious Disease (modified from Example 4.30 of your text, page 87)
Suppose a group of 100 men aged 55-74 years received a new flu vaccine in 2005.
Five of them died within the next year. Researchers are interested in the following
question:
Is there evidence that the mortality rate of this group of men who received the new
flu vaccine is higher than the mortality rate of the general population of 55-74 year
old men?
Question: What type of study design describes this experiment?
DESCRIPTIVE METHODS:
Recall from your introductory statistics course that these data can be summarized using
a frequency distribution, a relative frequency distribution, and/or a bar chart.
Creating Summaries with SAS: You can use the following programming statements.
To create the data set of interest:
data Vaccine;
input Died$ count;
datalines;
Yes 5
No 95
;
To obtain summaries:
proc freq;
table Died;
weight count;
run;
Alternatively, you could use a Graphical User Interface in SAS to obtain summaries.
Select Solutions > Analysis > Interactive Data Analysis.
6
Go to the WORK library, and double click on the data set of interest (Vaccine).
Next, select Analyze > Distribution, and move the variable of interest to the Y box
and the count variable to the Freq box.
Click OK, and SAS displays the following:
7
Questions:
1. What is the population of interest in this study?
2. What is the sample?
3. What is the variable of interest?
4. Consider that in 2005 the U.S. annual mortality rate in 55-74 year old men was
about 17 deaths per 1,000 people. (Source: http://wonder.cdc.gov/)
Is the data obtained in our sample far enough from what we would expect to see
in the general population to give us reason to believe that the mortality rate of
those who received the new flu vaccine is higher than that of the general
population? How confident are you in your conclusion?
A SIMULATION STUDY:
8
To statistically determine whether the mortality rate of those receiving the vaccination is
higher than that of the general population, we can carry out a simulation study.

First, we simulate the general population of 1,000 men aged 55-74. To model the
2005 mortality rate, we will create a hypothetical population in which 17 die and
983 survive the year.

Next, we take samples of size 100 from this population in order to observe how
many deaths appear by random chance alone. For example, consider the results
of 10 random samples:
Sample
1 2 3 4 5 6 7 8 9 10
Number of deaths 1 4 1 0 2 4 1 1 2 2
In R: > rbinom(10,size=100,prob=.017)
[1] 1 4 1 0 2 4 1 1 2 2
Does it appear likely that we would, by random chance alone, observe as high as 5
deaths out of 100 if the true number of deaths in the population is 17 out of 1,000?
Explain.

To get a better understanding of this distribution of sample statistics, we should
take many more than 10 random samples. The next graphic shows what the
distribution looks like under repeated sampling:
Now, does it appear likely that we would, by random chance alone, observe as
high as 5 deaths in a sample of 100 if the true number of deaths in the population
is actually 17 out of 1,000? Do you think we have evidence that the mortality rate
of those receiving the vaccination is higher than that of the general population?
How confident are you in your answer? Explain.
9
In R:
> x = rbinom(10000,size=100,prob=.017)
> table(x)/10000
x
0
1
2
3
4
5
6
7
8
0.1751 0.3206 0.2659 0.1489 0.0637 0.0181 0.0058 0.0016 0.0003
USING THE BINOMIAL PROBABILITY MODEL TO MAKE DECISIONS
In the previous section, the following graphic was obtained through repeated sampling
from the population. This also shows the probabilities from what is known as a
binomial distribution.
The binomial distribution can be used whenever the following assumptions are met:
1.
2.
3.
4.
There are a fixed number of trials.
There are only two possible outcomes on each trial, “success” and “failure.”
The probability of a “success” remains constant from trial to trial.
The n trials are independent.
Check these assumptions for our vaccination example:
1. There are a fixed number of trials.
2. There are only two possible outcomes on each trial, “success” and “failure.”
3. The probability of a “success” remains constant from trial to trial.
4. The n trials are independent.
The Binomial Formula:
In general, the probability of obtaining x successes out of n trials is given as follows:
10
n!
p x q n  x for x = 0, 1, 2, ….n,
x!(n - x)!
where p = probability of success on a single trial, q = 1 – p the probability of failure on a
single trial, n = number of trials, and x = number of successes in n trials.
Using SAS to Calculate Binomial Probabilities:
Instead of using the above formula, you can use SAS to calculate binomial probabilities.
data BinomialProbabilities;
prob = pdf('Binomial', m, p, n);
proc print data=BinomialProbabilities;
run;
(m = number of “successes,” p = probability of “success”, and n = number of trials.)

For example, assume that the mortality rate in the vaccinated men aged 55-74 is
the same as that of the general population (17 deaths in 1,000 people). Find the
probability that a random sample of 100 men from this population will yield
exactly five deaths:
data BinomialProbabilities;
prob = pdf('Binomial', 5, .017, 100);
proc print; run;

Now, assume once again that the mortality rate in the vaccinated men aged 55-74
is the same as that of the general population (17 deaths in 1,000 people). Find
the probability that a random sample of 100 men from this population will yield
at least five deaths. To do this, we first find the probability of four or fewer:
data BinomialProbabilities;
do i = 0 to 4 by 1;
prob = pdf('Binomial', i, .017, 100);
output;
end;
proc print; run;
11

We can find this same cumulative probability using the ‘cdf’ function instead of
the ‘pdf’ function:
data BinomialProbabilities;
prob = cdf('Binomial', 4, .017, 100);
proc print; run;
Note that in general, to find the probability of observing AT LEAST m
“successes”, i.e. P(X > m), you can use the following programming
statements:
data BinomialProbabilities;
prop = 1 - cdf('Binomial', m-1, p, n);
proc print; run;
To find the probability of observing AT MOST m “successes,”
i.e. P(X < m), you can use the following programming statements:
data BinomialProbabilities;
prob = cdf('Binomial', m, p, n);
proc print; run;
To find the probability of observing EXACTLY m “successes,”
i.e. P(X = m), you can use the following programming statements:
data BinomialProbabilities;
prob = pdf('Binomial', m, p, n);
proc print data=BinomialProbabilities;
run;
Using R to Calculate Binomial Probabilities:
12
P(X < 4)
----------------------------------------------> pbinom(4,size=100,prob=.017,lower.tail=T)
[1] 0.9716334
P(X > 5) = 1 – P(X < 4)
----------------------------------------------> 1 - pbinom(4,size=100,prob=.017,lower.tail=T)
[1] 0.02836664
P(X > 5)
----------------------------------------------> pbinom(4,size=100,prob=.017,lower.tail=F)
[1] 0.02836664
P(X = 5)
----------------------------------------------> dbinom(5,size=100,prob=.017)
[1] 0.02096775
EXAMPLE: Environmental Health, Obstetrics (Exercise 4.34 of your text, page 114)
Suppose the rate of major congenital malformations in the general population is 2.5
malformations per 100 deliveries. A study is set up to investigate if offspring of
Vietnam-veteran fathers are at special risk for congenital malformations.
If 100 infants are identified in a birth registry as offspring of Vietnam-veteran fathers
and 4 have a major congenital malformation, is there an excess risk of malformations
in this group?
To answer this question, we find the probability of observing 4 or more congenital
malformations:
data BinomialProbabilities;
prob = 1-cdf('Binomial', 3, .025, 100);
proc print data=BinomialProbabilities;
run;
What is our conclusion? Explain.
Basic Form of the Binomial Exact Test
13
𝐻𝑜 : 𝑝 ≤ 𝑝𝑜
𝐻𝑎 : 𝑝 > 𝑝𝑜
𝐻𝑜 : 𝑝 ≥ 𝑝𝑜
𝐻𝑎 : 𝑝 < 𝑝𝑜
𝐻𝑜 : 𝑝 = 𝑝𝑜
𝐻𝑎 : 𝑝 ≠ 𝑝𝑜
14
Download