Stat 653 HW3 Divya Nair Exercise 1

advertisement
Stat 653 HW3
Divya Nair
Exercise 1 (2.1). An article in the New York Times (February 17, 1999) about the PSA blood test for
detecting prostate cancer stated that, of men who had this disease, the test fails to detect prostate cancer
in
1
in
4
(so called false-negative results), and of men who did not have it, as many as two-thirds receive
C(C̄) denote the event of having (not having) prostate cancer and let +(−) denote
false-positive results. Let
a positive (negative) test result.
a. Which is true:
P (− | C) =
1
4 or
P (C | −) =
1
4?
P (C̄ | +) =
2
3 or
P (+ | C̄) =
2
3?
1 in 4 ...
1
precisely means that P (− | C) =
= 4 . Similarly, ... of men who did not have it (disease),
P (+∩C̄)
= 32 .
as many as two-thirds receive false-positive results. precisely means that P (+ | C̄) =
P (+)
1
2
Hence, P (− | C) =
4 and P (+ | C̄) = 3 are true.
Solution. ...
of the men who had this disease, the test fails to detect prostate cancer in
P (−∩C)
P (C)
b. What is the sensitivity of this test?
Solution. Sensitivity is the probability that the diagnostic test is positive given that a subject has the
disease. Using the complement rule for conditional probability and the known probability from part
(a),
P (+ | C) = 1 − P (− | C) = 1 −
1
4
3
4.
=
c. Of men who take the PSA test, suppose
P (C) = 0.01. Find the cell probabilities in the 2 × 2 table for
Y = diagnonis (+, −) with X = true disease status (C, C̄).
the joint distribution that cross classies
Solution. The
2×2
table with all the cell probabilities are given below.
True Disease Status
C
C̄
Total
Diagnosis
+
0.0075
0.66
0.6675
−
0.0025
0.33
0.3325
Total
0.01
0.99
1
The values in this table are lled in the following way:
Since
P (C) = 0.01,
its complement is
P (C̄) = 1 − P (C) = 0.99.
This lls up all the values in the third
column.
P (− | C) and using the known probability
1
×
0.01 = 0.0025. Consequently, P (+ ∩ C) =
4
0.0075. These calculations ll up all the values in the rst row.
Next, applying the denition of conditional probability on
from part (a) we have,
0.01 − 0.0025 =
P (− ∩ C) = P (− | C) · P (C) =
P (− | C̄) = 1 − P (+ | C̄) = 1 − 23 = 13 .
Also, applying the denition of conditional probability on P (− | C̄) gives P (−∩ C̄) = P (− | C̄)·P (C̄) =
1
3 × 0.99 = 0.33. Hence, P (+ ∩ C̄) = 0.99 − 0.33 = 0.66, and P (+) = 0.0075 + 0.66 = 0.6675, and
P (−) = 0.0025 + 0.33 = 0.3325. This completes the table.
Using the complement rule for conditional probability we have,
d. Using (c), nd the marginal distribution for the diagnosis.
Solution. As computed in part (c),
e. Using (c) and (d), nd
P (C | +),
P (+) = 0.6675
and
P (−) = 0.3325.
and interpret.
P (C∩+)
0.0075
= 0.6675
= 0.01124. This means
P (+)
that the probability of men diagnosed with prostate cancer given that they tested positive for it is
Solution. As computed in parts (c) and (d),
P (C | +) =
0.01124.
1
Stat 653 HW3
Divya Nair
X = true status (1 = disease, 2 =
πi = P (Y = 1 | X = i), i = 1, 2.
Exercise 2 (2.2). For diagnostic testing, let
diagnosis(1
= positive, 2 = negative).
a. Explain why sensitivity
= π1
Let
and specicity
no disease) and
Y =
= 1 − π2 .
Solution. Sensitivity is the probability that the diagnostic test is positive given that a subject has the
disease, that is,
P (Y = 1 | X = 1).
Said dierently, sensitivity is the probability of success for the
subjects in row 1 of the contingency table, and so its probability is given by
π1 .
Specicity is the probability that the test is negative given that the subject does not have the disease,
that is,
P (Y = 2 | X = 2).
In other words, specicity is the probability of failure for the subjects in
row 2 of the contingency table. Hence, its probability is given by
b. Let
γ
1 − π2 .
denote the probability that a subject has the disease. Given that the diagnosis is positive, use
Bayes' theorem that the probability a subject truly has the disease is
Solution. Recall that Bayes's Theorem is
nosis is positive
P (Y = 1)
P (A | B) =
π1 γ
.
π1 γ + π2 (1 − γ)
P (B | A) · P (A)
.
P (B)
The probability that the diag-
is
P (Y = 1) = P (Y = 1 ∩ X = 1) + P (Y = 1 ∩ X = 2)
= P (Y = 1 | X = 1) · P (X = 1) + P (Y = 1 | X = 2) · P (X = 2).
Thus, the probability that a subject truly has the disease
P (X = 1 | Y = 1)
is given by
P (Y = 1 | X = 1) · P (X = 1)
P (Y = 1)
P (Y = 1 | X = 1) · P (X = 1)
=
P (Y = 1 | X = 1) · P (X = 1) + P (Y = 1 | X = 2) · P (X = 2)
π1 γ
=
.
π1 γ + π2 (1 − γ)
P (X = 1 | Y = 1) =
c. For mammograms for detecting breast cancer, suppose
= 0.88.
γ = 0.01,
sensitivity
= 0.86,
and specicity
Given a positive test result, nd the probability that the woman truly has breast cancer.
Solution. From part (b), the probability that the woman truly has breast cancer is given by
Since specicity
= 1 − π2 = 0.88,
we get that
π2 = 1 − 0.88 = 0.12.
π1 γ
.
π1 γ + π2 (1 − γ)
Thus,
π1 γ
0.86 × 0.01
=
= 0.0675.
π1 γ + π2 (1 − γ)
0.86 × 0.01 + 0.12(1 − 0.01)
d. To better understand the answer in (c), nd the joint probabilities for the
X
and
Y.
Solution. The
True Status
X=1
X=2
Total
2×2
cross classication of
Discuss their relative sizes in the two cells that refer to a positive test result.
2×2
table with joint probabilities are given below.
Diagnosis
Y =1
0.0086
0.1188
0.1274
Y =2
0.0014
0.8712
0.8726
Total
0.01
0.99
1
2
Stat 653 HW3
Divya Nair
The values in the table are found in the following way:
P (Y = 1 ∩ X = 1) = P (Y = 1 | X = 1) · P (X = 1) = 0.86 × 0.01 = 0.0086
P (Y = 2 ∩ X = 1) = 0.01 − 0.0086 = 0.0014
P (X = 2) = 1 − 0.01 = 0.99
P (Y = 1 ∩ X = 2) = P (Y = 1 | X = 2) · P (X = 2) = 0.12 × 0.99 = 0.1188
P (Y = 2 ∩ X = 2) = 0.99 − 0.1188 = 0.8712.
The probability of women who have breast cancer and tested positive for it is lower than the probability
of women who do not have breast cancer but tested positive for it.
Exercise 3 (2.3). According to the recent UN gures, the annual gun homicide rate is
residents in the United States and
1.3
62.4
per one million
per one million residents in the UK.
a. Compare the proportion of residents killed annually by guns using (i) dierence of proportions, (ii)
relative risk.
Solution. The proportion of residents killed annually by guns
(i) Dierence of proportions:
ˆ = p1 − p2
∆
(ii) Relative risk is given by
− 1.3
= 62.4
per one million
= 61.1
per one million.
p1
π1
π2 which is equal to p2
per one million
= 48.
We see the dierence in proportions is a very small number compared to the relative risk.
0,
b. When both proportions are very close to
as here, which measure is more useful for describing the
strength in association? Why?
Solution. The relative risk is a more useful measure in describing the strength in association because
the dierence of proportions is so small that it misleads one into thinking that the dierence in the
annual gun homicide rate between the two countries is negligible.
Exercise 4 (2.4). A newspaper article preceding the 1994 World Cup seminal match between Italy and
Bulgaria stated that Italy is favored 10-11 to beat Bulgaria, which is rated at 10-3 to reach the nal.
Suppose this means that the odds that Italy wins are
11
3
10 and the odds that Bulgaria wins are 10 . Find the
probability that each team wins, and comment.
11
Solution. The probability of success is given by
and the probability that Bulgaria wins is
odds
10
. The probability that Italy wins is
11
odds +1
+
3
10
3
10
+1
10
1
= 0.5238,
= 0.2308.
Exercise 5 (2.5). Consider the following two studies reported in the New York Times :
a. A British study reported (December 3, 1998) that, of smokers who get lung cancer, women are
times more vulnerable than men to get small-cell lung cancer. Is
Solution. The number
1.7
1.7
1.7
an odds ratio, or a relative risk?
is a relative risk since the proportion of women who get small-cell lung
cancer are being compared to the proportion of men who get small-cell cancer.
3
Stat 653 HW3
Divya Nair
b. A National Cancer Institute study about tamoxifen and breast cancer reported (April 7, 1998) that
the women taking the drug were
45% less likely to experience invasive breast cancer compared with the
women taking placebo. Find the relative risk for (i) those taking the drug compared to those taking
placebo, (ii) those taking placebo compared to those taking the drug.
π1
π2 = 1−0.45 =
0.55. On the other hand, the relative risk for those taking placebo compared to those taking the drug
π
1
is 2 =
π1
0.55 = 1.8182.
Solution. The relative risk for those taking the drug compared to those taking placebo is
Exercise 6 (2.6). In the United States, the estimated annual probability that a woman over the age of
dies of lung cancer equals
0.001304
for current smokers and
0.000121
35
for nonsmokers [M. Pagano and K.
Gauvreau, Principles of Biostatistics, Belmont, CA: Duxbury Press (1993), p. 134].
a. Calculate and interpret the dierence of proportions and the relative risk. Which is more informative
for this data? Why?
Solution. The dierence of proportions is
π1
0.001304
π2 = 0.000121 =
proportions is so small.
risk is
10.7769.
ˆ = p1 − p2 = 0.001304 − 0.000121 = 0.001183.
∆
The relative
The relative risk is more informative here since the dierence of
b. Calculate and interpret the odds ratio. Explain why the relative risk and odds ratio take similar values.
0.001304/(1−0.001304)
π1 /(1−π1 )
π2 /(1−π2 ) = 0.000121/(1−0.000121) = 10.7896. Since the odds ratio is
greater than 1, we conclude that women who smoke and are over the age of 35 are more likely to die of
Solution. The odds ratio is given by
lung cancer than women who do not smoke and are over the age of
take similar values because both
the same as
1 − π2 .
π1
π2
and
35.
The relative risk and odds ratio
are close to zero. Consequently,
1 − π1
is approximately
They then cancel each other in the odds ratio formula leaving with the formula
for the relative risk.
Exercise 7 (2.7). For adults who sailed on the Titanic on its fateful voyage, the odds ratio between gender
(female, male) and survival (yes, no) was
11.4.
(For data, see R. Dawson, J. Statist. Educ. 3, no. 3, 1995.)
a. What is wrong with the interpretation, The probability of survival for females was
11.4
times that for
males.? Give the correct interpretation.
Solution. The odds ratio is the ratio of the odds of an event occurring in one group to the odds of that
event occurring in another group. The correct interpretation is The odds of survival for females was
11.4
times that the odds of survival for males.
b. The odds of survival for females equaled
Solution. The odds ratio
which gives that oddsM
0.2544
0.2544+1
0.7436.
= 0.2028,
c. Find the value of
R
θ =
oddsF
.
oddsM
= 0.2544.
2.9.
For each gender, nd the proportion who survived.
It is given in the problem that
θ = 11.4
The probability of survival for males is given by
and the probability of survival for females is given by
πF =
and oddsF
πM =
oddsF
oddsF +1
in the interpretation, The probability of survival for females was
R
= 2.9
oddsM
oddsM +1
=
2.9
2.9+1
=
=
times that for
males.
Solution. For the given interpretation to be sensible,
by
πF
πM
=
0.7436
0.2028
= 3.6667.
4
R
here has to be the relative risk which is given
Stat 653 HW3
Divya Nair
Exercise 8 (2.8). A research study estimated that under a certain condition, the probability a subject
would be referred for heart catheterization was
0.906
for whites and
0.847
for blacks.
a. A press release about the study stated that the odds of referral for cardiac catheterization for blacks
are
60%
60%
of the odds for whites. Explain how they obtained
Solution. The odds ratio is
θ =
oddsB
oddsW
=
πB /(1−πB )
πW /(1−πW )
=
(more accurately,
0.847/0.153
0.906/0.094
= .5744
57%).
which is equivalent to
57%.
b. An Associated Press story that described the study stated Doctors were only
cardiac catheterization for blacks as for whites.
60%
as likely to order
What is wrong with this interpretation? Give the
correct percentage for this interpretation. (In stating results to the general public, it is better to use
the relative risk than the odds ratio. It is simpler to understand and less likely to be misinterpreted.
For details, see New Engl. J. Med., 341: 279-283, 1999.)
Solution. The given interpretation is trying to compare the probability of cardiac catheterization in
blacks with the probability of cardiac catheterization in whites, but
60%
describes the odds ratio
instead. The interpretation can be corrected by using the percentage of relative risk which is
0.847
0.906
πB
πW
=
= 0.9349 ≈ 93%.
Exercise 9 (2.9). An estimated odds ratio for adult females between the presence of squamous cell carcinoma
(yes, no) and smoking behavior (smoker, nonsmoker) equals
subjects whose smoking level
s
is
0 < s < 20
11.7
cigarettes per day; it is
when the smoker category consists of
26.1
for smokers with
s ≥ 20
cigarettes
per day (R. Brownson et al., Epimediology, 3: 61-64, 1992). Show that the estimated odds ratio between
carcinoma and smoking levels (s
≥ 20, 0 < s < 20)
equals
26.1
11.7
= 2.2.
Data posted at the FBI website
(www.fbi.gov).
2 × 2 table, the estimated odds ratio between the presence of squamous cell carcinoma (Y )
s
(X) of 0 < s < 20 cigarettes per day is given by odds
oddsc = 26.1. Similarly, the estimated
odds ratio between the presence of squamous cell carcinoma and smoking level of s ≥ 20 cigarettes per day
oddsss
is given by
oddsc = 11.7. Then the estimated odds ratio between carcinoma and smoking levels (s ≥ 20,
26.1×oddsc
ss
0 < s < 20) is odds
oddss = 11.7×oddsc = 2.2.
Solution. In a
and smoking level
Exercise 10 (2.10). Data posted at the FBI website (www.fbi.gov) stated that of all blacks slain in 2005,
91%
were slain by blacks, and of all whites slain in 2005,
victim and
X
83%
a. What conditional distribution do these statistics refer to,
Solution. Clearly, these statistics refer to
X
given
b. Calculate and interpret the odds ratio between
b
w
Y
denote race of
X
w stands
Y
b
w
0.91 0.09
0.17 0.83
49.37
given
X,
or
X
given
Y?
and
Y.
2×2
contingency table where
b
stands for
for white.
The odds ratio between
murderer is
Y
Y.
Solution. The given information is lled in the following
black and
X
were slain by whites. Let
denote race of murderer.
X
and
Y
is then
π1 /(1 − π1 )
0.91/0.09
=
= 49.37.
π2 /(1 − π2 )
0.17/0.83
times higher than the odds of race of victim.
5
The odds of race of
Stat 653 HW3
Divya Nair
c. Given that a murderer was white, can you estimate the probability that the victim was white? What
additional information would you need to do this? (Hint: How could you use Bayes's Theorem?)
P (X = w | Y = w) · P (Y = w)
P (X = w)
P (Y = w) and P (X = w).
P (Y = w | X = w) =
Solution. By Bayes's Theorem,
for white. To estimate this probability we need
where
w
stands
Exercise 11 (2.12). A statistical analysis that combines information from several studies is called a meta
analysis. A meta analysis compared aspirin with placebo on incidence of heart attack and of stroke, separately
for men and for women (J. Am. Med. Assoc., 295: 306-313, 2006). For the Women's Health Study, heart
attacks were reported for
a. Construct a
198
of
19, 934
taking aspirin and for
193
of
19, 942
taking placebo.
2×2 table that cross classies the treatment (aspirin, placebo) with whether a heart attack
was reported (yes, no).
Solution. The given information is recorded in a
2×2
table below.
Heart Attack
Treatment
A
P
Y
N
Total
198
193
19, 736
19, 749
19, 934
19, 942
b. Estimate the odds ratio and interpret.
198/19,736
n11 /n12
n21 /n22 = 193/19,749 = 1.0266. Since the odds ratio is greater than
women who take aspirin are more likely to have a heart attack than women do not take aspirin.
Solution. The odds ratio is
c. Find a
θ̂ =
95% condence interval for the population odds ratio for women.
1,
Interpret. (As of 2006, results
suggested that for women, aspirin was helpful for reducing risk of stroke but not necessarily risk of
heart attack.)
Solution. The condence interval is given by
The calculations needed to compute the
log θ̂ ±Zα/2 ·σlog θ̂
95%
where
σlog θ̂ =
q
1
n11
+
1
n12
+
1
n21
+
1
n22 .
condence interval is shown below.
log θ̂ = log 1.0266 = 0.0114
r
1
1
1
1
+
+
+
= 0.1017.
σlog θ̂ =
198 19736 193 19749
Thus,
log θ̂ ± Zα/2 · σlog θ̂ becomes 0.0114 ± 1.96 × 0.1017 = (−0.18793, 0.21073), and
(e−0.18793 , e0.21073 ) = (0.82867, 1.23458). Since the interval does
condence interval is
θ̂ = 1,
we conclude that the true odds of heart attack is the same for both treatments.
Exercise 12 (2.13). Refer to Table 2.1 about belief in an afterlife.
Gender
F
M
Total
Belief in After Life
Y
N
Total
509
398
907
116
104
220
625
502
1127
a. Construct a
90%
condence interval for the dierence of proportions, and interpret.
6
so the
95%
not contain
Stat 653 HW3
Divya Nair
Solution. The condence interval for the dierence of proportions is given by
Here,
p1 =
509
625
= 0.8144
and
p2 =
398
502
r
(0.8144 − 0.7928) ± 1.645 ×
= 0.7928.
Thus, the
90%
(p1 −p2 )±Zα/2
q
p1 (1−p1 )
n1
+
condence interval is
0.8144 × 0.1856 0.7928 × 0.2072
+
= 0.0216 ± 0.0392
625
502
= (−0.01764, 0.06084).
Since this interval also contains negative values, we conclude
π1 − π2 < 0,
or equivalently,
π1 < π2 .
This means that more males believe in after life than females.
b. Construct a
90%
condence level for the odds ratio, and interpret.
Solution. The condence interval for the odds ratio is given by
log θ̂ ± Zα/2 · σlog θ̂ .
All the calculations
are shown below.
509/116
log θ̂ = log
= 0.05941
398/104
r
1
1
1
1
+
+
+
= 0.15071
σlog θ̂ =
509 116 398 104
log θ̂ ± Zα/2 · σlog θ̂ = 0.05941 ± 1.645 × 0.15071
= (−0.18851, 0.30733)
90% condence interval
θ̂ = 1, the true odds of belief
The
is
(e−0.18851 , e0.30733 ) = (0.82819, 1.35979).
Since this interval contains
in after life is dierent for males and females.
c. Conduct a test of statistical independence. Report the
p-value
and interpret.
Solution. The null hypothesis is that the two response variables are independent, that is,
for all
i
and
j.
πij = πi · πj
The alternate hypothesis is that the two response variables are dependent on each
other. We will use the Pearson chi-squared statistic for testing
H0
is given by
X2 =
P (nij −µ̂ij )2
ij
estimate of the expected frequency is given by
µ̂ij =
µ̂ij
. An
ni ·nj
. A calculation of the estimated expected
n
frequencies for each cell is given below.
625 × 907
1127
625 × 220
=
1127
502 × 907
=
1127
502 × 220
=
1127
µ̂11 =
= 502.9947
µ̂12
= 122.0053
µ̂21
µ̂22
= 404.0053
= 97.9947
The Pearson chi-squared statistic is
(509 − 502.9947)2
(116 − 122.0053)2
(398 − 404.0053)2
(104 − 97.9947)2
+
+
+
502.9947
122.0053
404.0053
97.9947
= 0.8246.
X2 =
The degrees of freedom is
(I − 1)(J − 1) = (2 − 1)(2 − 1) = 1.
The
p-value
is
0.3638.
the null hypothesis and conclude that belief in after life and gender are independent.
7
We fail to reject
p2 (1−p2 )
.
n2
Download