Stat 653 HW3 Divya Nair Exercise 1 (2.1). An article in the New York Times (February 17, 1999) about the PSA blood test for detecting prostate cancer stated that, of men who had this disease, the test fails to detect prostate cancer in 1 in 4 (so called false-negative results), and of men who did not have it, as many as two-thirds receive C(C̄) denote the event of having (not having) prostate cancer and let +(−) denote false-positive results. Let a positive (negative) test result. a. Which is true: P (− | C) = 1 4 or P (C | −) = 1 4? P (C̄ | +) = 2 3 or P (+ | C̄) = 2 3? 1 in 4 ... 1 precisely means that P (− | C) = = 4 . Similarly, ... of men who did not have it (disease), P (+∩C̄) = 32 . as many as two-thirds receive false-positive results. precisely means that P (+ | C̄) = P (+) 1 2 Hence, P (− | C) = 4 and P (+ | C̄) = 3 are true. Solution. ... of the men who had this disease, the test fails to detect prostate cancer in P (−∩C) P (C) b. What is the sensitivity of this test? Solution. Sensitivity is the probability that the diagnostic test is positive given that a subject has the disease. Using the complement rule for conditional probability and the known probability from part (a), P (+ | C) = 1 − P (− | C) = 1 − 1 4 3 4. = c. Of men who take the PSA test, suppose P (C) = 0.01. Find the cell probabilities in the 2 × 2 table for Y = diagnonis (+, −) with X = true disease status (C, C̄). the joint distribution that cross classies Solution. The 2×2 table with all the cell probabilities are given below. True Disease Status C C̄ Total Diagnosis + 0.0075 0.66 0.6675 − 0.0025 0.33 0.3325 Total 0.01 0.99 1 The values in this table are lled in the following way: Since P (C) = 0.01, its complement is P (C̄) = 1 − P (C) = 0.99. This lls up all the values in the third column. P (− | C) and using the known probability 1 × 0.01 = 0.0025. Consequently, P (+ ∩ C) = 4 0.0075. These calculations ll up all the values in the rst row. Next, applying the denition of conditional probability on from part (a) we have, 0.01 − 0.0025 = P (− ∩ C) = P (− | C) · P (C) = P (− | C̄) = 1 − P (+ | C̄) = 1 − 23 = 13 . Also, applying the denition of conditional probability on P (− | C̄) gives P (−∩ C̄) = P (− | C̄)·P (C̄) = 1 3 × 0.99 = 0.33. Hence, P (+ ∩ C̄) = 0.99 − 0.33 = 0.66, and P (+) = 0.0075 + 0.66 = 0.6675, and P (−) = 0.0025 + 0.33 = 0.3325. This completes the table. Using the complement rule for conditional probability we have, d. Using (c), nd the marginal distribution for the diagnosis. Solution. As computed in part (c), e. Using (c) and (d), nd P (C | +), P (+) = 0.6675 and P (−) = 0.3325. and interpret. P (C∩+) 0.0075 = 0.6675 = 0.01124. This means P (+) that the probability of men diagnosed with prostate cancer given that they tested positive for it is Solution. As computed in parts (c) and (d), P (C | +) = 0.01124. 1 Stat 653 HW3 Divya Nair X = true status (1 = disease, 2 = πi = P (Y = 1 | X = i), i = 1, 2. Exercise 2 (2.2). For diagnostic testing, let diagnosis(1 = positive, 2 = negative). a. Explain why sensitivity = π1 Let and specicity no disease) and Y = = 1 − π2 . Solution. Sensitivity is the probability that the diagnostic test is positive given that a subject has the disease, that is, P (Y = 1 | X = 1). Said dierently, sensitivity is the probability of success for the subjects in row 1 of the contingency table, and so its probability is given by π1 . Specicity is the probability that the test is negative given that the subject does not have the disease, that is, P (Y = 2 | X = 2). In other words, specicity is the probability of failure for the subjects in row 2 of the contingency table. Hence, its probability is given by b. Let γ 1 − π2 . denote the probability that a subject has the disease. Given that the diagnosis is positive, use Bayes' theorem that the probability a subject truly has the disease is Solution. Recall that Bayes's Theorem is nosis is positive P (Y = 1) P (A | B) = π1 γ . π1 γ + π2 (1 − γ) P (B | A) · P (A) . P (B) The probability that the diag- is P (Y = 1) = P (Y = 1 ∩ X = 1) + P (Y = 1 ∩ X = 2) = P (Y = 1 | X = 1) · P (X = 1) + P (Y = 1 | X = 2) · P (X = 2). Thus, the probability that a subject truly has the disease P (X = 1 | Y = 1) is given by P (Y = 1 | X = 1) · P (X = 1) P (Y = 1) P (Y = 1 | X = 1) · P (X = 1) = P (Y = 1 | X = 1) · P (X = 1) + P (Y = 1 | X = 2) · P (X = 2) π1 γ = . π1 γ + π2 (1 − γ) P (X = 1 | Y = 1) = c. For mammograms for detecting breast cancer, suppose = 0.88. γ = 0.01, sensitivity = 0.86, and specicity Given a positive test result, nd the probability that the woman truly has breast cancer. Solution. From part (b), the probability that the woman truly has breast cancer is given by Since specicity = 1 − π2 = 0.88, we get that π2 = 1 − 0.88 = 0.12. π1 γ . π1 γ + π2 (1 − γ) Thus, π1 γ 0.86 × 0.01 = = 0.0675. π1 γ + π2 (1 − γ) 0.86 × 0.01 + 0.12(1 − 0.01) d. To better understand the answer in (c), nd the joint probabilities for the X and Y. Solution. The True Status X=1 X=2 Total 2×2 cross classication of Discuss their relative sizes in the two cells that refer to a positive test result. 2×2 table with joint probabilities are given below. Diagnosis Y =1 0.0086 0.1188 0.1274 Y =2 0.0014 0.8712 0.8726 Total 0.01 0.99 1 2 Stat 653 HW3 Divya Nair The values in the table are found in the following way: P (Y = 1 ∩ X = 1) = P (Y = 1 | X = 1) · P (X = 1) = 0.86 × 0.01 = 0.0086 P (Y = 2 ∩ X = 1) = 0.01 − 0.0086 = 0.0014 P (X = 2) = 1 − 0.01 = 0.99 P (Y = 1 ∩ X = 2) = P (Y = 1 | X = 2) · P (X = 2) = 0.12 × 0.99 = 0.1188 P (Y = 2 ∩ X = 2) = 0.99 − 0.1188 = 0.8712. The probability of women who have breast cancer and tested positive for it is lower than the probability of women who do not have breast cancer but tested positive for it. Exercise 3 (2.3). According to the recent UN gures, the annual gun homicide rate is residents in the United States and 1.3 62.4 per one million per one million residents in the UK. a. Compare the proportion of residents killed annually by guns using (i) dierence of proportions, (ii) relative risk. Solution. The proportion of residents killed annually by guns (i) Dierence of proportions: ˆ = p1 − p2 ∆ (ii) Relative risk is given by − 1.3 = 62.4 per one million = 61.1 per one million. p1 π1 π2 which is equal to p2 per one million = 48. We see the dierence in proportions is a very small number compared to the relative risk. 0, b. When both proportions are very close to as here, which measure is more useful for describing the strength in association? Why? Solution. The relative risk is a more useful measure in describing the strength in association because the dierence of proportions is so small that it misleads one into thinking that the dierence in the annual gun homicide rate between the two countries is negligible. Exercise 4 (2.4). A newspaper article preceding the 1994 World Cup seminal match between Italy and Bulgaria stated that Italy is favored 10-11 to beat Bulgaria, which is rated at 10-3 to reach the nal. Suppose this means that the odds that Italy wins are 11 3 10 and the odds that Bulgaria wins are 10 . Find the probability that each team wins, and comment. 11 Solution. The probability of success is given by and the probability that Bulgaria wins is odds 10 . The probability that Italy wins is 11 odds +1 + 3 10 3 10 +1 10 1 = 0.5238, = 0.2308. Exercise 5 (2.5). Consider the following two studies reported in the New York Times : a. A British study reported (December 3, 1998) that, of smokers who get lung cancer, women are times more vulnerable than men to get small-cell lung cancer. Is Solution. The number 1.7 1.7 1.7 an odds ratio, or a relative risk? is a relative risk since the proportion of women who get small-cell lung cancer are being compared to the proportion of men who get small-cell cancer. 3 Stat 653 HW3 Divya Nair b. A National Cancer Institute study about tamoxifen and breast cancer reported (April 7, 1998) that the women taking the drug were 45% less likely to experience invasive breast cancer compared with the women taking placebo. Find the relative risk for (i) those taking the drug compared to those taking placebo, (ii) those taking placebo compared to those taking the drug. π1 π2 = 1−0.45 = 0.55. On the other hand, the relative risk for those taking placebo compared to those taking the drug π 1 is 2 = π1 0.55 = 1.8182. Solution. The relative risk for those taking the drug compared to those taking placebo is Exercise 6 (2.6). In the United States, the estimated annual probability that a woman over the age of dies of lung cancer equals 0.001304 for current smokers and 0.000121 35 for nonsmokers [M. Pagano and K. Gauvreau, Principles of Biostatistics, Belmont, CA: Duxbury Press (1993), p. 134]. a. Calculate and interpret the dierence of proportions and the relative risk. Which is more informative for this data? Why? Solution. The dierence of proportions is π1 0.001304 π2 = 0.000121 = proportions is so small. risk is 10.7769. ˆ = p1 − p2 = 0.001304 − 0.000121 = 0.001183. ∆ The relative The relative risk is more informative here since the dierence of b. Calculate and interpret the odds ratio. Explain why the relative risk and odds ratio take similar values. 0.001304/(1−0.001304) π1 /(1−π1 ) π2 /(1−π2 ) = 0.000121/(1−0.000121) = 10.7896. Since the odds ratio is greater than 1, we conclude that women who smoke and are over the age of 35 are more likely to die of Solution. The odds ratio is given by lung cancer than women who do not smoke and are over the age of take similar values because both the same as 1 − π2 . π1 π2 and 35. The relative risk and odds ratio are close to zero. Consequently, 1 − π1 is approximately They then cancel each other in the odds ratio formula leaving with the formula for the relative risk. Exercise 7 (2.7). For adults who sailed on the Titanic on its fateful voyage, the odds ratio between gender (female, male) and survival (yes, no) was 11.4. (For data, see R. Dawson, J. Statist. Educ. 3, no. 3, 1995.) a. What is wrong with the interpretation, The probability of survival for females was 11.4 times that for males.? Give the correct interpretation. Solution. The odds ratio is the ratio of the odds of an event occurring in one group to the odds of that event occurring in another group. The correct interpretation is The odds of survival for females was 11.4 times that the odds of survival for males. b. The odds of survival for females equaled Solution. The odds ratio which gives that oddsM 0.2544 0.2544+1 0.7436. = 0.2028, c. Find the value of R θ = oddsF . oddsM = 0.2544. 2.9. For each gender, nd the proportion who survived. It is given in the problem that θ = 11.4 The probability of survival for males is given by and the probability of survival for females is given by πF = and oddsF πM = oddsF oddsF +1 in the interpretation, The probability of survival for females was R = 2.9 oddsM oddsM +1 = 2.9 2.9+1 = = times that for males. Solution. For the given interpretation to be sensible, by πF πM = 0.7436 0.2028 = 3.6667. 4 R here has to be the relative risk which is given Stat 653 HW3 Divya Nair Exercise 8 (2.8). A research study estimated that under a certain condition, the probability a subject would be referred for heart catheterization was 0.906 for whites and 0.847 for blacks. a. A press release about the study stated that the odds of referral for cardiac catheterization for blacks are 60% 60% of the odds for whites. Explain how they obtained Solution. The odds ratio is θ = oddsB oddsW = πB /(1−πB ) πW /(1−πW ) = (more accurately, 0.847/0.153 0.906/0.094 = .5744 57%). which is equivalent to 57%. b. An Associated Press story that described the study stated Doctors were only cardiac catheterization for blacks as for whites. 60% as likely to order What is wrong with this interpretation? Give the correct percentage for this interpretation. (In stating results to the general public, it is better to use the relative risk than the odds ratio. It is simpler to understand and less likely to be misinterpreted. For details, see New Engl. J. Med., 341: 279-283, 1999.) Solution. The given interpretation is trying to compare the probability of cardiac catheterization in blacks with the probability of cardiac catheterization in whites, but 60% describes the odds ratio instead. The interpretation can be corrected by using the percentage of relative risk which is 0.847 0.906 πB πW = = 0.9349 ≈ 93%. Exercise 9 (2.9). An estimated odds ratio for adult females between the presence of squamous cell carcinoma (yes, no) and smoking behavior (smoker, nonsmoker) equals subjects whose smoking level s is 0 < s < 20 11.7 cigarettes per day; it is when the smoker category consists of 26.1 for smokers with s ≥ 20 cigarettes per day (R. Brownson et al., Epimediology, 3: 61-64, 1992). Show that the estimated odds ratio between carcinoma and smoking levels (s ≥ 20, 0 < s < 20) equals 26.1 11.7 = 2.2. Data posted at the FBI website (www.fbi.gov). 2 × 2 table, the estimated odds ratio between the presence of squamous cell carcinoma (Y ) s (X) of 0 < s < 20 cigarettes per day is given by odds oddsc = 26.1. Similarly, the estimated odds ratio between the presence of squamous cell carcinoma and smoking level of s ≥ 20 cigarettes per day oddsss is given by oddsc = 11.7. Then the estimated odds ratio between carcinoma and smoking levels (s ≥ 20, 26.1×oddsc ss 0 < s < 20) is odds oddss = 11.7×oddsc = 2.2. Solution. In a and smoking level Exercise 10 (2.10). Data posted at the FBI website (www.fbi.gov) stated that of all blacks slain in 2005, 91% were slain by blacks, and of all whites slain in 2005, victim and X 83% a. What conditional distribution do these statistics refer to, Solution. Clearly, these statistics refer to X given b. Calculate and interpret the odds ratio between b w Y denote race of X w stands Y b w 0.91 0.09 0.17 0.83 49.37 given X, or X given Y? and Y. 2×2 contingency table where b stands for for white. The odds ratio between murderer is Y Y. Solution. The given information is lled in the following black and X were slain by whites. Let denote race of murderer. X and Y is then π1 /(1 − π1 ) 0.91/0.09 = = 49.37. π2 /(1 − π2 ) 0.17/0.83 times higher than the odds of race of victim. 5 The odds of race of Stat 653 HW3 Divya Nair c. Given that a murderer was white, can you estimate the probability that the victim was white? What additional information would you need to do this? (Hint: How could you use Bayes's Theorem?) P (X = w | Y = w) · P (Y = w) P (X = w) P (Y = w) and P (X = w). P (Y = w | X = w) = Solution. By Bayes's Theorem, for white. To estimate this probability we need where w stands Exercise 11 (2.12). A statistical analysis that combines information from several studies is called a meta analysis. A meta analysis compared aspirin with placebo on incidence of heart attack and of stroke, separately for men and for women (J. Am. Med. Assoc., 295: 306-313, 2006). For the Women's Health Study, heart attacks were reported for a. Construct a 198 of 19, 934 taking aspirin and for 193 of 19, 942 taking placebo. 2×2 table that cross classies the treatment (aspirin, placebo) with whether a heart attack was reported (yes, no). Solution. The given information is recorded in a 2×2 table below. Heart Attack Treatment A P Y N Total 198 193 19, 736 19, 749 19, 934 19, 942 b. Estimate the odds ratio and interpret. 198/19,736 n11 /n12 n21 /n22 = 193/19,749 = 1.0266. Since the odds ratio is greater than women who take aspirin are more likely to have a heart attack than women do not take aspirin. Solution. The odds ratio is c. Find a θ̂ = 95% condence interval for the population odds ratio for women. 1, Interpret. (As of 2006, results suggested that for women, aspirin was helpful for reducing risk of stroke but not necessarily risk of heart attack.) Solution. The condence interval is given by The calculations needed to compute the log θ̂ ±Zα/2 ·σlog θ̂ 95% where σlog θ̂ = q 1 n11 + 1 n12 + 1 n21 + 1 n22 . condence interval is shown below. log θ̂ = log 1.0266 = 0.0114 r 1 1 1 1 + + + = 0.1017. σlog θ̂ = 198 19736 193 19749 Thus, log θ̂ ± Zα/2 · σlog θ̂ becomes 0.0114 ± 1.96 × 0.1017 = (−0.18793, 0.21073), and (e−0.18793 , e0.21073 ) = (0.82867, 1.23458). Since the interval does condence interval is θ̂ = 1, we conclude that the true odds of heart attack is the same for both treatments. Exercise 12 (2.13). Refer to Table 2.1 about belief in an afterlife. Gender F M Total Belief in After Life Y N Total 509 398 907 116 104 220 625 502 1127 a. Construct a 90% condence interval for the dierence of proportions, and interpret. 6 so the 95% not contain Stat 653 HW3 Divya Nair Solution. The condence interval for the dierence of proportions is given by Here, p1 = 509 625 = 0.8144 and p2 = 398 502 r (0.8144 − 0.7928) ± 1.645 × = 0.7928. Thus, the 90% (p1 −p2 )±Zα/2 q p1 (1−p1 ) n1 + condence interval is 0.8144 × 0.1856 0.7928 × 0.2072 + = 0.0216 ± 0.0392 625 502 = (−0.01764, 0.06084). Since this interval also contains negative values, we conclude π1 − π2 < 0, or equivalently, π1 < π2 . This means that more males believe in after life than females. b. Construct a 90% condence level for the odds ratio, and interpret. Solution. The condence interval for the odds ratio is given by log θ̂ ± Zα/2 · σlog θ̂ . All the calculations are shown below. 509/116 log θ̂ = log = 0.05941 398/104 r 1 1 1 1 + + + = 0.15071 σlog θ̂ = 509 116 398 104 log θ̂ ± Zα/2 · σlog θ̂ = 0.05941 ± 1.645 × 0.15071 = (−0.18851, 0.30733) 90% condence interval θ̂ = 1, the true odds of belief The is (e−0.18851 , e0.30733 ) = (0.82819, 1.35979). Since this interval contains in after life is dierent for males and females. c. Conduct a test of statistical independence. Report the p-value and interpret. Solution. The null hypothesis is that the two response variables are independent, that is, for all i and j. πij = πi · πj The alternate hypothesis is that the two response variables are dependent on each other. We will use the Pearson chi-squared statistic for testing H0 is given by X2 = P (nij −µ̂ij )2 ij estimate of the expected frequency is given by µ̂ij = µ̂ij . An ni ·nj . A calculation of the estimated expected n frequencies for each cell is given below. 625 × 907 1127 625 × 220 = 1127 502 × 907 = 1127 502 × 220 = 1127 µ̂11 = = 502.9947 µ̂12 = 122.0053 µ̂21 µ̂22 = 404.0053 = 97.9947 The Pearson chi-squared statistic is (509 − 502.9947)2 (116 − 122.0053)2 (398 − 404.0053)2 (104 − 97.9947)2 + + + 502.9947 122.0053 404.0053 97.9947 = 0.8246. X2 = The degrees of freedom is (I − 1)(J − 1) = (2 − 1)(2 − 1) = 1. The p-value is 0.3638. the null hypothesis and conclude that belief in after life and gender are independent. 7 We fail to reject p2 (1−p2 ) . n2