1 Stat 565 Assignment 2 Solutions Fall 2005 1) Exact randomization test for a 2x2 table of counts: The following data were taken from Table 3.11 on page 73 in the book, Categorical Data Analysis, by Alan Agresti. Surgery Radiation Therapy Cancer Controlled 21 15 Cancer not Controlled 2 3 The 41 larynx cancer patients were randomly assigned to the two treatments. Use Fisher's exact test to test the null hypothesis that the two treatments are equally effective in controlling cancer against the alternative that the treatments are not equally effective. Ho: The two treatments are equally effective in controlling cancer Ha: not Ho The p-value for the two-sided alternative is computed as ⎛ 36 ⎞ ⎛ 5 ⎞ ⎛ 36 ⎞ ⎛ 5 ⎞ ⎛ 36 ⎞ ⎛ 5 ⎞ ⎛ 36 ⎞ ⎛ 5 ⎞ ⎛ 36 ⎞ ⎛ 5 ⎞ ⎛ 36 ⎞ ⎛ 5 ⎞ ⎜ 21 ⎟ ⎜ 2 ⎟ ⎜ 22 ⎟ ⎜ 1 ⎟ ⎜ 23 ⎟ ⎜ 0 ⎟ ⎜ 19 ⎟ ⎜ 4 ⎟ ⎜ 18 ⎟ ⎜ 5 ⎟ ⎜ 20 ⎟ ⎜ 3 ⎟ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ p-value= + + + + = 1 - ⎝ ⎠ ⎝ ⎠ = 0.6384 ⎛ 41 ⎞ ⎛ 41 ⎞ ⎛ 41 ⎞ ⎛ 41 ⎞ ⎛ 41 ⎞ ⎛ 41 ⎞ ⎜ 23 ⎟ ⎜ 23 ⎟ ⎜ 23 ⎟ ⎜ 23 ⎟ ⎜ 23 ⎟ ⎜ 23 ⎟ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ . There is not enough evidence to reject the null hypothesis that two treatments are equally effective in controlling cancer. Note that there are more patients in the surgery group than in the radiation group, so you must examine proportions to determine which tables are less consistent with the null hypothesis than the observed table. For the observed data the over proportion of patients with controlled cancer is 36/41=0.878 and the proportion of surgery patients for which cancer was 21/23=0.913. The difference is 0.913-0.878=0.035. Cleary 22/23 and 23/23 are farther above 0.878. In the other direction, 19/23=0.826 and 18/23=0.783 are farther from 0.878, but 20/23=0.8695 is closer to 0.878 than 21/23=0.913. SAS CODE AND OUTPUT: data set1; input row col count; cards; 1 1 21 1 2 2 2 1 15 2 2 3 run; 2 proc freq data=set1; tables row*col/exact chisq; weight count; run; The FREQ Procedure Statistics for Table of row by col Fisher's Exact Test ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Cell (1,1) Frequency (F) 21 Left-sided Pr <= F 0.8947 Right-sided Pr >= F 0.3808 Table Probability (P) 0.2755 Two-sided Pr <= P 0.6384 Sample Size = 41 R CODE AND OUTPUT: > fisher.test(matrix(c(21,2,15,3),nrow=2,byrow=T)) Fisher's Exact Test for Count Data: matrix(c(21, 2, 15, 3), nrow = 2, byrow = T) p-value = 0.6384 1) Exact Randomization test for a 2x3 table of counts: Perform an "exact" randomization test of the null hypothesis Ho: the IFN-B treatment produces the same results as the treatment given to the controls against the one sided alternative Ha: the IFN-B treatment gives better results. The data are Treated with IFN-B Controls Result of treatment Improved Unchanged 5 4 1 4 Worsened 1 5 Totals 10 10 Using the criteria that better results for the IFN-B treatment must yield as many improved cases as in the observe data and the total number of improved and unchanged cases for the IFN-B treatment must be as large as the total for the observed data, the following tables of counts are at least as inconsistent with Ho as the observed table of counts: Treated with IFN-B Controls Improved 5 1 Result of treatment Unchanged 4 4 Worsened 1 5 Totals 10 10 3 Treated with IFN-B Controls Improved 6 0 Result of treatment Unchanged 4 4 Worsened 0 6 Totals 10 10 Treated with IFN-B Controls Result of treatment Improved Unchanged 6 3 0 5 Worsened 1 5 Totals 10 10 Treated with IFN-B Controls Result of treatment Improved Unchanged 5 5 1 3 Worsened 0 6 Totals 10 10 The p-value is ⎛ 6 ⎞⎛ 8 ⎞⎛ 6 ⎞ ⎛ 6 ⎞⎛ 8 ⎞⎛ 6 ⎞ ⎛ 6 ⎞⎛ 8 ⎞⎛ 6 ⎞ ⎛ 6 ⎞⎛ 8 ⎞⎛ 6 ⎞ ⎜⎜ ⎟⎟⎜⎜ ⎟⎟⎜⎜ ⎟⎟ ⎜⎜ ⎟⎟⎜⎜ ⎟⎟⎜⎜ ⎟⎟ ⎜⎜ ⎟⎟⎜⎜ ⎟⎟⎜⎜ ⎟⎟ ⎜⎜ ⎟⎟⎜⎜ ⎟⎟⎜⎜ ⎟⎟ 5 4 1 6 4 0 6 3 1 5 5 0 p − value = ⎝ ⎠⎝ ⎠⎝ ⎠ + ⎝ ⎠⎝ ⎠⎝ ⎠ + ⎝ ⎠⎝ ⎠⎝ ⎠ + ⎝ ⎠⎝ ⎠⎝ ⎠ = 0.0176557 ⎛ 20 ⎞ ⎛ 20 ⎞ ⎛ 20 ⎞ ⎛ 20 ⎞ ⎜⎜ ⎟⎟ ⎜⎜ ⎟⎟ ⎜⎜ ⎟⎟ ⎜⎜ ⎟⎟ ⎝ 10 ⎠ ⎝ 10 ⎠ ⎝ 10 ⎠ ⎝ 10 ⎠ This p-value provides enough evidence to reject the null hypothesis and conclude the IFN-B treatment was more effective in controlling larynx cancer than the control treatment. Another possibility is to use the more stringent criteria that the IFN-B treatment must yield as many improved cases and as many unchanged cases as for the observed data. The following tables of counts at least as inconsistent with Ho as the observed table of counts with respect to those criteria: Treated with IFN-B Controls Improved 5 1 Result of treatment Unchanged 4 4 Worsened 1 5 Totals 10 10 Treated with IFN-B Controls Improved 6 0 Result of treatment Unchanged 4 4 Worsened 0 6 Totals 10 10 4 Treated with IFN-B Controls Improved 5 1 Result of treatment Unchanged 5 3 Worsened 0 6 Totals 10 10 ⎛ 6 ⎞⎛ 8 ⎞⎛ 6 ⎞ ⎛ 6 ⎞⎛ 8 ⎞⎛ 6 ⎞ ⎛ 6 ⎞ ⎛ 8 ⎞⎛ 6 ⎞ ⎜⎜ ⎟⎟⎜⎜ ⎟⎟⎜⎜ ⎟⎟ ⎜⎜ ⎟⎟⎜⎜ ⎟⎟⎜⎜ ⎟⎟ ⎜⎜ ⎟⎟⎜⎜ ⎟⎟⎜⎜ ⎟⎟ 5 4 1 6 4 0 5 5 0 The p-value is P − value = ⎝ ⎠⎝ ⎠⎝ ⎠ + ⎝ ⎠⎝ ⎠⎝ ⎠ + ⎝ ⎠⎝ ⎠⎝ ⎠ = 0.0158371. ⎛ 20 ⎞ ⎛ 20 ⎞ ⎛ 20 ⎞ ⎜⎜ ⎟⎟ ⎜⎜ ⎟⎟ ⎜⎜ ⎟⎟ ⎝ 10 ⎠ ⎝ 10 ⎠ ⎝ 10 ⎠ This p-value provides enough evidence to reject the null hypothesis and conclude the IFN-B treatment was more effective in controlling larynx cancer than the control treatment. Under the null hypotheses, the IFN-B and control treatments would be equivalent and a patient would respond the same way, regardless of the treatment that patient was given. This fixes the column totals in the table. The row totals are fixed at 10 for each row due to the decision to put 10 patients into each treatment group. Consequently, if the null hypothesis is true, there are 43 possible tables of counts that could have been obtained from random assignment of 10 patients into each treatment group. These are shown below. 5 Summing the probabilities for those tables with values of the Pearson statistic that are at least 5.33, the value for the observed table, yields a total probability of 0.0642 . This is the p-value produced by the exact option in PROC FREQ in SAS and also by the fisher.exact( ) function in R and S-Plus. Since we are testing against a one-sided alternative and there are the same number of patients in each group, a more appropriate pvalue is obtained by dividing by 2 to obtain p-value=0.0321. This p-value also leads to the rejection of the null hypothesis at the 0.05 level of significance. This procedure is not quite satisfactory, however, because it includes a number of tables that are not more extreme than the observed table in the sense that they contain fewer improved and unchanged patients under the IFN-B treatment than the observed table. For example, the table with 5 improved and 0 unchanged patients in the IFN-B group is considered “more extreme” by the Pearson chi-square criterion. Some students failed to clearly state their test criteria. 3) Randomization, adherence and intention to treat. A. Since the subjects will come from two different populations (California and Iowa) and these two populations may differ with respect to climate and environment and other factors such as racial composition and consumption of soy, separate randomization assignments of women to the three treatments should be done in California and Iowa, 6 with equal numbers of women assigned to each of the three treatment groups in Iowa and equal numbers of women assigned to the three treatment groups in California. This balance avoids confounding with location differences. To minimize the number of medical personnel needed to perform the initial screening tests and control initial screening and follow-up costs, a few women will be admitted to the study each week over a two year period. Therefore it is not possible to assign all participants to the treatment groups at one time. Instead, we can randomize women to treatment groups as they enter the study. One way to do this is to use a uniform random number generator to generate a list of random numbers for each location. Each woman who is admitted to the study in Iowa would be assigned the next available random number on the list. Starting with the first number on the list, divide each list into 40 sets of numbers with three consecutive numbers per set. For any set of three numbers, the woman with the smallest number is assigned the placebo, the woman with the middle number is assigned the lower level of isoflavones, and the women with the largest number is assigned the higher level of isoflavones. Repeat this process in California. In this way, women can be randomized to a treatment group as soon as they are admitted to the study, and by using consecutive blocks of three at each location, the study will be nearly balanced with respect to treatment groups, location and season (or time of entry into the study). The study should be double blind, so neither the participating women nor the people administering the treatments and taking measurements should have any knowledge of the treatment assignments. Therefore the pills should be identical in shape and color and come in identical bottles (each bottle would have only the name of the patient). One person at each location would label the bottles and keep the list of treatment assignments secret until the end of the study. Strengths of this randomization procedure are that it maintains equal sample sizes for the three treatments at each location, it nearly maintains balance with respect to season or time of entry into the study, and women can be randomized to treatment groups as soon as they are admitted to the study. One weaknesses of this randomization scheme is that it does not block with respect to other possible confounding factors. If there really are no seasonal differences, blocking with respect to season would result in a slight loss of power to detect treatment differences, but this potential loss would be quite small relative to the potential gain in power if there are substantial seasonal differences. Randomization within blocks of three consecutive women at each site is complicated and personnel must be well trained to carry this out correctly. B. If your protocol indicated intent to treat, you would not be allowed to exclude these 34 women from the analysis for lack of adherence. Intention to treat preserves randomization: The validity of a randomized control trial depends greatly on the process of randomization. Randomization insures that both measurable and immeasurable factors will balance out “on average”. If a factor other than the treatment itself could possibly influence bone loss measurements, then randomization insures that patients with different levels of that factor are equally likely to 7 receive either of two treatments or the placebo. This prevents many types of bias that can occur in a non-randomized trial. An analysis that excludes noncompliant patients is no longer randomized and might cause serious bias. Suppose noncompliant women have something in common that affects the treatment outcome. For example, suppose they are all heavy users of alcohol and most were assigned to the high level of isoflavone treatment group. Suppose heavy use of alcohol cause bone density to decline. If we exclude these patients from the analysis, we are eliminating a higher percentage of poorly performing patients from one treatment group, but not from the control group or the other treatment group. This could seriously bias comparison of outcome means. The intention to treat principle maintains unbiased estimates of the treatment differences for the patients used in the study by keeping all patients who were randomized to treatment in the study. Note that this does not exactly coincide with unbiased estimation of treatment differences for a theoretical population of fully compliant patients. Intention to treat analysis is more realistic. There are many factors that influence whether a patient complies or not with a treatment. Some of the factors that influence compliance might also influence the outcome measure. Noncompliant patients, for example, may tend to have worse outcomes than compliant patients, even in a placebo group. Perhaps patients who forget to follow a prescribed treatment will also neglect other things important for their health. Thus an analysis that excludes non-compliant patients may produce a study sample that is healthier than the patients in the overall population and this could bias estimates of treatment effects, although it may produce less biased estimates of the subpopulation that would be compliant. Intention to treat analysis is especially important for medications that are difficult to tolerate. If you exclude noncompliant patients, you are ignoring the influence of poor tolerability on the efficacy of a treatment. Consequences of intention to treat: 1. Power of tests for treatment effects is generally reduced because existence of noncompliant subjects tends to increase variation of observed outcomes within treatment groups and also tends to diminish observed treatment differences. Standardized estimates of differences in treatment means tend to be pulled toward zero. 2. Conversely, in an equivalence trial (attempting to prove that two treatments do not differ by more than a certain amount), using an intention to treat analysis will tend to favor equality of treatments. 3. While estimates of treatment effects would be unbiased for this mixture of compliant and non-compliant women, estimates of treatment effects may be distorted relative to what they are for compliant women. C. Most biostatistician would refer to the intention to treat principle to argue that the 22 ineligible should be kept in the study. Arguments would follow those given in part B. Potential advantages: By not deleting any subjects from the study, the largest possible sample sizes are maintained and this could provide more power for tests of treatment 8 effects. Also, one might be able to make inferences to a wider population, but this is limited because no real effort would have been made to sample subjects from ineligible parts of the wider population. Potential disadvantages: The ineligible women have different characteristics and may respond to treatments or placebo in a some what different manner than eligible women. Generally, the enrollment criteria are imposed to reduce variability by providing more homogeneous subjects, although this restricts the population to which inferences can be made. Including ineligible may increase variation in responses within treatment groups and decrease power of tests of significance, in spite of the larger sample sizes. It may also distort estimates of treatment effects relative to what they would be for the population of “eligible” women. D. One could argue that since these ineligible women had to be removed from the study for safety reasons, they should be treated as dropouts and any data collected after they were removed (just to monitor safety) should not be included in the analysis. Many biostatisticians would appeal to the intention to treat principle to argue that all data collected on those women should be included in the analysis. I believe that that is an incorrect use of the intention to treat principle, but this debate is certainly unresolved at the present time. Since these ineligible women have been forced to stop using any treatment, their follow-up data will generally not conform to the follow-up data for women remaining in the study and this will tend to dilute estimates of treatment effects and increase variability within treatment groups. One strategy is to run several different analyses: an analysis of data for all women randomized to treatment, another analysis that excludes ineligible women who were removed from the study for safety reasons, a third analysis that further excludes data on all non-compliant women. Hopefully, all three analyses will produce the same inferences. If not, careful explanations of possible reasons for differences in inferences must be given. 4) Use simulation to explore randomization tests. Results will vary, depending on the permutations selected by your simulations, but you should see the following pattern: Number of Blocks 95th Percentile of the Simulated F-Values 2 4 6 8 10 48.375 2.932 3.029 2.875 2.904 95th Percentile of a Central F Distribution with (3,3(b-1)) df 9.277 3.863 3.287 3.072 2.960 Proportion of Ratio: Column 2 Simulated F-values divided by that exceed the value Column 3 in Column 3 5.215 0.247 0.759 0.0204 0.922 0.0414 0.936 0.0405 0.981 0.0464 9 As the number of blocks increases, the simulated 95th percentile of the randomization distribution of the ratio of mean squares (F-ratio) approaches and the 95th percentile of the central F-distribution and the actual type I error level for the randomization test approaches the nominal 0.05 value. As the number of blocks increases, the number of possible permutations increases. For these types of experimental units, as few as 6 blocks are enough for the central F approximation to provide a reasonable approximation to the exact randomization distribution of the F-ratio, at least near the upper .05 percentile of the distribution. The quality of this approximation will vary across experiments, depending how the experimental units differ from each other. The approximation will be better for larger sample sizes and when variation among units more nearly follows a normal distribution. A theoretical argument can be based on a central limit theorem. The approximation is not very good for two blocks because there are only eight experimental units and the distribution of their responses is quite discrete and nonnormal. The central limit theorem is a large sample result. 5) Sample Size Determination A. Using n1 = n 2 = n and α = 0.05 , find the value of n that provides power=0.8 for rejecting the null hypothesis H 0 : μ 2 − μ1 = 0 when the true difference in the means is given by the alternative δ = μ 2 − μ1 = 5 and the standard deviation is approximately 2δ . In this case, α = .05 and β =1-power=1-0.8=0.2, and initial estimate of the sample size is n% 1 = n% 2 = 2(z.05 + z.20 ) 2 s2 δ2 = 2(1.645 + 0.84)2 (2δ)2 δ2 ≈ 50 Then, a more accurate value for sample size based on the central t-distribution with 50+50-2=98 degrees of freedom is n1 = n 2 = 2(t (98),.05 + t (98),.20 )2 s2 δ2 = 2(1.661 + 0.8453)2 (2δ)2 δ2 = 51 The sample size calculation is rounded up to the next largest integer. It makes no sense to request a sample size of 50.6, for example. You must use whole subjects. B. Test the null hypothesis Ho : μ1 = μ 2 against the two-sided alternative Ha : μ1 ≠ μ 2 . Using n1 = n 2 = n and α = 0.05 , find the value of n that provides power=0.8 for rejecting the null hypothesis when the alternative is δ = 5 and the standard deviation is approximately 2δ . In this case, α = .05 and β =1-power=1-0.8=0.2, and initial estimate of the sample size is 10 n% 1 = n% 2 = 2(z.025 + z.20 )2 s2 δ2 = 2(1.96 + 0.84)2 (2δ)2 δ2 ≈ 63 Then, a more accurate value for sample size based on the central t-distribution with 63+63-2=124 degrees of freedom is n1 = n 2 = 2(t (124),.025 + t (124),.20 )2 (2δ)2 δ2 = 64 The sample size calculation is rounded up to the next largest integer. C. For an equivalence test of H 0 : μ1 − μ 2 ≥ θ versus H a : μ1 − μ 2 < θ the rejection region will be inside the interval of possible alternatives (− θ , θ ) . We reject the null hypothesis if x1 − x 2 is enough standards errors above − θ and x1 − x 2 is enough standard errors below θ to be convinced this outcome would be sufficiently unlikely to occur just because of sampling variability when the null hypothesis is actually true. Consequently, reject the null hypothesis of non-equivalent treatments if x 1 − x 2 > − θ + t α, n 1 + n 2 − 2 s 2 ( 1 1 1 1 + ) and x1 − x 2 < θ − t α, n1 + n 2 − 2 s 2 ( + ) n1 n 2 n1 n 2 Let n1 = λn 2 and assume that the actual difference in response means is δ = μ1 − μ 2 , where 0 ≤ δ < θ . Then, the power of the test is ⎛ ⎞ 1 1 1 1 P ⎜ − θ + t α, n 1 + n 2 − 2 s 2 ( + ) < x1 − x 2 < θ − t α, n 1 + n 2 − 2 s 2 ( + ) μ1 − μ 2 = δ ⎟ ⎜ ⎟ n1 n 2 n1 n 2 ⎝ ⎠ D. Calculate the sample sizes when α = .05 and β =1-power=1-0.8=0.2, δ = θ / 2 , and s = θ. ⎞ ⎛ 1 2 1 ⎟ ⎜ − θ + t α, n + n − 2 * s 2 ( 1 + 1 ) − δ t * s ( ) θ − + − δ α , n + n − 2 1 2 1 2 n1 n 2 n1 n 2 ⎟ ⎜ x1 − x 2 − δ 1 − β = P⎜ < < μ1 − μ 2 = δ ⎟ 1 1 1 1 1 1 ⎟ ⎜ s2 ( + ) s2 ( + ) s2 ( + ) ⎟ ⎜ n n n n n n 1 2 1 2 1 2 ⎠ ⎝ ⎛ 1 1 1 ⎞ 2 1 ⎜ − ( θ + δ) + t α, n + n − 2 * s 2 ( + θ − δ − + ) ( ) t * s ( )⎟ α + − , n n 2 1 2 1 2 n1 n 2 n1 n 2 ⎟ ⎜ = P⎜ < t ( n1 + n 2 − 2) < ⎟ 1 1 2 1 2 1 ⎜ ⎟ s ( + ) s ( + ) ⎜ ⎟ n n n n 1 2 1 2 ⎝ ⎠ 11 ⎛ ⎞ ⎜ ⎟ ⎜ − ( θ + δ) ⎟ ( θ − δ) = P⎜ t α, n1 + n 2 − 2 + < t ( n1 + n 2 − 2) < − t α, n1 + n 2 − 2 + ⎟ 1 1 ⎟ 2 1 2 1 ⎜ + + s ( ) s ( ) ⎜ n1 n 2 n1 n 2 ⎟⎠ ⎝ where t( n1 +n2 − 2) denotes a random variable with a central t-distribution with n1 + n 2 − 2 degrees of freedom and t α, ( n1 + n 2 − 2 ) denotes an upper tail percentile of that central tdistribution. Get an approximate solution by assuming that the sample sizes are large enough for a standard normal random variable (Z) to be a good approximation to a random variable with a central t-distribution. Then, ⎛ ⎞ ⎜ ⎟ ⎜ − ( θ + δ) ⎟ ( θ − δ) < Z < − Zα + 1 − β ≈ P⎜ Z α + ⎟ 1 1 ⎟ 2 1 2 1 ⎜ + + s ( ) s ( ) ⎜ n1 n 2 n1 n 2 ⎟⎠ ⎝ ⎛ ⎞ ⎛ ⎞ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ − ( θ + δ) ⎟ ( θ − δ) = P⎜ Z < − Z α + ⎟ − P⎜ Z < Z α + ⎟ 1 ⎟ 1 ⎟ 2 1 2 1 ⎜ ⎜ + + s ( ) s ( ) ⎜ ⎜ n1 n 2 ⎟⎠ n1 n 2 ⎟⎠ ⎝ ⎝ ⎛ ⎞ ⎜ ⎟ ⎜ ⎟ ( θ − δ) ≈ P⎜ Z < − Z α + ⎟ 1 ⎟ 2 1 ⎜ + s ( ) ⎜ n1 n 2 ⎟⎠ ⎝ since the second probability in the previous equation will be close to zero when the sample sizes are large enough. Consequently, set Zβ = − Z α + (θ − δ ) s2 ( Solving for n 2 , we obtain λn1 = n 2 ≈ 1 1 + ) n1 n 2 (Zα + Zβ )2 s2 (1 + λ −1 ) ( θ − δ )2 Substituting λ = 1 , δ = μ1 − μ 2 = θ / 2 , α = .05 , β =1-power=1-0.8=0.2, and s = θ , we obtain 12 n1 = n 2 ≈ 8(Z.05 + Z.20 )2 s2 ( θ) 2 = 8(1.64485 + 0.84162) 2 ( θ) 2 ( θ) 2 = 50 Then, make an adjustment for the t-distribution n1 = n 2 ≈ 8(t.05,98 + t.20,98 )2 s2 ( θ)2 = 8(1.66055 + 0.84540)2 ( θ)2 ( θ)2 = 51 Alternatively, we numerically find the smallest n1 = n 2 = n that satisfies ⎛ ( θ − δ) ⎞⎟ − ( θ + δ) 1 − β = P⎜ t α,2( n −1) + < t 2( n −1) < − t α,2( n −1) + ⎟ ⎜ 2s 2 / n 2s 2 / n ⎠ ⎝ ⇒ ⎛ − (3θ / 2) (θ / 2) ⎞⎟ < t 2( n −1) < − t α,2( n −1) + 0.8 = P⎜ t .05,2( n −1) + ⎜ ⎟ 2θ 2 / n 2θ 2 / n ⎠ ⎝ ⇒ ⎛ n ⎞ −3 n ⎟ 0.8 = P⎜⎜ t .05,2( n −1) + < t 2( n −1) < − t α,2( n −1) + 2 2 2 2 ⎟⎠ ⎝ The following R function yields n=51. fsize=function (alpha) { power=.8 n=1 repeat{ n=n+1 df=2*n-2 upper=-qt(1-alpha,df)+sqrt(n)/(2*sqrt(2)) lower=qt(1-alpha,df)-3*sqrt(n)/(2*sqrt(2)) pt=pt(upper,df)-pt(lower,df) if(pt>power) break } return(n) } E. Find the sample size needed so that the half length of a 95% confidence interval for the difference in the means will be about 0.5 when the standard deviation is 5.0. 13 A large sample approximation of the 95% CI for difference in means will be 1 1 ( x1 − x 2 ) m z α / 2 s + n1 n 2 Given the half length of a 95% confidence interval for the difference in the means will be about 0.5 when the standard deviation is 5.0 and assume n1 = n 2 = n , we can solve for n. 2 2 ⎛s z ⎛ (5)(1.96) 2 ⎞ 2⎞ ⎟ =⎜ ⎟ = 768.32 n ≥ ⎜⎜ α / 2 ⎜ ⎟ ⎟ 0 . 5 0 . 5 ⎝ ⎠ ⎠ ⎝ Hence n should be rounded up to 769. 6) Inferences for relative risk A. Use the information from the pilot study to estimate π2 π1 , the relative risk of relapse for the standard therapy relative to treatment with the new compound. Also construct an approximate 95% confidence interval for the relative risk of relapse. Treatment New compound Standard therapy Relapse 24 30 P2 = P(relapse|standard therapy)= 30 24 , P1 = P(relapse|new compound)= 80 80 Hence estimated relative risk, RRˆ = Not relapse 56 50 P2 30 = = 1.25 P1 24 From the previous assignment, we know that the variance of the large sample normal approximation to the distribution of RR̂ is (1 − π 2 ) (1 − π 1 ) p2 )) = + p1 nπ 2 mπ 1 30 24 (1 − ) (1 − ) (1 − p 2 ) (1 − p1 ) 80 + 80 = 0.05 , and the We can estimate this by + = np 2 mp1 30 24 p standard error is approximately SE(log( 2 )) = 0.05 . Then, an approximate 95% CI p1 for log(RR) is var(log( 14 p p log( 2 ) ± (1.96)SE(log( 2 )) ⇒ (log(1.25) ± (1.96) ,05 ⇒ ( −0.215256, 0.661256) p1 p1 . Finally, an approximate 95% CI for RR is (e −0.215256 , e 0.661256 ) ⇒ (0.806, 1.937 ) . B. Consider a new study in which the researchers want to test Ho : π 1 = π 2 against the alternative Ha : π 1 < π 2 . Determine the sample size needed in order to achieve power of at least 0.80 π2 when the relative risk ( π 1 ) is 1.20. If you can, present a formula in addition to a numerical value for the sample size. Assume n1 = n 2 = n . The sample size formula for a test involving the difference between two proportions is (z n= α 2π(1 − π) + z1−β π1 (1 − π1 ) + π 2 (1 − π 2 ) )2 ( π1 − π 2 ) 2 where π 2 =P(relapse | standard compound), π1 =P(relapse | new compound), and 30 = 0.375 . π = ( π1 + π2 ) / 2 . Using the information given in the problem, specify π 2 = 80 In a real situation you may have a good deal of information on the standard compound from previous studies and may be able to specify an accurate estimate of π 2 =P(relapse | standard compound). Since RR= ( π2 / π1 ) =1.2, the value for π1 must be π1 = π 2 / 1.2 = 0.375 / 1.2 = 0.3125 and it follows that π = ( π1 + π2 ) / 2 = (0.375 + 0.3125) / 2 = 0.34375 . Using α = .05 , β =1-power=10.8=0.2 we obtain n1 = n 2 = n = 714 observations. If you initially selected π1 =24/80=0.3 from the information from the pilot study, then RR= ( π2 / π1 ) =1.2 implies that the value for π 2 must be π 2 = 1.2 π1 = 0.36 . These values require a samples sizes of n1 = n 2 = n = 759 observations. π You could also do this test in terms of relative risk. H o : log( 2 ) = 0 versus π1 π H A : log( 2 ) > 0 . The large sample normal approximation to the distribution of the π1 ⎛ π p (1 − π 2 ) (1 − π1 ) ⎞ ⎟ . Then, the estimate of log(relative risk) is log( 2 ) ~& N ⎜⎜ log( 2 ), + π1 p1 nπ 2 nπ1 ⎟⎠ ⎝ 15 p null hypothesis is rejected at the α = 0.05 level if log( 2 ) > 0 + Z α p1 (1 − π 2 ) (1 − π1 ) . + nπ 2 nπ1 The power of the test is ⎛ ⎜ power = P ⎜ z = ⎜ ⎜ ⎝ p2 ) − 1.2 p1 > Zα − (1 − π2 ) (1 − π1 ) + nπ 2 nπ1 log( ⎞ ⎟ 1.2 ⎟ ⎟ (1 − π 2 ) (1 − π1 ) + ⎟ nπ 2 nπ1 ⎠ Solving for n, we obtain n= ( Zα + Zβ ) 2 ⎛ (1 − π 2 ) + ⎜ ⎝ π2 (0 − log(1.2))2 (1 − π1 ) ⎞ ⎟ π1 ⎠ = (1 − 0.3) (1 − 0.375) ⎞ + ⎟ 0.375 ⎠ ⎝ 0.3 = 770 (0 − log(1.2))2 (1.645 + 0.84 )2 ⎛⎜ (1 − π2 ) (1 − π1 ) + makes a substantial π2 π1 difference. Using π1 =0.5 and π2 =0.5 instead of π1 =0.30 and π2 =0.375, for example, in the previous formula changes the estimated sample size to 372. The variance of log(p 2 / p1 ) becomes smaller as π1 and π2 become larger. It would be wise to make these calculations for several sets of values for π1 and π2 . The values used for π1 and π2 in evaluating the C. Suppose that the new compound and the standard therapy are considered to be equivalent if the absolute value of the logarithm of the relative risk does not exceed 0.1. Determine the sample size needed in order to achieve power of at least 0.8 for π2 establishing equivalence when log( π 1 ) = 0.05 . Assume the relapse rate for the standard therapy is equal to the observed rate in the pilot study (30/80=0.375). Assume n1 = n 2 = n . Assuming n1 = n 2 = n , the large sample normal approximation to the natural logarithm of the estimate for relative risk is ⎛ π p (1 − π 2 ) (1 − π1 ) ⎞ ⎟ + log( 2 ) ~& N ⎜⎜ log( 2 ), π1 p1 nπ 2 nπ1 ⎟⎠ ⎝ Using a derivation similar to that used in part C of problem 5, the null hypothesis of equivalence is rejected if 16 p log( 2 ) > −θ + Z α p1 p (1 − π 2 ) (1 − π1 ) and log( 2 ) < θ − Z α + nπ 2 nπ1 p1 (1 − π 2 ) (1 − π1 ) + nπ 2 nπ1 where θ = 0.1, the boundary value of log( π 2 / π1 ) that determines equivalence in this case. When the alternative δ = log( π 2 / π1 ) = .05 is true, the power of the test is p ⎛ ⎞ log( 2 ) − δ ⎜ ⎟ − θ − δ + Zασθ n θ − δ − Zα σθ n ⎟ p1 < < P⎜ ⎜ ⎟ σδ n σδ / n σδ n ⎜ ⎟ ⎝ ⎠ where σθ 1 = n n (1 − π 2 ) (1 − π 2 e − θ ) is the large sample standard deviation for + π2 π2e − θ (1 − π 2 ) (1 − π 2 e − δ ) is the large + π2 π2e − δ sample standard deviation for log(p 2 / p1 ) when δ = log( π 2 / π1 ) . log(p 2 / p1 ) when θ = log( π 2 / π1 ) and σδ 1 = n n A sample size is determined by stepping through possible values of n until the smallest n is found for which p ⎛ ⎞ log( 2 ) − δ ⎜ ⎟ − θ − δ + Zασθ n θ − δ − Zα σθ n ⎟ p1 ⎜ power = 1 − β = P < < ⎜ ⎟ σδ n σδ / n σδ n ⎜ ⎟ ⎝ ⎠ This is done by the following R code which yields n=8816 alpha <- 0.05 power <- 0.8 p2 <- 0.375 theta= 0.1 delta= 0.05 p1 <- p2*exp(-theta); stheta <- sqrt(((1-p1)/p1) + ((1-p2)/p2)) p1a <- p2*exp(-delta) sdelta <- sqrt(((1-p1a)/p1a) + ((1-p2)/p2)) n=1 repeat{ n=n+1 upper <- (theta-delta)/(sdelta/sqrt(n)) - qnorm(1-alpha)*stheta/sdelta lower <- (-theta-delta)/(sdelta/sqrt(n)) + qnorm(1-alpha)*stheta/sdelta pZ <- pnorm(upper)-pnorm(lower) 17 if(pZ>power) break } n An approximate sample size formula is 2 ⎡ Z α σ δ + Zβ σ θ ⎤ n=⎢ ⎥ (θ − δ ) ⎦ ⎣ For θ =0.10, δ = 0.05 , π 2 = 30 80 = 0.375 , power=0.8= 1 − β and α =0.05, this formula yields n=8816 subjects for both the standard and new treatment groups. R code is given below alpha <- 0.05 power <- 0.8 p2 <- 0.375 theta= 0.1 delta= 0.05 p1 <- p2*exp(-theta); stheta <- sqrt(((1-p1)/p1) + ((1-p2)/p2)) p1a <- p2*exp(-delta) sdelta <- sqrt(((1-p1a)/p1a) + ((1-p2)/p2)) n <- ((qnorm(power)*sdelta+qnorm(1-alpha)*stheta)/(theta-abs(delta)))^2 n <- ceiling(n) n D. Determine the sample sizes needed to construct a 95% confidence interval for the relative risk of relapse for the new therapy versus the standard treatment, such that the length of the confidence does not exceed 5% of the actual value of the relative risk. Assume n1 = n 2 = n . From part A, we know an approximate 95% CI for log(RR) is; p (1 − p 2 ) (1 − p1 ) log( 2 ) ± 1.96 + p1 n 2 p2 n1p1 An approximate 95% CI for RR is; p (1− p 2 ) (1− p1 ) ⎛ log( p 2 ) 1.96 (1− p 2 ) (1− p1 ) log( 2 ) +1.96 − + + ⎜ p1 n 2p2 n 1 p1 p1 n 2p2 n1p1 ⎜e ,e ⎜⎜ ⎝ ⎞ ⎟ ⎟ ⎟⎟ ⎠ The length of the confidence should not exceed 5% of the actual value of the relative risk. Consequently, 18 p (1− p 2 ) (1− p1 ) log( 2 ) +1.96 + p1 n 2p2 n1p1 e ⎛ 1.96 p2 ⎜ ⎜e ⇒ p1 ⎜ ⎜ ⎝ ⎛ 1.96 ⎜ ⎜e ⇒ ⎜⎜ ⎝ p (1− p 2 ) (1− p1 ) log( 2 ) −1.96 + p1 n 2p2 n1p1 −e (1− p 2 ) (1− p1 ) + n 2p2 n1p1 −1.96 p ≤ 0.05 2 p1 (1− p 2 ) (1− p1 ) ⎞ + ⎟ n 2 p2 n1p1 p ⎟ ≤ 0.05 2 p1 ⎟⎟ ⎠ (1− p 2 ) (1− p1 ) (1− p 2 ) (1− p1 ) ⎞ + −1.96 + ⎟ n 2p2 n1p1 n 2p2 n 1 p1 ⎟ ≤ 0.05 −e ⎟⎟ ⎠ −e Using n1 = n 2 = n and using the information from the pilot study to select p1 = p2 = 24 and 80 30 (1− p2 ) (1− p1 ) , we have + = 4 and p2 p1 80 ⎛ 1 ⎛ (1− p 2 ) (1− p1 ) ⎞ ⎟ ⎜ 1.96 ⎜⎜ + −1.96 n ⎝ p2 p1 ⎟⎠ ⎜e −e ⎜ ⎜ ⎝ ⎛ 3.92 ⎜ ⇒ ⎜e ⎜ ⎝ 1 ⎛ (1− p 2 ) (1− p1 ) ⎞ ⎞ ⎜ ⎟⎟ + n ⎜⎝ p 2 p1 ⎟⎠ ⎟ ⎟ ⎟ ⎠ 1 n −e − 3.92 1 n ≤ 0.05 ⎞ ⎟ ⎟ ≤ 0.05 ⎟ ⎠ We can solve for n by stepping through a sequence of possibilities. The following R function evaluated at alpha=0.05 yields n=24591. Nsize=function (alpha) { n=0 qn=qnorm(1-alpha/2) repeat{ n=n+1 diff=exp(qn*2*sqrt(1/n))-exp(-qn*2*sqrt(1/n)) if (diff<0.05) break} return(n) }