Design and Analysis Considerations in Clinical Trials With a Sensitive Subpopulation Yan D. Z HAO, Alex D MITRIENKO , and Roy TAMURA This article deals with clinical trials with a sensitive subpopulation of patients, that is, a subgroup that is more likely to benefit from the treatment than the overall population. Given a sensitive subgroup defined by a prespecified classifier, for example, a clinical marker or pharmacogenomic marker, the trial’s outcome is declared positive if the treatment effect is established in the overall population or in the subgroup. We provide a summary of key considerations in clinical trials with a sensitive subgroup, including multiplicity and enrichment adjustments as well as optimality considerations in the analysis strategy. The methodology proposed in this article is illustrated using a neuroscience clinical trial and its operating characteristics are assessed via a simulation study. Key Words: Clinical trials; Enriched clinical trials; Sensitive population; Type I error rate control; Weighted power. 1. Introduction The development of genomic technologies, such as microarrays and single nucleotide polymorphism genotyping, has made it possible to identify a subpopulation of potentially sensitive patients who are likely to benefit from a targeted agent (Shipp et al. 2002; Rosenwald et al. 2002). For example, Herceptin (Baselga 2001), as part of a treatment regimen, is indicated for the adjuvant treatment of HER2-overexpressing breast cancer. Since many of targeted agents may affect only sensitive patients, methods have been proposed in the literature that enable the trial’s sponsor to pursue two objec72 tives in the trial: 1. Assessment of the overall effect of the experimental treatment compared to the control using all patients. 2. Assessment of the treatment effect in the selected subpopulation of sensitive patients. The trial’s outcome is declared positive if the treatment effect is established in the overall or sensitive populations. The sensitive population can be defined based on a classifier available at the beginning of the trial (fixed classifier) or a classifier developed during the trial based on the data available at an interim look (adaptively defined classifier). The signature design (Freidlin and Simon 2005) serves as an example of an adaptive design approach. A signature design trial is conducted in two stages. A classifier that identifies sensitive patients is developed using Stage 1 data only. The classifier is not used to restrict the entry of patients during Stage 2, but it is prospectively applied to Stage 2 patients to select a subgroup of sensitive patients. At the final analysis, the treatment effect is tested in the overall population as well as in the selected subgroup of sensitive patients accrued during Stage 2. Wang et al. (2007) proposed another adaptive design that enables the trial’s sponsor to restrict the enrollment of nonsensitive patients in Stage 2. In this article we consider confirmatory clinical trials where the sensitive population is identified based on a fixed classifier, for example, a classifier derived from the c American Statistical Association Statistics in Biopharmaceutical Research 2010, Vol. 2, No. 1 DOI: 10.1198/sbr.2010.08039 Design and Analysis Considerations in Clinical Trials With a Sensitive Subpopulation data collected earlier in the development program. This setting arises commonly in tailored-therapy programs. We focus on the following key considerations in the design and analysis of clinical trials in this class: • Multiplicity considerations. We propose a new multiplicity adjustment method that controls the overall Type I error rate in trials with a sensitive population. This method is based on a family of flexible multiple testing procedures and enables the trial’s sponsor to tailor the multiplicity adjustment to the trial’s objectives. • Enrichment considerations. We consider a patient selection procedure that enables the sponsor to have a greater representation of sensitive patients compared to the general population to better characterize the treatment effect in the sensitive population. The overall test is stratified based on whether a patient is in the sensitive or nonsensitive subgroups and, to make sure that the conclusions based on the enriched trial can be extended to the general population, the overall stratified test is adjusted using the method due to Horvitz and Thompson (1952). • Optimality considerations. We define a method for optimally balancing the power of the overall and sensitive population tests in the trial. An optimal balance is achieved by maximizing a criterion that reflects the trial’s objectives under the restriction that the overall Type I error rate is protected. The article is organized as follows. Section 2 presents the multiplicity adjustment method. Section 3 discusses enrichment considerations and introduces an enrichmentadjusted stratified test for the overall population analysis. Optimality considerations are discussed in Section 4. The methodology is illustrated using a clinical trial example and its operating characteristics are examined in Section 5. Some discussion can be found in Section 6. 2. Multiplicity Considerations Consider a confirmatory parallel-group clinical trial designed to compare an experimental treatment to a control treatment with n0 /2 patients enrolled in each of the two arms. A classifier, available at the beginning of the trial, is used to define a subgroup of sensitive patients. Let n+ and n− be the size of the sensitive and nonsensitive subgroups, respectively, with n0 = n+ + n− . The treatment effect test in this trial has two components, the overall test and the sensitive subgroup test. Let H0 and H+ denote the null hypotheses of no treatment effect in the overall and sensitive populations, respectively. To be able to make claims about the significance of the treatment effect in either population, the trial’s sponsor needs to control the familywise error rate (FWER) associated with H0 and H+ in the strong sense (Hochberg and Tamhane 1987; Dmitrienko et al. 2009). The two null hypotheses are tested using a multiple testing procedure which will be called the feedback procedure. This procedure serves as an example of a parametric multiple testing procedure, that is, its decision rule takes into account the joint distribution of the test statistics associated with the hypotheses H0 and H+ . An important feature of the feedback procedure is the relationship between the overall and sensitive subgroup tests. The overall test provides feedback for the sensitive subgroup test and the critical value for the subgroup test is selected as a function of the overall p-value. Further, the term feedback emphasizes the fact that the feedback procedure is closely related to and, in fact, serves as an extension of the fallback procedures (Wiens 2003; Wiens and Dmitrienko 2005; Huque and Alosh 2008). The test statistics for the overall and sensitive subgroup tests are denoted by Z0 and Z+ , respectively, and the associated one-sided p-values are denoted by p0 and p+ . Further, the one-sided FWER selected for this trial is denoted by α , for example, one-sided α = 0.025. The feedback procedure includes a parameter that determines the probability of rejecting each of the two hypotheses. This parameter is denoted by α0 (0 ≤ α0 ≤ α ) and the choice of this parameter will be discussed in Section 4. The rejection rules for H0 and H+ are given by • Reject H0 if (1) p0 ≤ α0 , or (2) α0 < p0 ≤ α and p+ ≤ α f (p0 ). • Reject H+ if (1) p0 ≤ α0 and p+ ≤ α , or (2) α0 < p0 and p+ ≤ α f (p0 ). The function f (x) determines the α level for testing the null hypothesis H+ and will be termed the α spending function. The α -spending function has the following properties: • Property 1: f (x) = 1 if 0 ≤ x < α0 and 0 ≤ f (x) ≤ 1 if α0 ≤ x ≤ 1. • Property 2: P(α0 < p0 , p+ ≤ α f (p0 )) = α − α0 under the global null hypothesis H0 ∩ H+ . Several parametric multiple testing procedures proposed in the literature are special cases of the general feedback procedure defined above. This includes the parametric fallback procedure (Huque and Alosh 2008) and its extension developed in Alosh and Huque (2009), 4A procedure (Li and Mehrotra 2008) and procedure introduced in Song and Chi (2007). The α -spending functions for these special cases are defined below. 73 Statistics in Biopharmaceutical Research: Vol. 2, No. 1 Reject H+ 0.0000 0.0135 0.0250 p0 0.0000 .0135 0.0135 p+ p+ 0.0250 1 Reject H0 0 .0135 p0 1 Figure 1. Rejection regions of the feedback procedure with a constant α -spending function (one-sided α = 0.025, α0 = 0.0135). • Constant function: f (x) = c, α0 ≤ x ≤ 1, where c = c(α0 ) is a free parameter chosen to satisfy Property 2. The parametric fallback and Song-Chi procedures are special cases of the feedback procedure with a constant α -spending function. • 4A function: f (x) = min(c/x2 , α0 ), α0 ≤ x ≤ 1. Here c = c(α0 ) is selected to satisfy Property 2. The feedback procedure with this α -spending function simplifies to the 4A procedure. As an illustration, consider the feedback procedure with a constant α -spending function in a one-sided testing problem with α = 0.025 and α0 = 0.0135. Assume that Z0 and Z+ follow a standard bivariate normal distribution with ρ = Corr (Z0 , Z+ ) = 0.5. The decision rules for the feedback procedure are given by: • Reject H0 if (1) p0 ≤ 0.0135, or (2) 0.0135 < p0 ≤ 0.025 and p+ ≤ 0.0135. • Reject H+ if (1) p0 ≤ 0.0135 and p+ ≤ 0.025, or (2) 0.0135 < p0 and p+ ≤ 0.0135. The rejection regions for H0 and H+ are displayed in Figure 1. It is instructive to compare the feedback procedure defined above to a more basic parametric multiple testing procedure based on the distribution of the maximum test statistic or, equivalently, the minimum p-value. This procedure rejects H0 or H+ if p0 ≤ α ∗ or p+ ≤ α ∗ , respectively, where α ∗ is computed from P(min(p0 , p+ ) ≤ α ∗ ) = α 74 under the global null hypothesis. It is easy to show that in this particular case α ∗ = 0.0135 and thus this procedure is uniformly less powerful than the feedback procedure in the sense that the latter always rejects as many as and potentially more hypotheses than the former. Further, Alosh and Huque (2009) introduced the consistency restriction, according to which a significant treatment effect in the sensitive population is accepted only if it is consistent with the treatment effect in the overall population, for example, if p0 is not too large. Under the consistency restriction, the definition of an α spending function needs to be modified as follows f (x) = 1 if 0 ≤ x < α0 , 0 ≤ f (x) ≤ 1 if α0 ≤ x ≤ γ , f (x) = 0 if γ < x ≤ 1, where α < γ ≤ 1 is the consistency parameter. Note that selecting a smaller value of γ restricts opportunities to detect a significant treatment effect in the sensitive population. For example, with γ = 0.5, the treatment effect in the sensitive population is not declared statistically significant even if p+ approaches 0 unless the observed treatment difference in the overall population is positive, that is, unless p0 ≤ 0.5. On the other hand, when γ is set to 1, the feedback procedure with a consistency restriction simplifies to the regular procedure. To show that the feedback procedure controls the FWER in the strong sense, one can use the closure principle (Marcus, Peritz, and Gabriel 1976). First, consider the closed testing procedure displayed in Table 1. This procedure is based on three local tests for the intersection hypotheses H1 = H0 ∩ H+ , H2 = H0 , and H3 = H+ . Design and Analysis Considerations in Clinical Trials With a Sensitive Subpopulation Table 1. Closed testing representation of the feedback procedure. The closed testing procedure rejects a null hypothesis if all intersection hypotheses including this null hypothesis are rejected by their local tests. Intersection H1 = H0 ∩ H+ H2 = H0 H3 = H+ Rejection rule (Local test) p0 ≤ α0 or p+ ≤ α f (p0 ) p0 ≤ α p+ ≤ α The closed testing procedure rejects a null hypothesis if all intersection hypotheses including this particular null hypothesis are rejected by their local tests. For example, the procedure rejects H0 if both H1 and H2 are rejected by their local tests. Each local test in Table 1 is, in fact, an α -level test. In particular, under the global null hypothesis H0 ∩ H+ , P(p0 ≤ α0 or p+ ≤ α f (p0 )) = P(p0 ≤ α0 ) + P(α0 < p0 , p+ ≤ α f (p0 )) = α0 + (α − α0 ) = α. By the closure principle, the closed testing procedure controls the FWER in the strong sense. Further, it is easy to show that the feedback procedure is equivalent to the procedure in Table 1 in the sense the former rejects a null hypothesis if and only if it is rejected by the latter. This immediately implies that the feedback procedure strongly controls the FWER at α . The closed representation of the feedback procedure highlights one of its key properties. As was shown above, each of the three intersection hypotheses in the closed family is tested at a full α level and thus the feedback procedure is α -exhaustive (Grechanovsky and Hochberg 1999). An important implication of this property is that the feedback procedure is uniformly more powerful than its non-α -exhaustive special cases, including the parametric fallback and 4A procedures. 3. Enrichment Considerations Enrichment strategies are employed when the prevalence rate of sensitive patients in the general population is low. In this case, a large overall sample size will be required to ensure that a sufficiently large number of sensitive patients are included in the trial. To overcome this issue, the trial’s sponsor can enrich the trial by enrolled a higher proportion of patients in the sensitive subgroup. To introduce an enrichment strategy, let π+ be the assumed prevalence rate of sensitive patients in the general population and r+ = n+ /n0 be the prevalence rate of sensitive patients in the clinical trial. Let s+ and s− be the selection probabilities of sensitive and nonsensitive patients in the clinical trial, respectively. Then, the selection probabilities can be determined through the following relationship: s− π+ /(1 − π+ ) = . s+ r+ /(1 − r+ ) (1) For a given pair of π+ and r+ , the ratio s− /s+ is fixed; however, there are no unique solutions for s+ and s− , individually. Because higher selection probabilities are preferred in practice, we will choose maximum values of s+ and s− that satisfy Equation (1). Since π+ < r+ , we have s+ > s− and the maximum values of s+ and s− that satisfy Equation (1) are given by s+ = 1 and s− = π+ /(1 − π+ ) . r+ /(1 − r+ ) For example, if the prevalence rate of sensitive patients in the general population is π+ = 0.1 and the prevalence rate of sensitive patients in the trial is set at r+ = 0.3, it is easy to show that s+ = 1 and s− = 0.26. Once the selection probabilities s+ and s− have been chosen, an enriched clinical trial can be conducted in the following way. All sensitive patients who met the entry criteria are enrolled in the trial (i.e., s+ = 1) and, for nonsensitive patients, one of the following two options can be considered: • Option 1: Enroll all nonsensitive patients until a certain total number is reached (e.g., 70% of the total), then stop enrolling nonsensitive patients and keep enrolling sensitive patients until the enrollment goal is met. This option may be attractive from an operational perspective. • Option 2: Accept nonsensitive patients with probability s− . This option is preferable from a general scientific perspective because the enrollment periods for sensitive and nonsensitive patients coincide (note that the enrollment period for sensitive patients will be shorter compared to nonsensitive patients in Option 1). This is particularly important when the standard of care or patient characteristics are expected to change over time. Also, note that it is important to carefully choose the selection timing so that it is not too early or too late. A reasonable choice would be to select nonsensitive patients before a patient enters the trial (i.e., before a patient signs the informed consent document). Because sensitive patients are over-represented in the enriched clinical trial, the regular overall test statistic Z0 will be driven by the sensitive subgroup. Beginning with 75 Statistics in Biopharmaceutical Research: Vol. 2, No. 1 the case of continuous endpoints, note that the standard stratified test statistic is given by r+ θb+ + r− θb− Z0 = q , 2 /n + r 2 /n 2 r+ + − − (2) where r− = 1 − r+ and θb+ and θb− are the estimated effect sizes in the sensitive and nonsensitive subgroups, respectively. Here the true effect size is defined as θ = (µ1 − µ2 )/σ , where µ1 , µ2 are the mean responses in the two treatment groups and σ is a common standard deviation. Under the null hypothesis of no treatment effect, E(θb+ ) = E(θb− ) = 0, var(θb+ ) = 4/n+ , and var(θb− ) = 4/n− . This is easily shown for the continuous endpoint case; the Appendix gives details for the binary and timeto-event cases when the effect sizes are defined as follows: p • Binary endpoints: θ = (p1 − p2 )/ p̄(1 − p̄), where p1 , p2 are the response rates in the two treatment groups and p̄ = (p1 + p2 )/2. • Time-to-event endpoints: θ = log λ , where λ is the hazard ratio between the two treatment groups. To be able to draw conclusions about the general population from an enriched trial, we propose the following enrichment adjustment for the stratified test statistic in the overall population analysis. The numerator of the adjusted test statistic estimates the overall effect as if the trial had not been enriched. An enrichment-adjusted test statistic of this kind can be obtained using the HorvitzThompson method (Horvitz and Thompson 1952), that is, Ze0 = 2 (r+ /s+ )θb+ + (r− /s− )θb− (r+ /s+ )2 /n+ + (r− /s− )2 /n− p e r− θb− r+ θb+ + e = q , 2 /n + e 2 2 e r+ + r− /n− (3) r− are dewhere the enrichment-adjusted rates e r+ and e fined as follows e r+ = sr+ sr+ + r− and e r− = r− sr+ + r− and s is the enrichment ratio, that is, s = s− /s+ . It can be shown that, in this particular case, the enrichmentadjusted rates are equal to the population prevalence rates, that is, e r+ = π+ and e r− = π− . As an illustration, assume again that π+ = 0.1 and r+ = 0.3, then s = 0.26, e r+ = 0.1, and e r− = 0.9. Further, when an enrichment strategy is not used, s = 1 and the enrichment-adjusted test simplifies to the regular test. 76 4. Optimization Considerations The multiplicity adjustment strategy based on the feedback procedure includes several design parameters, including the α allocation parameter (α0 ), enrichment ratio (s), and parameters of the α -spending function. These parameters influence the power of the overall and sensitive population tests and their choice is driven by a number of factors, including the expected magnitude of the treatment effect in the sensitive population relative to that in the overall population, size of the sensitive population and importance of obtaining regulatory claims in the sensitive population. To select the design parameters in a clinical trial with a sensitive population, one can either examine the power functions of the overall and sensitive population tests separately or perform a joint evaluation of the two functions. In this section we describe a formal method for achieving an optimal balance between the two power functions based on maximizing a suitably defined weighted power function which will be termed the maximum weighted power (MWP) method. The MWP method can be readily applied to continuous and binary endpoints. A slight modification is needed for time-toevent endpoints where the log-rank and stratified logrank tests are used. In this case, we show in the Appendix that the sample sizes n0 , n+ , and n− will be replaced by corresponding event counts. In order to achieve an optimal balance between the power functions of the overall and sensitive population tests, it is important to formulate the optimization criterion that reflects the clinical objectives of a trial. The two primary outcomes in a trial with a sensitive population are listed below: • Outcome 1: A significant treatment effect is established in the overall population (H0 is rejected). • Outcome 2: A significant treatment effect is established in the sensitive population but not in the overall population (H0 is accepted and H+ is rejected). Note that detecting a significant effect in the sensitive population after overall efficacy is demonstrated (both H0 and H+ are rejected) is a clinically relevant finding; however, this finding is generally of secondary interest and is not viewed as a key outcome in the trial. Having defined the two primary outcomes, let w0 and w+ denote the weights, specified at the beginning of the clinical trial, representing the relative importance of these outcomes (w0 > 0, w+ > 0 and w0 +w+ = 1). Using these weights, the optimization problem is stated as follows: Find the values of the design parameters that maximize the weighted power function defined by ψ = w 0 ψ0 + w + ψ+ , Design and Analysis Considerations in Clinical Trials With a Sensitive Subpopulation where ψ0 and ψ+ are the marginal power functions defined in terms of probabilities of Outcomes 1 and 2, that is, ψ0 = P(H0 is rejected) = P(p0 ≤ α0 ) + P(α0 < p0 ≤ α , p+ ≤ α f (p0 )), and ψ+ = P(H+ is rejected, H0 is accepted) = P(α < p0 , p+ ≤ α f (p0 )). Note that both probabilities are evaluated when the true effect sizes are set to θ− = Θ− and θ+ = Θ, where Θ− and Θ+ are the values of θ− and θ+ under the alternative hypothesis. It is worth pointing out that, with equal weights (w0 = w+ = 0.5), weighted power is closely related to disjunctive power (Senn and Bretz 2007) defined as the probability of rejecting at least one null hypothesis. Indeed, ψ = 0.5(ψ0 + ψ+ ) = 0.5P(H0 is rejected or H+ is rejected). In addition, restricted optimization can be considered. In this case the weighted power function ψ is maximized over a subset of α0 values, for example, a region defined by the following two conditions: ψ 0 ≥ b0 , ψ + ≥ b + , where b0 and b+ are prespecified lower bounds. The resulting algorithm helps the trial’s sponsor avoid “lopsided” results when the weighted power function is dominated by one of its components. This approach ensures that the success probabilities ψ0 and ψ+ are sufficiently high in both populations. The optimal values of the design parameters in the MWP method are found using the following algorithm: • Step 1: Form a grid of parameter values and, for each configuration of parameters on the grid, select the α -spending function to ensure that Property 2 defined in Section 2 is satisfied, that is, the feedback procedure controls the FWER. • Step 2: Choose the parameter configuration that maximizes the weighted power function ψ . Step 1 of the algorithm uses the fact that the overall and sensitive population test statistics asymptotically follow a bivariate normal distribution defined below. Recall that the enrichment-adjusted stratified test statistic for the overall population and test statistic for the sensitive population are given by e r+ θb+ + e r− θb− Ze0 = q 2 /n + e 2 2 e r+ + r− /n− and θb+ Z+ = p . 2 1/n+ The marginal distributions of Ze0 and Z+ are given by s s 2 2 n− n θ − + N + θ+ , 1 s 4(n+ + n− /s2 ) 4(n+ + n− /s2 ) and respectively, and r n+ ,1 , N θ+ 4 corr(Ze0 , Z+ ) = r n+ . n+ + n− /s2 (4) Sample size calculations for the MWP method can be performed using a similar grid search. One begins with a range of possible values for the total sample size. Then, for each sample size value on the grid, find the optimal design parameter values using the MWP algorithm described above. Finally, determine the optimal weighted power and select the total sample size n0 corresponding to a prespecified power level. 5. Clinical Trial Example To illustrate the methodology introduced in Sections 2–4, consider a clinical development program for a neuroscience compound. A genomic classifier was developed prior to the initiation of confirmatory trials to identify patients who were believed to be more responsive to the treatment compared to the general population of patients with the condition of interest. The sponsor was interested in designing a Phase III study to evaluate the efficacy profile of the treatment in the overall population as well as the subgroup of classifier-positive patients (sensitive population). The primary endpoint in the trial was based on a clinician-rated scale and was normally distributed. The total sample size in the trial was set at 800 patients. Based on published data, the prevalence rate of sensitive patients in the general population was estimated at 20%, that is, π+ = 0.2. Since the population prevalence rate of sensitive patients was relatively low, an enrichment strategy was introduced in order to increase the proportion of sensitive patients in the trial to 40% (r+ = 0.4). The resulting sample sizes in the overall and sensitive populations were n+ = 320 and n− = 480. The enrichment ratio was set to s= 0.2/(1 − 0.2) = 0.375. 0.4/(1 − 0.4) The associated enrichment-adjusted rates in the overall and sensitive populations were given by e r+ = 0.2 and e r− = 0.8. As a result, the enrichment-adjusted test statistic Ze0 assigned a much smaller weight to the sensitive 77 Statistics in Biopharmaceutical Research: Vol. 2, No. 1 population test compared to the standard test statistic Z0 with r+ = 0.4 and r− = 0.6. To apply the feedback procedure to testing the null hypotheses of no treatment effect in the overall and sensitive populations, we will focus on the most basic type of an α -spending function, that is, a constant α -spending function, without the consistency restriction (γ = 1). In other words, f (x) = 1 if 0 ≤ x < α0 and f (x) = c if α0 ≤ x ≤ 1. Let p0 and p+ denote the one-sided p-values associated with Ze0 and Z+ . The constant c = c(α0 ) is computed from P(α0 < p0 , p+ ≤ α c) = α − α0 under the global null hypothesis. This probability can be evaluated directly by noting that Ze0 and Z+ follow a bivariate normal distribution with the correlation coefficient s 200 = 0.29. 200 + 300/0.3752 Specifically, it is easy to show that c is found from P(Ze0 < z(α0 ), Z+ < z(α c)) = 1 − α , where z(x) is the (1 − x) percentile of the standard normal distribution, that is, z(x) = Φ−1 (1 − x) and Φ−1 (x) is the cumulative distribution function of the standard normal distribution. Further, given the value of c computed above, the probabilities of Outcomes 1 and 2 defined in Section 4 are given by and ψ0 = P(p0 ≤ α0 ) + P(α0 < p0 ≤ α , p+ ≤ α c), ψ+ = P(α < p0 , p+ ≤ α c). Four scenarios for the effect sizes in the sensitive and nonsensitive populations will be considered to assess the performance of the feedback procedure in the trial and illustrate the process of selecting the α0 parameter. The assumptions in these scenarios are similar to those made by Wang et al. (2007). • Scenario 1: θ+ = θ− = Θ = 0.3 (common effect size between the sensitive and nonsensitive populations). • Scenario 2: θ+ = Θ = 0.3 and θ− = Θ/2 = 0.15 (the effect size in the sensitive population is twice as large as that in the nonsensitive population). • Scenario 3: θ+ = Θ = 0.3 and θ− = 0 (there is no treatment effect in the nonsensitive population and thus the overall effect is completely driven by the sensitive subgroup). 78 • Scenario 4: θ+ = Θ = 0.3 and θ− = −r+ Θ/r− = −0.2 (note that θ0 = r+ θ+ + r− θ− = 0 and thus there is no treatment effect in the overall population). Scenario 1 helps evaluate the sensitivity of analysis methods for clinical trials with a sensitive subpopulation. Scenario 2 is an optimistic scenario which is likely to be used in the design of tailored-therapy trials. Scenarios 3 and 4 are based on an extreme set of assumptions and can be used to assess the robustness of candidate trial designs. The probabilities of Outcomes 1 and 2 in the four scenarios as a function of the α0 parameter are plotted in Figure 2. The performance of the feedback procedure in these scenarios is consistent with general principles of subgroup analysis. First of all, when there is no differential effect between the nonsensitive and sensitive populations (Scenario 1), the feedback procedure maximizes power of the overall population test. The probability of establishing the treatment effect only in the sensitive population (Outcome 2) is less than 1%. As the effect size in the sensitive population relative to that in the overall population increases (Scenarios 2 and 3), the two power curves begin to converge as the feedback procedure shifts the power balance in favor of the sensitive population test. For example, when no treatment effect is expected in the nonsensitive population (Scenario 3), the probability of Outcome 1 is only slightly above 10%. In the extreme case captured in Scenario 4, there is no treatment effect in the overall population and the probability of Outcome 1 drops to less than 0.1%. The process of selecting an optimal value of α0 is quite straightforward in Scenarios 1, 3, and 4. Beginning with Scenario 1, since the probability of Outcome 2 is very low for any value of α0 , the choice of this parameter is driven by the probability of Outcome 1. This probability is maximized when α0 = 0.025. With this value of α0 , the treatment effect in the sensitive population is not tested and the overall population test is carried out at the full 0.025 level. In Scenarios 3 and 4, the probability of Outcome 1 is virtually independent of α0 and thus, to improve the probability of Outcome 2, the trial’s sponsor needs to set α0 to 0. In this case the feedback procedure simplifies to the fixed-sequence procedure beginning with the sensitive population test: • Step 1: A significant treatment effect is established only in the sensitive population (H+ is rejected) if p+ ≤ 0.025. • Step 2: A significant treatment effect is established in the overall population (H0 is rejected) if p0 ≤ 0.025 and H+ is rejected in Step 1. Design and Analysis Considerations in Clinical Trials With a Sensitive Subpopulation 0.8 0.010 0.015 0.020 0.025 0.005 0.010 0.015 α0 α0 Scenario 3 Scenario 4 0.020 0.025 0.020 0.025 0.8 0.000 0.005 0.010 0.015 0.020 0.025 α0 0.0 0.2 0.2 0.4 0.4 0.6 0.6 Outcome probabilites 0.8 Outcome probabilites 0.0 0.000 1.0 0.005 0.0 0.2 0.4 0.6 Outcome probabilites 0.000 1.0 0.0 0.2 0.4 0.6 Outcome probabilites 0.8 1.0 Scenario 2 1.0 Scenario 1 0.000 0.005 0.010 0.015 α0 Figure 2. Probability of Outcome 1 (solid curve) and Probability of Outcome 2 (dashed curve) for the feedback procedure in the neuroscience clinical trial example for Scenario 1 (θ+ = 0.3, θ− = 0.3), Scenario 2 (θ+ = 0.3, θ− = 0.15), Scenario 3 (θ+ = 0.3, θ− = 0) and Scenario 4 (θ+ = 0.3, θ− = −0.2). In Scenario 2, the value of α0 which provides an optimal balance between the probabilities of Outcomes 1 and 2 is likely to be closer to the middle of the interval [0, 0.025]. To determine this optimal value, the MWP method will be applied to maximize the weighted power functions with the following importance weightings: • Scenario 2A: w0 = 0.8 and w+ = 0.2. • Scenario 2B: w0 = 0.6 and w+ = 0.4. In the first scenario the importance weightings are equal to the population prevalence rates of nonsensitive and sensitive patients and the second scenario em- phasizes a significant outcome in the sensitive population by increasing the value of w+ to 0.4. The weighted power functions for the two scenarios are displayed in Figure 3. Weighted power is maximized in Scenario 2A when α0 = 0.0210, which indicates that the α0 -optimized feedback procedure puts more emphasis on the overall population analysis. The corresponding probabilities of Outcomes 1 and 2 are given by 64.1% and 13.6%, respectively. Increasing the weight for Outcome 2 shifts the weighted power function to the left and improves the probability of detecting a significant treatment effect in the sensitive population. The optimal value of α0 in Scenario 2B is 0.0126. The smaller value of α0 de79 Statistics in Biopharmaceutical Research: Vol. 2, No. 1 0.40 0.50 Weighted power 0.000 0.005 0.010 0.015 0.020 0.025 α0 0.30 0.30 0.40 0.50 Weighted power 0.60 Scenario 2B 0.60 Scenario 2A 0.000 0.005 0.010 0.015 0.020 0.025 α0 Figure 3. Weighted power function for the feedback procedure with a constant α -spending function in the neuroscience clinical trial example for Scenario 2A (θ+ = 0.3, θ− = 0.15, w0 = 0.8, w+ = 0.2) and Scenario 2B (θ+ = 0.3, θ− = 0.15, w0 = 0.6, w+ = 0.4). creases the probability of Outcome 1 (61.8%) and increases the probability of Outcome 2 (19.4%). Note that the resulting change in the success probabilities is relatively small and thus the MWP method and feedback procedure are fairly robust with respect to the prespecified importance weightings. Also, note that the absolute values of weighted power should not be compared between the two scenarios since the vertical scale is determined by the weights selected, for example, the weighted power function cannot exceed 0.8 in Scenario 2A and the corresponding maximum value of the weighted power function in Scenario 2B is only 0.6. Another important consideration in tailored-therapy clinical trials is the choice of the enrichment ratio (s) which determines the proportion of sensitive patients in the trial (r+ ). We have assumed so far that this proportion is set at 40%. Figure 4 plots the probabilities of Outcomes 1 and 2 as a function of the α0 parameter for four different enrichment scenarios: • Scenario 2C: r+ = 0.2 (no enrichment). • Scenario 2D: r+ = 0.3. • Scenario 2E: r+ = 0.4. • Scenario 2F: r+ = 0.5. Figure 4 shows that increasing the proportion of sensitive patients in the trial clearly increases the probability of establishing a significant treatment effect in the sensitive population for smaller values of α0 . This is a direct consequence of the fact that the size of the sensitive population is proportional to r+ . In addition, a greater value 80 of r+ reduces the impact of α0 on the power of the overall population test. As a result, the trial’s sponsor will be more inclined to select a smaller value of the α0 parameter in the feedback procedure. 6. Discussion In this article we reviewed key considerations in clinical trials with a fixed classifier which defines a population of sensitive patients, including a novel multiplicity adjustment (feedback procedure), enrichment-adjusted overall population analysis (Horvitz-Thompson method) and optimal selection of the analysis strategy (maximum weighted power or MWP method). In what follows we summarize additional considerations that will be relevant in the design or analysis of tailored-therapy clinical trials. Beginning with design considerations, we generally recommend that an enrichment strategy be considered when the prevalence rate of sensitive patients in the general population is low. Enrichment helps the trial’s sponsor increase the proportion of sensitive patients enrolled in the trial and improve the power of the sensitive population test. If the trial is enriched, it is important to use the enrichment-adjusted overall test defined in Section 3 in place of the usual unadjusted test. The degree of multiplicity adjustment depends on the consistency restriction defined in Section 2 and the value of the consistency parameter γ . It is worth noting that the FWER is reduced when the consistency restriction is applied and α is reclaimed by adjusting the free parameters of the α -spending function to bring the FWER to its nominal level. As a result, the consistency restriction Design and Analysis Considerations in Clinical Trials With a Sensitive Subpopulation 0.8 0.010 0.015 0.020 0.025 0.005 0.010 0.015 α0 α0 Scenario 2E Scenario 2F 0.020 0.025 0.020 0.025 0.8 0.000 0.005 0.010 0.015 0.020 0.025 α0 0.0 0.2 0.4 0.6 Outcome probabilities 0.8 0.2 0.4 0.6 Outcome probabilities 0.0 0.000 1.0 0.005 0.0 0.2 0.4 0.6 Outcome probabilities 0.000 1.0 0.0 0.2 0.4 0.6 Outcome probabilities 0.8 1.0 Scenario 2D 1.0 Scenario 2C 0.000 0.005 0.010 0.015 α0 Figure 4. Probability of Outcome 1 (solid curve) and Probability of Outcome 2 (dashed curve) for the feedback procedure in the neuroscience clinical trial example for Scenario 2C (θ+ = 0.3, θ− = 0.15, r+ = 0.2), Scenario 2D (θ+ = 0.3, θ− = 0.15, r+ = 0.3), Scenario 2E (θ+ = 0.3, θ− = 0.15, r+ = 0.4) and Scenario 2F (θ+ = 0.3, θ− = 0.15, r+ = 0.5). is binding in the sense that it can no longer be removed in a post-hoc analysis because this will result in FWER inflation. This is analogous to the use of binding futility stopping boundaries in group-sequential trials; see, for example, Proschan, Wittes, and Lan (2006, chap. 5). If a binding consistency restriction is undesirable, this restriction can be applied in a nonbinding manner by computing the free parameters of the α -spending function with γ = 1 and then modifying the decision rules for the two hypotheses of interest: • Reject H0 if (1) p0 ≤ α0 , or (2) α0 < p0 ≤ α and p+ ≤ α f (p0 ). • Reject H+ if (1) p0 ≤ α0 and p+ ≤ α , or (2) α0 < p0 ≤ γ and p+ ≤ α f (p0 ). Another important consideration in clinical trials with a sensitive population is the impact of the sensitive population test on the outcome of the overall test. The null hypothesis of no treatment effect in the overall population is rejected if (1) p0 ≤ α0 or (2) α0 < p0 ≤ α and p+ ≤ α f (p0 ). The second condition will be met when the p-value for the overall population test (p0 ) is marginally significant whereas the sensitive population p-value (p+ ) is highly significant. In other words, this condition helps improve the power of the overall population test by “borrowing” power from the sensitive population test. If the 81 Statistics in Biopharmaceutical Research: Vol. 2, No. 1 trial’s sponsor would like to avoid cases when the overall conclusion is driven mostly by a significant treatment effect in the sensitive population, the second condition can be dropped. Note also that, if the overall and sensitive population tests are both significant, the main outcome of the trial will be formulated in terms of the overall population analysis. In addition, since the FWER is controlled in this problem, the results of the sensitive subgroup analysis could also be included in the product label to provide information about the beneficial effect of the treatment for sensitive patients. Further, the secondary analyses are generally restricted to the population used in the primary analysis, for example, they are restricted to the sensitive population if the treatment effect for the primary endpoint is demonstrated only in that subgroup. If control of the FWER for the primary and secondary objectives is mandated, the sponsor can use gatekeeping procedures that account for logical relationships among the multiple objectives in a trial with a sensitive population (Dmitrienko et al. 2007, 2008). The performance of the MWP method depends on a number of parameters. The most important of them is the weight assigned to the overall population analysis (w0 ). The value of w0 must be chosen a priori and its choice is driven by the relative importance of a significant outcome in the overall population compared to the subpopulation defined by a fixed classifier. We advise that a number of weights be evaluated as part of sensitivity analysis before the final analysis strategy is selected. Appendix: Stratified Test Statistics for Binary and Time-to-Event Endpoints Binary endpoints Cochran (1954) suggested an approach to combine the results from several 2×2 contingency tables. Assuming a 1:1 randomization, the Cochran-Mantel-Haenszel statistic is given by (n+ d+ + n− d− ) , Y= √ 2 n+ P+ Q+ + n− P− Q− where, in the sensitive subgroup, n+1 , n+2 = sample sizes in the two treatment groups, p+1 , p+2 = observed response rates in the two treatment groups, P+ = (n+1 p+1 + n+2 p+2 )/(n+1 + n+2 ), Q+ = 1 − P+ , d+ = p+1 − p+2 , and similar notation is used in the subgroup of nonsensitive patients (with + replaced by −). 82 When P+ and P− are in the range between 0.2 and 0.8, P+ Q+ is approximately equal to P− Q− . Therefore, √ √ (n+ d+ / P+ Q+ + n− d− / P− Q− ) Y = √ 2 n+ + n− (n+ θ+ + n− θ− ) = √ 2 n+ + n − p + θ + + p− θ − = q , (A.1) 2 p2+ /n+ + p2− /n− √ where the effect√sizes are defined as θ+ = d+ / P+ Q+ and θ− = d− / P− Q− . Also, p+ = n+ /n0 and p− = n− /n0 . Note that the stratified test statistic in Equation (A.1) is in the format of Equation (2). Time-to-Event Endpoints First we review the standard log-rank test for trials with two treatment groups in one stratum. Let r and U = ∑ (d1 j − d j n1 j /n j ) j=1 n1 j n2 j d j (n j − d j ) , n2j (n j − 1) j=1 (A.2) U ∼ N(θ V,V ). (A.3) V ≈ d/4, (A.4) U+ +U− Ws = √ ∼ N(0, 1). V+ +V− (A.5) r V = ∑ where r is the number of distinct event times, d1 j and d2 j are the number of events at time t j in Group I and Group II, respectively, n1 j and n2 j are the number of individuals at risk in Group I and Group II, respectively, and d j = d1√ j + d2 j , n j = n1 j + n2 j . The log-rank statistic is WL = U/ V which has asymptotically a standard normal distribution. Let θ be the log hazard ratio. Then, by Sellke and Siegmund (1983), for small values of θ , In addition, by a result due to Collett (2003, p. 304), where d is the total number of events. Now consider a stratified log-rank test with two strata. Let U+ , V+ , and U− , V− be defined as in Equation (A.2) for Stratum 1 and 2, respectively. Then a stratified logrank test is defined as Now, we will rewrite Ws in a more familiar form. First, define and K+ = U+ /V+ ∼ N(θ+ , 1/V+ ) Design and Analysis Considerations in Clinical Trials With a Sensitive Subpopulation K− = U− /V− ∼ N(θ− , 1/V− ). (A.6) Combined with Equations (A.4) and (A.6), Equation (A.5) becomes Ws = (d+ /4)K+ + (d− /4)K− p d+ /4 + d− /4 d+ K+ + d− K− p = 2 d+ + d − p+ K+ + p− K− = q , 2 /d 2 p2+ /d+ + p− − Hochberg, Y., and Tamhane, A.C. (1987), Multiple Comparison Procedures, New York: Wiley. 73 Horvitz, D.G., and Thompson, D.J. (1952), “A Generalization of Sampling Without Replacement from a Finite Universe,” Journal of the American Statistical Association, 47, 663–685. 73, 76 (A.7) where d+ and d− are the numbers of events in stratum 1 and stratum 2, respectively, d = d+ + d− , p+ = d+ /d, and p− = d− /d. Note that the stratified test statistic in Equation (A.7) is in the format of Equation (2). Acknowledgments Alex Dmitrienko thanks Dr. Gene Pennello (U.S. Food and Drug Administration) for a helpful discussion of statistical issues arising in subgroup analysis. [Received September 2008. Revised August 2009.] REFERENCES Grechanovsky, E., and Hochberg, Y. (1999), “Closed Procedures are Better and Often Admit a Shortcut,” Journal of Statistical Planning and Inference, 76, 79–91. 75 Alosh, M., and Huque, M. (2009), “A Flexible Strategy for Testing Subgroups and Overall Population,” Statistics in Medicine, 28, 3–23. 73, 74 Baselga, J. (2001), “Herceptin Alone or in Combination With Chemotherapy in the Treatment of HER2-positive Metastatic Breast Cancer: Pivotal Trials,” Oncology, 61, 14–21. 72 Collett, D. (2003), Modelling Survival Data in Medical Research (2nd ed.), London: Chapman and Hall/CRC. 82 Dmitrienko, A., Wiens, B.L., Tamhane, A.C., and Wang, X. (2007), “Tree-Structured Gatekeeping Tests in Clinical Trials with Hierarchically Ordered Multiple Objectives,” Statistics in Medicine, 26, 2465–2478. 82 Dmitrienko, A., Tamhane, A.C., Liu, L., and Wiens, B.L. (2008), “A Note on Tree Gatekeeping Procedures in Clinical Trials,” Statistics in Medicine, 27, 3446–3451. 82 Dmitrienko, A., Bretz, F., Westfall, P.H., Troendle, J., Wiens, B.L., Tamhane, A.C., and Hsu, J.C. (2009), “Multiple Testing Methodology,” in Multiple Testing Problems in Pharmaceutical Statistics, eds. A. Dmitrienko, A.C. Tamhane, and F. Bretz, New York: Chapman and Hall/CRC Press. 73 Freidlin, B., and Simon, R. (2005), “Adaptive Signature Design: An Adaptive Clinical Trial Design for Generating and Prospectively Testing a Gene Expression Signature for Sensitive Patients,” Clinical Cancer Research, 11, 7872–7878. 72 Huque, M.F., and Alosh, M. (2008), “A Flexible Fixed-Sequence Testing Method for Hierarchically Ordered Correlated Multiple Endpoints in Clinical Trials,” Journal of Statistical Planning and Inference, 138, 321–335. 73 Li, D., and Mehrotra, D.V. (2008), “An Efficient Method for Accommodating Potentially Underpowered Primary Endpoints,” Statistics in Medicine, 27, 5377–5391. 73 Marcus, R., Peritz, E., Gabriel, K.R. (1976), “On Closed Testing Procedures with Special Reference to Ordered Analysis of Variance,” Biometrika, 63, 655–660. 74 Proschan, M.A., Wittes, J.T., and Lan, K.K.G. (2006), Statistical Monitoring of Clinical Trials: A Unified Approach, New York: Springer. 81 Senn, S., and Bretz, F. (2007), “Power and Sample Size when Multiple Endpoints are Considered,” Pharmaceutical Statistics, 6, 161–170. 77 Shipp, M.A., Ross, K.N., Tamayo, P., et al. (2002), “Diffuse Large Bcell Lymphoma Outcome Prediction by Gene Expression Profiling and Supervised Machine Learning,” Nature Medicine, 8, 68–74. 72 Song, Y., and Chi, G.Y. (2007), “A Method for Testing a Prespecified Subgroup in Clinical Trials,” Statistics in Medicine, 26, 3535–3549. 73 Rosenwald, A., Wright, G., Chan, W.C., et al. (2002), “The Use of Molecular Profiling to Predict Survival After Chemotherapy for Diffuse Large B-cell Lymphoma,” New England Journal of Medicine, 346, 1937–1947. 72 Sellke, T., and Siegmund, D. (1983), “Sequential Analysis of the Proportional Hazards Model,” Biometrics, 70, 315–326. 82 Wang, S.J., O’Neill, R.T., and Hung, H.M.J. (2007), “Approaches to Evaluation of Treatment Effect in Randomized Clinical Trials with Genomic Subset,” Pharmaceutical Statistics, 6, 227–244. 72, 78 Wiens, B. (2003), “A Fixed-Sequence Bonferroni Procedure for Testing Multiple Endpoints,” Pharmaceutical Statistics, 2, 211–215. 73 Wiens, B., and Dmitrienko, A. (2005), “The Fallback Procedure for Evaluating a Single Family of Hypotheses,” Journal of Biopharmaceutical Statistics, 15, 929–942. 73 About the Authors Yan D. Zhao is Senior Research Scientist, Alex Dmitrienko is Research Advisor, and Roy Tamura, is Research Fellow, Eli Lilly and Company, Lilly Corporate Center, Indianapolis, IN 46285 (Email for correspondence: yzhao@lilly.com). 83
0
You can add this document to your study collection(s)
Sign in Available only to authorized usersYou can add this document to your saved list
Sign in Available only to authorized users(For complaints, use another form )