Use of the False Discovery Rate for Evaluating Clinical Safety Data Joseph F. Heyse Devan V. Mehrotra Clinical Biostatistics – Vaccines Merck Research Laboratories Blue Bell, PA Third International Conference on Multiple Comparisons Bethesda, MD August 6, 2002 Acknowledgment This research was in collaboration with the late Professor John Tukey (Princeton University). Heyse/MCP2002 bl 2 Outline Motivating example Multiplicity issues FWER and FDR Proposal for flagging AEs Summary of three examples Concluding remarks Heyse/MCP2002 bl 3 Introduction Evaluation of safety is an important part of clinical trials of pharmaceutical and biological products. Adverse experiences (AEs) can be categorized as three types – Tier 1: Associated with specific hypotheses – Tier 2: Set encountered as part of trial safety evaluation – Tier 3: Rare spontaneous reports of serious events that require clinical evaluation Our interest is primarily Tier 2 Heyse/MCP2002 bl 4 ICH Recommendations ICH-E9 recommends descriptive statistical methods supplemented by confidence intervals p-values useful to evaluate a specific difference of interest If hypothesis tests are used, statistical adjustments for multiplicity to quantitate the Type I error are appropriate, but the Type II error is usually of more concern p-values sometimes useful as a “flagging” device applied to a large number of safety variables to highlight differences worthy of further attention Heyse/MCP2002 bl 5 Illustration Multiplicity in Safety Assessment Clinical trial compared the safety and immunogenicity of the combination vaccine COMVAX™* to its monovalent components 1 of 92 safety comparisons revealed a higher rate of unusual high-pitched crying (UHPC) following the second of a three-dose series (6.7% vs. 2.3%, p=0.016) No medical rationale for this finding was discovered and a larger hypothesis-driven study was designed Comparable rates were observed following vaccination in this larger trial *COMVAX™ is a combination of HIB and HB vaccine Heyse/MCP2002 bl 6 Motivating Example (MMRV* Vaccine) Safety and immunogenicity vaccine trial. Study population: healthy toddlers, 12-18 months of age Group 1 = MMRV + PedvaxHIB on Day 0 Group 2 = MMR + PedvaxHIB on Day 0, followed by (optional) varicella vaccine on Day 42 *MMRV is a combination measles, mumps, rubella, varicella vaccine Heyse/MCP2002 bl 7 Motivating Example (cont’d) Safety follow-up (local and systemic reactions) Group 1: Day 0-42 (N=148) Group 2: Day 0-42 (N=148) and Day 42-84 (N=132) Question: Is the safety profile different if the varicella component is given as part of a combination vaccine on Day 0 compared with giving it 6 weeks later as a monovalent vaccine? AEs: Group 1 (Day 0-42) vs. Group 2 (Day 42-84) Heyse/MCP2002 bl 8 Clinical AE Counts (“Tier 2” AEs) Grp 1 Grp 2 (N1=148) (N2=132) X1 X2 DIFF (%) p-value # BS ADVERSE EXPERIENCE 1 01 ASTHENIA / FATIGUE 57 40 8.2 .1673 2 01 FEVER 34 26 3.3 .5606 3 01 INFECTION, FUNGAL 2 0 1.4 .4998 4 01 INFECTION, VIRAL 3 1 1.3 .6248 5 01 MALAISE 27 20 3.1 .5248 6 03 ANOREXIA 7 2 3.2 .1791 7 03 CANDIDIASIS, ORAL 2 0 1.4 .4998 8 03 CONSTIPATION 2 0 1.4 .4998 9 03 DIARRHEA 24 10 8.6 .0289* 10 03 GASTROENTERITIS, INFECTIOUS 3 1 1.3 .6248 11 03 NAUSEA 2 7 -4.0 .0889 19 19 -1.6 .7295 3 2 0.5 1.0000 12 03 VOMITING 13 05 LYMPHADENOPATHY Heyse/MCP2002 bl 9 Clinical AE Counts (“Tier 2” AEs) - cont’d # BS ADVERSE EXPERIENCE Grp 1 Grp 2 (N1=148) (N2=132) X1 X2 DIFF (%) p-value 14 06 DEHYDRATION 0 2 -1.5 .2214 15 08 CRYING 2 0 1.4 .4998 16 08 INSOMNIA 2 2 -0.2 1.0000 75 43 18.1 18 09 BRONCHITIS 4 1 1.9 .3746 19 09 CONGESTION, NASAL 4 2 1.2 .6872 20 09 CONGESTION, RESPIRATORY 1 2 -0.8 .6033 13 8 2.7 .4969 22 09 INFECTION, RESPIRATORY, UPPER 28 20 3.8 .4308 2 1 0.6 1.0000 13 15 3 8 2.7 -0.5 1.3 .4969 1.0000 .6248 17 08 IRRITABILITY 21 09 COUGH 23 09 LARYNGOTRACHEOBRONCHITIS 24 09 PHARYNGITIS 25 09 RHINORRHEA 26 09 SINUSITIS 14 1 .0025* Heyse/MCP2002 bl 10 Clinical AE Counts (“Tier 2” AEs) - cont’d # 27 28 29 30 31 32 33 34 35 36 37 38 39 40 BS 09 09 10 10 10 10 10 10 10 10 10 11 11 11 Grp 1 Grp 2 (N1=148) (N2=132) X1 X2 ADVERSE EXPERIENCE DIFF (%) p-value 2 1 TONSILLITIS 0.6 1.0000 3 1 WHEEZING 1.3 .6248 4 0 BITE/STING, NON-VENOMOUS 2.7 .1248 2 0 ECZEMA 1.4 .4998 2 1 PRURITUS 0.6 1.0000 13 3 RASH 6.5 .0209* 6 2 RASH, DIAPER 2.5 .2885 1 RASH, MEASLES/RUBELLA-LIKE 8 4.6 .0388* 4 2 RASH, VARICELLA-LIKE 1.2 .6872 0 2 URTICARIA -1.5 .2214 1 2 VIRAL EXANTHEMA -0.8 .6033 0 2 CONJUNCTIVITIS -1.5 .2214 18 14 OTITIS MEDIA 1.6 .7109 2 1 OTORRHEA 0.6 1.0000 Heyse/MCP2002 bl 11 Multiplicity Issues - The Problem Potential for too many false positive safety findings if the multiplicity problem is ignored (for “Tier 2” AEs). This can muddy the interpretation of the safety profile of the vaccine/drug. Heyse/MCP2002 bl 12 Multiplicity Issues - The Challenge To develop a procedure for tackling multiplicity that: Provides a proper balance between “no adjustment” and “too much adjustment”. Is easy to automate/implement. Heyse/MCP2002 bl 13 Familywise Error Rate (FWER) Let F = {H1,H2 … Hm} denote a family of m hypotheses. FWER = Pr(any true Hi F is rejected). We usually seek methods for which FWER a. Benjamini & Hochberg (1995) argue that, in certain settings, requiring control of the FWER is often too conservative. They suggest controlling the “false discovery rate” instead, as a more powerful alternative. Benjamini , Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, B, 57, 289-300. Heyse/MCP2002 bl 14 False Discovery Rate (FDR) (Benjamini & Hochberg) Declared Insignificant Declared Significant Total # of true Hi U V m0 # of false Hi T S m m0 Total mR R m V FDR E expected proportion of rejected null R hypotheses which are incorrectly rejected. Define 0 0 as 0. Heyse/MCP2002 bl 15 False Discovery Rate (FDR) (cont’d) (Benjamini & Hochberg) Re ject H1 , H2 , , H j if p j j a m m0 {This controls FDR at m a } Adjusted p - values : ~ pm pm m ~ p j min~ p j1 , p j , j m 1 j Example Unadjusted p-values .0193 .0280 .2038 .4941 FDR-adjusted p-values .0560 .0560 .2718 .4941 FDR FWER {equality holds if m = m0}. Effect of correlations on FDR is an area of research. Heyse/MCP2002 bl 16 Proposal for Flagging AEs We routinely summarize AEs by body system (BS). s body systems (i = 1, 2, …, s) ki AEs associated with body system i pij = between-group p-value for the jth AE within ith BS (e.g., based on two-tailed Fisher’s exact test.) Heyse/MCP2002 bl 17 Proposal for Flagging AEs (cont’d) Step 1 Ignore AEs for which the total incidence is so low that a rejection even at the unadjusted 0.05 level is impossible. Step 2 Among the remaining AEs, flag those for which the p-value achieves statistical significance after adjusting for multiplicity using a “Double FDR” approach. Heyse/MCP2002 bl 18 Double FDR Approach Define p*i min pi1, pi2, pik . i This represents the strongest safety “signal” for body system i. 1st level FDR adjustment – Apply FDR adjustment to p1* , p*2 , , p*s – Let ~ pi* FDR - adjusted p*i 2nd level FDR adjustment – Within body system i, apply FDR adjustment to pi1, pi2, piki , 1 i s p FDR - adjusted p – Let ~ ij ij Heyse/MCP2002 bl 19 Double FDR Approach (cont’d) Proposed Flagging Rule * ~ ~ p a and pij a2 Flag AE(i,j) if i 1 What values of a1 and a2 should we use? Heyse/MCP2002 bl 20 Choosing a1 and a2 Set a2 = a and use either (a) or (b) below for a1. (a) Using resampling (non-parametric bootstrap) to determine the largest data-dependent a1 ( a2) that ensures FDR a. OR (b) Choose a1 ( a2) independent of the data. For a example, let a1 a 2 or 2 , and estimate the 2 resulting FDR using resampling. Heyse/MCP2002 bl 21 Resampling Procedure Purpose – To estimate the false discovery rates of the following: NOADJ No multiplicity adjustment; flag AE if unadjusted p < .05 FULLFDR(a) Full FDR adjustment (ignore BS grouping) DFDR(a1, a2) Double FDR adjustment for selected (a1, a2) – To determine the largest a1( a2) that guarantees FDR a when using DFDR(a1, a2). Heyse/MCP2002 bl 22 Resampling Procedure (cont’d) Details 1.POOL data from both treatment groups into a common population. Sample with replacement from this common population, to simulate many repetitions of the original trial. This procedure: a) simulates a true null situation (Group 1 = Group 2). b) preserves the correlation structure of original data. 2.Implement our proposal for flagging AEs using the NOADJ, FULLFDR(a), and DFDR(a1, a2) approaches, and calculate the corresponding FDRs. Heyse/MCP2002 bl 23 MMRV Example - Resampling Results Y = # of incorrectly flagged AEs* Distribution of Y (%) Method 0 1 2 3 FDR (%) NOADJ 48.8 33.0 12.9 5.3 51.2 FULLFDR(.10) 95.2 4.0 0.6 0.2 4.8 DFDR(.02, .05) 97.0 2.5 0.4 0.1 3.0 DFDR(.05, .05) 91.2 7.3 1.1 0.4 8.8 DFDR(.05, .10) 90.9 6.4 1.9 0.8 9.1 DFDR(.10, .10) 79.8 13.0 5.2 2.0 20.2 * out of 40; 2000 simulations Heyse/MCP2002 bl 24 MMRV Example - Resampling Results DFDR(a1, a2): Estimated FDR (%) a2 a1 0.05 1.45 3.00 4.70 7.10 8.80 0.10 1.45 3.00 4.70 7.15 9.15 11.70 13.65 16.35 18.85 20.25 0.15 0.01 1.45 0.02 3.00 0.03 4.70 0.04 7.15 0.05 9.15 0.06 11.70 0.07 13.70 0.08 16.50 0.09 19.25 0.10 21.30 0.11 24.25 0.12 25.60 0.13 27.75 0.14 29.90 0.15 31.25 5% 10% 15% Max. Acceptable FDR (a) (aa2 = a) (.03,.05) (.05,.10) (.07,.15) Heyse/MCP2002 bl 25 First Level FDR Adjustment Number FDR Unadjusted Adjusted of AE Body System ID p-value Types p-value Nervous system 0.0025 3 0.0200 Skin 0.0209 9 0.0771 Digestive system 0.0289 7 0.0771 Body site unspecified 0.1673 5 0.2952 Special senses 0.2214 3 0.2952 Metabolic / immune 0.2214 1 0.2952 Respiratory 0.3746 11 0.4281 Hematologic and lymphatic 1.0000 1 1.0000 Heyse/MCP2002 bl 26 Second Level FDR Adjustment Body System 08: Nervous System and Psychiatric Unadjusted p-value FDR Adjusted p-value Irritability 0.0025 0.0075 Crying 0.4998 0.7497 Insomnia 1.0000 1.0000 Adverse Experience Heyse/MCP2002 bl 27 Summary of Three Examples Flagged AEs # DFDR Adjustment, maximum FDR (%): Trial of No Multiplicity 15% 10% 5% (# of subs.) AEs Adjustment FDR ~ 43% a1=.07,a=.15 a1=.05,a=.10 a1=.02,a=.05 PedvaxHIB 15 Irritability Irritability (N=681) Upper Resp. Inf. Upper Resp. Inf. Rash FDR ~ 51% a1=.07,a=.15 a1=.05,a=.10 aa Irritability Irritability Irritability Irritability MMRV 40 Rash Rash (N=280) M/R-like rash M/R-like rash Diarrhea Diarrhea FDR ~ 87% aa aa aa COMVAX Erythema 58 Rash (N=811) Rhinorrhea Heyse/MCP2002 bl 28 Concluding Remarks Current approach of flagging AEs based on unadjusted p-values (or C.I.s) can result in excessive false positive safety findings. These can cause undue concern for approval/labeling, and can affect post-marketing commitments. Under our proposal, the unadjusted p-values (or C.I.s) would still be reported. The Double FDR multiplicity adjustment is a method to facilitate the interpretation of the unadjusted p-values. Heyse/MCP2002 bl 29 Concluding Remarks (cont’d) Our proposal for tackling multiplicity will: – substantially reduce the percentage of incorrectly flagged AEs. – be better accepted if described a priori in the protocol/DAP rather than on a post-hoc basis. – facilitate comparable interpretation of safety results across studies, with respect to Type I error. Heyse/MCP2002 bl 30