Study Design and Hypothesis Testing in Clinical Research Jonathan J. Shuster, Ph.D (jshuster@biostat.ufl.edu) Research Professor of Biostatistics Univ. of Florida, College of Medicine 1 Take-home Messages • Rely on Evidence-Based Medicine. Conventional wisdom can easily lead us astray. • The objective of Statistics is to make informed inferences about a population, based on a sample. It is imperative to quantify the uncertainty. • The P-value is a quantity that allows us to infer something about whether a scientific hypothesis is false. • Non-significant results are inconclusive • Randomization and intent-to-treat are vital components in sound clinical research 2 3 Topics 1. Motivating Evidence-Based Clinical Studies 2. Objective of Statistics 3. Hypothesis testing and P-values 4. Real Examples and their lessons 4 5 1. Motivating Evidence-Based Medicine • A coin is “loaded”, with a 70% chance of landing heads. One player picks a three outcome sequence (e.g. HTH), then the other picks a different sequence. Whoever’s sequence comes up first is the winner. • Do you want to choose first, and if so, what sequence to you select? 6 Evidence-Based Medicine • So you decided to go first and pick HHH, right? • OK, I pick THH. • HHH can only occur before THH if it is on the first three flips. (If the first time HHH occurs is flips 6,7,8 then flip 5 is T, so flips 5,6,7 are THH, I win. (I make your first 2, my last 2, so I tend to stay ahead.) • Your chance of winning=.73 =.343 (34.3%) 7 Evidence-Based Medicine • Lesson from this example. • Things are not always what they seem. You need to be a healthy skeptic. • Reference: Shuster, J. A two-player coin game paradox in the classroom. American Statistician, 2006(Feb), vol 60, pp 68-70. 8 9 2. Objective of Statistics • To make an inference about a defined target population from a representative sample. • That is, for us, to start from a medical hypothesis about a medical condition, help design a study that can collect data to test the question, and draw conclusions. Quantifying the uncertainty about the inference is a key part. 10 2. Comment on This • Should we compare treatment groups statistically in a randomized study with respect to baseline parameter (e.g. age, gender, ethnicity, blood pressure)? 11 2. Provenzano: Clin J Am Soc Nephrol 4, 386-93, 2009 • “Baseline characteristics were similar except for more men in the oral iron group compared with the ferumoxytol group (62.9% versus 50.0%, P 0.04). Mean baseline laboratory measures were similar between the two treatment groups.” 12 2. Comment on This • For hypothesis driven research, should we test for normality before using a t-test, and if we reject try to transform the data? 13 Nissen Article • JAMA. 2008;299(13):1561-1573. Comparison of Pioglitazone vs Glimepiride on Progression of Coronary Atherosclerosis in Patients With Type 2 Diabetes • ‘For continuous variables with a normal distribution, the mean and 95% confidence intervals (CIs) are reported. For variables not normally distributed, median and interquartile ranges are reported and 95% CIs around median changes were computed using bootstrap resampling.’ (N=273 vs 270 in groups) 14 2. Testing Assumptions Diagnostic Test Passes Fails 15 16 3. Testing a Hypothesis (P-Value) • Put a statement on Trial: “Null Hypothesis” • ISIS #2 (International Sudden Infarct Study #2): The five week mortality rates for Streptokinase and Placebo are equivalent in patients with recent MIs • Results: Strep(791/8592=9.2%) vs. Plac(1029/8595=12.0%) 17 3. P-Value • P=3.8* -9 10 • If you replicated the experiment in a population where the null hypothesis was true, there is a 3.8 in a billion chance of seeing a difference at least as extreme in either direction (2-sided) 18 3. ISIS #2 Reference • ISIS #2 Collaborative Group. (1988) Randomised trial of intravenous streptokinase, oral aspirin, both, or neither among 17,187 cases of acute myocardial infarction: ISIS 2, Lancet 2: 349-360. 19 3. P-Value and Proof by Contradiction • What is the probability that if you replicated your experiment in a target population where your null hypothesis is true that you would see differences at least as extreme as what you actually observed. If this value (the p-value) is small it is evidence against this null hypothesis. • Analogy is beyond a reasonable doubt. Science uses 5% arbitrarily as “reasonable” doubt in most cases. 20 3. Was this overkill in terms of sample size • Suppose the results were 79/859 vs. 103/860 (same percentages of 9.2% vs. 12.0% but with one tenth the sample size). • Now P=0.071 (7.1%), and would not be statistically significant. Would we be using this clot buster today? It was the biostatistician, Sir Richard Peto who determined this sample size. 21 3. ISIS #2: • Any other questions about the study? 22 3. ISIS #2 Issues • Who was watching the store. Accrual took 3.5 years and outcome was known for each patient within five weeks. • Always report a sample size justification in your papers (Provenzano, slide 12, did not). 23 4. Real Example • Coronary Drug Project 24 The Coronary Drug Project Research Group (1980) • Influence of adherence to treatment and response of cholesterol on mortality in the Coronary Drug Project. NEJM 303: 10381041. • Double blind randomized study of Clofibrate vs. Placebo in men who had prior MI. 25 Compliers vs. Not on Drug Coronary Drug Project 25 5Yr Mortality(%) 20 15 C_Drug NC_Drug 10 5 0 C_Drug NC_Drug 26 Compliers vs. Not 27 Drug vs. Placebo 28 Coronary Drug Project Take home Message What can this study teach us about Clinical Studies? 29 Intent-to-Treat • The gold standard for analyzing randomized clinical trials is Intent-to-treat. Patients are analyzed in the groups they were assigned to, irrespective of what they actually received. 30 31 4. Real UF Example: • Effectiveness of Nesiritide on Dialysis or All-Cause Mortality in Patients Undergoing Cardiothoracic Surgery. Clinical Cardiology. 2006; Jan;29(1):18-24. with T. Beaver et. al. • Motivation: Shands impression was that it was harmful and costly. 32 4. Nesiritide Example • Study Null Hypothesis: 20 day death/dialysis rate in patients getting nesiritide within two days of surgery have the same death rate as “similar” patients not getting it. • Design Suggestions? 33 4. Possible Designs (+/-) • Observational: Historical Control (Compare period before drug) to period after drug started to be given to a sizable fraction (gap during ramping up of use). Must include all comers and use electronic chart review. • Observational: Compare those getting to those not getting the drug. • Randomized controlled prospective trial 34 4. Sources of Variation • Within treatments, why might we not get the same result for every patient? • Historical Control? • Comparing concurrent nesiritide vs. not? • Randomized prospective trial? 35 4. Sources of Bias (Confounders) • Why might we see differences that might be totally unrelated to the treatment (nesiritide vs. not)? • Historical Control? • Comparing concurrent nesiritide vs. not? • Randomized prospective trial? 36 4. Nesiritide: Propensity Scoring • Actual Design: Compared Nesiritide vs. Not by Propensity Score Matching. • Using 12 key covariates, we estimated the probability that a patient would get Nesiritide given these covariates. Then we matched the nesiritide patients to nonnesiritide patients for the propensity, and did a matched analysis. 37 4. Conclusions • Nesiritide showed no significant difference (inconclusive) within CABG patients, • Nesiritide showed promise in aneurysm subjects with baseline elevated SCR, but was inconclusive in other such patients. • Run a future randomized double-blind trial in aneurisms with elevated SCR (Just completed and close to being in press with an inconclusive result.) 38 4. Conclusion (continued) • Note that the Shands study data were very important in designing the randomized follow-up study, in terms of the number of subjects needed (power analysis). 39 Take-home Messages • Rely on Evidence-Based Medicine. Conventional wisdom can easily lead us astray. • The objective of Statistics is to make informed inferences about a population, based on a sample. It is imperative to quantify the uncertainty. • The P-value is a quantity that allows us to infer something about whether a scientific hypothesis is false. • Non-significant results are inconclusive • Randomization and intent-to-treat are vital components in sound clinical research 40 Design One Together • Medical Question: Does Caffeine Withdrawal cause Headaches? 41 Eligibility 42 Design • What are the sources of variation besides caffeine consumption? • How do we control caffeine consumption • Should we use deception—hide purpose of study? Is this ethical? 43 Design • Pre-Post? • Double Blind Parallel Study? • Double Blind Crossover Study? 44 Forensics for Irregularity Phenylephrine 45 Phenylephrine Crossover Studies 46 Phenylephrine (Baseline NAR) Study (10 mg vs Placebo) 1 (N=16) (EB) Std Dev 2.0 CV=100SD/Mea n 15.3% 2 (N=10) (EB) 0.9 6.7% 3 (N=16) 7.8 36.3% 4 (N=15) 9.5 35.6% 5 (N=16) 6.2 29.3% 6 (N=16) 9.8 40.4% 7 (N=14) 9.4 35.3% 47 How do we test for Data Irregularities? • Background: Baseline NAR (Nasal Airway resistance) measures are typically xx.x (e.g. 20.2), and are always based on the mean of 10 observations (5 from each nostril). • What null hypothesis can we test to find potential irregularities? What P-value might we use to declare significance? 48 Baseline Last Digit (3rd sign) Study 1 Study 2 0:2 1:4 2:2 3:6 5 2 1 9 4:2 5:23 6:8 4 7 5 7:9 8:3 9:5 10 3 4 49 • Thank You!! 50 Coronary Drug ProjectCoronary Drug Project Data Five Year Mortality (Clofibrate) • Compliers: 15.0% (15.7%) (N=708) • Non-Compliers: 24.6%(22.5%) (N=357) • Compliers took >80% of their meds to death or to 5 years whichever was first. • In () is 5 year mortality, adjusted for prognostic factors. 51 Coronary Drug Project Five Year Mortality (Placebo) • Compliers: 15.1% (16.4%) (N=1813) • Non-Compliers: 28.2%(25.8%) (N=882) • Compliers took >80% of their meds to death or to 5 years whichever was first. • In () is 5 year mortality, adjusted for prognostic factors. 52 Coronary Drug Project Five-year mortality (As randomized) • Clofibrate: 20.0% (N=1103) • Placebo: 20.9% (N=2789) • NB: Compliance could not be assessed in a small number of patients. 53