An Introduction to Educational Research Statistics Graham McMahon MD MMSc gmcmahon@partners.org 1 Course Overview Last week: Stages of a trial from design to completion Generating hypotheses Working with the IRB Considering the funding required Trial Designs Today: Choosing an outcome variable Powering your study Establishing inter-rater reliability Determining if there is a difference between two groups Test development Qualitative approaches 2 Stages of an Educational Interventional Trial Stage Activities 1 Initial Design Hypothesis, Size 2 Protocol Design Define methods, collaborations, IRB 3 Recruitment Subject Acquisition, Monitoring 4 Followup Collect outcome data 5 Analysis Prepare “Clean + Locked” Database Perform analysis 6 Reporting Write and submit manuscript 7 Additional analyses Further explorations of trial data 3 Population & Sampling Must balance Variability [the smaller or more diverse the population, the more variable; variability creates error] Generalizability [population can’t be too specific] Access [you can only study those you have access to] Cost [larger studies are much more expensive] Consider Participation rate Multiple sites Online projects Lower reimbursement 4 Outcome What is really important? What would colleagues care about? ‘Hard’ outcomes Death, attendence, ‘Soft’ outcomes Satisfaction, self-confidence 5 Outcomes / Endpoints Primary Outcome Secondary Outcome Other related outcomes that may be interesting to test Exploratory Outcomes What you power your study on Association studies, subgroups that may be interesting, but likely to be underpowered May serve as pilot data for future studies Surrogate Endpoint In the causal pathway and affected by the intervention 6 Group Activity Medical errors and patient safety continue to be an important concern for patients and physicians. Numerous reports have suggested that fatigue and sleepiness contribute to medical errors. You are the program director in an internal medicine residency that has 40 residents and want to make a contribution in this area. List an hypothesis that could be generated based on this reflection. How would you measure sleepiness? You review the available sleepiness scales and must choose one. Which one is best? A Awake index B Sleepy score C Doze Index D Snory scale E Yawn score Scale Size 8 100 20 60 12 Mean Rating for Residents 6 72 15 30 5 Standard Deviation for Residents 5 20 4 9 3 Mean> Median Mean= Median Mean= Median Mean< Median Mean= Median 3 14 5 10 4 Distribution Expected Score Difference Power and Error α is the probability of making a Type I error Power is the likelihood of avoiding a Type II error Use trial type, α and power to calculate sample size 9 Sample Size Calculations 10 Calculating Sample Size Effect Size 0.3 SD diff between groups with power of 0.8 requires 300-400 subjects 1 SD diff between groups with power of 0.8 requires 30-40 subjects 11 Simple Calculation (per group) = 15.8 / (effect size)2 for power of 80% and α=0.05 N Remember to increase enrollment so that number completing ≥ expected sample size 12 You review the available sleepiness scales and must choose one. Which one is best? A Awake index B Sleepy score C Doze Index D Snory scale E Yawn score Scale Size 8 100 20 60 12 Mean Rating for Residents 6 72 15 30 7 Standard Deviation for Residents 5 20 4 9 3 Mean> Median Mean= Median Mean= Median Mean< Median Mean= Median 3 14 5 10 4 Distribution Expected Score Difference Effect size = score difference / standard deviation Power and Samples Sizes A Awake index B Sleepy score C Doze Index D Snory scale E Yawn score Scale Size 8 100 20 60 12 Mean Rating for Residents 6 72 15 30 5 Standard Deviation for Residents 5 20 4 9 3 Mean> Median Mean= Median Mean= Median Mean< Median Mean= Median 3 14 5 10 4 N per group 29 33 26 20 10 Power (N=15 per grp) 0.35 0.45 0.91 0.84 0.94 Distribution Expected Score Difference Calculating Sample Size using Software Choose Test Difference between groups Standard Deviation http://biostat.mc.vanderbilt.edu/twiki /bin/view/Main/PowerSampleSize 15 Two faculty offer to measure the sleepiness of residents using your scale. How can you find out if they are good raters? Interrater Reliability Interrater reliability is the extent to which two or more individuals (coders or raters) agree. Training, education and monitoring skills can enhance interrater reliability. Goal is generally reliability > 0.8 • Categorial: measure % • Ordinal: spearman rho • Continuous: pearson r Rater 1 Rater 2 Rater 1 Rater 2 1 2 3 4 5 6 7 8 2 1 3 4 6 8 7 5 3 3 5 5 7 7 9 9 5 4 3 6 5 3 8 7 Pearson 0.81 Pearson 0.56 Analyzing your Data Plan your analysis Consider consulting a specialist Test for normality Choose the right test Avoid statistical explorations with the data 18 You start your study and find that among the interns the M:F ratio was 12:5 and 8:9 and wonder if they are statistically unbalanced. Categorical Counts Chi-square statistic: no cell in the table should have an expected frequency of <1, and no more than 20% of the cells should have an expected frequency of <5. Use Fisher’s exact test when numbers are small Group 1 Group 2 Men 12 8 Women 5 9 Chi-square = 1.1 Fisher exact, p=0.29 20 You collect your baseline observations and find the following sleepiness in each group. Are they different? Grp 1 – 8, 6, 5, 2, 3, 9, 11, 6, 11 Grp 2 – 3, 5, 5, 2, 7, 4, 8, 10, 2 Summary of Tests Type of Data Two Paired Groups Two Independent Groups Many Independent Groups Categories McNemar Chi-square Chi-square Continuous Paired t-test t-test ANOVA Pearson r Wilcoxon Kruskal-Wallis Spearman r Rank Correlation Test for Normality! 22 t-test Had No Elective Elective Comparing two means Check if paired or unpaired The more SE’s you are away from zero, the less likely that the difference occurred by chance Number of students 145 48 Mean Score 76% 64% SD 12 11 23 Testing difference between two groups over time t- test on between group difference at end t-test on change over time Time 1 Time 2 24 Statistical Tests for Skewed or Rank Data These data don’t follow normal rules Non-parametric tests are less powerful Two groups Wilcoxon rank sum (=Mann-Whitney-U) Three or more groups Kruskal-Wallis 26 Wilcoxon Rank Sum Rank all observations in increasing order of magnitude, ignoring which group they come from. Add up the ranks in the smaller of the two groups . Look up the critical value of the sum of ranks for that size group. 27 Summary of Tests Type of Data Two Paired Groups Two Independent Groups Many Independent Groups Categories McNemar Chi-square Chi-square Continuous Paired t-test t-test ANOVA Pearson r Wilcoxon Kruskal-Wallis Spearman r Rank Correlation 28 Summary Careful choice of your population will improve your chances of finding an effect Choose your outcome measure thoughtfully Estimate your power and sample size in advance Ensure internal consistency is good Determine normality and analyze your dataset accordingly Graham McMahon gmcmahon@partners.org 30