Educational Experimentation

advertisement
An Introduction to
Educational Research Statistics
Graham McMahon MD MMSc
gmcmahon@partners.org
1
Course Overview

Last week:






Stages of a trial from design to completion
Generating hypotheses
Working with the IRB
Considering the funding required
Trial Designs
Today:






Choosing an outcome variable
Powering your study
Establishing inter-rater reliability
Determining if there is a difference between two groups
Test development
Qualitative approaches
2
Stages of an Educational
Interventional Trial
Stage
Activities
1
Initial Design
Hypothesis, Size
2
Protocol Design
Define methods, collaborations,
IRB
3
Recruitment
Subject Acquisition, Monitoring
4
Followup
Collect outcome data
5
Analysis
Prepare “Clean + Locked”
Database
Perform analysis
6
Reporting
Write and submit manuscript
7
Additional analyses
Further explorations of trial data
3
Population & Sampling

Must balance





Variability [the smaller or more diverse the population,
the more variable; variability creates error]
Generalizability [population can’t be too specific]
Access [you can only study those you have access to]
Cost [larger studies are much more expensive]
Consider




Participation rate
Multiple sites
Online projects
Lower reimbursement
4
Outcome
 What
is really important?
 What would colleagues care about?
 ‘Hard’ outcomes

Death, attendence,
 ‘Soft’

outcomes
Satisfaction, self-confidence
5
Outcomes / Endpoints

Primary Outcome


Secondary Outcome


Other related outcomes that may be interesting to test
Exploratory Outcomes



What you power your study on
Association studies, subgroups that may be
interesting, but likely to be underpowered
May serve as pilot data for future studies
Surrogate Endpoint

In the causal pathway and affected by the intervention
6
Group Activity

Medical errors and patient safety continue to be
an important concern for patients and
physicians. Numerous reports have suggested
that fatigue and sleepiness contribute to medical
errors. You are the program director in an
internal medicine residency that has 40
residents and want to make a contribution in this
area.
 List an hypothesis that could be generated
based on this reflection.
 How would you measure sleepiness?
You review the available sleepiness scales
and must choose one. Which one is best?
A
Awake index
B
Sleepy score
C
Doze Index
D
Snory scale
E
Yawn score
Scale Size
8
100
20
60
12
Mean Rating for
Residents
6
72
15
30
5
Standard
Deviation for
Residents
5
20
4
9
3
Mean>
Median
Mean=
Median
Mean=
Median
Mean<
Median
Mean=
Median
3
14
5
10
4
Distribution
Expected Score
Difference
Power and Error
α is the probability of making a Type I error
 Power is the likelihood of avoiding a Type II error
 Use trial type, α and power to calculate sample
size

9
Sample Size Calculations
10
Calculating Sample Size
Effect Size
0.3 SD diff between
groups with power of
0.8 requires 300-400
subjects
1 SD diff between
groups with power of
0.8 requires 30-40
subjects
11
Simple Calculation
(per group) = 15.8 / (effect size)2
for power of 80% and α=0.05
N
 Remember
to increase enrollment so that
number completing ≥ expected sample size
12
You review the available sleepiness scales
and must choose one. Which one is best?
A
Awake index
B
Sleepy score
C
Doze Index
D
Snory scale
E
Yawn score
Scale Size
8
100
20
60
12
Mean Rating for
Residents
6
72
15
30
7
Standard
Deviation for
Residents
5
20
4
9
3
Mean>
Median
Mean=
Median
Mean=
Median
Mean<
Median
Mean=
Median
3
14
5
10
4
Distribution
Expected Score
Difference
Effect size = score difference / standard deviation
Power and Samples Sizes
A
Awake index
B
Sleepy score
C
Doze Index
D
Snory scale
E
Yawn score
Scale Size
8
100
20
60
12
Mean Rating for
Residents
6
72
15
30
5
Standard
Deviation for
Residents
5
20
4
9
3
Mean>
Median
Mean=
Median
Mean=
Median
Mean<
Median
Mean=
Median
3
14
5
10
4
N per group
29
33
26
20
10
Power (N=15
per grp)
0.35
0.45
0.91
0.84
0.94
Distribution
Expected Score
Difference
Calculating Sample Size using Software
Choose Test
Difference between groups
Standard Deviation
http://biostat.mc.vanderbilt.edu/twiki
/bin/view/Main/PowerSampleSize
15
Two faculty offer to measure the
sleepiness of residents using your scale.
How can you find out if they are good
raters?
Interrater Reliability



Interrater reliability is the extent to
which two or more individuals (coders
or raters) agree.
Training, education and monitoring
skills can enhance interrater reliability.
Goal is generally reliability > 0.8
• Categorial: measure %
• Ordinal: spearman rho
• Continuous: pearson r
Rater 1
Rater 2
Rater 1
Rater 2
1
2
3
4
5
6
7
8
2
1
3
4
6
8
7
5
3
3
5
5
7
7
9
9
5
4
3
6
5
3
8
7
Pearson
0.81
Pearson
0.56
Analyzing your Data
 Plan
your analysis
 Consider consulting a specialist
 Test for normality
 Choose the right test
 Avoid statistical explorations with the data
18
 You
start your study and find that among
the interns the M:F ratio was 12:5 and 8:9
and wonder if they are statistically
unbalanced.
Categorical Counts


Chi-square statistic: no cell
in the table should have an
expected frequency of <1,
and no more than 20% of
the cells should have an
expected frequency of <5.
Use Fisher’s exact test
when numbers are small
Group 1
Group 2
Men
12
8
Women
5
9
Chi-square = 1.1
Fisher exact, p=0.29
20
 You
collect your baseline observations
and find the following sleepiness in each
group. Are they different?


Grp 1 – 8, 6, 5, 2, 3, 9, 11, 6, 11
Grp 2 – 3, 5, 5, 2, 7, 4, 8, 10, 2
Summary of Tests
Type of Data
Two Paired
Groups
Two
Independent
Groups
Many
Independent
Groups
Categories
McNemar
Chi-square
Chi-square
Continuous
Paired t-test
t-test
ANOVA
Pearson r
Wilcoxon
Kruskal-Wallis
Spearman r
Rank
Correlation
Test for
Normality!
22
t-test
Had
No
Elective Elective

Comparing two means
 Check if paired or
unpaired
 The more SE’s you are
away from zero, the less
likely that the difference
occurred by chance
Number of
students
145
48
Mean Score
76%
64%
SD
12
11
23
Testing difference between two
groups over time
 t-
test on between
group difference at
end
 t-test on change over
time
Time 1
Time 2
24
Statistical Tests for
Skewed or Rank Data
These data don’t follow normal rules
 Non-parametric tests are less powerful
 Two groups



Wilcoxon rank sum (=Mann-Whitney-U)
Three or more groups

Kruskal-Wallis
26
Wilcoxon Rank Sum

Rank all observations in increasing order of
magnitude, ignoring which group they come
from.
 Add up the ranks in the smaller of the two
groups .
 Look up the critical value of the sum of ranks for
that size group.
27
Summary of Tests
Type of Data
Two Paired
Groups
Two
Independent
Groups
Many
Independent
Groups
Categories
McNemar
Chi-square
Chi-square
Continuous
Paired t-test
t-test
ANOVA
Pearson r
Wilcoxon
Kruskal-Wallis
Spearman r
Rank
Correlation
28
Summary
 Careful
choice of your population will
improve your chances of finding an effect
 Choose your outcome measure
thoughtfully
 Estimate your power and sample size in
advance
 Ensure internal consistency is good
 Determine normality and analyze your
dataset accordingly
Graham McMahon
gmcmahon@partners.org
30
Download