EpiStats scenes

advertisement
Statistical Inference I:
Hypothesis testing; sample size
Statistics Primer







Statistical Inference
Hypothesis testing
P-values
Type I error
Type II error
Statistical power
Sample size calculations
What is a statistic?


A statistic is any value that can be
calculated from the sample data.
Sample statistics are calculated to give
us an idea about the larger population.
Examples of statistics:

mean


difference in means


The difference in the average gas price in Los
Angeles ($2.96) compared with Des Moines, Iowa
($2.32) is 64 cents.
proportion


The average cost of a gallon of gas in the US is
$2.65.
67% of high school students in the U.S. exercise
regularly
difference in proportions

The difference in the proportion of Democrats who
approve of Obama (83%) versus Republicans who
do (14%) is 69%
What is a statistic?

Sample statistics are estimates of
population parameters.
Sample statistics estimate
population parameters:
Sample statistic: mean IQ
of 5 subjects
Truth (not
observable)
Mean IQ of
some population
of 100,000
people =100
110  105  96  124  115
 110
5
Sample
(observation)
Make guesses
about the whole
population
What is sampling variation?




Statistics vary from sample to sample due to
random chance.
Example:
A population of 100,000 people has an
average IQ of 100 (If you actually could
measure them all!)
If you sample 5 random people from this
population, what will you get?
Sampling Variation
120  160  180  95  95
90  85  95  92  88  130

90
5
100  105  86  104  95
110  105 596  124  115  98
5
5 (not
Truth
observable)
Mean
IQ=100
 110
Sampling Variation and
Sample Size




Do you expect more or less sampling
variability in samples of 10 people?
Of 50 people?
Of 1000 people?
Of 100,000 people?
Sampling Distributions
Most experiments are one-shot deals. So, how do we know if
an observed effect from a single experiment is real or is just an
artifact of sampling variability (chance variation)?
**Requires a priori knowledge about how sampling variability
works…
Question: Why have I made you learn about probability
distributions and about how to calculate and
manipulate expected value and variance?
Answer: Because they form the basis of describing the
distribution of a sample statistic.
Standard error


Standard Error is a measure of sampling
variability.
Standard error is the standard deviation of a
sample statistic.



It’s a theoretical quantity! What would the distribution of my statistic be if I
could repeat my experiment many times (with fixed sample size)? How
much chance variation is there?
Standard error decreases with increasing
sample size and increases with increasing
variability of the outcome (e.g., IQ).
Standard errors can be predicted by
computer simulation or mathematical theory
(formulas).

The formula for standard error is different for every type of
statistic (e.g., mean, difference in means, odds ratio).
What is statistical inference?

The field of statistics provides guidance
on how to make conclusions in the face
of chance variation (sampling
variability).
Example 1: Difference in
proportions

Research Question: Are antidepressants
a risk factor for suicide attempts in
children and adolescents?
Example modified from: “Antidepressant Drug Therapy and Suicide in Severely
Depressed Children and Adults ”; Olfson et al. Arch Gen Psychiatry.2006;63:865872.

Example 1



Design: Case-control study
Methods: Researchers used Medicaid records
to compare prescription histories between
263 children and teenagers (6-18 years) who
had attempted suicide and 1241 controls who
had never attempted suicide (all subjects
suffered from depression).
Statistical question: Is a history of use of
antidepressants more common among cases
than controls?
Example 1

Statistical question: Is a history of use of
particular antidepressants more common
among heart disease cases than controls?
What will we actually compare?
 Proportion of cases who used antidepressants
in the past vs. proportion of controls who did
Results
Any antidepressant
drug ever
No (%) of
cases
(n=263)
No (%) of
controls
(n=1241)
120 (46%)
448 (36%)
46%
36%
Difference=10%
What does a 10% difference
mean?


Before we perform any formal statistical
analysis on these data, we already have
a lot of information.
Look at the basic numbers first; THEN
consider statistical significance as a
secondary guide.
Is the association statistically
significant?


This 10% difference could reflect a true
association or it could be a fluke in this
particular sample.
The question: is 10% bigger or smaller
than the expected sampling variability?
What is hypothesis testing?

Statisticians try to answer this question
with a formal hypothesis test
Hypothesis testing
Step 1: Assume the null hypothesis.
Null hypothesis: There is no association
between antidepressant use and suicide
attempts in the target population (= the
difference is 0%)
Hypothesis Testing
Step 2: Predict the sampling variability assuming the null
hypothesis is true—math theory (formula):
The standard error of the difference in
two proportions is:

p (1  p )
p (1 - p )

n1
n2

568
568
568
568
(1 
)
(1 
)
1504
1504  1504
1504  .033
263
1241
Thus, we expect to see differences between the group as big
as about 6.6% (2 standard errors) just by chance…
Hypothesis Testing
Step 2: Predict the sampling variability assuming the null
hypothesis is true—computer simulation:


In computer simulation, you simulate taking
repeated samples of the same size from the
same population and observe the sampling
variability.
I used computer simulation to take 1000
samples of 263 cases and 1241 controls
assuming the null hypothesis is true (e.g., no
difference in antidepressant use between the
groups).
Computer Simulation Results
What is standard error?
Standard error:
measure of
variability of
sample statistics
Standard error is
about 3.3%
Hypothesis Testing
Step 3: Do an experiment
We observed a difference of 10% between
cases and controls.
Hypothesis Testing
Step 4: Calculate a p-value
P-value=the probability of your data or
something more extreme under the null
hypothesis.
Hypothesis Testing
Step 4: Calculate a p-value—mathematical theory:
Difference in
proportions follows
a normal
distribution.
Observed
difference between
the groups.
.10
Z=
= 3.0; p = .003
.033
Standard error.
A Z-value of 3.0
corresponds to a
p-value of .003.
The p-value from computer
simulation…
We also got 2
results as small
or smaller than
–10%.
When we ran this
study 1000 times,
we got 1 result as
big or bigger than
10%.
P-value
P-value=the probability of
your data or something
more extreme under the null
hypothesis.
From our simulation, we
estimate the p-value to be:
3/1000 or .003
Hypothesis Testing
Step 5: Reject or do not reject the null hypothesis.
Here we reject the null.
Alternative hypothesis: There is an association
between antidepressant use and suicide in the
target population.
What does a 10% difference
mean?



Is it “statistically significant”? YES
Is it clinically significant?
Is this a causal association?
What does a 10% difference
mean?



Is it “statistically significant”? YES
Is it clinically significant? MAYBE
Is this a causal association? MAYBE
Statistical significance does not necessarily
imply clinical significance.
Statistical significance does not necessarily
imply a cause-and-effect relationship.
What would a lack of
statistical significance mean?

If this study had sampled only 50 cases
and 50 controls, the sampling variability
would have been much higher—as
shown in this computer simulation…
Standard error is
about 3.3%
Standard error is
about 10%
263 cases and
1241 controls.
50 cases and 50
controls.
With only 50 cases and 50 controls…
Standard
error is
about 10%
If we ran this
study 1000 times,
we would expect to
get values of 10%
or higher 170 times
(or 17% of the
time).
Two-tailed p-value
Two-tailed
p-value =
17%x2=34%
What does a 10% difference
mean (50 cases/50 controls)?



Is it “statistically significant”? NO
Is it clinically significant? MAYBE
Is this a causal association? MAYBE
No evidence of an effect  Evidence of no effect.
Example 2: Difference in means

Example: Rosental, R. and Jacobson, L.
(1966) Teachers’ expectancies:
Determinates of pupils’ I.Q. gains.
Psychological Reports, 19, 115-118.
The Experiment
(note: exact numbers have been altered)




Grade 3 at Oak School were given an IQ test at
the beginning of the academic year (n=90).
Classroom teachers were given a list of names of
students in their classes who had supposedly
scored in the top 20 percent; these students
were identified as “academic bloomers” (n=18).
BUT: the children on the teachers lists had
actually been randomly assigned to the list.
At the end of the year, the same I.Q. test was readministered.
Example 2

Statistical question: Do students in the
treatment group have more improvement
in IQ than students in the control group?
What will we actually compare?
 One-year change in IQ score in the treatment
group vs. one-year change in IQ score in the
control group.
Results:
“Academic
bloomers”
(n=18)
Change in IQ score:
12.2 (2.0)
12.2 points
The standard deviation
of change scores was
2.0 in both groups. This
affects statistical
significance…
Controls
(n=72)
8.2 (2.0)
8.2 points
Difference=4 points
What does a 4-point
difference mean?


Before we perform any formal statistical
analysis on these data, we already have
a lot of information.
Look at the basic numbers first; THEN
consider statistical significance as a
secondary guide.
Is the association statistically
significant?


This 4-point difference could reflect a
true effect or it could be a fluke.
The question: is a 4-point difference
bigger or smaller than the expected
sampling variability?
Hypothesis testing
Step 1: Assume the null hypothesis.
Null hypothesis: There is no difference between
“academic bloomers” and normal students (=
the difference is 0%)
Hypothesis Testing
Step 2: Predict the sampling variability assuming the null
hypothesis is true—math theory:
The standard error of the difference in
two means is:

2
2
s
s
4 4



 0.52
n1 n2
18 72
We expect to see differences between the group as
big as about 1.0 (2 standard errors) just by chance…
Hypothesis Testing
Step 2: Predict the sampling variability assuming the null
hypothesis is true—computer simulation:


In computer simulation, you simulate taking
repeated samples of the same size from the
same population and observe the sampling
variability.
I used computer simulation to take 1000
samples of 18 treated and 72 controls,
assuming the null hypothesis (that the
treatment doesn’t affect IQ).
Computer Simulation Results
What is the standard error?
Standard error is
about 0.52
Standard error:
measure of
variability of
sample statistics
Hypothesis Testing
Step 3: Do an experiment
We observed a difference of 4 between
treated and controls.
Hypothesis Testing
Step 4: Calculate a p-value
P-value=the probability of your data or
something more extreme under the null
hypothesis.
Hypothesis Testing
Step 4: Calculate a p-value—mathematical theory:
Difference in means
follows a T distribution
(which is very similar to
a normal except with
very small samples).
Observed
difference between
the groups.
4
t88 
8
.52
Standard error.
A T88-value of 8.0
corresponds to a
p-value of <.0001
p-value <.0001
A t-curve with 88
df’s has slightly
wider cut-off’s for
95% area (t=1.99)
than a normal
curve (Z=1.96)
Getting the P-value from
computer simulation…
If we ran this
study 1000 times
we wouldn’t
expect to get 1
result as big as a
difference of 4
(under the null
hypothesis).
P-value
P-value=the probability of
your data or something
more extreme under the null
hypothesis.
Here, p-value<.0001
Hypothesis Testing
Step 5: Reject or do not reject the null hypothesis.
Here we reject the null.
Alternative hypothesis: There is an association
between being labeled as gifted and subsequent
academic achievement.
What does a 4-point
difference mean?



Is it “statistically significant”? YES
Is it clinically significant?
Is this a causal association?
What does a 4-point
difference mean?



Is it “statistically significant”? YES
Is it clinically significant? MAYBE
Is this a causal association? MAYBE
Statistical significance does not necessarily
imply clinical significance.
Statistical significance does not necessarily
imply a cause-and-effect relationship.
What if our standard deviation
had been higher?

The standard deviation for change
scores in both treatment and control
was 2.0. What if change scores had
been much more variable—say a
standard deviation of 10.0?
Standard error is
0.52
Standard error is 2.58
Std. dev in
change scores =
2.0
Std. dev in
change scores =
10.0
With a std. dev. of 10.0…
Standard
error is 2.58
If we ran this
study 1000 times,
we would expect to
get +4.0 or –4.0
12% of the time.
P-value=.12
What would a 4.0 difference
mean (std. dev=10)?



Is it “statistically significant”? NO
Is it clinically significant? MAYBE
Is this a causal association? MAYBE
No evidence of an effect  Evidence of no effect.
Hypothesis testing summary

Null hypothesis: the hypothesis of no effect (usually
the opposite of what you hope to prove). The straw
man you are trying to shoot down.


Example: antidepressants have no effect on suicide risk
P-value: the probability of your observed data if the
null hypothesis is true.



Example: The probability that the study would have found 10%
higher suicide attempts in the antidepressant group (compared
with control) if antidepressants had no effect (i.e., just by
chance).
If the p-value is low enough (i.e., if our data are very unlikely
given the null hypothesis), this is evidence that the null
hypothesis is wrong.
If p-value is low enough (typically <.05), we reject the null
hypothesis and conclude that antidepressants do have an
effect.
Summary: The Underlying
Logic of hypothesis tests…
Follows this logic:
Assume A.
If A, then B.
Not B.
Therefore, Not A.
But throw in a bit of uncertainty…If A, then probably B…
Error and power

Type I error rate (or significance level): the
probability of finding an effect that isn’t real (false
positive).



If we require p-value<.05 for statistical significance, this means
that 1/20 times we will find a positive result just by chance.
Type II error rate: the probability of missing an effect
(false negative).
Statistical power: the probability of finding an effect if
it is there (the probability of not making a type II
error).

When we design studies, we typically aim for a power of 80%
(allowing a false negative rate, or type II error rate, of 20%).
Type I and Type II Error in a box
Your Statistical
Decision
True state of null hypothesis
H0 True
Reject H0
(e.g., conclude
antidepressants increase
suicide risk)
(e.g., antidepressants do not
increase suicide risk)
(e.g., antidepressants do
increase suicide risk)
Type I error (α)
Correct
Correct
Type II Error (β)
Do not reject H0
(e.g., conclude there is
insufficient evidence that
antidepresssants increase
suicide risk)
H0 False
Reminds me of…
Pascal’s Wager
The TRUTH
Your Decision
God Exists
God Doesn’t Exist
BIG MISTAKE
Correct
Correct—
Big Pay Off
MINOR MISTAKE
Reject God
Accept God
Type I and Type II Error in a box
Your Statistical
Decision
Reject H0
True state of null hypothesis
H0 True
H0 False
Type I error (α)
Correct
Correct
Type II Error (β)
Do not reject H0
Review Question 1
If we have a p-value of 0.03 and so decide that our
effect is statistically significant, what is the
probability that we’re wrong (i.e., that the
hypothesis test gave us a false positive)?
a.
b.
c.
d.
e.
.03
.06
Cannot tell
1.96
95%
Review Question 1
If we have a p-value of 0.03 and so decide that our
effect is statistically significant, what is the
probability that we’re wrong (i.e., that the
hypothesis test gave us a false positive)?
a.
b.
c.
d.
e.
.03
.06
Cannot tell
1.96
95%
Review Question 2
Standard error is:
a.
b.
c.
d.
e.
For a given variable, its standard deviation
divided by the square root of n.
A measure of the variability of a sample
statistic.
The inverse of sample size.
A measure of the variability of a
characteristic.
All of the above.
Review Question 2
Standard error is:
a.
b.
c.
d.
e.
For a given variable, its standard deviation
divided by the square root of n.
A measure of the variability of a
sample statistic.
The inverse of sample size.
A measure of the variability of a
characteristic.
All of the above.
Review Question 3
A randomized trial of two treatments for depression
failed to show a statistically significant difference in
improvement from depressive symptoms (p-value
=.50). It follows that:
a.
b.
c.
d.
e.
The treatments are equally effective.
Neither treatment is effective.
The study lacked sufficient power to detect a
difference.
The null hypothesis should be rejected.
There is not enough evidence to reject the null
hypothesis.
Review Question 3
A randomized trial of two treatments for depression
failed to show a statistically significant difference in
improvement from depressive symptoms (p-value
=.50). It follows that:
a.
b.
c.
d.
e.
The treatments are equally effective.
Neither treatment is effective.
The study lacked sufficient power to detect a
difference.
The null hypothesis should be rejected.
There is not enough evidence to reject the null
hypothesis.
Review Question 4
Following the introduction of a new treatment regime in a
rehab facility, alcoholism “cure” rates increased. The
proportion of successful outcomes in the two years
following the change was significantly higher than in the
preceding two years (p-value: <.005). It follows that:
a.
b.
c.
d.
e.
The improvement in treatment outcome is clinically important.
The new regime cannot be worse than the old treatment.
Assuming that there are no biases in the study method, the new
treatment should be recommended in preference to the old.
All of the above.
None of the above.
Review Question 4
Following the introduction of a new treatment regime in a
rehab facility, alcoholism “cure” rates increased. The
proportion of successful outcomes in the two years
following the change was significantly higher than in the
preceding two years (p-value: <.005). It follows that:
a.
b.
c.
d.
e.
The improvement in treatment outcome is clinically important.
The new regime cannot be worse than the old treatment.
Assuming that there are no biases in the study method, the new
treatment should be recommended in preference to the old.
All of the above.
None of the above.
Statistical Power

Statistical power is the probability of
finding an effect if it’s real.
Can we quantify how much
power we have for given
sample sizes?
study 1: 263 cases, 1241 controls
Null
Distribution:
difference=0.
Rejection region.
Any value >= 6.5
(0+3.3*1.96)
For 5% significance level,
one-tail area=2.5%
(Z/2 = 1.96)
Power= chance of being in the
Clinically relevant
rejection region if the alternative
alternative: is true=area to the right of this
difference=10%.
line (in yellow)
study 1: 263 cases, 1241 controls
Rejection region.
Any value >= 6.5
(0+3.3*1.96)
Power here = >80%
Power= chance of being in the
rejection region if the alternative
is true=area to the right of this
line (in yellow)
study 1: 50 cases, 50 controls
Critical value=
0+10*1.96=20
Z/2=1.96
2.5% area
Power closer to
20% now.
Study 2: 18 treated, 72 controls, STD DEV = 2
Critical value=
0+0.52*1.96 = 1
Clinically relevant
alternative:
difference=4 points
Power is nearly
100%!
Study 2: 18 treated, 72 controls, STD DEV=10
Critical value=
0+2.59*1.96 = 5
Power is about
40%
Study 2: 18 treated, 72 controls, effect size=1.0
Critical value=
0+0.52*1.96 = 1
Power is about
50%
Clinically relevant
alternative:
difference=1 point
Factors Affecting Power
1.
2.
3.
4.
Size of the effect
Standard deviation of the characteristic
Bigger sample size
Significance level desired
1. Bigger difference from the null mean
Null
Clinically
relevant
alternative
average weight from samples of 100
2. Bigger standard deviation
average weight from samples of 100
3. Bigger Sample Size
average weight from samples of 100
4. Higher significance level
Rejection region.
average weight from samples of 100
Sample size calculations

Based on these elements, you can write
a formal mathematical equation that
relates power, sample size, effect size,
standard deviation, and significance
level…
Simple formula for difference
in proportions
Represents the
Sample size in
each group
(assumes equal
sized groups)
n  2
desired power
(typically .84 for
80% power).
( p )(1  p )( Z   Z/2 )
A measure of
variability (similar
to standard
deviation)
(p1  p2 )
Effect Size
(the
difference in
proportions)
2
2
Represents the
desired level of
statistical
significance
(typically 1.96).
Simple formula for difference
in means
Sample size in each
group (assumes equal
sized groups)
Represents the
desired power
(typically .84 for
80% power).
 ( Z   Z/2 )
2
n  2
Standard deviation
of the outcome
variable
difference
2
2
Represents the
desired level of
Effect Size
(the difference statistical
significance
in means)
(typically 1.96).
Sample size calculators on the
web…



http://biostat.mc.vanderbilt.edu/twiki/bi
n/view/Main/PowerSampleSize
http://calculators.stat.ucla.edu
http://hedwig.mgh.harvard.edu/sample
_size/size.html
These sample size calculations are
idealized
•They do not account for losses-to-follow up
(prospective studies)
•They do not account for non-compliance (for
intervention trial or RCT)
•They assume that individuals are independent
observations (not true in clustered designs)
•Consult a statistician!
Review Question 5
Which of the following elements does not
increase statistical power?
a.
b.
c.
d.
Increased sample size
Measuring the outcome variable more
precisely
A significance level of .01 rather than .05
A larger effect size.
Review Question 5
Which of the following elements does not
increase statistical power?
a.
b.
c.
d.
Increased sample size
Measuring the outcome variable more
precisely
A significance level of .01 rather than
.05
A larger effect size.
Review Question 6
Most sample size calculators ask you to
input a value for . What are they asking
for?
a.
b.
c.
d.
e.
The
The
The
The
The
standard error
standard deviation
standard error of the difference
coefficient of deviation
variance
Review Question 6
Most sample size calculators ask you to
input a value for . What are they asking
for?
a.
b.
c.
d.
e.
The standard error
The standard deviation
The standard error of the difference
The coefficient of deviation
The variance
Review Question 7
For your RCT, you want 80% power to detect a
reduction of 10 points or more in the
treatment group relative to placebo. What is
10 in your sample size formula?
a. Standard deviation
b. mean change
c. Effect size
d. Standard error
e. Significance level
Review Question 7
For your RCT, you want 80% power to detect a
reduction of 10 points or more in the
treatment group relative to placebo. What is
10 in your sample size formula?
a. Standard deviation
b. mean change
c. Effect size
d. Standard error
e. Significance level
Homework




Problem Set 3
Reading: continue reading textbook
Reading: p-value article
Journal article/article review sheet
Download