Evidence supports a theory that most strongly predicted it

advertisement
How to get the most out of null
results using Bayes
Zoltán Dienes
The problem:
Does a non-significant result count as evidence for the
null hypothesis or as no evidence either way?
? .081
* .034
26
.740
.034
.090
.817
.028
.001
.056
.031
.279
.024
.083
.002
.167
.172
.387
.614
.476
.006
.028
.002
.024
.144
.230
24
*
?
*
***<
?
*
*
?
**
**
*
**
*
-20
25
23
22
21
20
19
18
Successive experiments
Geoff Cummin: http://www.latrobe.edu.au/psy/esci/index.html
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
-10
2
mdiff
H0
0
10
Difference in verbal ability
20
30
40
The solutions:
1. Power
2. Interval estimates
3. Bayes Factors
Problems with Power
I)
Power depends on specifying the minimal effect of interest (which may be
poorly specified by the theory)
II) Power cannot make use of your actual data to determine the sensitivity of
those data
Confidence intervals solve the second problem
Bayes Factors can solve both problems
By making use of the full range of predictions of the theory, it makes maximal use of
the data in assessing the sensitivity of the data in distinguishing your theory
from the null
A Bayes Factor can show strong evidence for the null hypothesis over your theory,
when it is impossible to say anything using power or confidence intervals
The four principles of inference by intervals:
Null region
Minimal interesting value
0
If the 95% confidence/ credibility/ likelihood
interval is completely contained in this
region, conclude there is good evidence that
population value lies in null region – accept
the null region hypothesis
Difference between means ->
If the interval is completely outside this region,
conclude there is good evidence that population
value lies outside null region – reject the null region
hypothesis
If the upper limit of the interval is below the
minimal interesting value, conclude there is
evidence against a theory postulating a positive
difference
If the interval includes both null and theoretically interesting values, the data
are insensitive
The Bayes Factor
Devil hypothesis
Cat hypothesis
If a devil, you
will lose finger
9/10 of time
If a cat, you
lose finger only
1/10 of time
Evidence supports the theory that most strongly predicted it
Evidence supports the theory that most strongly predicted it
John puts his hand in the box and loses a finger.
Which hypothesis is most strongly supported, the cat
hypothesis or the devil hypothesis?
Evidence supports a theory that most strongly predicted it
John puts his hand in the box and loses a finger.
Which hypothesis is most strongly supported, the cat
hypothesis or the devil hypothesis?
Cat hypothesis predicts this result with probability = 1/10
Devil hypothesis predicts this result with probability = 9/10
Evidence supports a theory that most strongly predicted it
John puts his hand in the box and loses a finger.
Which hypothesis is most strongly supported, the cat
hypothesis or the devil hypothesis?
Cat hypothesis predicts this result with probability = 1/10
Devil hypothesis predicts this result with probability = 9/10
Strength of evidence for devil over cat hypothesis
= 9/10 divided by 1/10
=9
The evidence is nine times as strong for the devil over the cat
hypothesis
OR
Bayes Factor (B) = 9
Consider:
John does not lose a finger
Consider:
John does not lose a finger
Now evidence strongly supports cat over devil hypothesis
(BF = 9 for cat over devil hypothesis or 1/9 for devil over cat
hypothesis)
Probability of losing finger given cat = 4/10
Probability of losing finger given devil = 6/10
Now if John loses finger strength of evidence for devil over cat =
6/4 = 1.5
Not very strong
We can distinguish:
Evidence for cat hypothesis over devil
Evidence for devil hypothesis over cat
Not much evidence either way.
Bayes factor tells you how strongly the data are predicted by
the different theories (e.g. your pet theory versus null
hypothesis):
B=
Probability of your data given your pet theory
divided by
probability of data given null hypothesis
If B is greater than 1 then the data supported your theory over the
null
If B is less than 1, then the data supported the null over your theory
If B = about 1, experiment was not sensitive.
(Automatically get a notion of sensitivity;
contrast: just relying on p values in significance testing.)
Jeffreys, 1961: Bayes factors more than 3 or less than a 1/3 are
substantial
To know which theory data support need to know what the
theories predict
The null is normally the prediction of e.g. no difference
On the null hypothesis
only this value is
plausible
Plausibility
-2
0
2
4
Population difference between conditions
To know which theory data support need to know what the
theories predict
The null is normally the prediction of e.g. no difference
Need to decide what difference or range of differences are
consistent with one’s theory
Difficult - but forces one to think clearly about one’s theory.
To calculate a Bayes factor must decide what range of differences
are predicted by the theory
1) Uniform distribution
2) Half normal
3) Normal
Example: The theory predicts a difference will be in one
direction.
Subjects give 0-8 ratings in two conditions
Plausibility
-2
0
2
4
8
Population difference in means between conditions
Maximum difference allowed
Seems more plausible to think the larger effects are less likely than
the smaller ones:
Plausibility
0
Population difference in means
between conditions
But how to scale the rate of drop?
Plausibility
0
4
Population difference in means between
conditions
Implies: Smaller effects more likely than bigger ones; effects
bigger than 8 very unlikely
Similar sorts of effects as those predicted in the past have been on
the order of a 5% difference between conditions
Plausibility
0
5
Population difference in means between
conditions
Implies: Smaller effects more likely than bigger ones; effects
bigger than 10% very unlikely
Plausibility
0
5
10
Difference between conditions
To calculate Bayes factor in a t-test situation
Need same information from the data as for a t-test:
Mean difference, Mdiff
SE of difference, SEdiff
To calculate Bayes factor in a t-test situation
Need same information from the data as for a t-test:
Mean difference, Mdiff
SE of difference, SEdiff
Note: t = Mdiff / SEdiff
=> SEdiff = Mdiff/t
To calculate a Bayes factor:
1) Google “Zoltan Dienes”
2) First site to come up is the right one:
http://www.lifesci.sussex.ac.uk/home/Zoltan_Dienes/
3) Click on link to book
4) Click on link to Chapter Four
5) Scroll down and click on “Click here to calculate your Bayes
factor!”
2.96
4.88
0.52
4.88
2.70
0.46
4.40
1024.6
3.33
4.88
1.73
4.28
2.96
49.86
2.16
2.12
1.01
0.65
0.75
28.00
4.28
49.86
5.60
2.36
1.73
The tai chi of the
Bayes factors
p
.081
.034
.74
.034
.09
.817
.028
.001
.056
.031
.279
.024
.083
.002
.167
.172
.387
.614
.476
.006
.028
.002
.024
.144
.23
The dance of the p
values
http://www.latrobe.edu.au/psy/esci/index.html
?.081
*.034
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
.740
*.034
?.090
.817
*.028
***<.001
?.056
*.031
.279
*.024
?.083
**.002
.167
.172
.387
.614
.476
**.006
*.028
**.002
*.024
.144
.230
-20
H0
-10
0
mdiff
10
20
Difference in verbal ability
30
Successive experiments
Bayes
40
A Bayes Factor requires establishing predicted effect sizes. How?
Do digit-colour synesthetes show a Stroop effect on digits?
You display:
3
…4
…5
… 6
What they see:
3
…4
…5
… 6
You get a null effect (incongruent minus congruent RTs) . . . What size effect would be
predicted if there were one?
A Bayes Factor requires establishing predicted effect sizes.
Do digit-colour synesthetes show a Stroop effect on digits?
You display:
3
…4
…5
… 6
What they see:
3
…4
…5
… 6
You get a null effect (incongruent minus congruent RTs) . . . What size effect would be
predicted if there were one?
Run normals on a condition in which digits are coloured in the way synesthetes say they
are. The Stroop effect is presumably the maximum one could expect synesthetes to show.
Use a uniform:
Plausibility
0
Possible population Stroop effects
Effect for normals with real
colours
Another condition in your experiment might help settle expectations:
Jiang et al 2012
Obtained significant amount of unconscious knowledge (5%)
Conscious knowledge was 6% with a SE of 7%
Another condition in your experiment might help settle expectations:
Jiang et al 2012
Obtained significant amount of unconscious knowledge (5%)
Conscious knowledge was 6% with a SE of 7% (non-significant)
To assess meaning of non-significant result, used a half-normal with SD = 5%
BF = 1.25
Another group in your experiment might help settle expectations:
Jiang et al 2012
Obtained significant amount of unconscious knowledge (5%)
Conscious knowledge was 6% with a SE of 7%
Used a half-normal with SD = 5%
BF = 1.25
Nothing follows about whether subjects had conscious knowledge or not
If you have a manipulation meant to reduce an effect, effect of manipulation unlikely to be
larger than the basic effect
e.g. Dienes, Baddeley & Jansari (2012)
Predicted sad mood would reduce learning compared to neutral mood
So e.g. if on 2-alternative forced choice test, in neutral condition people get 70% correct
If you have a manipulation meant to reduce an effect, effect of manipulation unlikely to be
larger than the basic effect
e.g. Dienes, Baddeley & Jansari (2012)
Predicted sad mood would reduce learning compared to neutral mood
So e.g. if on 2-alternative forced choice test, in neutral condition people get 70% correct
Sad condition expected to be somewhere between 50 and 70%
So effect of mood must be?
My typical practice:
If think of way of determining an approximate expected size of effect
Use half normal with SD = to that typical size
If think of way of determining an approximate upper limit of effect
=> Use uniform from 0 to that limit
Moral and inferential paradoxes of orthodoxy:
1.
On the orthodox approach, standardly you should plan in advance how many subjects
you will run.
If you just miss out on a significant result you are not allowed to just run 10 more subjects
and test again.
You are not allowed to run until you get a significant result.
Bayes: It does not matter when you decide to stop running subjects. You can always run more
subjects if you think it will help.
Moral paradox:
If p = .07 after running planned number of subjects
i)
If you run more and report significant at 5% you have cheated
ii) If you don’t run more and bin the results you have wasted tax payer’s money and
your time, and wasted relevant data
You are morally damned either way
Inferential paradox
Two people with the same data and theories could draw opposite conclusions
Moral and inferential paradoxes of orthodoxy:
2. On the orthodox approach, it matters whether you formulated your hypothesis before
or after looking at the data.
Post hoc vs planned comparisons
Predictions made in advance of rather than before looking at the data are treated
differently
Bayesian inference: It does not matter what day of the week you thought of your theory
The evidence for your theory is just as strong regardless of its timing
Moral and inferential paradoxes of orthodoxy:
3. On the orthodox approach, you must correct for how many tests you conduct in total.
For example, if you ran 100 correlations and 4 were just significant, researchers would not try
to interpret those significant results.
On Bayes, it does not matter how many other statistical hypotheses you investigated (or your
RA without telling you). All that matters is the data relevant to each hypothesis under
investigation.
For orthodoxy but not Bayes:
Different people with the same data and theories can come to different
conclusions
You can thus be tempted to make false (albeit inferentially irrelevant claims),
like when you thought of your theory
What is the aim of statistics?
1)
Control the proportion of errors you make in the long run in accepting and rejecting
hypotheses
(conventional statistics)
2) Indicate how strong the evidence is for one hypothesis rather than another / how much
you should change your confidence in one hypothesis rather than another
(Bayesian statistics)
Dienes 2011 Perspectives on Psychological Science
Download