Statistical Foundations: Hypothesis Testing

advertisement
Statistical Foundations:
Hypothesis Testing
Psychology 790
Lecture #9
9/19/2006
Today’ss Class
Today
• Hypothesis
H
th i Testing.
T ti
– General terms and philosophy.
– Specific Examples
Hypothesis
yp
Testing
g
Rules of the NHST Game
•
Recall our discussion about Null Hypothesis Significance Testing
from the last lecture:
•
This probability value is often called a p-value or p.
– When p < .05, a result is said to be “statistically significant”
•
In short, when a result is statistically significant (p < .05), we
conclude that the difference we observed was unlikely to be due to
sampling error alone. We “reject the null hypothesis.”
•
If the statistic is not statistically significant (p > .05), we conclude
th t sampling
that
li error iis a plausible
l ibl interpretation
i t
t ti off the
th results.
lt We
W
“fail to reject the null hypothesis.”
Hypothesis Testing Notes
• It iis iimportant
t t to
t keep
k
in
i mind
i d that
th t NHSTs
NHST were
developed for the purpose of making yes/no
decisions about the null hypothesis.
hypothesis
– As a consequence, the null is either accepted or
rejected on the basis of the p-value.
• For logical reasons, some people are uneasy
“accepting the null hypothesis” when p > .05, and
prefer to say that they “failed to reject the null
hypothesis” instead
instead.
Hypothesis Testing Items of Interest
•
Very important points about
Ver
abo t significance
testing:
1. The term “significant” does not mean
important,
po ta t, substa
substantial,
t a , oor wo
worthwhile.
t w e.
Points continued
Points,
2 Th
2.
The null
ll andd alternative
lt
ti hypotheses
h
th
are often
ft
constructed to be mutually exclusive. If one is
true the other must be false.
true,
false
• As a consequence,
–
–
•
When you reject the null hypothesis, you accept the alternative.
When you fail to reject the null hypothesis, you reject the alternative.
This may seem tricky because NHSTs do not
t t the
test
th researchh hypothesis
h
th i per se.
–
Formally, only the null hypothesis is tested.
Points continued
Points,
3 B
3.
Because NHSTs
NHST are often
ft usedd to
t make
k a
yes/no decision about whether the null
h
hypothesis
th i is
i a viable
i bl explanation,
l ti
mistakes can be made.
Errors in
Hypothesis Testing
Errors in Inference using NHST
• NHST can lead to decisions which are not
correct:
• Type I error: Your test is significant (p < .05),
so you reject the null hypothesis, but the null
hypothesis is actually true.
• Type II error: Your test is not significant (p >
.05),
05) you ddon’t
’ reject
j the
h null
ll hypothesis,
h
h i but
b you
should have because it is false.
Errors in Inference using NHST
• The probabilit
probability of making a Type
T pe I error is
determined by the experimenter. Often
called the alpha value.
value Usually set to 5%.
5%
• The probability of making a Type II error
is determined by the experimenter. Often
called the beta value.
value Usually ignored by
social science researchers.
Errors in Inference using NHST
• Th
The converse off T
Type II error is
i called
ll d
Power:
– The probability of rejecting the null hypothesis
when it is false—a correct decision.
– 11 beta
b
More on Power
• P
Power iis strongly
t
l influenced
i fl
d by
b sample
l
size.
– With larger N, more likely to reject null if it is
false.
– Power
P
analyses
l
are conducted
d
d to ddetermine
i the
h
size of a sample needed to reject a null
hypothesis.
hypothesis
Inferential Errors and NHST
Null is trrue
N
Null is false
N
Connclusionn of the teest
Real World
Null is true
Null is false
Correct
decision
Type II error
Type I error
Correct
decision
Points of Interest
• Th
The example
l we explored
l d previously
i l was
an example of what is called a z-test of a
sample
l mean.
• Significance tests have been developed for
a number of statistics
– difference between two group means: t-test
– difference between two or more group means:
ANOVA
– differences between proportions: chi-square
How do we control Type I errors?
•
The Type I error rate is typically controlled by the researcher.
researcher
•
It is called the alpha rate, and corresponds to the probability cut-off
that one uses in a significance test.
•
By convention, researchers often use an alpha rate of .05.
– In other words, they will only reject the null hypothesis when a statistic
is likely to occur 5% of the time or less when the null hypothesis is true.
true
•
In principle, any probability value could be chosen for making the
accept/reject decision.
– 5% is used by convention.
Type I errors
• What does 5% mean in this context?
• It means that we will only make a decision error 5% of
the time if the null hypothesis is true.
• If the null hypothesis is false, the Type I error rate is
undefined.
How do we control Type II errors?
• Type II errors can also be controlled by the experimenter.
experimenter
• The Type II error rate is sometimes called beta.
beta
• How can the beta rate be controlled? The easiest way to
control Type II errors is by increase the statistical power
of a test.
Statistical Power
• Statistical power is defined as the probability of
rejecting the null hypothesis when it is false—a correct
decision (1-beta).
• Power is strongly influenced by sample size. With a
l
larger
N,
N we are more likely
lik l to
t reject
j t the
th null
ll hypothesis
h
th i
if it is truly false.
– (As N increases, the standard error shrinks. Sampling error
becomes less problematic, and true differences are easier to
detect.)
Power and correlation
1.0
0.8
0.6
0.4
•
Population r = .30
0.2
•
This graph shows how the
power of the significance test
for a correlation varies as a
function of sample size.
Notice that when N = 80, there
is about an 80% chance of
correctly rejecting the null
hypothesis (beta = .20).
When N = 45, we only have a
50% chance of making the
correct decision—a coin toss
(beta = .50).
POWER
•
50
100
SAMPLE SIZE
150
200
Power and correlation
•
Power also varies as a function of the
size of the correlation.
r = .80
1.0
0.8
0.6
0.4
r = .20
0.2
When the ppopulation
p
correlation is
smallish (e.g., .20), it requires a large
number of subjects to correctly reject
the null hypothesis.
r = .40
0.0
•
When the population correlation is
large (e.g., .80), it requires fewer
subjects to correctly reject the null
hypothesis that the population
correlation is 0.
POWER
•
r = .60
•
When the population correlation is 0,
0
the probability of rejecting the null is
constant at 5% (alpha). Here “power”
is technically undefined because the
null hypothesis is true.
50
100
150
SAMPLE SIZE
r = .00
200
Low Power Studies
r = .80
r = .60
1.0
1
Because correlations in the .2 to .4
range are typically observed in
non-experimental research, one
would be wise not to trust research
b d on sample
based
l sizes
i
less
l than
th
60ish.
r = .40
0.2
0.4
r = .20
r = .00
0.0
0
Why? Because such research only
stands
d a 50% chance
h
off yielding
i ldi
the correct decision, if the null is
false. It would be more efficient
(and, importantly, just as accurate)
to flip a coin to make the decision
rather than collecting data and
using a significance test.
POWER
•
0.6
0.8
•
50
100
SAMPLE SIZE
150
200
A Sad Fact
•
In 1962 Jacob Cohen surveyed all articles in the Journal of Abnormal
and Social Psychology and determined that the typical power of
research conducted in this area was 53%.
•
An even sadder fact: In 1989, Sedlmeier and Gigerenzer surveyed
studies in the same journal (now called the Journal of Abnormal
Psychology) and found that the power had decreased slightly.
•
Researchers, unfortunately, pay little attention to power. As a
consequence, the Type II error rate of research in psychology is
likely to be dangerously high—maybe
high maybe as high as 50%.
50%
Power in Research Design
• Power is important to consider,
consider and should be used to
design research projects.
– Given an educated guess about what the population
parameter
t might
i ht be
b (e.g.,
(
a correlation
l ti off .30,
30 a mean
difference of .5 SD), one can determine the number of
subjects needed for a desired level of power.
– Cohen and others recommend that researchers try to
obtain a power level of about 80%.
Power in Research Design
• Thus
Thus, if one used an alpha-level
alpha level of 5% and collected
enough subjects to ensure a power of 80% for an assumed
effect, one would know, before the study was done, what
the theoretical error rates are for the statistical test.
test
• Although these error rates correspond to long-run
outcomes, one could get a sense of whether the research
design was a credible one—whether it is likely to
minimize the two kinds of errors that are ppossible in
NHST and, correspondingly, maximize the likelihood of
making a correct decision.
Misconceptions About
Hypothesis Testing
Three Common Misinterpretations of
Significance Tests and p-values
1
1.
The p-value
p value indicates the probability that the results are
due to sampling error or “chance.”
2.
A statistically significant result is a “reliable” result.
3.
A statistically significant result is a powerful, important
result.
Misinterpretation # 1
• The p-value
p value is a conditional probability.
probability The probability
of observing a specific range of sample statistics GIVEN
(i.e., conditional upon) that the null hypothesis is true.
P(D|Ho).
• Thi
This is
i nott equivalent
i l t to
t the
th probability
b bilit off the
th null
ll
hypothesis being true, given the data.
P(Ho |D) ≠ P(D| Ho)
Misinterpretation # 2
• Is a significant result a “reliable,”
“reliable ” easily replicated result?
• Not necessarily.
necessarily The pp-value
value is a poor indicator of the
replicability of a finding.
• Replicability (assuming a real effect exists, that is, that he
null hypothesis is false), is primarily a function of
statistical
t ti ti l power.
Misinterpretation # 2
•
If a study had a statistical power equivalent to 80%, what is the
probability of obtaining a “significant” result twice?
•
The probability of two independent events both occurring is the
simple
i l product
d t off the
th probability
b bilit off each
h off th
them occurring.
i
– .80 × .80 = .64
•
If power = 50%? .50
50 × .50
50 = .25
25
•
Bottom line: The likelihood of replicating a result is determined by
statistical power, not the p-value derived from a significance test.
Wh power off the
When
h test is
i low,
l
the
h likelihood
lik lih d off a long-run
l
series
i off
replications is even lower.
Misinterpretation # 3
• Is a significant result a powerful,
powerful important result?
• Not necessarily.
necessarily
• The importance of the result, of course, depends on the
issue at hand, the theoretical context of the finding, etc.
Misinterpretation # 3
• We can measure the practical or theoretical significance
of an effect using an index of effect size.
• An effect size is a quantitative index of the strength of the
relationship between two variables.
• Some common measures of effect size are correlations,
regression weights, t-values, and R-squared.
Misinterpretation # 3
• Importantly
Importantly, the same effect size can have different pp
values, depending on the sample size of the study.
• For example, a correlation of .30 would not statistically
significant with a sample size of 30, but would be
statistically
t ti ti ll significant
i ifi t with
ith a sample
l size
i off 130.
130
• Bottom line: The pp-value
value is a poor way to evaluate the
practical “significance” of a research result.
Wrapping Up
• Today was another
fun lecture about the
pphilosophy
p y of
hypothesis testing.
• We do hypothesis
testing all the time.
– That doesn
doesn’tt make it
something without
error, though.
Next Time
• Office
Offi hhours ttoday
d (1pm-4pm,
(1
4
449 F
Fraser).
)
• Lab tonight (examples of hypothesis tests).
• Hypothesis testing example.
• Confidence Intervals (Ch 6.8 – 6.11).
Download