Hypothesis Testing Slides

advertisement
Hypothesis testing
Classical hypothesis testing is a statistical method that appeared in the first
third of the 20th Century, alongside the “modern” conception of a “scientific”
theory. In Popper’s view, a theory not falsifiable is a theory not scientific. The
philosopher of science, Karl Popper (1902-1994) asserted that it was the
scientist’s responsibility to construct statements that would be either consistent
or inconsistent with a scientific theory. In his view, science progresses by
repeatedly “testing” aspects of theories against observation; falsified theories
are those that after much experimentation / observation have not successfully
survived these tests.
Three individuals are responsible for developing the statistical analog to
Popper’s falsificationism: Ronald Fisher, Jerzy Neyman, and Egon
Pearson.
Ronald Fisher
(1890 – 1962)
Jersey Neyman
(1894 – 1981)
Egon Pearson
(1895 – 1980)
Statistical hypothesis testing, the statistical methodology analogous to
falsificationalism, proceeds by first summarizing the scientists’
observations in numeric form: i.e. calculating statistics. In theory, the
statistics chosen for calculation would aid the scientist by creating
hypotheses in numeric form that correspond to scientific hypotheses about
what should happen if a statement or theory were true, and what should
happen if a statement or theory were false.
Some set of values of the statistic(s) would be regarded as enough
information to falsify a statement or theory, and a different set of values
would be regarded as not enough (yet) information to falsify a statement.
An example: The Iron Butterfly Theory
6
Magnetic fields affect the orientation of a wide
variety of animals.
* Homing pigeon (Columba livia domestica)
* European robin (Erithacus rubecula)
* Indigo bunting (Passerina cyanea)
7
One well-known migratory animal is the monarch butterfly (Danaus
plexippus)
These creatures migrate over tremendous distances, and it is not known how
they locate their “winter homes.” One possibility: they use the Earth’s
magnetic field.
8
If monarchs have magnetic material in their bodies, this would lend support
to the possibility of their use of the earth’s magnetic field in navigation. The
existence of magnetic material can be measured by a magnetometer.
Unfortunately the magnetometer itself has its own magnetic material at a
level of 200 pico-emu’s.
To demonstrate that monarchs have magnetic material in their bodies it
must be shown that the magnetic intensity of the butterflies exceeds the
background 200 pico-emu’s.
9
For purposes of the statistical inference, we pick the theory that can be
expressed as a population parameter equal to some constant. If
monarchs have no magnetic material in their bodies, we expect the
magnetometer to record 200 pico-emu.
10
What counts as evidence against the no-magnetic-material theory?
Our alternative theory is that the monarchs do have magnetic
material (and thus navigate using the magnetic field of the earth.)
Evidence in favor of the magnetic navigation theory would be a
higher mean magnetic intensity.
11
Of course, the actual population mean can’t be observed, so we must rely
on a random sample of these butterflies. If a sample mean, , is larger
than 200 pico-emu, this might be due to one of two possible reasons: (i)
the magnetometer is correctly detecting magnetic material in the
butterflies, or (ii) the slings and arrows of outrageous sampling.
The mathematics of hypothesis testing is designed to quantify the
probability of getting a larger-than-200 sample mean if chance alone is
operating.
The mechanism for this quantification requires a consideration of all the
possible sample means that might result from random sampling, and the
probabilities of these means appearing.
The probability distribution of a sample statistic is known as its sampling
distribution. The sampling distribution of a statistic is a mathematical model
of the possible results of taking a sample, and like all mathematical models
has certain driving assumptions.
In the case of the sample mean, the assumptions are:
1. A simple random sample of size n is taken from a population of size N.
2. It is reasonable to assume that (a) the population is normal, or
(b) the sample size is “large enough” for the Central Limit Theorem to
work its magic.
The specification of what counts as sufficient evidence against our theory is
done by considering the probabilities associated with possible sample means.
Recall our idea that a large sample mean might be due to one of two possible
reasons: (i) the magnetometer is correctly detecting magnetic material in the
butterflies, or (ii) chance.
Our statistical hypothesis testing procedure is all about considering chance as
a reason for acquiring a large sample mean.
If we can to our satisfaction eliminate chance as the reason, we are left with
evidence that the difference is “real” and that our “null hypothesis” is
incorrect.
For easy interpretation of results, sample statistics are “standardized” to get
“test statistics.” Which test statistic is calculated depends on the sample statistic,
and in the case of the sample mean, the test statistic is a “t” statistic:
How large the sample mean must be to provide sufficient evidence is thus
translated to the equivalent question, how large must the test statistic (t in
this case) be to provide sufficient evidence against the hypothesis?
The answer to this question is provided by considering probabilities
associated with the distribution of the test statistic.
If the test statistic is so extreme that the probability it would occur “by chance” less
than a predetermined value (the “level of significance”) the null hypothesis will be
rejected, signifying sufficient evidence that the null hypothesis is discredited.
Traditionally the level of significance (α) is set at 5%. The probability associated
with the actual sample statistic is the “p-value.”
Unfortunately, in many situations the assumptions of the mathematical model
used to construct the sampling distribution are less than credible.
1. We may not actually be sampling from any known or suspected population.
2. (a) We have no reason to believe the population is normal, and/or
(b) we have no assurance that the size of the sample is large enough
for the Central Limit Theorem to work its magic.
But worry not! An alternative approach, not dependent on these assumptions,
is available: randomization tests.
Download