Hypothesis testing Classical hypothesis testing is a statistical method that appeared in the first third of the 20th Century, alongside the “modern” conception of a “scientific” theory. In Popper’s view, a theory not falsifiable is a theory not scientific. The philosopher of science, Karl Popper (1902-1994) asserted that it was the scientist’s responsibility to construct statements that would be either consistent or inconsistent with a scientific theory. In his view, science progresses by repeatedly “testing” aspects of theories against observation; falsified theories are those that after much experimentation / observation have not successfully survived these tests. Three individuals are responsible for developing the statistical analog to Popper’s falsificationism: Ronald Fisher, Jerzy Neyman, and Egon Pearson. Ronald Fisher (1890 – 1962) Jersey Neyman (1894 – 1981) Egon Pearson (1895 – 1980) Statistical hypothesis testing, the statistical methodology analogous to falsificationalism, proceeds by first summarizing the scientists’ observations in numeric form: i.e. calculating statistics. In theory, the statistics chosen for calculation would aid the scientist by creating hypotheses in numeric form that correspond to scientific hypotheses about what should happen if a statement or theory were true, and what should happen if a statement or theory were false. Some set of values of the statistic(s) would be regarded as enough information to falsify a statement or theory, and a different set of values would be regarded as not enough (yet) information to falsify a statement. An example: The Iron Butterfly Theory 6 Magnetic fields affect the orientation of a wide variety of animals. * Homing pigeon (Columba livia domestica) * European robin (Erithacus rubecula) * Indigo bunting (Passerina cyanea) 7 One well-known migratory animal is the monarch butterfly (Danaus plexippus) These creatures migrate over tremendous distances, and it is not known how they locate their “winter homes.” One possibility: they use the Earth’s magnetic field. 8 If monarchs have magnetic material in their bodies, this would lend support to the possibility of their use of the earth’s magnetic field in navigation. The existence of magnetic material can be measured by a magnetometer. Unfortunately the magnetometer itself has its own magnetic material at a level of 200 pico-emu’s. To demonstrate that monarchs have magnetic material in their bodies it must be shown that the magnetic intensity of the butterflies exceeds the background 200 pico-emu’s. 9 For purposes of the statistical inference, we pick the theory that can be expressed as a population parameter equal to some constant. If monarchs have no magnetic material in their bodies, we expect the magnetometer to record 200 pico-emu. 10 What counts as evidence against the no-magnetic-material theory? Our alternative theory is that the monarchs do have magnetic material (and thus navigate using the magnetic field of the earth.) Evidence in favor of the magnetic navigation theory would be a higher mean magnetic intensity. 11 Of course, the actual population mean can’t be observed, so we must rely on a random sample of these butterflies. If a sample mean, , is larger than 200 pico-emu, this might be due to one of two possible reasons: (i) the magnetometer is correctly detecting magnetic material in the butterflies, or (ii) the slings and arrows of outrageous sampling. The mathematics of hypothesis testing is designed to quantify the probability of getting a larger-than-200 sample mean if chance alone is operating. The mechanism for this quantification requires a consideration of all the possible sample means that might result from random sampling, and the probabilities of these means appearing. The probability distribution of a sample statistic is known as its sampling distribution. The sampling distribution of a statistic is a mathematical model of the possible results of taking a sample, and like all mathematical models has certain driving assumptions. In the case of the sample mean, the assumptions are: 1. A simple random sample of size n is taken from a population of size N. 2. It is reasonable to assume that (a) the population is normal, or (b) the sample size is “large enough” for the Central Limit Theorem to work its magic. The specification of what counts as sufficient evidence against our theory is done by considering the probabilities associated with possible sample means. Recall our idea that a large sample mean might be due to one of two possible reasons: (i) the magnetometer is correctly detecting magnetic material in the butterflies, or (ii) chance. Our statistical hypothesis testing procedure is all about considering chance as a reason for acquiring a large sample mean. If we can to our satisfaction eliminate chance as the reason, we are left with evidence that the difference is “real” and that our “null hypothesis” is incorrect. For easy interpretation of results, sample statistics are “standardized” to get “test statistics.” Which test statistic is calculated depends on the sample statistic, and in the case of the sample mean, the test statistic is a “t” statistic: How large the sample mean must be to provide sufficient evidence is thus translated to the equivalent question, how large must the test statistic (t in this case) be to provide sufficient evidence against the hypothesis? The answer to this question is provided by considering probabilities associated with the distribution of the test statistic. If the test statistic is so extreme that the probability it would occur “by chance” less than a predetermined value (the “level of significance”) the null hypothesis will be rejected, signifying sufficient evidence that the null hypothesis is discredited. Traditionally the level of significance (α) is set at 5%. The probability associated with the actual sample statistic is the “p-value.” Unfortunately, in many situations the assumptions of the mathematical model used to construct the sampling distribution are less than credible. 1. We may not actually be sampling from any known or suspected population. 2. (a) We have no reason to believe the population is normal, and/or (b) we have no assurance that the size of the sample is large enough for the Central Limit Theorem to work its magic. But worry not! An alternative approach, not dependent on these assumptions, is available: randomization tests.