MATH 2441 Probability and Statistics for Biological Sciences A Non-Statistical Example of Hypothesis Testing The formalism and rationale behind the hypothesis testing procedure introduced in the previous document is sometimes difficult to understand at first because it seems to be artificially complicated. Many writers have pointed out the parallel between the statistical hypothesis testing procedure and the trial procedures followed in our criminal justice system, and looking at the parallel features between the two may help you understand a bit better how the pieces of the statistical hypothesis testing procedure fit together, and why they've been set up the way they are. (One of the most well-known descriptions of this parallel is given by James M. Kenney in an article on page 55 of the January 1988 issue of Quality Progress, a monthly publication of the American Society for Quality. Kenney's article, called "Hypothesis Testing: Guilty or Innocent", is somewhat wider ranging than the discussion below, raising a few statistical concepts we haven't looked into yet. However, you might find it interesting to read through one of the paper copies of the article available in class.) We will use the "generic" symbols introduced at the end of the previous document to denote the elements of statistical hypothesis tests. In particular, stands for a population parameter (eg. , , , 1 - 2, 1 - 2, etc.) 0 stands for a specific hypothesized numerical value of f is the symbol for the standardized test statistic (e. g. z, t, 2, etc.) The symbol f stands for the value of f which cuts of a right-hand tail of area . Then, the symbol f1 - stands for the value of f that cuts of a left-hand tail of area . In the table to follow, the commentary refers to a two-tailed statistical hypothesis test, but with appropriate minor modifications, the points raised could be illustrated with reference to single-tailed tests as well. concept statistical hypothesis testing criminal justice system null hypothesis H0: = 0 "the defendant is innocent" alternative hypothesis HA: 0 "the defendant is guilty" nature of experiment A random sample of the population is collected. Various summary quantities of that sample are used as estimators of corresponding population summaries and/or estimates of parameters in the sampling distribution. With an adequate sampling plan, the data obtained for the random sample is believed to reflect the properties of the entire population. A standardized test statistic, f, is computed, a measure of the degree to which the data is inconsistent with the null hypothesis. "Data" in the form of observations (visual, verbal, etc.) is collected, organized, summarized. The case that the prosecution presents is intended to demonstrate that these observations are inconsistent with the defendant's claim of innocence. rejection region The null hypothesis is rejected (and thereby support for the alternative hypothesis is inferred) if either f > f/2 or f < f1 - /2. If this rejection criterion is not met, the alternative hypothesis is assumed to be unsupported. The prosecution must present enough evidence in contradiction to the null hypothesis that the judge or jury is convinced of the guilt of the defendant beyond "reasonable doubt." If the data from the sample is ambiguous, the result of the © David W. Sabo (2000) The default outcome is "the defendant is innocent," the null hypothesis. Only if overwhelming evidence to the A Non-Statistical Example of Hypothesis Testing Page 1 of 5 hypothesis test is "no conclusion" -cannot say that HA is supported. contrary is presented will the null hypotheses be rejected by the judge or jury, and a finding of "guilty" occur. The null hypothesis is rejected and the alternative hypothesis is supported only by the presentation of observations contradicting the null hypothesis at some pre-determined level of significance. The burden of proof is on the experiment to provide clear evidence in support of HA before a conclusion can be declared. If the prosecution is unable to present evidence of the defendant's guilt, the defendant is not required to present any defense -- the verdict is automatically "innocent" unless clear evidence of guilt is presented. We say, the burden of proof is on the prosecution to make a case for guilt -the role of the defense is not to prove innocence, but to challenge the evidence presented for guilt. type 1 error The null hypothesis is rejected even though the alternative hypothesis is not correct. This means that rejection of the null hypothesis is not absolute proof that the alternative hypothesis is correct -- there is a chance that an error has been made. Just by coincidence, the random sample may misrepresent the population. However, the rejection criterion has been set up to control the probability of such an error occurring. An innocent defendant is convicted. This means that conviction of the defendant does not establish absolutely that the defendant is truly guilty. Errors can occur if inadvertently misleading evidence is presented and not effectively challenged. However, the principles and procedures followed in the investigation and trial are designed to control the probability of such an error occurring. type 2 error The null hypothesis is not rejected even though the alternative hypothesis is correct. Thus, failing to reject the null hypothesis does not mean that HA can be said to be absolutely certainly untrue. It just means that there has not been adequate evidence presented to support the conclusion HA. A guilty defendant is acquitted. Acquittal doesn't necessarily mean that the defendant is innocent -- it may mean that the prosecution was simply unable to find adequate evidence of the defendant's guilt. The principles and procedures of our criminal justice system have evolved over centuries to better and better ensure that verdicts are based on a logical and reliable decision process. The more abstract formalism we've described for statistical hypothesis testing illuminates some of the less immediately obvious motivations for certain aspects of the criminal justice system. At the same time, the more intuitive features of the criminal justice system (at least as portrayed so familiarly by Ben Matlock or Perry Mason or their more recent incarnations) give insight to the elements of the statistical hypothesis testing procedure. Note for example, the following few observations: "Burden of Proof" The notion of "burden of proof" is a useful one in interpreting the result of a hypothesis test. In a criminal trial structured along the lines described above, it is the responsibility of the prosecutor to provide evidence of the defendant's guilt. It is not necessary for defendants to provide evidence of their innocence. (Of course, you could organize your "justice" system so that the accused is assumed guilty until they can provide evidence of their innocence, and apparently this is the approach in some societies. You can see that that would lead to a quite different legal system.) While it may sometimes be in the defendant's interest to provide evidence of their innocence as a way to counter the prosecution's case, the prosecution has the "burden" of making a case in the first place. If the prosecution cannot provide strong evidence of guilt, the defendant is acquitted, regardless of whether they committed the crime or not and regardless of whether they present any explicit defense or not at the trial. Page 2 of 5 A Non-Statistical Example of Hypothesis Testing © David W. Sabo (2000) Similarly in statistical hypothesis testing, it is up to the statistician to make the case for the alternative hypothesis by providing evidence that is clearly inconsistent with the null hypothesis in the appropriate way. If you state a conclusive result, it must be because the corresponding alternative hypothesis has been supported through rejection of a null hypothesis. This means that conclusive results of hypothesis tests are always supported by direct evidence. They are not the result of lack of evidence for some other possibilities. For example, suppose the hypotheses H0: = 65 ppm vs HA: < 65 ppm are tested, and the data does not allow the rejection of H0. What you can say is that "our evidence does not allow us to conclude that the population mean is less than 65 ppm." What you cannot say in this instance is, "since we could not reject H0, we cannot say that the mean is less than 65 ppm [correct so far] so this must indicate that the mean is greater than 65 ppm [not correct at all!]." The fact that you cannot prove that the mean is less than 65 ppm is not evidence that the mean is greater than 65 ppm. To be able to declare that the mean is greater than 65 ppm in the instance just described, you would have to set up the hypothesis test: H0: = 65 ppm vs HA: > 65 ppm and obtain data that allowed you to reject H0. This is what we mean by saying conclusive results must be directly supported by evidence rather than based on the lack of evidence for some contrary result. At the risk of belaboring the point, you need to realize that this opens the door for some flexibility in the way a hypothesis test is set up, and the potential for stating results in a misleading manner. To be a bit more concrete, we'll refer to the SalmonCa0 question used to illustrate the hypothesis testing procedure in the preceding introductory document. At issue was the relationship of the mean calcium content of unsanitized salmon fillets to the value 65 ppm. The points here can be made by considering just one-tailed tests. H0: = 65 ppm HA: > 65 ppm The sample mean would have to be a statistically significant amount greater than 65 ppm. H0: = 65 ppm HA: < 65 ppm The sample mean would have to be a statistically significant amount less than 65 ppm. 2. For what kind of populations will H0 most likely be rejected? The true mean value of those populations tend to be markedly larger than 65 ppm . The true mean value of those populations tend to be markedly smaller than 65 ppm. 3. How can you state the conclusion when the data allows rejection of H0? The data supports the conclusion that the population mean is greater than 65 ppm at a level of significance of … . The data supports the conclusion that the population mean is less than 65 ppm at a level of significance of … . 4. What sort of evidence (experimental data) will result in H0 not being rejected? The sample mean may be larger than 65 ppm, but not large enough to allow rejection of H0 at an acceptable level of significance. The sample mean may be smaller than 65 ppm, but not small enough to allow rejection of H0 at an acceptable level of significance. 5. For what kind of populations will H0 most likely not be rejected? The true population mean is either less than 65 ppm, or only marginally greater than 65 ppm. The true population mean is either greater than 65 ppm, or only marginally less than 65 ppm. Hypotheses: 1. What sort of evidence (experimental data) will result in rejection of H0? © David W. Sabo (2000) A Non-Statistical Example of Hypothesis Testing Page 3 of 5 Hypotheses: 6. How can you state the conclusion when the data does not allow rejection of H0? 7. Suppose that it was in the interests of the researcher to find that the mean calcium concentration in the salmon fillets was greater than 65 ppm. How would the researcher state the result of the study if the null hypothesis could be rejected? 8. Suppose that it was in the interests of the researcher to find that the mean calcium concentration in the salmon fillets was greater than 65 ppm. How would the researcher state the result of the study if the null hypothesis could not be rejected? H0: = 65 ppm HA: > 65 ppm "The data is not adequate to support a conclusion that the true population mean is greater than 65 ppm." H0: = 65 ppm HA: < 65 ppm "The data is not adequate to support the conclusion that the true population mean is less than 65 ppm." This is not the same thing as saying that the data indicates the true population mean is less than 65 ppm. "The data definitely supports the claim that the mean calcium concentration in the salmon fillets is greater than 65 ppm." This is not the same thing as saying that the data indicates the true population mean is greater than 65 ppm. "aw shucks!" "Our data is inconclusive on the issue of whether the mean calcium concentration is greater than 65 ppm." "The data does not indicate that the mean calcium concentration in the fillets is definitely less than 65 ppm." or "We do not have definite evidence that the mean calcium concentration is less than 65 ppm." By the way, if for some reason a two-tailed hypothesis test was done, H0: = 65 ppm HA: 65 ppm the answers to questions 1, 2, 4, and 5 would be the amalgamation or combination of the two answers to each question above. This is because the two-tailed rejection region consists of two (smaller) one-tailed regions, and the standardized test statistic will inevitably fall in one tail or the other if you are to be able to reject H0. The only difference is that rejection of H0 for a two-tailed test essentially requires data (and so underlying populations) which are considerably more inconsistent with H 0 than do one-tailed tests because the two "single tails" making up the rejection region are smaller than the single tail of a one-tailed test using the same value of . In answer to question 3 for the two-tailed test, we could say "The data supports the conclusion that the population mean is different from 65 ppm." In answer to question 6, we could say, "The data is consistent with the population mean being equal to 65 ppm." Relationship Between Type 1 Error Probability and Type 2 Error Probability Recall the observation we made previously that for a given amount of data (i.e. sample size), any modification of the rejection region to decrease the probability of making a type 2 error inevitably resulted in an increase in the probability of making a type 1 error. Without increasing the sample size, you cannot adjust the rejection region to simultaneously decrease the values of both and . The corresponding feature of the criminal justice system is this. If you create trial rules that require the prosecution to present very strong evidence of guilt before a defendant can be convicted, then you will have a system in which it is very improbable that an innocent defendant will be falsely convicted. In such a system, the probability of a type 1 error (convicting an innocent defendant) will be very small. However, in such a system, it would also be more difficult to convict a guilty defendant , and so it would be more Page 4 of 5 A Non-Statistical Example of Hypothesis Testing © David W. Sabo (2000) probable that a guilty defendant will be acquitted (a type 2 error) due to lack of adequate evidence. Setting very high standards of evidence for guilt will hinder the mistaken conviction of innocent defendants, but at the cost of increasing the frequency of acquittal of guilty defendants. A lenient system is less likely to convict an innocent defendant, but it is also less likely to convict a guilty defendant. Similarly, if you relax the rules of evidence to make it easier for the prosecution to obtain a verdict of guilty, then while you'll increase the likelihood of conviction of truly guilty defendants (i.e., the probability of a type 2 error will decrease), you will also be making it easier for innocent defendants to be convicted (the probability of making type 1 errors will increase). A severe system is more likely to convict guilty defendants, but it is also more likely to convict innocent defendants. The Worst Possible Error Recall the principle stated earlier that one way to decide on an appropriate hypothesis test is to choose those hypotheses for which the type 1 error is the most serious error possible in the situation. We can see how this principle is realized in the way our criminal justice system is organized. In Canada and the United States, criminal trials amount to testing the hypotheses H0: the defendant is innocent HA: the defendant is guilty (CRIM - 1) This is the "hypothesis testing" version of the principle: the defendant is presumed innocent until "proven" guilty. People say this system is based on "the presumption of innocence." For these hypotheses, the type 1 error amounts to erroneously convicting an innocent defendant. The type 2 error would be erroneously acquitting a guilty defendant. In principle, one could create a system organized in the opposite way: H0: the defendant is guilty HA: the defendant is innocent (CRIM - 2) Now, the simple act of charging a person with a crime is taken to be sufficient to presume their guilt. It is up to the defendant to provide adequate evidence of their innocence. We could call this fundamental principle "the presumption of guilt." In such a system, the type 1 error would be to acquit a guilty defendant, and the type 2 error would be to convict an innocent victim. You can see that the criminal justice system based on (CRIM - 1) views the conviction of an innocent defendant to be a more serious error than the acquittal of a guilty defendant. That is the view held by most in our society. In our society, the debate is usually not over the presumption of innocence (i.e., the choice of the basic hypotheses to test) as it is over the standard of evidence required for conviction (which is the analog of the value of , the probability of making a type 1 error). When people say, "the courts are too soft," they are not usually advocating a switch from (CRIM - 1) to (CRIM - 2) that we automatically presume the defendants to be guilty until proven otherwise but that the degree of evidence required to obtain a conviction may be too strict. They are criticizing the analog of "" used in the trial, not the structure of the "hypothesis test" itself. © David W. Sabo (2000) A Non-Statistical Example of Hypothesis Testing Page 5 of 5