Hypothesis Testing Overview Model and assumptions, e.g., Y~N(?, ) H0: = 0 HA: < 0, > 0, 0 Design = 0.01, 0.05, 0.10 n , 1- ≡ Power Perform Survey, Experiment, or Observational Study. Estimator, Standard error, C.I., Test Criterion P-value P< P> Decision Reject H0 NOT Reject H0 Characterize Statistically Sig, or NOT Stat Sig Effect = 0. The hypotheses, in terms of the effect, are H0 : = 0 HA: < 0, > 0, or 0 Estimated Effect ˆ Y 0 P-Value = the probability that the estimated effect would be as great as or greater than that observed, in the direction specified by HA, if H0 were true. Conclusion Method. Assuming [Y~N(?, ), verbally], we tested [H0 vs. HA, verbally], using the z-test ([cite reference]) with significance level [] and sample size [n]. Results. There is significant statistical evidence that [H A, verbally] ([P-value]). There is NOT significant statistical evidence that [HA, verbally] ([P-value]). Truth Table Decision Reject H0 NOT Reject H0 True State of Nature H0 HA Type 1 error, Correct Decision, (1− ) Power Type 2 error, Correct Decision O.C. P P-value The probability that the distance between the estimator and the hypothetical value of the parameter, in the direction specified by H A, would be as great or greater as that observed, if H0 were true. significance level Type 1 error rate, is set by investigator in Design step. operating characteristic Type 2 error rate = f (; , n, ),-0, i.e., is a function of (i) the effect, i.e., the difference between the true and the hypothetical (null) value of the parameter of interest, (ii) the significance level, (iii) the sample size, and (iv) the underlying variability. (1− ) Power P{Reject H0 |; , n, } = 1 − f (; , n, ). Golde Holtzman 687291555 2/5/2016 Hypothesis Testing Overview Guiding objectives of Hypothesis Testing Controlled by investigator Significance level, Type 1 error rate, + − Sample size, n Type 2 error rate, − + Underlying variability, − − Effect, = (0) Power, (1) = significance level = Type 1 error rate = probability of rejecting a true null hypothesis. = operating characteristic = Type 2 error rate = probability of not rejecting a false null hypothesis. = power = sensitivity = probability of rejecting a false null hypothesis. = (0) = effect = unknown true value of the parameter, , minus null hypothetical value, 0. Power = P{Reject Null} Power Curve 1 0.8 n smaller SE 0.6 SE 0.4 n larger 0.2 0 0 2 4 6 8 10 effect Golde Holtzman 687291555 2/5/2016 Hypothesis Testing Overview Statistical Significance versus Practical (clinical, biological, economic, etc.) Importance Statistical significance. A test of significance/hypothesis is a test of the null H0: = 0, vs. HA: 0 hypothesis that the effect = 0 versus an alternative such as < 0, > 0, or 0, where the effect is the actual difference between the true value of the population parameter and the hypothetical value from the null hypothesis. The result of the test, i.e., the decision and the conclusion, depend on the observed effect the parameter and the hypothetical value. δ̂ , which is the difference between the sample estimator of A non zero observed effect does not necessarily indicate that the actual effect is non-zero and null hypothesis is false. Even if the null hypothesis is true and the actual effect is zero, the observed effect will most likely be non-zero due to natural underlying variation, i.e., due to chance. In these terms, the question we seek to answer by performing the test if significance is the following: Is the observed effect the result of an actually non-zero effect, or is the observed effect due to chance? Rejecting the null hypothesis is saying that the observed effect (in the sample) is due to an actual effect (in the population). Not rejecting is saying that the observed effect may simply be due to chance, in which case there may be no actual effect in the population. We therefore reject the null hypothesis only if the P-value is small. The P-value is the probability of so large an observed effect due to natural underlying variation when the null hypothesis is true and the actual effect is zero. I.e., the P-value is the probability of so large an effect due to chance. If the P-value is small, we reject the null hypothesis and declare that the results are statistically significant, and there is statistically significant evidence in favor of the alternative. Thus, Statistical significance means that the observed effect is the result of an actual non-zero effect in the population, and not due merely to the chance variation of a random sample. On the other hand, If the sample size is extremely large, such that the power of the test is more than enough to detect an important difference (effect, relationship, etc.--whatever is specified by the alternative hypothesis), then the results of the study can be statistically significant but not biologically (clinically, economically, etc,) important. I.e., If the sample size is large, then statistical significance does not necessarily imply practical importance. If the sample size is so small that the power of the test is insufficient to detect an important difference, then the results of the study can be of practical importance, yet not achieve statistical significance. I.e., If the sample size is small, then a failure to achieve statistical significance does not necessarily imply a lack of practical importance. Regardless of the result of a test of significance, it is always a good idea to estimate the parameter(s) of interest with a confidence interval. Doing so refocuses attention to the magnitude of the parameter of interest, which is, of course, of the subject of the study. [Note: In many studies, rather than reporting confidence intervals, point estimates and standard errors are reported. Point estimates and standard errors are sufficient, as everyone knows that approximate 95% confidence limits = (point estimate) ± (2)(standard error).] Example 1-A. Mendel’s theory of genetics implies that the proportion of females among humans ought to be 50%. To test this research hypothesis, we will observe the sex (female, male) of new born human, i.e., the random variable of interest is whether or not sex-female for the i-th randomly sampled new-born. The parameter of interest is the proportion, say , of human births that are female. The null and alternative hypotheses are, respectively, Golde Holtzman 687291555 2/5/2016 Hypothesis Testing Overview H0: = 0.50, vs. HA: 0.50. We will observe the sample proportion p , and use the test criterion Z p 0.50 0.50 0.50 . n We plan to use a small significance level of = 0.01 because only strong evidence would be convincing for a theory that has stood for over a century, and a sample of size n = 10,000, because birth records are easy and inexpensive to obtain. Performing the study, we find that out of 10,172 births (why throw away 172 extra records we found), 5,202 are female. Thus, we have observed a sample proportion of p 5, 202 10,172 0.5114 51.14% , and a test criterion of Z 0.5114 0.5000 0.01140 2.300 . 0.500.50 10,172 0.004958 Notice that the observed effect = 0.01140, which is hardly more than 1%, and rather a small departure from what is predicted by the null hypothesis. On the other hand, this small observed effect amounts to 2.3 standard errors because of the large sample size, and a P-value for the two-tailed test is P 2PZ 2.300 2 0.01072 0.02145 0.02 Now, because P = 0.02 is not as small as the significance level of = 0.01, the decision is to not reject the null hypothesis, and we conclude that There is no significant statistical evidence that the proportion of female births is different from the 50% predicted by Mendel’s theory (P = 0.02). We might also report an estimate: We are 99% confident that the proportion of females among human births is between 49.9% and 52.4%. or simply that The observed proportion of females among human births was 51.1% (n = 10,172, SE = 0.496%). and simply let readers calculate confidence limits themselves, taking for granted that it is universally known that n represents the sample size and SE represents the standard error. Concluding that the results are not statistically significant means that the observed effect might be due merely to chance. Revealing that the estimated effect is only ˆ p 0 0.011 gives the further information that the result observed effect is not biologically important. Example 1-B. Now suppose that we repeat the study but use a sample size of n = 100,000 birth records, and observe 51,142 females. We then have observed a sample proportion of p 51,142 100,000 0.5114 51.14% , and a test criterion of Z 0.5114 0.5000 0.50 0.50 100,000 0.01140 7.210 . 0.001581 and P < 0.0001 (In fact, P < 0.000000000001). Golde Holtzman 687291555 2/5/2016 Hypothesis Testing Overview We report There is highly significant statistical evidence that the proportion of female births is different from the 50% predicted by Mendel’s theory (P < 0.0001).The observed proportion of females among human births was 51.1% (n = 100,000, SE = 0.158%), i.e., we are 99% confident that the proportion of human births that are female is between 50.7% and 51.5%. Concluding that the results are highly statistically significant means that the observed effect can not be explained by chance, and must reflect a real an actual departure from the 0.50 of the null hypothesis. Revealing that the estimated effect is only 0.011 gives the further information that the result observed effect is not biologically important, albeit statistically significant, because there are many reasonable explanations for the small discrepancy (differential survival of the zygotes, etc.). Example 1-A Example 1-B 10,172 100,000 0.011 0.011 standard error, SE 0.00496 0.00158 test criterion, Z 2.3 7.2 P-value 0.02 < 0.0001 Statistical significance slightly significant, statistically very highly significant, statistically Practical importance little or no biological importance little or no biological importance observed effect, ˆ The significance test of Example 1-B with a huge sample size and a small standard error has a very small Type 2 error rate, very high power, and is very sensitive to departures from the null hypothesis. A very small observed effect—so small that it is of no practical importance—is very highly statistically significant. The figure to the right is a graph of power curves for sample sizes of n = 10,000 and n = 100,000. The effect = ( 0) where is the true probability of a female and 0 = 0.50 is the hypothetical null value. An effect of 0.01 therefore corresponds to = 0.51. You can see that both of these tests are quite sensitive. 1 Power = P{Reject H0} sample size, n 0.8 0.6 0.4 0.2 0 0 .01 .02 .03 effect Golde Holtzman 687291555 2/5/2016