Appendix H FREQUENTLY ASKED QUESTIONS During the thirty plus years that I have been teaching research and statistics, I have noticed that certain questions tend to get asked almost every semester by students in connection to areas of statistics that they have already studied. In this appendix I’ll identify those questions and answer them. You may recognize some of these questions, or their answers, from earlier sections of this text. However, based on my experience teaching this material, the frequency with which these questions get asked by students who have already studied the relevant material merits revisiting them in this appendix. Let’s get right to it. Question 1: Why square the differences between case values and the mean when calculating the variance or standard deviation? Answer: Because if we didn’t square them, the plus differences and the minus differences would cancel each other out, and the variance and standard deviation therefore would always be zero. By squaring, we turn the minuses into plusses, thus avoiding that problem. Question 2: You say that we need to conduct a statistical power analysis to determine how large our sample size should be. But we read somewhere or heard from another professor that 30 is sort of a magic number for sample size – that we need an N of 30 to do various analyses. Who’s right? Answer: I am, of course! You are understandably – like many students – confusing the recommended minimum number of cases generally needed to obtain a distribution Appendix H May 2005 Evidence-Based 623 that approximates normality with the number needed to obtain adequate statistical power. Suppose 15 recipients of Intervention A and 15 recipients of Intervention B combine to form a normal distribution with regard to an outcome variable. Having an N of 30 does not assure a high likelihood that the difference in outcome between the groups will be statistically significant. It only gives you a good chance of having a normal distribution for your sample of 30 cases. The same holds even if each group has an N of 30 and each is normally distributed. If the population of recipients of Intervention A recipients has a mean score of 50 and a standard deviation of 10, and the population of recipients of Intervention B has a mean score of 51 and a standard deviation of 10, the likelihood that you’ll get a significant difference between two samples of 30 of each group of those recipients is virtually nil even if your samples are distributed normally. Question 3: Sometimes I see published studies that report using parametric statistics (like a t-Test) to analyze data that don’t meet all the assumptions of parametric tests. What gives? Answer: Some statisticians believe that parametric tests have more statistical power than nonparametric ones and that their greater power outweighs the statistical impact of violating certain assumptions. Other statisticians disagree. Feasibility constraints in practice research often make it impossible to meet all the assumptions of any significance test, parametric or nonparametric. The main implication of not meeting all of the assumptions of a significance test is that the resulting p value will not be precise. If we remember that there is no mathematical basis for choosing any significance level, however, we may not mind accepting this imprecision. For example, suppose we get a Appendix H May 2005 Evidence-Based 624 significant p value of .049 because we violated an assumption of the significance test we used and that had we used a different test we would have obtained a non-significant p value of .051. Although some statisticians might prefer simply reporting the latter finding as not significant, others might prefer reporting the actual value of .051 in light of the potential importance of a Type II error. The latter statisticians might also recognize that the .049 and .051 values are almost identical in practical terms. In either case, there is about a .95 probability that the finding is not attributable to sampling error. Had we chosen .052 as our significance level – which has no less of a basis in mathematics than choosing .050 – both findings would have been significant. Regardless of which group of statisticians you side with, remember that ultimately we need replications of any finding. Question 4: Does the above answer relate to why some studies merely report NS (not significant) for findings that fall short of the critical region (the significance level) – without reporting the p value --while other studies report the p value for such findings? Answer: Yes. Those who care a lot about Type II errors – especially when statistical power is low – and who are mindful of the somewhat arbitrary nature of any significance level -- are more likely to report p values for non-significant findings than are those who feel that doing so is playing fast and loose with the concept of significance. The latter folks believe that a finding is either significant or not significant, period. Chances are they worry more about Type I errors than Type II errors. Question 5: I’m getting confused. Sometimes you talk about a significance level (or alpha) of .05, and sometimes you talk about a p value of .05. Please clarify. Appendix H May 2005 Evidence-Based 625 Answer: You probably have a lot of company. Your question gets asked a lot. The significance level (alpha) is a probability that we select in advance, before conducting any statistical test. That is, we pick a probability that is low enough for us to rule out sampling error as a plausible explanation for our findings. Then we apply a statistical significance test to calculate the actual p value, which is the actual probability that our findings are attributable to sampling error. If that p value is at or below our pre-selected alpha value, we rule out sampling error. It can be confusing because some reports will label the .05 finding with the word “significance,” without distinguishing its actual p value from its .05 significance level. Question 6: Let’s return to that tricky concept of statistical power. For example, I see many published studies – particularly those that report evaluations of clinical interventions – that have small samples and I presume low statistical power. Why did they get published? Moreover, why did the researchers actually implement the study with such low power? Answer: Great questions – ones that require a multiple answer analysis . First, why did they get published? One possibility is that the journals’ editorial reviewers had limited statistical expertise, especially regarding statistical power analysis. You might be surprised at the degree to which political and ideological factors can influence the process of selecting reviewers of manuscripts submitted for publication to some (not all) journals. For example, a dean might be well connected with a journal’s chief editor and lobby to have one of her faculty members appointed to the editorial board. Or a journal editor Appendix H May 2005 Evidence-Based 626 might select reviewers who share his paradigmatic views, with little consideration of their expertise in research and statistics. Another possibility is that the findings were statistically (and perhaps clinically) significant. Once we get significant findings, the problem of low statistical power and Type II error risk is no longer relevant. That is, we don’t have to worry about our low probability of getting statistically significant findings once we’ve got them. A third possibility is that findings may have value despite their lack of statistical significance. For example, suppose a new intervention was being evaluated for a serious problem for which there are no known effective interventions. Suppose a relatively large effect size was found, but one that fell short of statistical significance perhaps due to the small sample size. That finding could merit publication – especially if the authors interpret it properly and cautiously -- so that is spurs others to replicate the study with a larger sample size. Now on to your second question: Why did the researchers actually implement the study with such low power? Perhaps it was impossible to get a larger sample in the agency that allowed them to do the research. They might have decided to pursue the study anyway, with two ideas in mind. One might be their hope that the effect size would be so large that they’d get significance even with a small sample. Another might be that if they obtained a large effect size it would have value to the field even if it fell short of significance, for reasons stated above regarding spurring replications. Question 7: I still confuse the concepts of effect size and power. Can you clarify their distinction? Appendix H May 2005 Evidence-Based 627 Answer: The distinction between these two concepts confuses many students. To clarify the distinction, I’ll borrow an analogy from Brewer (1978). Let’s think of it in terms of the old saying “It’s like trying to find a needle in a haystack.” Imagine the difference in size between a needle and a very large pumpkin. It would be a lot easier to find the pumpkin in the haystack than the needle. We can think of the needle and the pumpkin as representing two effect sizes. We can think of statistical power as the probability of finding the needle or the pumpkin. With the larger effect size (i.e., the pumpkin) we have more power. That is, we have a greater probability of finding it. With the smaller effect size (i.e., the needle), the opposite is true. The effect size is not a probability statistic – it is a size. Statistical power, on the other hand, is a probability statistic. Assuming there really is a pumpkin that large in the haystack, and assuming that we have ample time to search for it, our probability of finding the pumpkin would be very high – maybe close to 1.0. Likewise, our probability of not finding it, and thus committing a Type II error would be very low, perhaps close to 0. Assuming there really is a needle in the haystack, our probability of finding it would be much lower, and our Type II error probability would therefore be much higher. We can also think of the impact of sample size on power as analogous with how much time we have to search for the object. The more time we have, the greater the chance of finding any object in the haystack – even the needle. But if we have only a few seconds to look, we might not even find the pumpkin. Question 8: Since the purpose of significance testing is to rule out sampling error, does that mean that if we have data on an entire population all relationships are true and Appendix H May 2005 Evidence-Based 628 significance testing is no longer relevant because with a population there is no sampling error? Answer: Another great question! Many statisticians would agree that there is no point to test the significance of relationships found with population data. They might reason, for example, that if the population of adult New York residents weighs on average 150 pounds, and the population of adult Californians weighs on average 150 pounds three ounces, then Californians weigh on average three ounces more than New Yorkers, and we don’t need a significance test to know that. Well, at least that’s how some statisticians would see it. Others would disagree. They would argue that the concept of a population is mainly theoretical because any population is always in flux. Because some residents of the two states are constantly aging, and because some die every day, the population at any given moment is different than the population a moment ago, a day ago, a week ago, and so on. During the time between the data collection and the point when we are discussing the finding, more obese New Yorkers than Californians may have reached adulthood, and more obese Californians than New Yorkers may have died. Likewise, more very thin Californians may have reached adulthood, and more very thin New Yorkers may have died. Moreover, there may be nothing about living in either state that influences such changes. It may just be due to sampling error – or random fluctuations. Consequently, these statisticians would reason, it does make sense to test for statistical significance even when our data are on an entire “population.” In contrast, suppose we found a larger mean weight difference between the two states and found that the larger difference was statistically significant. That would rule out random fluctuations in each population as a plausible Appendix H May 2005 Evidence-Based 629 explanation for the difference, which in turn would imply that perhaps there really is some variable related to which of the states a person resides in that influences their weight. Any more questions? If so, you can email me at arubin@mail.utexas.edu. I’ll answer as best I can, and perhaps even include your question in the next edition of this text! Reference: Brewer, James K. (1978). Everything You Always Wanted to Know about Statistics, but Didn’t Know How to Ask. Dubuque, Iowa: Kendall/Hunt Publishing Company Appendix H May 2005 Evidence-Based 630