Appendix H FREQUENTLY ASKED QUESTIONS

advertisement
Appendix H FREQUENTLY ASKED QUESTIONS
During the thirty plus years that I have been teaching research and statistics, I
have noticed that certain questions tend to get asked almost every semester by students in
connection to areas of statistics that they have already studied. In this appendix I’ll
identify those questions and answer them. You may recognize some of these questions, or
their answers, from earlier sections of this text. However, based on my experience
teaching this material, the frequency with which these questions get asked by students
who have already studied the relevant material merits revisiting them in this appendix.
Let’s get right to it.
Question 1: Why square the differences between case values and the mean when
calculating the variance or standard deviation?
Answer: Because if we didn’t square them, the plus differences and the minus
differences would cancel each other out, and the variance and standard deviation
therefore would always be zero. By squaring, we turn the minuses into plusses, thus
avoiding that problem.
Question 2: You say that we need to conduct a statistical power analysis to determine
how large our sample size should be. But we read somewhere or heard from another
professor that 30 is sort of a magic number for sample size – that we need an N of 30 to
do various analyses. Who’s right?
Answer: I am, of course!  You are understandably – like many students – confusing
the recommended minimum number of cases generally needed to obtain a distribution
Appendix H May 2005 Evidence-Based
623
that approximates normality with the number needed to obtain adequate statistical power.
Suppose 15 recipients of Intervention A and 15 recipients of Intervention B combine to
form a normal distribution with regard to an outcome variable. Having an N of 30 does
not assure a high likelihood that the difference in outcome between the groups will be
statistically significant. It only gives you a good chance of having a normal distribution
for your sample of 30 cases. The same holds even if each group has an N of 30 and each
is normally distributed. If the population of recipients of Intervention A recipients has a
mean score of 50 and a standard deviation of 10, and the population of recipients of
Intervention B has a mean score of 51 and a standard deviation of 10, the likelihood that
you’ll get a significant difference between two samples of 30 of each group of those
recipients is virtually nil even if your samples are distributed normally.
Question 3: Sometimes I see published studies that report using parametric statistics (like
a t-Test) to analyze data that don’t meet all the assumptions of parametric tests. What
gives?
Answer: Some statisticians believe that parametric tests have more statistical power than
nonparametric ones and that their greater power outweighs the statistical impact of
violating certain assumptions. Other statisticians disagree. Feasibility constraints in
practice research often make it impossible to meet all the assumptions of any significance
test, parametric or nonparametric. The main implication of not meeting all of the
assumptions of a significance test is that the resulting p value will not be precise. If we
remember that there is no mathematical basis for choosing any significance level,
however, we may not mind accepting this imprecision. For example, suppose we get a
Appendix H May 2005 Evidence-Based
624
significant p value of .049 because we violated an assumption of the significance test we
used and that had we used a different test we would have obtained a non-significant p
value of .051. Although some statisticians might prefer simply reporting the latter finding
as not significant, others might prefer reporting the actual value of .051 in light of the
potential importance of a Type II error. The latter statisticians might also recognize that
the .049 and .051 values are almost identical in practical terms. In either case, there is
about a .95 probability that the finding is not attributable to sampling error. Had we
chosen .052 as our significance level – which has no less of a basis in mathematics than
choosing .050 – both findings would have been significant. Regardless of which group of
statisticians you side with, remember that ultimately we need replications of any finding.
Question 4: Does the above answer relate to why some studies merely report NS (not
significant) for findings that fall short of the critical region (the significance level) –
without reporting the p value --while other studies report the p value for such findings?
Answer: Yes. Those who care a lot about Type II errors – especially when statistical
power is low – and who are mindful of the somewhat arbitrary nature of any significance
level -- are more likely to report p values for non-significant findings than are those who
feel that doing so is playing fast and loose with the concept of significance. The latter
folks believe that a finding is either significant or not significant, period. Chances are
they worry more about Type I errors than Type II errors.
Question 5: I’m getting confused. Sometimes you talk about a significance level (or
alpha) of .05, and sometimes you talk about a p value of .05. Please clarify.
Appendix H May 2005 Evidence-Based
625
Answer: You probably have a lot of company. Your question gets asked a lot. The
significance level (alpha) is a probability that we select in advance, before conducting
any statistical test. That is, we pick a probability that is low enough for us to rule out
sampling error as a plausible explanation for our findings. Then we apply a statistical
significance test to calculate the actual p value, which is the actual probability that our
findings are attributable to sampling error. If that p value is at or below our pre-selected
alpha value, we rule out sampling error. It can be confusing because some reports will
label the .05 finding with the word “significance,” without distinguishing its actual p
value from its .05 significance level.
Question 6: Let’s return to that tricky concept of statistical power. For example, I see
many published studies – particularly those that report evaluations of clinical
interventions – that have small samples and I presume low statistical power. Why did
they get published? Moreover, why did the researchers actually implement the study with
such low power?
Answer: Great questions – ones that require a multiple answer analysis . First, why did
they get published? One possibility is that the journals’ editorial reviewers had limited
statistical expertise, especially regarding statistical power analysis. You might be
surprised at the degree to which political and ideological factors can influence the process
of selecting reviewers of manuscripts submitted for publication to some (not all) journals.
For example, a dean might be well connected with a journal’s chief editor and lobby to
have one of her faculty members appointed to the editorial board. Or a journal editor
Appendix H May 2005 Evidence-Based
626
might select reviewers who share his paradigmatic views, with little consideration of their
expertise in research and statistics.
Another possibility is that the findings were statistically (and perhaps clinically)
significant. Once we get significant findings, the problem of low statistical power and
Type II error risk is no longer relevant. That is, we don’t have to worry about our low
probability of getting statistically significant findings once we’ve got them.
A third possibility is that findings may have value despite their lack of statistical
significance. For example, suppose a new intervention was being evaluated for a serious
problem for which there are no known effective interventions. Suppose a relatively large
effect size was found, but one that fell short of statistical significance perhaps due to the
small sample size. That finding could merit publication – especially if the authors
interpret it properly and cautiously -- so that is spurs others to replicate the study with a
larger sample size.
Now on to your second question: Why did the researchers actually implement the
study with such low power? Perhaps it was impossible to get a larger sample in the
agency that allowed them to do the research. They might have decided to pursue the
study anyway, with two ideas in mind. One might be their hope that the effect size would
be so large that they’d get significance even with a small sample. Another might be that if
they obtained a large effect size it would have value to the field even if it fell short of
significance, for reasons stated above regarding spurring replications.
Question 7: I still confuse the concepts of effect size and power. Can you clarify their
distinction?
Appendix H May 2005 Evidence-Based
627
Answer: The distinction between these two concepts confuses many students. To clarify
the distinction, I’ll borrow an analogy from Brewer (1978). Let’s think of it in terms of
the old saying “It’s like trying to find a needle in a haystack.” Imagine the difference in
size between a needle and a very large pumpkin. It would be a lot easier to find the
pumpkin in the haystack than the needle. We can think of the needle and the pumpkin as
representing two effect sizes. We can think of statistical power as the probability of
finding the needle or the pumpkin. With the larger effect size (i.e., the pumpkin) we have
more power. That is, we have a greater probability of finding it. With the smaller effect
size (i.e., the needle), the opposite is true. The effect size is not a probability statistic – it
is a size. Statistical power, on the other hand, is a probability statistic. Assuming there
really is a pumpkin that large in the haystack, and assuming that we have ample time to
search for it, our probability of finding the pumpkin would be very high – maybe close to
1.0. Likewise, our probability of not finding it, and thus committing a Type II error
would be very low, perhaps close to 0. Assuming there really is a needle in the haystack,
our probability of finding it would be much lower, and our Type II error probability
would therefore be much higher. We can also think of the impact of sample size on
power as analogous with how much time we have to search for the object. The more time
we have, the greater the chance of finding any object in the haystack – even the needle.
But if we have only a few seconds to look, we might not even find the pumpkin.
Question 8: Since the purpose of significance testing is to rule out sampling error, does
that mean that if we have data on an entire population all relationships are true and
Appendix H May 2005 Evidence-Based
628
significance testing is no longer relevant because with a population there is no sampling
error?
Answer: Another great question! Many statisticians would agree that there is no point to
test the significance of relationships found with population data. They might reason, for
example, that if the population of adult New York residents weighs on average 150
pounds, and the population of adult Californians weighs on average 150 pounds three
ounces, then Californians weigh on average three ounces more than New Yorkers, and
we don’t need a significance test to know that.
Well, at least that’s how some statisticians would see it. Others would disagree.
They would argue that the concept of a population is mainly theoretical because any
population is always in flux. Because some residents of the two states are constantly
aging, and because some die every day, the population at any given moment is different
than the population a moment ago, a day ago, a week ago, and so on. During the time
between the data collection and the point when we are discussing the finding, more obese
New Yorkers than Californians may have reached adulthood, and more obese
Californians than New Yorkers may have died. Likewise, more very thin Californians
may have reached adulthood, and more very thin New Yorkers may have died. Moreover,
there may be nothing about living in either state that influences such changes. It may just
be due to sampling error – or random fluctuations. Consequently, these statisticians
would reason, it does make sense to test for statistical significance even when our data
are on an entire “population.” In contrast, suppose we found a larger mean weight
difference between the two states and found that the larger difference was statistically
significant. That would rule out random fluctuations in each population as a plausible
Appendix H May 2005 Evidence-Based
629
explanation for the difference, which in turn would imply that perhaps there really is
some variable related to which of the states a person resides in that influences their
weight.
Any more questions? If so, you can email me at arubin@mail.utexas.edu. I’ll
answer as best I can, and perhaps even include your question in the next edition of this
text!
Reference:
Brewer, James K. (1978). Everything You Always Wanted to Know about Statistics, but
Didn’t Know How to Ask. Dubuque, Iowa: Kendall/Hunt Publishing Company
Appendix H May 2005 Evidence-Based
630
Download