Topic 10: Power

advertisement
Lecture 11 - Power
Power: Probability of rejecting the null hypothesis in those situations when the null is false.
In terms of the 2 x 2 table of outcomes.
The combination of two states with two decisions leads to four possible outcomes.
Situation that exists in the populations
Null is True
Null is false
Retain Null
Correct Retention
Incorrect Retention
Probability: Type II error rate
Reject Null
Incorrect Rejection
Probability: Significance level
Correct Rejection
Probability: Power.
Experiment
Outcome
Note that power is not an issue for the left side of the 2x2 table.
If we’re in the left side of the table, the null hypothesis is true.
If we’re in the left side of the above table, the only thing that affects the probability of Rejection (or the
probability of Retention) is the Significance Level of the statistical test. This is set (usually at .05) and does
not depend on the outcome of the research.
Power is an issue only for the right side of the 2x2 table.
If we’re in the right side of the table, the null is false. There is some difference between the performance
population means.
If we’re in the right side of the table, a whole bunch of factors affect the probability of Rejection – the
probability of our making the correct decision. Those are the factors we’re considering here.
Obviously, if the null is false, then you want to do whatever you can to put yourself in the lower right cell.
To recap
If the null is true, power is not an issue. There is no difference between population means. Or the
population correlation coefficient is 0. Oh woe is me!
If the null is false, there IS a difference in the population means. Or the population correlation coefficient is
different from zero. So you want your research project to be able to detect that falseness. So you want the
most powerful design you can afford. You want to reject the null.
Copyright © 2005 by Michael Biderman
Power- 1
02/05/16
Factors that affect Power in order of importance.
1. The effect size – the actual size of the effect in the population: When comparing means: How big the
difference between population means actually is. When doing correlational research: how strong the
correlation actually is in the population.
Definitions
When comparing two population means
When investigating a relationship
Effect size is d = (E – uC) / .
Effect size = Population r
If the population means are equal or the population r = 0, then effect size is 0, the null is true, and power is
not an issue.
The larger the effect size, the more likely we are to detect it.
Analogy: The brightness of a distant star. The brighter it is, the easier it will be to detect.
What are small, medium, and big effect sizes?
From Lance, C. E., & Bandenberg, R. J. (2008). Statistical and Methodological Myths and Urban Legends.
Routledge.
For a recent update on correlation effect sizes, see
Bosco, F. A., Aguinis, H., Singh, K., Field, J. G., & Pierce, C. A. (2014, October 13).
Correlational Effect Size Benchmarks. Journal of Applied Psychology. Advance online
publication. http://dx.doi.org/10.1037/a0038047
Characterizations of effect sizes in terms of what Cohen considered small, medium, and large will be
presented below.
Copyright © 2005 by Michael Biderman
Power- 2
02/05/16
2. The sample size. The only thing we really have control over. The larger the sample size the greater the
power. Sample size is the primary method of manipulating power.
Analogy: The size of our telescope. The larger then telescope, the greater the chance of detecting a star.
Note that increasing the sample size has no effect on probabilities computed in those situations in which the
null is true – the left side of the 2x2 table above. If the null is true, the probability of incorrectly rejecting it
depends only on the significance level. The significance level is set before the research is conducted.
3. The particular test chosen. For example, in the comparison of two groups, if the assumptions of the ttest are met, it is the most powerful way to compare means. The Mann-Whitney U-test is less powerful as
a test to compare means than the t-test when the assumptions of the t are met.
4. The significance level. The larger the significance level, the larger the power.
But you can't have your cake and eat it too. Unfortunately, increasing the significance level increases the
probability of a Type I error - since that's what significance level is.
5. The variability of scores within each population.
Recall that for two populations, the effect size is the difference in population means divided by the
population standard deviation, (1 - u0) / . Proper conduct of the experiment may affect the value of .
The smaller the value of , the larger the power. Manipulating  does not affect the probability of a Type I
error.
Telescope analogy: Get rid of random atmospheric distortion.
6. Direction of alternative hypothesis. All other things being equal, a one-tailed alternative hypothesis is
more powerful than a two-tailed alternative if you've specified the direction correctly.
Analogy
Detecting a true difference is like
Detecting a distant star
Copyright © 2005 by Michael Biderman
Effect size
Sample size
Test chosen
Significance
Variability of scores
One-tailed H1
Power- 3
Brightness of the star
Diameter of the telescope
Type of telescope – refractive vs. reflective
Willingness to call a spot of light “the star”
Cleanliness of the lenses
Knowing where to look
02/05/16
Summarizing
Manipulation
Effect of manipulation if there is
no difference in population
means. (Left side of 2x2 table.)
Effect of manipulation if there is a
difference in population means.
(Right side of 2x2 table.)
Increase the effect size
No effect at all
Increases power
Increase sample size
No effect at all
Increases power
Choose a more powerful test
No effect at all
Increases power
Make significance level larger
Increases Type I error rate
Increases power
Decrease variability of scores
within groups
No effect at all
Increases power
Choose appropriate one-tailed
alternative hypothesis
No effect at all
Increases power
Copyright © 2005 by Michael Biderman
Power- 4
02/05/16
How big is an effect size
Measures of Effect Size for Common Statistics Tests
Population Value
Sample Estimate
One Population t
Actual Pop Mean – Hyp’d Pop Mean
--------------------------------------------Pop SD
d=
Small = .2
Sample Mean – Hyp’d Pop Mean
----------------------------------------Sample SD
Medium = .5
Large = .8
Two Independent Samples t
Pop Mean 1 –Pop Mean 2
--------------------------------------------Pop SD
d=
Small = .2
Sample Mean 1 – Sample Mean 2
----------------------------------------Square root of (Pooled Variance, S2p)
Medium = .5
Large = .8
Two Correlated Samples t
Pop Mean 1 –Pop Mean 2
--------------------------------------------Pop SD
d=
Sample Mean 1 – Sample Mean 2
----------------------------------------Square root of ((S21 + S22)/2)
Small = .2
Medium = .5
But correlation of paired scores, r, influences actual effect size.
Large = .8
One Way independent samples ANOVA,
SD of Population Means
--------------------------------------------Pop SD
f=
Small = .1
Sample SD of Sample Means
----------------------------------------Square root of MS Within
Medium = .25
Large = .4
f2
η = Eta = --------------------------------------------1+f2
2
Printed by SPSS in some
procedures.
2
Small = .010
Medium = .059
Large = .138
Pearson R between two variables
Population r
Small = . 10
Copyright © 2005 by Michael Biderman
Sample r
Medium = .30
Large = .50
Power- 5
02/05/16
Determining Sample Size for upcoming research
It is important to take power into account when planning the sample size for research.
Following is an illustration of what must be considered when comparing two groups, a common situation.
I. Determine how big the population effect size is that you’re trying to detect. That is, determine how
big of a difference you’ll be trying to discover in your research.
Commonly asked question: How can we know what the difference will be in the population before we’ve
conducted the experiment to discover if there is a difference? A Catch 22 situation.
From Lance, C. E., & Bandenberg, R. J. (2008). Statistical and Methodological Myths and Urban Legends.
Routledge.
In the above table, the nonredundant correlations are in red.
Mean of the red correlations is -.11, an estimated of effect size for inconsistency as a predictor of GPA.
II. Determine the desired power – the probability of detecting the difference we think our manipulation
will make. Typically, we want that probability to be as large as possible (1 would be great) but, realistically,
we usually settle for the value .8. That value is to power analysis and sample size determination what .05
is to significance levels.
III. We then consult sample size tables or a computer program such as SamplePower 3 or gPower to
determine the sample size required to detect the estimate effect with the desired power. A collection of
power tables is available at www.utc.edu/Michael-Biderman -> Psychology 2010 -> Power Tables.
Choose a sample large enough to yield power identified in II to detect the effect size identified in I.
Copyright © 2005 by Michael Biderman
Power- 6
02/05/16
Example 1 – Two Groups Research.
You plan to investigate a new method of teaching statistics. Prior to the research, you wish to determine
how many participants will be required.
I. Population Effect size: Hmm. If your new method will only yield a small effect, then it probably
wouldn’t be worth your efforts to pursue it. So you’re only interested in the new method if it yields a
medium effect size, d=0.5. So plan the statistical analysis so that it will be likely to detect a medium or
larger effect size. If the effect size is smaller than medium, your analysis might not detect it, but that’s OK,
since a small effect size would mean that the method wasn’t that effective.
II. Power. We’d like at least a 90% chance of detecting a medium effect size. There’s no point in doing the
research and the analysis if we can’t be quite sure that we’ll detect a useful difference.
III. Sample Power Output
Sample Power indicates that we’ll need 90+90 or 180 participants in order to have power of 92% to detect a
difference of 0.5 standard deviations, a medium effect.
Biderman’s Power tables . . .www.utc.edu/Michael-Biderman -> Psychology 2010 -> Power Tables
So, we’ll use 90 persons per group and have .92 probability of detecting a difference of .5 SDs.
Copyright © 2005 by Michael Biderman
Power- 7
02/05/16
Example 2 – Correlational Research.
You are investigating a new test for predicting performance of students in a statistics curriculum. How big
of a sample should you use?
1. Effect Size: Hmm. CA correlates about .5 with performance. But Conscientiousness correlates only
about .2 with performance in academia. You decide that you are not interested in doing any more work on
your test unless is correlates more highly with performance than does Conscientiousness. You decide that a
medium effect size correlation coefficient, r=.3, is the effect size you are most interested in.
2. Power: Let’s choose a sample that will have a 90% chance of detecting a correlation of .3.
3. Sample Power Output
So the sample power output suggests that you’ll need 110 participants in order to have probability of .90 to
detect a correlation of .3.
Biderman’s Power Table Output
Argh. Biderman didn’t prepare a power table for Correlations. Somebody get him to do that.
Copyright © 2005 by Michael Biderman
Power- 8
02/05/16
Why be concerned about Power?
1. Assuming we create treatments to make a difference, it only makes sense to conduct research that has the
greatest probability of detecting the difference we set out to make.
2. To provide insight into reasons for failure to reject the null (failure to find differences).
If we fail to reject the null, it will be due to one of at least two reasons.
a. The manipulation we implemented had no effect - that is, the actual effect size was zero. Our
treatment did not make a difference. We’ve learned something – although it may not be what we wanted to
know.
b. The manipulation had an effect, but the statistical test had insufficient power to detect the
effect of our manipulation. Our treatment made a difference but we were too lazy or poor or ignorant to use
enough participants, and we didn’t detect it.
FOR Study example.
We performed a study investigating the effect of Frame Of Reference (FOR) instructions on the
validity of Conscientiousness as a predictor of GPA. Our original sample had 150 students. The FOR effect
was not significant. For this and other reasons, the study was not accepted in a conference.
We followed up by adding 150 more participants, on the assumption that the population FOR effect
size was small, e.g., r=.1 or .2. For the 300 participant sample, the difference in validities between the
nonFOR and the FOR condition was .07, quite small, but statistically significant.
If you fail to reject, you should estimate the actual effect size in your data. If the estimated effect size is
small, then this indicates that your manipulation was not as powerful as you might have expected.
But if the sample estimate of effect size was large while your statistical test was not significant, that suggests
that your sample size was too small.
Copyright © 2005 by Michael Biderman
Power- 9
02/05/16
When we don’t want high power
In general, high power is good. If the null hypothesis is false, we want to be able to correctly reject it.
There are instances, however, when we may not want to detect a difference even if it is there.
Examples
1) We're not interested in the difference.
E.g., We're interested in the effect of Type of Training. A Gender difference is found. We're not interested
in gender differences. Nuts! Now we have to deal with them.
2) We're overwhelmed by differences already and don't have time to deal with any others.
We've conducted research evaluating Type of Training, Sex, Type of Job, Age of Employee. A Gender
difference is found. Rats! We don't have time to deal with the gender effect at the present time.
3) The difference is incredibly small.
Suppose the average statistical test performance of the population of I/O students is 84.3 while the average
statistical test performance of the population of Research students is 84.31. With 10,000 I/O students and
10,000 RM students, the difference would be statistically significant.. Oh wow! I really care!
This issue is what is referred to as the issue of statistical vs. practical significance.
Any difference, however small or inconsequential, can be made statistically significant by increasing power
(usually through larger samples.) But whether a statistically significant difference is worth our dealing with
is another question. Many times, statistically significant differences are not worth dealing with.
For this reason, it has become common practice to report not only the statistical significance of a difference,
but also a measure of sample effect size - the estimated size of the difference, measured in a standardized
fashion. That way, small differences which were detected by extremely powerful statistical procedures can
be recognized for what they are: small differences. The GLM procedure in SPSS can print such sample
effect sizes.
Copyright © 2005 by Michael Biderman
Power- 10
02/05/16
Using SamplePower to obtain power and sample sizes
Sample Power 3 is an add-on module available with the SPSS suite of programs.
It can be used to compute power. More often then not, however, it’s used to compute the sample size
required to have a prespecified power for proposed research. That’s what will be illustrated here.
SamplePower opens with a blank screen, except for a randomly chosen tip.
Pull down File and choose New.
Copyright © 2005 by Michael Biderman
Power- 11
02/05/16
Independent Groups t-test
1) Specify the effect size. Do that by changing one of the population means to the desired effect size.
Either the mean of population 1 or the mean of population 2 can be changed.
2) Adjust the N per Group until the desired power appears below.
To get the exact sample size for Power = 80%, pull down the Tools menu and choose “Sample size for 80%
Power.”
Copyright © 2005 by Michael Biderman
Power- 12
02/05/16
Population Correlation Coefficient
1) Set the Population Correlation to the desired value.
2) Pull down Tools and choose “Sample Size for 80% power.”
Copyright © 2005 by Michael Biderman
Power- 13
02/05/16
One way Analysis of Variance
1) Initial Screen. Click on the “Number of Levels” field or the “Effect Size” field.
Copyright © 2005 by Michael Biderman
Power- 14
02/05/16
2) Enter the Number of categories in the field on the right.
3) Click on the appropriate effect size.
Pull down the Tools menu and choose “Sample Size for 80% Power”.
Copyright © 2005 by Michael Biderman
Power- 15
02/05/16
Two Way chi-square
Chi-square is unusual in that the power and thus, required sample size for a given difference between
proportions depends on what specific values those proportions take on.
Sample size required to detect a .05 difference - .40 vs. .45.
1) Choose the two population proportions whose difference you’ll want to detect.
2) Pull down the Tools menu and choose “Sample size for 80% Power.”
Copyright © 2005 by Michael Biderman
Power- 16
02/05/16
Sample size required to detect a .05 difference - .05 vs. .10.
Note that the sample size required to detect the difference between .05 and .10 is much smaller than that
required to detect the difference between .40 and .45.
The bottom line is that when testing hypotheses about population proportions, you must specify not only the
difference in proportions, but also the two specific proportions whose difference you wish to detect.
Copyright © 2005 by Michael Biderman
Power- 17
02/05/16
Download