Psych 5500/6500 The t Test for a Single Group Mean (Part 3): Effect Size Fall, 2008 1 Effect Size In the t test for a single group the ‘effect size’ is the difference between the actual value of the population mean (the population from whence the sample was drawn) and the value of the population mean proposed by H0. For example, math scores in some class have traditionally had a mean of 50, a new teaching program is tested to see if it changes the math scores. H0 can be written as μH0=50 (no effect due to new program). But say that the new program actually leads to an improvement in math scores such that now μY=55. The effect of the new program was to raise the scores by 5: 2 μY - μH0= 5 p Values Until fairly recently the report of the statistical analysis of experiments in psychology focused primarily on the ‘p values’ that were obtained. In the example of the math scores and the effect of the new teaching program the results of the analysis would be presented as t(23)=2.45, p=0.02, from this it can be concluded that the effect of the new teaching program was statistically significant (H0 is rejected). With p values the focus is on whether or not the effect is statistically significant; with a p value of .051 (don’t reject H0) being fundamentally different than a p value of .049 (reject H0). 3 p Values and Effect Size p values simply tell us whether the effect was statistically significant (i.e. unlikely to have occurred due to chance alone). p values are a poor indication of the size of the effect in an experiment as the value of p is influenced by a variety of things, including the effect size, the size of the sample, and the variance of the populations. The trend in psychology is to report both the size of the effect and it’s p value. 4 “Until quite recently in the history of experimental psychology, when researchers spoke of ‘the results of a study,’ they almost invariably were referring to whether they had been able to ‘reject the null hypothesis,’ that is, to whether the p values of their tests of significance were .05 or less. Spurred on by a spirited debate over the failings and limitations of the ‘accept/reject’ rhetoric of the paradigm of null hypothesis testing, the American Psychological Association (APA) created a task force to review journal practices and to propose guidelines for the reporting of statistical results. Among the ensuing recommendations were that effect sizes...be reported.” Rosnow & Rosenthal, 2003. p. 221) 5 “It is no longer considered sufficient to ask of an effect or relationship: ‘Is it there?’ It is increasingly considered essential to also ask ‘How much is there?’ and sometimes even ‘Is it enough to care?’ McGrath & Meyer, 2006, p. 386. 6 Measures of Effect Size There are many ways of measuring and reporting effect size, and various authors provide various ways of clumping these approaches into categories. We will consider three categories: 1. Simply reporting ‘raw’ effect size. 2. Standardized effect size 3. Strength of Association 7 1) Simply Reporting ‘Raw’ Effect Size If the measures are easily comprehendible then you can simply state the effect size. In our math example you can report that the expected value of the μ of math scores given H0 was 50 while the estimate of μ given your sample was 55, or you could simply state that the mean math scores in the sample was 5 greater than what was predicted by H0. Belying the concept that if it isn’t complicated it can’t be good, this is actually the approach favored by the APA. 8 2) Standardized Effect Size If the measure is something that is hard to grasp (i.e. inverse reaction times, where an effect of ‘0.2’ would be hard to intuitively understand) or if you want to do a meta-analysis (comparing results across several studies) then a standardized effect size may be more useful. In a standardized effect size you are turning the effect size into something that is similar to a standard score. 9 Cohen’s d (population) Y H 0 Y This formula is for computing the actual effect size as it occurs in the population from which we sampled. The difference between the mean proposed by H0 and the actual mean of the population is divided by the standard deviation of the population from which the sample was drawn. 10 Example Say that in our example the actual mean score of the population of students taught using the new method was 56 (slightly higher than the sample mean we happened to get) with a population standard deviation of 7.3. The formula turns the difference of 6 between H0 and the actual population mean into a standard score of 0.82. Is that a big effect or a small effect? We will cover that in a minute. Y H 0 56 50 0.82 Y 7.3 11 Cohen’s d (sample) Y μ H0 d SY SS where SY S N 2 Y This formula is for computing the effect size as it occurs in the sample. The difference between the mean proposed by H0 and the mean of the sample is divided by the standard deviation of the sample. 12 Cohen’s d (sample) Alternative Formula t obt. d df This is an easy way to compute d if you have the tobtained value and df from the t test for a single group mean. It has the disadvantage of not making it clear what d is actually measuring (i.e. the standardized effect size). 13 Hedges’s g (estimate of the effect size in population) est .μ Y μ H0 Y μ H0 g est .σ Y est .σ Y where est .σ Y est .σ 2 Y SS N -1 This formula uses the data from the sample to estimate the effect size in the population. 14 Hedges’ g and Cohen’s d Y μ H0 g est.σ Y Y μ H0 d SY For large samples the difference between est.σY and SY (the estimate of the population std dev and the std dev of the sample) will be quite small, and thus the values of Hedges’s g and Cohen’s d will be quite close. 15 Interpreting ‘d’ Cohen proposed a simple way to evaluate the size of an effect based upon the value of ‘d’ (and as there is only a small difference between ‘d’ and ‘g’ it could apply to Hedges’s g as well). Note: take the absolute value of the d, whether it is negative or positive is irrelevant to the strength of the effect. |d|= .2 |d|= .5 |d|= .8 a ‘small’ effect a ‘medium’ effect a ‘large’ effect 16 Interpreting ‘d’ (cont.) Where did this come from? According to Cohen an effect size of d=.5 (a ‘medium’ effect size) is usually noticeable to someone looking at graphs of the data. Subsequent surveys of the literature have found that the average size of effects reported in various fields is approximately equal to a d of .5. A small effect (.2) is smaller than that but still not too trivial, and a large effect (.8) is the same distance above a medium effect as a small effect is below it. 17 Interpreting ‘d’ (cont.) Cohen offered these criteria with some misgivings. His goal was to make the value of ‘d’ more meaningful but he was worried that people would take them too seriously (he was right). These criteria are fairly arbitrary and are based upon what might be considered the size of the effect view purely through the lens of statistics. A ‘small’ effect might still be of great theoretical interest, a ‘small’ effect in the field of medicine might lead to saving 10’s of thousands of lives (giving it great social or pragmatical interest). A ‘large’ effect might be of little theoretical or practical significance. 18 Interpreting ‘d’ (cont.) The real value of Cohen’s effect size values (small, medium, and large) will be seen when we discuss ‘power’. When computing the possible power of an experiment that you are designing, you need to guess what the effect size will be. Cohen’s criteria provide one way to help you guess. If you anticipate that the effect you will be looking at will be small, then plug in a value of .2 for d, etc. We will take a look at this later. 19 More on Standardized Effect Sizes 1) These formulas are for the context of testing a single mean versus what is predicted by H0, different forms of the formula are necessary for other experimental designs (we will cover these later). 20 More on Standardized Effect Sizes 2) Beware that there is a bewildering lack of consistency in the literature on how to compute Cohen’s d. Often the formula for finding the effect size in the population will be given, followed by an example where the mean and standard deviation of the sample are plugged into the formula (under the assumption, I assume, that by not generalizing to a larger population we are treating the sample as our population of interest). One of the reasons I like the way I use the symbols in this class is the way in which it makes it easy to discriminate exactly what is being accomplished by the various formulas. 21 More on Standardized Effect Sizes 3) What does SPSS provide? None of these. SPSS will provide the mean of Y and the ‘standard deviation’ of Y (which is actually the est. σY), making it a simple process to calculate Hedges’s g. If you want to calculate Cohen’s d (for the sample) you can either translate est. σY into S using the formula given earlier this semester (provided again below) or you can use the formula for computing ‘d’ from ‘t’ (as SPSS will give you both tobt and df). Y μ H0 N -1 S est.σ , thend N S t obt. or, d df 22 More on Standardized Effect Sizes 4) Advantages of using standardized effect sizes: a) If the effect size involves some metric that is hard to conceptualize (i.e. an effect size of -0.2 in a measure of inverse reaction times) then turning it into a standard score will help. Cohen’s criteria for what constitutes a small, medium, and large effect size can give the standardized effect size some level of meaning. 23 More on Standardized Effect Sizes 4) Advantages of using standardized effect sizes: b) Standardizing the effect size makes it easier to do meta-analysis (where you compare the effect size of several different studies) particularly when the studies are examining the same topic but with different measures. By translating the effect sizes found in all of the studies into standardized effect sizes you turn them into essentially the same metric so that they can be directly compared. 24 Example Say one study used inches to measure the variable ‘length’ and found an effect size of 24 inches. Say another study measured exactly the same subjects but used feet to measure length and found an effect size of 2 feet. The standard deviation of scores in the first study was 15, that would make the standard deviation of the second study be 1.25 (i.e. 15/12....trust me). While the first study had an effect size of 24 (inches) and the second study had an effect size of 2 (feet) we can see by computing the d’s that they found the same effect. Y μ H0 60 36 24 d 1.6 S 15 15 53 2 d 1.6 1.25 1.25 25 Example (cont.) While it is obvious in the example that the effect size should be the same, let’s apply the idea to a more realistic scenario. In this scenario ‘Study A’ measures intelligence using one IQ test and finds an effect size of 4. ‘Study B’ measures intelligence using a different IQ test and finds an effect size of 6. The two IQ tests have different means and different variances and it is hard to know how the two effect sizes really compare, but if we change each effect size to standardized differences we can compare then directly. Y μ H0 112 116 - 4 dA .36 S 11 11 103 109 - 6 dB .21 26 29 29 More on Standardized Effect Sizes 5) Disadvantages of using standardized effect sizes: a) The problem with standard scores is that they take you away from the units of measure that you used in the study. It might be more useful to know that the fertilizer increased growth rate by 24 inches a year than to know that d=0.3. 27 More on Standardized Effect Sizes 5) Disadvantages of using standardized effect sizes: b) Standardized effect sizes bring the standard deviation of the scores into the expression of effect size, which in some cases can hide the pure understanding of the effect. Say that the math teaching method raised the scores of students on the average by 5, and these students were a pretty varied lot (differed a lot in terms of math ability). Now say that in another study the teaching method raised scores again by 5, but in a class where the students were similar in math ability. Even though the teaching method had the same effect in both classes the values of d would differ in the two studies (as the denominator would differ in the d formulas). Y μ H0 55 50 Y μ H0 55 50 dA 0.22 d B 0.36 28 S 23 S 14 More on Standardized Effect Sizes 6) Which to use ‘g’ or ‘d’? ‘d’ gets more press, ‘g’ seems to be of more interest (to me), you can use either, and with any kind of large N they will be very close in value. If you want to compare your study to other similar studies see which one most of them use so you can more easily compare. 29 3) Strength of Association This category of effect size measures is also called ‘correlation’ or ‘amount of variance accounted for’. Everything we do next semester will automatically crank these out and in that context they will be quite understandable. In the context of what we are doing this semester (ANOVA) standardized measures (such as ‘d’) are often used, consequently we will hold off discussion of ‘Strength of Association’ measures of effect size until next semester. 30