Effect Size Issues Rob Coe WRI Workshop, 18 March 2013 Four parts I II What is Effect Size? The case for using effect size • III Problems in using effect size • 5 reasons) (6 problems) IV Recommendations • (13 recommendations) 2 What is Effect Size? 3 © 2003 Robert Coe, University of Durham Sources Coe, R. (2002) It's the effect size, stupid: what effect size is and why it is important. Paper presented at the Annual Conference of the British Educational Research Association, University of Exeter, England, 12-14 September 2002. Coe, R.J. (2012) ‘Effect Size’ in J. Arthur, M. Waring, R. Coe, and L.V. Hedges (Ed.s) (2012) Research Methods and Methodologies in Education. London: Sage. 4 © 2013 Robert Coe, University of Durham Normal distribution High standard deviation (spread out) Low standard deviation (tightly grouped) Effect Size is the difference between the two groups, relative to the standard deviation Mean of experimental group – Mean of control group Effect Size = Standard deviation Examples of Effect Sizes: ES = 0.2 58% of control group below mean of experimental group “Equivalent to the difference in heights between 15 and 16 year old girls” Probability you could guess which group a person was in = 0.54 Change in the proportion above a given threshold: from 50% to 58% or from 75% to 81% ES = 0.5 69% of control group below mean of experimental group “Equivalent to the difference in heights between 14 and 18 year old girls” Probability you could guess which group a person was in = 0.60 Change in the proportion above a given threshold: from 50% to 69% or from 75% to 88% ES = 0.8 79% of control group below mean of experimental group “Equivalent to the difference in heights between 13 and 18 year old girls” Probability you could guess which group a person was in = 0.66 Change in the proportion above a given threshold: from 50% to 79% or from 75% to 93% Effect Sizes from EEF Toolkit 11 © 2013 Robert Coe, University of Durham The case for using effect size measures 12 © 2003 Robert Coe, University of Durham Source Coe, R. (2004) ‘Issues arising from the use of effect sizes in analysing and reporting research’ in I. Schagen and K. Elliot (Eds) But what does it mean? The use of effect sizes in educational research. Slough, UK: National Foundation for Educational Research. http://www.nfer.ac.uk/nfer/publications/SEF01/SEF01.pdf 13 © 2013 Robert Coe, University of Durham 1. Effect size enables uncalibrated measures to be interpreted From a questionnaire on teachers’ perceptions of their training needs (7 item scale, each item coded 1-4) By age group: age 20-40 mean 2.98 n 389 By gender: age 41-65 SD 0.87 mean 2.09 n 345 female SD 0.95 “younger teachers expressed stronger needs than their older colleagues” Effect size = 0.98 mean 2.64 n 451 male SD 1.05 mean 2.44 n 283 SD 1.11 “female teachers appeared to have higher training needs than males” 14Effect © 2003 Robert Coe, University of Durham size = 0.19 2. Effect size emphasises amounts, not just statistical significance The dichotomous “significant/not” decision is almost never appropriate The size of a difference is almost always important “Significance” has many meanings, but is inevitably related to the size of the difference 15 © 2003 Robert Coe, University of Durham Nonsensical dichotomies Experimental Group 52 69 83 66 58 69 68 44 62 51 mean SD 31 70 44 63 70 55 86 74 68 74 63.0 13.7 Control Group 51 60 57 80 45 37 56 47 55 63 mean SD Experimental Group 51 45 69 62 63 46 73 39 49 52 52 69 83 66 58 69 68 44 62 51 54.9 11.2 mean SD t-test gives: p= 0.049 31 70 44 63 70 55 86 74 68 74 63.0 13.7 Control Group 51 60 57 80 45 37 56 47 55 63 mean SD 51 45 70 62 63 46 73 39 49 52 55.0 11.3 t-test gives: p= 0.052 Statistically significant difference Difference not significant THE TREATMENT WORKED! IT DIDN’T WORK! 16 Don’t ignore amounts 3 2.5 2 1.5 1 0.5 0 -0.5 -1 (a) "not significant" (b) "significant" 17 © 2003 Robert Coe, University of Durham (c) "significant" Types of significance Statistical significance o Probability that difference due to chance Practical significance o Theoretical or applied importance Clinical significance o ‘extent to which the intervention makes a real difference to the quality of life’ (Kazdin, 1999) Economic significance o Benefit in relation to cost (Leech and Onwuegbuzie, 2003) 18 © 2003 Robert Coe, University of Durham 3. Effect size draws attention to the margin of error Statistical power is important, but often overlooked Much apparent disagreement is actually just sampling error 19 © 2003 Robert Coe, University of Durham 3 2.5 2 1.5 1 0.5 0 -0.5 -1 20 © 2003 Robert Coe, University of Durham 4. Effect size may help reduce reporting bias The “file-drawer” problem is alive and well Within-study reporting bias can also be a problem 21 © 2003 Robert Coe, University of Durham 3 2.5 2 1.5 1 0.5 0 -0.5 -1 22 © 2003 Robert Coe, University of Durham 5. Effect size allows the accumulation of knowledge Meta-analysis can combine results from different studies Small studies are worth doing 23 © 2003 Robert Coe, University of Durham Problems in using effect size measures 24 © 2003 Robert Coe, University of Durham 1. Which effect size? Proportion of variance accounted for o Universal measure, but o Non-directional o Sensitive to violations of assumptions o Large standard errors o Interpretation counter-intuitive o ‘Effect’ should mean effect Non-parametric effect size measures Odds ratio Un-standardised (raw) difference 25 © 2003 Robert Coe, University of Durham 2. Which standard deviation? Pooled or control group? o Control group is conceptually purer o Pooled is statistically better (provided compatible) o Sometimes there isn’t a ‘control’ group Residual standard deviation o Residual gain becomes ‘progress’ o Effect sizes substantially inflated and dependent on correlation o Important to report clearly Restricted range o Effect size higher if range limited 26 © 2003 Robert Coe, University of Durham 3. Measurement reliability Standardised mean difference is spuriously affected by the reliability of the outcome measure o Part of the variance in measured scores is due to measurement error o More error Higher S.D. Lower E.S. o Reliability should be reported 27 © 2003 Robert Coe, University of Durham 4. Non-normal distributions -4 -3 -2 -1 0 28 © 2003 Robert Coe, University of Durham 1 2 3 4 -4 -3 -2 -1 0 29 © 2003 Robert Coe, University of Durham 1 2 3 4 An effect size of 1 Normal -3 -2 -1 Contaminated-normal 0 1 2 3 4 -3 Median person raised to the 84th percentile -2 -1 0 1 2 3 4 5 Median person raised to the 97th percentile 30 © 2003 Robert Coe, University of Durham 6 5. Interpreting effects Cohen’s 0.2 = small 0.5 = medium 0.8 = large is a bit simplistic Interpretation depends on o Translation into familiar metric o Comparison with known effects o Costs o Feasibility o Benefits (and their value) o Availability of alternatives 31 © 2003 Robert Coe, University of Durham 6. Incommensurability Commensurable outcomes o Construct o Operationalisation o Reliability Commensurable treatments o Well defined? o Fidelity of delivery o Intensity / duration o Control group treatment Commensurable populations o Range 32 © 2003 Robert Coe, University of Durham Recommendations 33 © 2003 Robert Coe, University of Durham Recommendations 1. Calculate and report standardised effect size, with confidence interval / standard error, for all comparisons 2. Show these graphically 3. Report all relevant comparisons regardless of whether confidence intervals include zero 4. Interpret effect sizes by comparison with known effects and in relation to familiar metrics 5. Report un-standardised raw differences whenever the outcome is measured on a familiar scale 34 © 2003 Robert Coe, University of Durham Recommendations (cont) 6. Interpret the significance of an effect with regard to issues such as its o o o o o o o effect size theoretical importance associated benefits associated costs policy relevance feasibility comparison with available alternatives 7. Don’t use the word ‘effect’ (with or without ‘size’) unless a causal claim is intended and can be justified 35 © 2003 Robert Coe, University of Durham Recommendations (cont) 9. Be cautious about the calculation and interpretation of standardised effect sizes whenever o Sample has restricted range o Population is not known to be normal o Outcome measure has low or unknown reliability o Outcomes have been statistically adjusted (residuals) 10.Always report reliability of measures, extent of restriction, correlations (or R2) in these cases 36 © 2003 Robert Coe, University of Durham Recommendations (4) 11.Small studies with low power and statistically non-significant effects should still be conducted, reported and published, provided they are free from bias 12.Synthesise the results of compatible studies using meta-analysis 13.Beware of combining or comparing effect sizes from studies with incommensurable outcomes, treatments or populations 37 © 2003 Robert Coe, University of Durham