Size matters – also in scientometric studies and research evaluation Jesper W. Schneider Danish Centre for Studies in Research & Research Policy, Department of Political Science & Government, Aarhus University, Denmark jws@cfa.au.dk Last year … • ... I complained about the rote use of statistical significance tests • ... one of my claims was that such tests are used more or less mindlessly to mechanically decide whether a result is important or not • Similar to last year I will sharewith you some further concerns of mine and it basically continues where I left ... ... what bothers me this year? • We rarely reflect on the numbers we produce ... and we produce a lot! • We mainly pose dichotomous or ordinal hypotheses (“weak hypotheses”) • ... it is the easy way, but it is not particularly informative or for that matter scientific • Compare this to researchers in the natural sciences • They do reflect upon their numbers and they formulate “strong” hypotheses or theories and test their predictive ability • ... researchers in the natural sciences are concerned with questions such as “how much” or “to what degree”, questions concerning the size of effects and the impact of such effcets ... to them size matters Consider this up to, on average, 1 citation per paper in difference Have you ever considered what the nummerical differences between JIFs imply in practice? Consider this Difference from #101 to #125 is 0.5% points Management at my university like (really like) rankings, their focus though is soley on the rank position – in this case #117 Consider this • Possible gender inequalities in publication output from Spanish psychology researchers • One research question: – “Is there a difference between the proportion of female authors depending on the gender of the first author” • Statistical hypothesis tested – “there is no difference” we keep it anonymous Some suggested solutions • Example, three research groups Group 1 Group 2 Group 3 MNCS 1.42 1.64 1.56 • Some colleagues have recently argued that statistical significance tests should be used in order to detect ”significant differences” • Kruskall-Wallis (K-W) tests are suggested • H0: Group 1 = Group 2 = Group 3 no difference in medians between groups • K-W is ”significant” (p = .014, α=5%) • Pair wise comparisons suggest that ”significant differences” exist between group 1 and 2 (p = .008), as well as group 1 and 3 (p = .013)! • Finito! ... not so, statistical significance cannot tells us to what degree the results are important! • … but effect sizes can help us interpret the data • … where Z is the statistic from the Mann-Whitney tests used for the pair wise comparisons, and N is the total number of cases compared Group 1 Group 2 Group 3 Group 1 0.13 0.13 Group 2 0.02 Group 3 - Effect sizes • Judgments about the relative importance of results must be determined by theory, former findings, practical implications, cost-benefits … whatever that informs • Effect sizes should be the starting point for such judgments – as size matters! • Effect sizes are measures of the strength of a phenomenon and they come in standardized (scale-free) and nonstandardized forms • Effects sizes are comparable across studies 9 Interpretation • Theory, former findings, practical implications etc, that can help us interpret results are not plenty in our field • Example – r = 0.47, is that a big or small effects, in the scheme of things? • If you haven't a clue, you're not alone. Most people don't know how to interpret the magnitude of a correlation, or the magnitude of any other effect statistic. But people can understand trivial, small, moderate, and large, so qualitative terms like these can be used to discuss results. A scale of magnitudes for effect statistics – a beginning • Cohen (1988) reluctantly suggested a benchmark-scale for effect sizes based upon important findings in the behavioral and social sciences Standardized Effect Sizes Small Medium Large Correlation 0.1 0.3 0.5 Cohen’s h (proportions) 0.2 0.5 0.8 • The main justification for this scale of correlations comes from the interpretation of the correlation coefficient as the slope of the line between two variables when their standard deviations are the same. For example, if the correlation between height (X variable) and weight (Y variable) is 0.7, then individuals who differ in height by one standard deviation will on average differ in weight by only 0.7 of a standard deviation. So, for a correlation of 0.1, the change in Y is only one-tenth of the change in X. A scale of magnitudes for effect statistics – a beginning Correlation Difference in means Trivial 0 0 Small 0.1 0.2 Moderate 0.3 0.6 Large 0.5 1.2 Very large 0.7 2 Nearly perfect 0.9 4 Perfect 1 infinite In principle, we are skeptical of this categorization as it also encourages "mechanical thinking" similar to statistical significance tests. A "one size fit all" yardstick is problematic as Cohen himself argued. However, for the sake of argument we use these benchmarks to illustrate what we need to focus upon: importance of findings! Returning to the group comparison Group 1 Group 2 Group 3 MNCS 1.42 1.64 1.56 • … where Z is the statistic from the Mann-Whitney tests used for the pair wise comparisons, and N is the total number of cases compared Group 1 Group 2 Group 3 Group 1 0.13 0.13 Group 2 0.02 Group 3 - Very small, close to trivial difference Extremely trivial difference University rankings: Stanford University vs. UCLA Leiden Ranking Institution pp_top 10% No. Papers Stanford University 22.90% 26917 UCLA 19.40% 30865 z-test = ”significant difference” (p < .0001) Difference = 3.5 percentage points Ho = p1 = p2 HA = p1 ≠ p2 Does this mean that we have a "difference that makes a difference“ between the rankings of these two universities? 14 Differences in effect sizes between Stanford University and one of the other 499 institutions in the Leiden ranking ≈ 20% = ”trivial effects” Effect sizes for Stanford University (Leiden) 40% Stanford vs. UCLA: Cohen’s h = 0.09 100% ≈ 4% = ”medium effects” 35% 80% 30% 25% 60% 20% 40% 15% 10% 20% 5% ≈ 76% = ”small effects” 0% 0% Distribution of Cohen's h Cumulative distribution of Cohen's h 15 Institution Cohen's h Stanford Univ 0.00 Univ Calif Santa Barbara 0.01 Univ Calif Berkeley 0.01 Harvard Univ 0.02 Caltech 0.02 Univ Calif San Francisco 0.03 Leiden Ranking Princeton Univ 0.03 Rice Univ 0.03 Univ Chicago 0.04 Univ Calif Santa Cruz 0.05 Institution Univ Washington - Seattle 0.06 Hacettepe Univ Ankara Columbia Univ 0.06 Nihon Univ Yale Univ 0.07 Gifu Univ MIT 0.07 Duke Univ Istanbul Univ 0.07 Gazi Univ Ankara Univ Penn 0.08 SmallBelgrade Univ Carnegie Mellon Univ 0.08 Ankara Univ Ecole Polytecn Fed Lausanne 0.08 Adam Mickz Univ 0.08 Poznan Univ Calif San Diego Lom State WashingtonMoscow Univ - StMv Louis 0.09 Univ St Petersburg State Univ Univ Calif Los Angeles 0.09 pp_top 10% Pubs 22.9 26917 22.6 9250 22.3 22213 23.6 61462 22.0 13379 21.8 20527 24.0 10832 21.7 4646 21.1 14103 20.9 4667 Cohen's h pp_top 10% 20.5 29273 0.54 5.225339 20.2 0.54 5.220366 20.2 0.54 5.019294 26.0 0.56 4.721078 19.8 0.57 4.525837 19.8 Medium 0.58 4.36371 19.7 0.59 4.18831 19.6 0.59 4.023453 19.4 0.61 3.816512 19.4 0.64 3.130865 19.4 Relation between “pp_top 10%” values and Cohen's h Stanford for Stanford University University: 22.9 30 25 pp_top 10 % 20 15 10 Pubs 5773 4423 3357 5573 4420 6936 4830 3363 14205 4826 As we said, we use these benchmarks only for the sake of 5 argument and illustration, a ”trivial effect size” can in fact be important! 0 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 But the distribution of effect sizes can Cohen's h certainly inform us 16 Explanatory research “The data also show a statistically significant difference in the proportion of female authors depending on the gender of the first author (t = 2.707, df = 473, p = 0.007). Thus, when the first author was female, the average proportion of female co-authors per paper was 0.49 (SD 0.38, CI 0.53–0.45), whereas when a man was the first author, the average proportion of women dropped to 0.39 (SD 0.39, CI 0.43–0.35).” 95% CIs for proportions 0.6 Some uncertainty, the interval is points wide 0.5 0.4 0.3 Cohen’s d = 0.25 r = 0.12: ”small effect”, perhaps not that ”significant”? Possible empirical calibration of Effect Sizes • Cohen always stated that the guidelines he was providing were simply first attempts to provide some useful general guidelines for interpreting social science research • His suggestion was that specific areas of research would need to refine these guidelines • Strictly empirical, based on the actual distribution of all effect sizes • between the institutions in the Leiden Ranking • Based on quartiles • Small effects = difference between 3rd and 2nd quartiles • Medium effects = difference between upper and lower half • Large effects = top quartile versus bottom quartile • Very instrumental and cannot initially be applied outside this dataset 18 Summary • Size matters – Discuss the theoretical and/or practical importance of results (numbers) – Always report effect sizes – For want of something better use an ordinal benchmarking like Cohen’s – Calibrate benchmarks according to the findings and their degree of theoretical and practical importance – Effect sizes are comparable across studies • We need a focus on this otherwise our studies continue to be mainly instrumental and a-theoretical – there is little understanding or advance in that! Summary ©Lortie, Aarsen, Budden & Leimu (2012) Thank you for your attention 21