Size matters * also in scientometric studies and research evaluation

advertisement
Size matters – also in scientometric
studies and research evaluation
Jesper W. Schneider
Danish Centre for Studies in Research & Research Policy,
Department of Political Science & Government,
Aarhus University, Denmark
jws@cfa.au.dk
Last year …
• ... I complained about the rote use of statistical significance tests
• ... one of my claims was that such tests are used more or less
mindlessly to mechanically decide whether a result is important or
not
• Similar to last year I will sharewith you some further concerns of
mine and it basically continues where I left ...
... what bothers me this year?
• We rarely reflect on the numbers we produce ... and we produce a
lot!
• We mainly pose dichotomous or ordinal hypotheses (“weak
hypotheses”)
• ... it is the easy way, but it is not particularly informative or for that
matter scientific
• Compare this to researchers in the natural sciences
• They do reflect upon their numbers and they formulate “strong”
hypotheses or theories and test their predictive ability
• ... researchers in the natural sciences are concerned with questions
such as “how much” or “to what degree”, questions concerning the
size of effects and the impact of such effcets ... to them size matters
Consider this
up to, on average, 1 citation
per paper in difference
Have you ever considered what the nummerical
differences between JIFs imply in practice?
Consider this
Difference from
#101 to #125 is 0.5%
points
Management at my university like (really like) rankings, their focus though is
soley on the rank position – in this case #117
Consider this
• Possible gender inequalities in publication output from
Spanish psychology researchers
• One research question:
– “Is there a difference between the proportion of female
authors depending on the gender of the first author”
• Statistical hypothesis tested
– “there is no difference”
we keep it anonymous
Some suggested solutions
• Example, three research groups
Group 1 Group 2 Group 3
MNCS 1.42
1.64
1.56
• Some colleagues have recently argued that statistical significance tests
should be used in order to detect ”significant differences”
• Kruskall-Wallis (K-W) tests are suggested
• H0: Group 1 = Group 2 = Group 3  no difference in medians between
groups
• K-W is ”significant” (p = .014, α=5%)
• Pair wise comparisons suggest that ”significant differences” exist between
group 1 and 2 (p = .008), as well as group 1 and 3 (p = .013)!
• Finito!
... not so, statistical significance cannot tells us to what
degree the results are important!
• … but effect sizes can help us interpret the data
• … where Z is the statistic from the Mann-Whitney tests used for the pair
wise comparisons, and N is the total number of cases compared
Group 1 Group 2 Group 3
Group 1
0.13
0.13
Group 2
0.02
Group 3
-
Effect sizes
• Judgments about the relative importance of results must be
determined by theory, former findings, practical implications,
cost-benefits … whatever that informs
• Effect sizes should be the starting point for such judgments –
as size matters!
• Effect sizes are measures of the strength of a phenomenon
and they come in standardized (scale-free) and nonstandardized forms
• Effects sizes are comparable across studies
9
Interpretation
• Theory, former findings, practical implications etc, that can help us
interpret results are not plenty in our field
• Example
– r = 0.47, is that a big or small effects, in the scheme of things?
• If you haven't a clue, you're not alone. Most people don't know
how to interpret the magnitude of a correlation, or the magnitude
of any other effect statistic. But people can understand trivial,
small, moderate, and large, so qualitative terms like these can be
used to discuss results.
A scale of magnitudes for effect statistics – a beginning
• Cohen (1988) reluctantly suggested a benchmark-scale for effect sizes
based upon important findings in the behavioral and social sciences
Standardized Effect Sizes
Small
Medium
Large
Correlation
0.1
0.3
0.5
Cohen’s h (proportions)
0.2
0.5
0.8
• The main justification for this scale of correlations comes from the
interpretation of the correlation coefficient as the slope of the line
between two variables when their standard deviations are the same. For
example, if the correlation between height (X variable) and weight (Y
variable) is 0.7, then individuals who differ in height by one standard
deviation will on average differ in weight by only 0.7 of a standard
deviation. So, for a correlation of 0.1, the change in Y is only one-tenth of
the change in X.
A scale of magnitudes for effect statistics – a beginning
Correlation
Difference in means
Trivial
0
0
Small
0.1
0.2
Moderate
0.3
0.6
Large
0.5
1.2
Very large
0.7
2
Nearly perfect
0.9
4
Perfect
1
infinite
In principle, we are skeptical of this categorization as it also encourages
"mechanical thinking" similar to statistical significance tests. A "one size fit
all" yardstick is problematic as Cohen himself argued. However, for the sake
of argument we use these benchmarks to illustrate what we need to focus
upon: importance of findings!
Returning to the group comparison
Group 1 Group 2 Group 3
MNCS 1.42
1.64
1.56
• … where Z is the statistic from the Mann-Whitney tests used for the pair
wise comparisons, and N is the total number of cases compared
Group 1 Group 2 Group 3
Group 1
0.13
0.13
Group 2
0.02
Group 3
-
Very small, close to
trivial difference
Extremely trivial
difference
University rankings: Stanford University vs. UCLA
Leiden Ranking
Institution
pp_top 10% No. Papers
Stanford University
22.90%
26917
UCLA
19.40%
30865
z-test = ”significant
difference” (p < .0001)
Difference = 3.5
percentage points
Ho = p1 = p2
HA = p1 ≠ p2
Does this mean that we have a "difference that makes a difference“
between the rankings of these two universities?
14
Differences in effect sizes between Stanford
University and one of the other 499 institutions in
the Leiden ranking
≈ 20% =
”trivial effects”
Effect sizes for Stanford University
(Leiden)
40%
Stanford vs.
UCLA:
Cohen’s h =
0.09
100%
≈ 4% =
”medium effects”
35%
80%
30%
25%
60%
20%
40%
15%
10%
20%
5%
≈ 76% =
”small effects”
0%
0%
Distribution of Cohen's h
Cumulative distribution of Cohen's h
15
Institution
Cohen's h
Stanford Univ
0.00
Univ Calif Santa Barbara
0.01
Univ Calif Berkeley
0.01
Harvard Univ
0.02
Caltech
0.02
Univ Calif San Francisco
0.03
Leiden Ranking
Princeton Univ
0.03
Rice Univ
0.03
Univ Chicago
0.04
Univ Calif Santa Cruz
0.05
Institution
Univ Washington - Seattle
0.06
Hacettepe Univ Ankara
Columbia Univ
0.06
Nihon
Univ
Yale Univ
0.07
Gifu
Univ
MIT
0.07
Duke Univ Istanbul Univ
0.07
Gazi
Univ
Ankara
Univ Penn
0.08
SmallBelgrade
Univ
Carnegie Mellon
Univ
0.08
Ankara
Univ
Ecole Polytecn Fed Lausanne 0.08
Adam
Mickz Univ 0.08
Poznan
Univ Calif San
Diego
Lom State
WashingtonMoscow
Univ - StMv
Louis
0.09 Univ
St
Petersburg
State
Univ
Univ Calif Los Angeles
0.09
pp_top 10% Pubs
22.9
26917
22.6
9250
22.3
22213
23.6
61462
22.0
13379
21.8
20527
24.0
10832
21.7
4646
21.1
14103
20.9
4667
Cohen's
h
pp_top
10%
20.5
29273
0.54
5.225339
20.2
0.54
5.220366
20.2
0.54
5.019294
26.0
0.56
4.721078
19.8
0.57
4.525837
19.8
Medium
0.58
4.36371
19.7
0.59
4.18831
19.6
0.59
4.023453
19.4
0.61
3.816512
19.4
0.64
3.130865
19.4
Relation between “pp_top 10%” values and Cohen's h
Stanford
for Stanford University
University:
22.9
30
25
pp_top 10 %
20
15
10
Pubs
5773
4423
3357
5573
4420
6936
4830
3363
14205
4826
As we said,
we use these benchmarks only for the sake of
5
argument and illustration, a ”trivial effect size” can in fact be
important!
0
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
But the distribution of effect sizes
can
Cohen's
h certainly inform us
16
Explanatory research
“The data also show a statistically significant difference in the proportion of
female authors depending on the gender of the first author (t = 2.707, df = 473,
p = 0.007). Thus, when the first author was female, the average proportion of
female co-authors per paper was 0.49 (SD 0.38, CI 0.53–0.45), whereas when a
man was the first author, the average proportion of women dropped to 0.39 (SD
0.39, CI 0.43–0.35).”
95% CIs for proportions
0.6
Some uncertainty, the
interval is points wide
0.5
0.4
0.3
Cohen’s d = 0.25  r = 0.12: ”small effect”,
perhaps not that ”significant”?
Possible empirical calibration of Effect Sizes
• Cohen always stated that the guidelines he was providing were simply first
attempts to provide some useful general guidelines for interpreting social
science research
• His suggestion was that specific areas of research would need to refine
these guidelines
• Strictly empirical, based on the actual distribution of all effect sizes
• between the institutions in the Leiden Ranking
• Based on quartiles
• Small effects = difference between 3rd and 2nd quartiles
• Medium effects = difference between upper and lower half
• Large effects = top quartile versus bottom quartile
• Very instrumental and cannot initially be applied outside this dataset
18
Summary
• Size matters
– Discuss the theoretical and/or practical importance of results
(numbers)
– Always report effect sizes
– For want of something better use an ordinal benchmarking like
Cohen’s
– Calibrate benchmarks according to the findings and their degree
of theoretical and practical importance
– Effect sizes are comparable across studies
• We need a focus on this otherwise our studies continue to be
mainly instrumental and a-theoretical – there is little understanding
or advance in that!
Summary
©Lortie, Aarsen, Budden & Leimu (2012)
Thank you for your attention
21
Download