Funded through the ESRC’s Researcher Development Initiative Session 3.3: Inter-rater reliability Prof. Herb Marsh Ms. Alison O’Mara Dr. Lars-Erik Malmberg Department of Education, University of Oxford Establish research question Define relevant studies Develop code materials Data entry and effect size calculation Pilot coding; coding Locate and collate studies Main analyses Supplementary analyses Interrater reliability Aim of co-judge procedure, to discern: Consistency within coder Consistency between coders Take care when making inferences based on little information, Phenomena impossible to code become missing values Interrater reliability Percent agreement: Common but not recommended Cohen’s kappa coefficient Kappa is the proportion of the optimum improvement over chance attained by the coders, 1 = perfect agreement, 0 = agreement is no better than that expected by chance, -1 = perfect disagreement Kappa’s over .40 are considered to be a moderate level of agreement (but no clear basis for this “guideline”) Correlation between different raters Intraclass correlation. Agreement among multiple raters corrected for number of raters using Spearman-Brown formula (r) Interrater reliability of categorical IV (1) Number of observations agreed on Percent exact agreement = Total number of Study Rater 1 Rater 2 observations 1 2 3 4 5 6 7 8 9 10 11 12 0 1 2 1 1 2 1 1 0 2 1 1 0 1 1 1 1 2 1 1 0 1 0 1 Categorical IV with 3 discreet scale-steps 9 ratings the same % exact agreement = 9/12 = .75 Interrater reliability of categorical IV (2) unweighted Kappa Rater 1 0 Rater 2 K 0 1 2 Sum PO PE 1 PE 1 2 2 0 0 2 1 6 0 7 0 2 1 3 3 8 1 12 Kappa: , PO ( 2 6 1) / 12 . 75 PE [( 2 )( 3 ) ( 7 )( 8 ) ( 3 )( 1)] / 12 K . 750 . 451 1 . 451 Sum If agreement matrix is irregular Kappa will not be calculated, or misleading . 544 2 . 451 Positive values indicate how much the raters agree over and above chance alone Negative values indicate disagreement Interrater reliability of categorical IV (3) unweighted Kappa in SPSS CROSSTABS /TABLES=rater1 BY rater2 /FORMAT= AVALUE TABLES /STATISTIC=KAPPA /CELLS= COUNT /COUNT ROUND CELL . Symmetric Measures As ymp. a Std. Error Value Measure of Agreement Kappa N of Valid Cases .544 12 a. Not as suming the null hypothesis. b. Us ing the as ymptotic s tandard error assuming the null hypothes is . .220 Approx. T b 2.719 Approx. Sig. .007 Interrater reliability of categorical IV (4) Kappas in irregualar matrices If rater 2 is systmatically “above” rater 1 when coding an ordinal scale, Kappa will be misleading possible to “fill up” with zeros Rater 1 1 2 Rater 2 2 3 4 Sum 3 Rater 1 1 Sum 4 1 0 5 3 6 1 10 0 3 7 10 7 10 8 25 K = .51 Rater 2 1 2 3 4 Sum 2 3 4 Sum 0 0 0 0 0 4 1 0 0 5 3 6 1 0 10 0 3 7 0 10 7 10 8 0 25 K = -.16 Interrater reliability of categorical IV (5) Kappas in irregular matrices If there are no observations in some row or column, Kappa will not be calculated possible to “fill up” with zeros Rater 1 1 3 Rater 2 1 2 3 4 Sum 4 Sum 4 0 0 4 2 1 0 3 1 3 2 6 0 1 4 5 7 5 6 18 K not possible to estimate Rater 2 1 2 3 4 Sum Rater 1 1 2 4 0 0 0 4 2 0 1 0 3 1 0 3 2 6 0 0 1 4 5 7 0 5 6 18 K = .47 3 4 Sum Interrater reliability of categorical IV (6) weighted Kappa using SAS macro PROC FREQ DATA = int.interrater1 ; TABLES rater1 * rater2 / AGREE; TEST KAPPA; RUN; KW 1 w i p oi w i p ei Papers and macros available for estimating Kappa when unequal or misaligned rows and columns, or multiple raters: <http://www.stataxis.com/ab out_me.htm> Interrater reliability of continuous IV (1) Study Rater 1 Rater 2 Rater 3 1 5 6 5 2 2 1 2 3 3 4 4 4 4 4 4 5 5 5 5 6 3 3 4 7 4 4 4 8 4 3 3 9 3 3 2 10 2 2 1 11 1 2 1 12 3 3 3 Correlations rater1 rater1 Pearson C orrelation rater2 1 Sig. (2-tailed) N rater2 Pearson C orrelation .873** Sig. (2-tailed) .000 N rater3 12 rater3 .873** .879** .000 .000 12 12 1 .000 12 12 Pearson C orrelation .879** .866** Sig. (2-tailed) .000 .000 12 12 N .866** **. Correlation is significant at the 0.01 level (2-tailed). Average correlation r = (.873 + .879 + .866) / 3 = .873 Coders code in same direction! 12 1 12 Interrater reliability of continuous IV (2) a Estimates of Covariance Parameters Parameter Residual Es timate .222222 Intercept [s ubject = s tudy] Variance 1.544613 a. Dependent Variable: rating. B 2 ICC 2 B 2 W 1 . 544 1 . 544 0 . 222 1 . 544 1 . 767 0 . 874 Interrater reliability of continuous IV (3) Design 1 one-way random effects model when each study is rater by a different pair of coders Design 2 two-way random effects model when a random pair of coders rate all studies Design 3 two-way mixed effects model ONE pair of coders rate all studies Comparison of methods (from Orwin, p. 153; in Cooper & Hedges, 1994) Low Kappa but good AR when little variability across items, and coders agree Interrater reliability in meta-analysis and primary study Interrater reliability in meta-analysis vs. in other contexts Meta-analysis: coding of independent variables How many co-judges? How many objects to co-judge? (sub-sample of studies, versus sub-sample of codings) Use of “Golden standard” (i.e., one “master-coder”) Coder drift (cf. observer drift): are coders consistent over time? Your qualitative analysis is only as good as the quality of your categorisation of qualitative data