Inter-expert and intra-expert reliability in sleep spindle scoring Sabrina L. Wendt, Peter Welinder, Helge B.D. Sorensen, Paul E. Peppard, Poul Jennum, Pietro Perona, Emmanuel Mignot, Simon C. Warby Supplementary Data Figure S1: Example of how pooling spindles with different confidence scores leads to improved performance. Table S1: Intra-expert pairwise F1-score agreement (event-by-event and epoch-by-epoch). Table S2: Inter-expert pairwise F1-score agreement (event-by-event and epoch-by-epoch). Table S3: Intra-expert pairwise κ reliability (sample-by-sample and epoch-by-epoch). Table S4: Inter-expert pairwise κ reliability (sample-by-sample and epoch-by-epoch). Table S5: Intra-expert pairwise spindle characteristics reliability of matched events. Table S6: Inter-expert pairwise spindle characteristics reliability of matched events. Table S7: Pairwise intra-expert average overlap score. Table S8: Pairwise inter-expert average overlap score. Figure S1: Example of how pooling spindles with different confidence scores can lead to improved performance. In this example scorer 1 is quite conservative and very confident in his/her detected spindles whereas scorer 2 is more uncertain and captures more spindles that may be doubtful. The performance (F1-score) is the amount of reliability/agreement between the two scorers overall, and the performance therefore depends on which sets of spindles are included in the comparison (i.e. which categories of confidence scores are allowed). Note that confidence scores are not used directly in the performance comparison - the confidence score is only used to determine which sets of spindle detections are being compared. Depending on which categories to compare, spindles with those particular levels of confidence are converted to binary (yes/no) data in order to calculate the performance. This example explains why performance improves when spindles with varying confidence levels are pooled together. Experts often find the same spindles but do not assign them equal confidence scores, thus allowing spindles with all levels of confidence scores increases the overall agreement between the experts. This is due to the vague definitions of how to categorize the spindles but also because certainty of a spindle is highly individual as is spindle scoring. In other words, the agreement is better between scorers when you allow the scorers some flexibility in the scoring system. They can indicate that some spindles are not perfect, but are still spindles. When the confidence scores are pooled, the performance goes up because the two experts do not need to agree on the confidence level of a spindle, only the presence of a spindle. Table S1: Intra-expert pairwise F1-score agreement (event-by-event and epoch-by-epoch). Event Epoch Number of Experts L M H H+M H+M+L H+M+L Epochs F1_F2 30.6 54.2 66.7 74.4 81.2 87.0 250 F1_F3 7.0 39.5 58.8 59.4 64.9 84.6 250 F2_F3 15.2 45.8 57.1 65.0 70.9 83.7 250 I1_I2 26.3 35.8 73.9 72.6 72.8 87.5 56 mean±SD 19.8±10.7 43.8±8.1 64.1±7.7 67.9±6.9 72.4±6.7 85.7±1.8 202±97 The intra-expert F1-score agreement for two experts (expert F and expert I) after repeated observations of the indicated number of epochs. Agreement is reported on an event-by-event or epoch-by-epoch basis. Spindles are divided in groups based on their assigned confidence scores: H (high = ‘definitely’), M (medium = ‘probably’) and L (low = ‘guessing’). See further description in Table 1. Table S2: Inter-expert pairwise F1-score agreement (event-by-event and epoch-by-epoch). Experts A_B A_C A_D A_E A_F A_G A_H A_I A_J B_C B_D B_E B_F B_I C_D C_E C_F C_H C_I D_F E_F E_I G_I H_I mean±SD L 0.0 0.0 0.0 0.0 15.7 0.0 0.0 14.9 22.5 0.0 0.0 9.9 8.9 16.2 21.5 0.0 15.1 12.6 5.8 14.6 0.0 0.0 9.6 20.4 7.8±8.2 M 4.9 4.0 2.8 2.3 16.4 6.8 5.5 21.0 19.7 0.0 0.6 12.3 21.4 28.3 24.5 5.7 25.3 11.6 18.5 27.6 0.0 0.0 11.3 14.0 11.9±9.6 Event H 53.2 55.4 39.7 31.4 42.8 66.9 63.9 57.0 59.3 43.9 44.8 41.6 50.2 57.8 50.8 37.9 37.7 62.9 55.8 43.8 36.1 36.2 30.0 32.2 47.1±11.0 H+M 61.1 61.3 54.9 51.2 58.7 66.9 58.0 57.6 63.8 45.7 59.0 58.3 53.8 72.0 63.7 50.2 53.0 62.5 61.0 56.0 47.7 49.5 57.7 55.3 57.5±6.2 H+M+L 65.0 57.8 59.1 59.4 62.1 65.4 54.7 58.9 68.3 45.7 60.5 62.1 48.6 73.5 68.6 55.7 62.0 65.5 65.1 64.6 61.3 54.8 66.2 67.7 61.4±6.4 Epoch H+M+L 74.9 76.1 71.1 71.8 73.5 79.3 67.2 75.2 78.3 63.7 80.6 73.6 65.6 88.0 80.7 65.7 76.5 78.4 80.8 75.7 70.3 71.1 77.7 79.5 74.8±5.8 Number of Epochs 834 1180 726 727 971 485 550 1054 374 493 305 325 411 462 450 444 598 343 634 750 500 496 500 501 588±232 The inter-expert F1-score agreement for 24 expert pairs (experts A-I) for the indicated number of epochs. Agreement is reported on an event-by-event or epoch-by-epoch basis. Spindles are divided in groups based on their assigned confidence scores: H (high = ‘definitely’), M (medium = ‘probably’) and L (low = ‘guessing’). See further description in Table 1. Table S3: Intra-expert pairwise κ reliability (sample-by-sample and epoch-by-epoch). Sample Epoch Number of Experts L M H H+M H+M+L H+M+L Epochs F1_F2 0.28 0.49 0.62 0.70 0.74 0.81 250 F1_F3 0.05 0.34 0.54 0.55 0.58 0.76 250 F2_F3 0.13 0.38 0.53 0.59 0.63 0.74 250 I1_I2 0.21 0.30 0.69 0.67 0.67 0.56 56 mean±SD 0.17±0.10 0.38±0.08 0.60±0.07 0.63±0.07 0.66±0.07 0.72±0.11 202±97 The intra-expert κ agreement for two experts (expert F and expert I) after repeated observations of the indicated number of epochs. Agreement is reported on a sample-by-sample or epoch-byepoch basis. Spindles are divided in groups based on their assigned confidence scores: H (high = ‘definitely’), M (medium = ‘probably’) and L (low = ‘guessing’). See further description in Table 2. Table S4: Inter-expert pairwise κ reliability (sample-by-sample and epoch-by-epoch). Experts A_B A_C A_D A_E A_F A_G A_H A_I A_J B_C B_D B_E B_F B_I C_D C_E C_F C_H C_I D_F E_F E_I G_I H_I mean±SD L 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.19 0.15 0.06 0.11 0.11 0.10 0.04 0.19 0.00 0.12 0.13 0.07 0.10 0.06 0.00 0.06±0.07 M 0.01 0.04 0.04 0.05 0.02 0.05 0.00 0.04 0.00 0.21 0.10 0.06 0.22 0.20 0.10 0.16 0.18 0.00 0.18 0.14 0.10 0.18 0.17 0.00 0.09±0.08 Event H 0.29 0.48 0.50 0.57 0.38 0.33 0.36 0.46 0.34 0.47 0.30 0.31 0.39 0.35 0.59 0.53 0.55 0.35 0.52 0.40 0.44 0.51 0.44 0.40 0.43±0.09 H+M 0.41 0.51 0.44 0.57 0.48 0.42 0.38 0.46 0.42 0.58 0.48 0.51 0.48 0.47 0.59 0.56 0.59 0.45 0.51 0.52 0.57 0.56 0.45 0.47 0.50±0.06 H+M+L 0.47 0.52 0.41 0.56 0.49 0.46 0.38 0.43 0.52 0.62 0.59 0.56 0.56 0.53 0.61 0.58 0.61 0.48 0.52 0.53 0.57 0.54 0.40 0.47 0.52±0.07 Epoch H+M+L 0.48 0.55 0.40 0.57 0.53 0.48 0.30 0.46 0.53 0.63 0.56 0.50 0.56 0.41 0.57 0.61 0.63 0.43 0.42 0.50 0.55 0.66 0.41 0.48 0.51±0.09 Number of Epochs 834 1180 726 727 971 485 550 1054 374 493 305 325 411 462 450 444 598 343 634 750 500 496 500 501 588±232 The inter-expert κ agreement for 24 expert pairs (experts A-I) for the indicated number of epochs. Agreement is reported on a sample-by-sample or epoch-by-epoch basis. Spindles are divided in groups based on their assigned confidence scores: H (high = ‘definitely’), M (medium = ‘probably’) and L (low = ‘guessing’). See further description in Table 2. Table S5: Intra-expert pairwise spindle characteristics reliability of matched events. Experts Duration Amplitude Frequency Number of Epochs F1_F2 0.72 0.96 0.89 250 F1_F3 0.60 0.98 0.90 250 F2_F3 0.53 0.91 0.87 250 I1_I2 0.85 0.97 0.89 56 mean±SD 0.68±0.14 0.95±0.03 0.89±0.03 202±97 The intra-expert pairwise agreement for two experts (expert F and expert I) after repeated observations of the indicated number of epochs. Spindle duration (seconds) is estimated directly by the expert. Spindle maximum peak-to-peak amplitude (µV) and oscillation frequency (Hz) are calculated from the event detected by the expert. See further description in Table 3. Table S6: Inter-expert pairwise spindle characteristics reliability of matched events. Experts Duration Amplitude Frequency A_B A_C A_D A_E A_F A_G A_H A_I A_J B_C B_D B_E B_F B_I C_D C_E C_F C_H C_I D_F E_F E_I G_I H_I mean±SD 0.37 0.38 0.29 0.52 0.48 0.55 0.53 0.23 0.51 0.47 0.42 0.39 0.47 0.54 0.74 0.36 0.58 0.23 0.52 0.47 0.53 0.01 0.44 0.16 0.43±0.16 0.92 0.92 0.89 0.82 0.92 0.94 0.93 0.87 0.94 0.92 0.94 0.94 0.93 0.92 0.95 0.91 0.97 0.87 0.90 0.88 0.92 0.82 0.98 0.94 0.91±0.04 0.89 0.86 0.87 0.84 0.87 0.89 0.86 0.83 0.88 0.90 0.91 0.86 0.86 0.90 0.96 0.86 0.94 0.89 0.91 0.91 0.85 0.77 0.92 0.88 0.88±0.04 Number of Epochs 834 1180 726 727 971 485 550 1054 374 493 305 325 411 462 450 444 598 343 634 750 500 496 500 501 588±232 The inter-expert κ agreement for 24 expert pairs (experts A-I) for the indicated number of epochs. Spindle duration (seconds) is estimated directly by the expert. Spindle maximum peakto-peak amplitude (µV) and oscillation frequency (Hz) are calculated from the event detected by the expert. See further description in Table 3. Table S7: Pairwise intra-expert average overlap score. Experts Mean SD Number of Epochs F1_F2 0.82 0.12 250 F1_F3 0.80 0.13 250 F2_F3 0.79 0.15 250 I1_I2 0.82 0.09 56 mean±SD 0.81 0.12 202±97 The average (mean and SD) intra-expert overlap score for events detected by two experts (expert F and expert I) after repeated observations of the indicated number of epochs. Overlap score measures the amount of overlap between matched spindle events and is defined as intersecting duration divided by the united duration (defined in Supplementary Figure 1A). See further description in Table 4. Table S8: Pairwise inter-expert average overlap score. Experts Mean SD A_B A_C A_D A_E A_F A_G A_H A_I A_J B_C B_D B_E B_F B_I C_D C_E C_F C_H C_I D_F E_F E_I G_I H_I mean±SD 0.73 0.72 0.70 0.74 0.75 0.74 0.73 0.69 0.75 0.82 0.80 0.74 0.77 0.78 0.86 0.73 0.79 0.75 0.80 0.78 0.76 0.63 0.76 0.70 0.75 0.15 0.14 0.15 0.14 0.13 0.13 0.14 0.14 0.13 0.14 0.17 0.16 0.15 0.14 0.12 0.15 0.13 0.14 0.12 0.15 0.14 0.14 0.13 0.15 0.14 Number of Epochs 834 1180 726 727 971 485 550 1054 374 493 305 325 411 462 450 444 598 343 634 750 500 496 500 501 588±232 The average (mean and SD) inter-expert overlap score for events detected by two experts (expert F and expert I) for the indicated number of epochs. Overlap score measures the amount of overlap between matched spindle events and is defined as intersecting duration divided by the united duration (defined in Supplementary Figure 1A). See further description in Table 4.