Sabrina L. Wendt, Peter Welinder, Helge B.D. Sorensen, Paul E.... Perona, Emmanuel Mignot, Simon C. Warby

advertisement
Inter-expert and intra-expert reliability in sleep spindle scoring
Sabrina L. Wendt, Peter Welinder, Helge B.D. Sorensen, Paul E. Peppard, Poul Jennum, Pietro
Perona, Emmanuel Mignot, Simon C. Warby
Supplementary Data
Figure S1: Example of how pooling spindles with different confidence scores leads to improved
performance.
Table S1: Intra-expert pairwise F1-score agreement (event-by-event and epoch-by-epoch).
Table S2: Inter-expert pairwise F1-score agreement (event-by-event and epoch-by-epoch).
Table S3: Intra-expert pairwise κ reliability (sample-by-sample and epoch-by-epoch).
Table S4: Inter-expert pairwise κ reliability (sample-by-sample and epoch-by-epoch).
Table S5: Intra-expert pairwise spindle characteristics reliability of matched events.
Table S6: Inter-expert pairwise spindle characteristics reliability of matched events.
Table S7: Pairwise intra-expert average overlap score.
Table S8: Pairwise inter-expert average overlap score.
Figure S1: Example of how pooling spindles with different confidence scores can lead to
improved performance.
In this example scorer 1 is quite conservative and very confident in his/her detected spindles
whereas scorer 2 is more uncertain and captures more spindles that may be doubtful. The
performance (F1-score) is the amount of reliability/agreement between the two scorers overall,
and the performance therefore depends on which sets of spindles are included in the
comparison (i.e. which categories of confidence scores are allowed). Note that confidence
scores are not used directly in the performance comparison - the confidence score is only used
to determine which sets of spindle detections are being compared. Depending on which
categories to compare, spindles with those particular levels of confidence are converted to
binary (yes/no) data in order to calculate the performance. This example explains why
performance improves when spindles with varying confidence levels are pooled together.
Experts often find the same spindles but do not assign them equal confidence scores, thus
allowing spindles with all levels of confidence scores increases the overall agreement between
the experts. This is due to the vague definitions of how to categorize the spindles but also
because certainty of a spindle is highly individual as is spindle scoring. In other words, the
agreement is better between scorers when you allow the scorers some flexibility in the scoring
system. They can indicate that some spindles are not perfect, but are still spindles. When the
confidence scores are pooled, the performance goes up because the two experts do not need to
agree on the confidence level of a spindle, only the presence of a spindle.
Table S1: Intra-expert pairwise F1-score agreement (event-by-event and epoch-by-epoch).
Event
Epoch
Number of
Experts
L
M
H
H+M
H+M+L
H+M+L
Epochs
F1_F2
30.6
54.2
66.7
74.4
81.2
87.0
250
F1_F3
7.0
39.5
58.8
59.4
64.9
84.6
250
F2_F3
15.2
45.8
57.1
65.0
70.9
83.7
250
I1_I2
26.3
35.8
73.9
72.6
72.8
87.5
56
mean±SD
19.8±10.7
43.8±8.1
64.1±7.7
67.9±6.9
72.4±6.7
85.7±1.8
202±97
The intra-expert F1-score agreement for two experts (expert F and expert I) after repeated
observations of the indicated number of epochs. Agreement is reported on an event-by-event or
epoch-by-epoch basis. Spindles are divided in groups based on their assigned confidence
scores: H (high = ‘definitely’), M (medium = ‘probably’) and L (low = ‘guessing’). See further
description in Table 1.
Table S2: Inter-expert pairwise F1-score agreement (event-by-event and epoch-by-epoch).
Experts
A_B
A_C
A_D
A_E
A_F
A_G
A_H
A_I
A_J
B_C
B_D
B_E
B_F
B_I
C_D
C_E
C_F
C_H
C_I
D_F
E_F
E_I
G_I
H_I
mean±SD
L
0.0
0.0
0.0
0.0
15.7
0.0
0.0
14.9
22.5
0.0
0.0
9.9
8.9
16.2
21.5
0.0
15.1
12.6
5.8
14.6
0.0
0.0
9.6
20.4
7.8±8.2
M
4.9
4.0
2.8
2.3
16.4
6.8
5.5
21.0
19.7
0.0
0.6
12.3
21.4
28.3
24.5
5.7
25.3
11.6
18.5
27.6
0.0
0.0
11.3
14.0
11.9±9.6
Event
H
53.2
55.4
39.7
31.4
42.8
66.9
63.9
57.0
59.3
43.9
44.8
41.6
50.2
57.8
50.8
37.9
37.7
62.9
55.8
43.8
36.1
36.2
30.0
32.2
47.1±11.0
H+M
61.1
61.3
54.9
51.2
58.7
66.9
58.0
57.6
63.8
45.7
59.0
58.3
53.8
72.0
63.7
50.2
53.0
62.5
61.0
56.0
47.7
49.5
57.7
55.3
57.5±6.2
H+M+L
65.0
57.8
59.1
59.4
62.1
65.4
54.7
58.9
68.3
45.7
60.5
62.1
48.6
73.5
68.6
55.7
62.0
65.5
65.1
64.6
61.3
54.8
66.2
67.7
61.4±6.4
Epoch
H+M+L
74.9
76.1
71.1
71.8
73.5
79.3
67.2
75.2
78.3
63.7
80.6
73.6
65.6
88.0
80.7
65.7
76.5
78.4
80.8
75.7
70.3
71.1
77.7
79.5
74.8±5.8
Number of
Epochs
834
1180
726
727
971
485
550
1054
374
493
305
325
411
462
450
444
598
343
634
750
500
496
500
501
588±232
The inter-expert F1-score agreement for 24 expert pairs (experts A-I) for the indicated number of
epochs. Agreement is reported on an event-by-event or epoch-by-epoch basis. Spindles are
divided in groups based on their assigned confidence scores: H (high = ‘definitely’), M (medium
= ‘probably’) and L (low = ‘guessing’). See further description in Table 1.
Table S3: Intra-expert pairwise κ reliability (sample-by-sample and epoch-by-epoch).
Sample
Epoch
Number of
Experts
L
M
H
H+M
H+M+L
H+M+L
Epochs
F1_F2
0.28
0.49
0.62
0.70
0.74
0.81
250
F1_F3
0.05
0.34
0.54
0.55
0.58
0.76
250
F2_F3
0.13
0.38
0.53
0.59
0.63
0.74
250
I1_I2
0.21
0.30
0.69
0.67
0.67
0.56
56
mean±SD
0.17±0.10
0.38±0.08
0.60±0.07
0.63±0.07
0.66±0.07
0.72±0.11
202±97
The intra-expert κ agreement for two experts (expert F and expert I) after repeated observations
of the indicated number of epochs. Agreement is reported on a sample-by-sample or epoch-byepoch basis. Spindles are divided in groups based on their assigned confidence scores: H (high
= ‘definitely’), M (medium = ‘probably’) and L (low = ‘guessing’). See further description in Table
2.
Table S4: Inter-expert pairwise κ reliability (sample-by-sample and epoch-by-epoch).
Experts
A_B
A_C
A_D
A_E
A_F
A_G
A_H
A_I
A_J
B_C
B_D
B_E
B_F
B_I
C_D
C_E
C_F
C_H
C_I
D_F
E_F
E_I
G_I
H_I
mean±SD
L
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.19
0.15
0.06
0.11
0.11
0.10
0.04
0.19
0.00
0.12
0.13
0.07
0.10
0.06
0.00
0.06±0.07
M
0.01
0.04
0.04
0.05
0.02
0.05
0.00
0.04
0.00
0.21
0.10
0.06
0.22
0.20
0.10
0.16
0.18
0.00
0.18
0.14
0.10
0.18
0.17
0.00
0.09±0.08
Event
H
0.29
0.48
0.50
0.57
0.38
0.33
0.36
0.46
0.34
0.47
0.30
0.31
0.39
0.35
0.59
0.53
0.55
0.35
0.52
0.40
0.44
0.51
0.44
0.40
0.43±0.09
H+M
0.41
0.51
0.44
0.57
0.48
0.42
0.38
0.46
0.42
0.58
0.48
0.51
0.48
0.47
0.59
0.56
0.59
0.45
0.51
0.52
0.57
0.56
0.45
0.47
0.50±0.06
H+M+L
0.47
0.52
0.41
0.56
0.49
0.46
0.38
0.43
0.52
0.62
0.59
0.56
0.56
0.53
0.61
0.58
0.61
0.48
0.52
0.53
0.57
0.54
0.40
0.47
0.52±0.07
Epoch
H+M+L
0.48
0.55
0.40
0.57
0.53
0.48
0.30
0.46
0.53
0.63
0.56
0.50
0.56
0.41
0.57
0.61
0.63
0.43
0.42
0.50
0.55
0.66
0.41
0.48
0.51±0.09
Number of
Epochs
834
1180
726
727
971
485
550
1054
374
493
305
325
411
462
450
444
598
343
634
750
500
496
500
501
588±232
The inter-expert κ agreement for 24 expert pairs (experts A-I) for the indicated number of
epochs. Agreement is reported on a sample-by-sample or epoch-by-epoch basis. Spindles are
divided in groups based on their assigned confidence scores: H (high = ‘definitely’), M (medium
= ‘probably’) and L (low = ‘guessing’). See further description in Table 2.
Table S5: Intra-expert pairwise spindle characteristics reliability of matched events.
Experts
Duration
Amplitude
Frequency
Number of
Epochs
F1_F2
0.72
0.96
0.89
250
F1_F3
0.60
0.98
0.90
250
F2_F3
0.53
0.91
0.87
250
I1_I2
0.85
0.97
0.89
56
mean±SD
0.68±0.14
0.95±0.03
0.89±0.03
202±97
The intra-expert pairwise agreement for two experts (expert F and expert I) after repeated
observations of the indicated number of epochs. Spindle duration (seconds) is estimated
directly by the expert. Spindle maximum peak-to-peak amplitude (µV) and oscillation frequency
(Hz) are calculated from the event detected by the expert. See further description in Table 3.
Table S6: Inter-expert pairwise spindle characteristics reliability of matched events.
Experts
Duration
Amplitude
Frequency
A_B
A_C
A_D
A_E
A_F
A_G
A_H
A_I
A_J
B_C
B_D
B_E
B_F
B_I
C_D
C_E
C_F
C_H
C_I
D_F
E_F
E_I
G_I
H_I
mean±SD
0.37
0.38
0.29
0.52
0.48
0.55
0.53
0.23
0.51
0.47
0.42
0.39
0.47
0.54
0.74
0.36
0.58
0.23
0.52
0.47
0.53
0.01
0.44
0.16
0.43±0.16
0.92
0.92
0.89
0.82
0.92
0.94
0.93
0.87
0.94
0.92
0.94
0.94
0.93
0.92
0.95
0.91
0.97
0.87
0.90
0.88
0.92
0.82
0.98
0.94
0.91±0.04
0.89
0.86
0.87
0.84
0.87
0.89
0.86
0.83
0.88
0.90
0.91
0.86
0.86
0.90
0.96
0.86
0.94
0.89
0.91
0.91
0.85
0.77
0.92
0.88
0.88±0.04
Number of
Epochs
834
1180
726
727
971
485
550
1054
374
493
305
325
411
462
450
444
598
343
634
750
500
496
500
501
588±232
The inter-expert κ agreement for 24 expert pairs (experts A-I) for the indicated number of
epochs. Spindle duration (seconds) is estimated directly by the expert. Spindle maximum peakto-peak amplitude (µV) and oscillation frequency (Hz) are calculated from the event detected by
the expert. See further description in Table 3.
Table S7: Pairwise intra-expert average overlap score.
Experts
Mean
SD
Number of
Epochs
F1_F2
0.82
0.12
250
F1_F3
0.80
0.13
250
F2_F3
0.79
0.15
250
I1_I2
0.82
0.09
56
mean±SD
0.81
0.12
202±97
The average (mean and SD) intra-expert overlap score for events detected by two experts
(expert F and expert I) after repeated observations of the indicated number of epochs. Overlap
score measures the amount of overlap between matched spindle events and is defined as
intersecting duration divided by the united duration (defined in Supplementary Figure 1A). See
further description in Table 4.
Table S8: Pairwise inter-expert average overlap score.
Experts
Mean
SD
A_B
A_C
A_D
A_E
A_F
A_G
A_H
A_I
A_J
B_C
B_D
B_E
B_F
B_I
C_D
C_E
C_F
C_H
C_I
D_F
E_F
E_I
G_I
H_I
mean±SD
0.73
0.72
0.70
0.74
0.75
0.74
0.73
0.69
0.75
0.82
0.80
0.74
0.77
0.78
0.86
0.73
0.79
0.75
0.80
0.78
0.76
0.63
0.76
0.70
0.75
0.15
0.14
0.15
0.14
0.13
0.13
0.14
0.14
0.13
0.14
0.17
0.16
0.15
0.14
0.12
0.15
0.13
0.14
0.12
0.15
0.14
0.14
0.13
0.15
0.14
Number of
Epochs
834
1180
726
727
971
485
550
1054
374
493
305
325
411
462
450
444
598
343
634
750
500
496
500
501
588±232
The average (mean and SD) inter-expert overlap score for events detected by two experts
(expert F and expert I) for the indicated number of epochs. Overlap score measures the amount
of overlap between matched spindle events and is defined as intersecting duration divided by
the united duration (defined in Supplementary Figure 1A). See further description in Table 4.
Download