Frequency-Based Sleep Stage Detections by Single Nobuhide Hirai , Seiji Nishino

advertisement
AAAI Technical Report SS-12-05
Self-Tracking and Collective Intelligence for Personal Wellness
Frequency-Based Sleep Stage Detections by Single
EEG Derivation in Healthy Human Subjects
Nobuhide Hirai 1, 2, Seiji Nishino 2
1. Department of Psychiatry, Jichi Medical University School of Medicine,
3311-1 Yakushiji, Shimotsuke, Tochigi 239-0498, JAPAN
2. Sleep and Circadian Neurology Laboratory, Center for Narcolepsy, Stanford University School of Medicine,
3165 Porter Drive, RM1195, Palo Alto, CA 94304, USA
nobu@nobu.com
Abstract
on subjects. The simple method can also extend EEG application to various fields.
However, in such cases, electrode condition could be
easily disturbed by environmental influences, and signal
levels might be uneven among recordings. Sometimes, detected signals could be so noisy that human scorers can
hardly score it. Thus, an automated scoring method that
works on low quality EEG signals is much needed.
There have been an increasing number of applications to
determine human sleep stages by automated computer systems. However, most applications aim to imitate human
scorers’ decisions, such as by EEG pattern recognitions
(Martin et al. 1972) or by focusing on certain EEG frequency bands (Gross et al. 2009); they could poorly work
on low quality signals when some specific noise pattern or
frequency were disturbed.
We developed a simple frequency-based sleep stage
classifier by single EEG derivation of polysomnograms,
and evaluated the performance. Our simple classifier was
not based on specific frequency bands or any specific EEG
patterns, and it could work better on low quality signals.
Further more, our method can be applied to any other biological signals of unknown nature.
A need for sleep monitoring is increasing in modern society.
However, sleep stage scoring is time consuming, and large
inconsistencies may exist among scorers. The settings for
the recordings are also complicated and usually need to be
professionally prepared. If simple small equipment could
record human EEG and detect sleep stages, it would bring
significant benefits to a large population. We thus developed a simple frequency-based sleep stage classifier by single EEG derivation, and evaluated the performance of the
classifier. It showed a potential to work as well as the other
known automated classifiers. The classifier was not based
on specific frequency bands or EEG patterns. It could perform as well with poor quality signals and could easily be
adopted to score any other biological signals.
Introduction
In clinical settings, sleep stages are visually determined
based on polysomnograms (multiple electroencephalograms (EEGs) and other physiological signals) recorded
during sleep by human scorers. The scoring is time consuming, and large inconsistencies may exist among scorers.
The settings for the recording are also complicated and
usually need to be professionally prepared. The recorder
has many electrodes and is usually too large to carry
around. Moreover, people’s behaviors can be disturbed so
easily in such settings that it is very hard to obtain natural
sleep recordings. Thus, if small equipment with a small
computer could record human EEG and detect sleep stages
automatically, we can evaluate human sleep in various environments and for extended periods without large stress
Subjects and Method
To evaluate the classifiers by single EEG derivation, we
used single derivation derived from polysomnogram signals because human scorers usually determine sleep stages
by polysomnograms. Polysomnograms with eight EEG recordings: F3, 4, C3, 4, O1, 2 and A1, 2; spanning a whole night
were recorded from ten healthy human subjects (6 males
and 4 females; 20-26 years old; average 22.2 years old)
and nine derivations (F3-A2, F4-A1, C3-A2, C4-A1, O1-A2,
Copyright © 2012, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
19
We did not exclude epochs even if it had apparent artifacts
when at least one human scorer could rate it. For every
DFT frequency, probability distribution functions of each
stage were calculated on each derivation of each subject.
To obtain this function, we estimated t-distribution. A set
of the distribution functions of 0.5-45 Hz was used as a
classifier. Each classifier was trained by a single derivation
of a single subject. Thus, total 90 (10 each for 9 derivations) classifiers were obtained.
O2-A1, F3-F4, C3-C4 and O1-O2) were chosen to be subjected.
The polysomnograms were rated by two human scorers
for every 30-second epoch. The scorers were clinical professionals at the same institute and regularly trained to
keep inter-rater consistency. The original human scorers’
ratings had six stages: stage 1-4, REM and Wake. However,
we reduced them into four stages: stage L (light sleep:
stage 1 and 2), stage D (deep sleep: stage 3 and 4), REM
and Wake. Inter-rater agreement rates were 94.0 % on average (Table 1).
S*
1
2
3
4
5
6
7
8
9
10
Age Sex
20
20
21
21
22
22
22
24
24
26
M
M
M
F
M
F
F
M
F
M
BMI
19.5
23.8
22.5
22.1
22.5
24.0
31.2
20.0
22.2
24.2
TS
401
416
423
419
398
396
422
441
534
396
W
16
4
10
6
12
14
7
22
8
12
R
14
18
19
27
12
18
24
8
16
26
L
60
53
64
53
65
52
49
62
61
51
D
9
23
6
13
9
15
18
8
14
9
Applying classifiers
When applying this frequency-based classifier (CF), the
standardized logarithmic power was obtained for every
frequency bin in the same way. Subsequently, the product
of probabilities among all the frequency bins was calculated using the distribution functions of each stage. The most
probable stage was chosen as the estimate of each second,
and the most frequent estimation within every 30-second
was selected and compared with the human scorers’ ratings
to calculate sensitivity, specificity and total agreement rate
of each classifier. Each classifier was applied to its own
subject (self-evaluation) and other nine subjects (crossevaluation).
For comparison, we prepared frequency band based
classifier (CB) of each subject where delta (0.5-5 Hz), theta
(6.5-9 Hz), alpha (9-14 Hz), and beta (14-29 Hz) bands
were subjected in the same way. These bands were often
used for automated sleep stage detection (Gross et al.
2009), and compatible with the focused frequency bands
when clinical professionals examine EEG (Hirshkowitz
2011).
IRA
93.6
92.9
94.5
99.1
91.1
93.0
95.0
94.4
91.6
94.5
425 11 18 57 12 94.0
Avg 22.2
23.2
Table 1. Subject profiles and inter-rater agreement rates.
S*: Subject.
BMI: Body mass index.
TS: Total sleep period (minutes).
W, R, L, D: Ratio of Wake, REM, stage L and stage D (%).
IRA: Inter-rater agreement rate (%).
Avg: Average.
Results
The average logarithmic powers of each stage on a derivation of a subject are shown in Figure 1. The power values
in this figure are not standardized as described above. After
the standardization, power distributions were relativized.
The center values of distribution functions of the same derivation are shown in Figure 2.
Obtaining standardized frequency power
Discrete Fourier Transform (DFT) was performed upon
two second signals clipped every second (overlapping one
second). Before the DFT, signals were de-trended by subtracting the regression line derived by the least-square
method, and the Hamming window function was applied.
On each 0.5-45 Hz frequency bin, the average and the
standard deviation of the logarithmic power were calculated over the recording span. Each DFT logarithmic power
was standardized by these values (subtracted the average
and divided by the standard deviation) before the following
analysis.
Training classifiers
For training classifiers, only agreed epochs were used. The
first and the last epochs of each stage were also excluded
because transitional epochs may contain multiple stages.
20
Hz
Figure 1. Average logarithmic powers (subject 9, F3-A2).
H* T*
F3
F4
C3
C4
O1 O2
F
C
O
Average
1
1
2
3
4
5
6
7
8
9
10
84
84
82
86
75
80
88
86
84
90
84
83
85
87
79
77
84
86
82
87
86
84
85
89
79
82
89
85
80
88
85
87
85
90
83
83
85
84
79
86
84
84
80
87
76
81
87
83
82
90
82
80
86
88
80
78
85
84
82
88
84
85
85
87
76
81
88
86
83
90
80
83
87
85
79
78
88
87
75
90
77
84
87
87
79
80
90
83
84
90
83
84
85
87
78
80
87
85
81
89
2
1
2
3
4
5
6
7
8
9
10
83
83
83
86
72
79
88
85
83
90
83
82
86
87
76
76
84
85
81
88
86
83
85
89
77
81
89
84
79
89
85
85
85
90
81
82
85
83
79
88
83
83
81
87
73
79
87
83
80
90
81
80
87
88
77
75
86
84
81
89
84
83
86
87
75
79
89
85
83
90
78
82
87
85
78
78
89
86
75
91
75
83
88
87
77
79
93
83
83
91
82
83
85
87
76
79
88
84
80
90
Average 83 83 84 84 83 83 84 83 84
84
Table 2. Self-evaluated averaged total agreement rates (%).
H*: Human scorer.
T*: Trainer subject for classifier.
H* T* F3 F4 C3 C4 O1 O2 F C O Average
63
1 70 58 67 59 66 49 71 66 61
77
77
75
77
74
73
78
73
76
75
2
63
3 55 63 61 66 60 66 62 69 66
78
4 79 78 79 79 80 79 78 73 80
68
5 69 69 63 68 67 70 70 66 71
1
75
6 75 76 76 76 75 74 75 68 76
72
7 73 71 72 68 72 70 74 72 75
60
8 60 59 59 59 58 57 62 64 61
72
73
72
75
71
71
76
72
73
73
9
73
10 74 71 75 72 75 73 72 71 73
63
1 69 58 67 58 65 49 71 65 61
76
2 78 77 75 77 74 73 78 73 77
62
3 54 62 61 65 59 65 61 68 65
79
4 79 79 79 79 80 79 78 73 81
69
69
63
67
67
70
70
66
71
68
5
2
75
77 76 76 75 74 75 68 76
75
6
72
7 73 72 72 68 72 70 75 72 76
60
8 60 58 58 59 58 56 62 64 61
73
9 72 73 72 74 71 71 76 72 73
72
10 74 70 75 72 74 72 71 71 73
70
70
70
70
70
68
72
69
71
70
Average
Table 3. Cross-evaluated averaged total agreement rates (%).
H*: Human scorer.
T*: Trainer subject for classifier.
Hz
Figure 2. Average of distributions (subject 9, F3-A2).
Self- and cross-evaluated averaged total agreement rates
of CF classifiers are shown in Table 2 and 3, respectively.
Figure 3 is the scattered plots of these self- and crossevaluated values with the results of CB classifiers.
21
H*
Human
1
Wake
REM
Stage L
Stage D
Classifier CF (self-evaluation)
Wake
REM
Stage L Stage D
8702
260
209
0
1310
12002
1340
0
3931
2967
37628
1059
219
3
1445
8161
8406
167
184
Wake
1291
11950
1276
REM
2
4268
3114
37451
Stage L
188
1
1711
Stage D
Table 5. Humans and classifier s’ ratings (epochs).
H*: Human scorer.
Figure 3. Cross- and self-evaluation of classifiers.
X-axis: Total agreement rate (%) of self-evaluation.
Y-axis: Total agreement rate (%) of cross-evaluation.
CF
CB
Self-evaluation
Min
Max
Avg
81%
78%
1
Wake
REM
Stage L
Stage D
Classifier CF (cross-evaluation)
Wake
REM
Stage L Stage D
71697
5785
5051
6
29966
63771
38126
5
45535
26106
329185
9439
4464
238
48886
34864
5
4
7710
36595
Averaged sensitivities and specificities of the classifiers
with two types of classifiers are shown in Table 7. Although the specificities of stage L and stage D exceeded
80%, and the sensitivities of stage L and Wake exceeded or
were close to 80%, the sensitivities of REM and stage D
and the specificities of Wake and REM remained in a lower range.
Cross-evaluation
Min
Max
Avg
72%
93%
84%
49%
49%
89%
78%
42%
Table 4. Total agreement rate of classifiers.
Min: Minimum.
Max: Maximum.
Avg: Average.
Human
69808
4577
4423
Wake
29942
63190
37517
REM
2
47753
27907
328758
Stage L
4078
226
50550
Stage D
Table 6. Humans and classifiers’ ratings (epochs).
H*: Human scorer.
There appears to be a positive correlation between selfand cross-evaluated values in Figure 3. Total agreement
rates of self- and cross-evaluations of CF and CB are summarized in Table 4. The total agreement rate of CF is higher than CB on minimum, maximum and average.
Classifier
H*
0
0
959
8261
70%
65%
Self-evaluation
Cross-evaluation
W R
L
D W R
L
D
Sensitivity 95 81 82 82 88 49 80 41
CF
Specificity 60 78 92 90 53 69 79 85
Sensitivity 91 79 74 86 76 62 67 48
CB
Specificity 56 68 93 88 46 62 81 84
Table 7. Average sensitivities and specificities of classifiers.
W, R, L and D: Wake, REM, stage L and stage D.
Classifier
Table 5 and 6 are the matrices of humans and classifiers’
ratings on self- and cross-evaluation, respectively. In these
matrices, agreed ratings are in the diagonal cells. Disagreement occurred prominently with the adjacent cells
(Wake-REM, REM-stage L and stage L-stage D). However,
the column “Wake” (classifiers) has high numbers in rows
“Stage L” and “State D” (human scorers) that cannot be
dismissed.
22
Discussions
References
Performance of our classifiers can be expected to be relatively low because they are based only on a single EEG
derivation of a single subject. In reality, their total agreement rate is, on average, not excellent. However, some
classifiers exceeded 80% (cross-evaluation) in total agreement rate, which is comparable with other known classifiers (Shambroom, Fabregas, and Johnstone 2011; Berthomier et al. 2007; Flexer, Gruber, and Dorffner 2005). This
suggests that the classifier could be reliable if trained with
proper data. The performance of cross-evaluation had a
positive correlation with the one of self-evaluation. Thus, if
the classifier performed well with the trainer subjects, it is
expected to achieve high performance with other subjects.
Total agreement rate of the CF classifier was higher than
that of the CB classifier. This may suggest that summing up
each frequency bin into a band loses some important information. Or, higher frequency (gamma band) may work
for better detection. Even though sensitivities on REM and
stage D were higher in CB, the specificities were lower,
which suggests that CB included too many incorrect epochs
into these stages and lowered the total agreement rates.
Disagreement in CF classifier is largely due to over inclusion into Wake, which caused low specificity of Wake
and low sensitivity of stage D. This error could be caused
by the training data that contained relatively small amount
of Wake epochs. If the classifier would be trained by data
containing more Wake epochs, the classifier may show
better performance.
Although the performance of the classifier differed
among trainer subjects, the differences among derivations
were not prominent. This may suggest that our classifier
could perform well even when the electrode position is inaccurate. Our standardization method could contribute to
compensate uneven electrode conditions. This feature can
be helpful when analyzing low quality signals.
This simple classifier based on statistical data does not
simulate complex decisions by human scorers, but it can
eliminate arbitrary errors. Especially, it could be expected
to work on low-quality signals containing artifacts since
this method does not depend on specific frequencies or
waveforms. It could also be applied to other biological signals where frequency features differ among stages.
Berthomier, C., Drouot, X., Herman-Stoica, M., Berthomier, P.,
Prado, J., Bokar-Thire, D., Benoit, O., Mattout, J., and d'Ortho, M.
P. 2007. Automatic Analysis of Single-Channel Sleep Eeg: Validation in Healthy Individuals. Sleep 30(11): 1587-95.
Flexer, A., Gruber, G., and Dorffner, G. 2005. A Reliable Probabilistic Sleep Stager Based on a Single Eeg Signal. Artificial Intelligence in Medicine 33(3): 199-207.
Gross, B. A., Walsh, C. M., Turakhia, A. A., Booth, V., Mashour,
G. A., and Poe, G. R. 2009. Open-source logic-based automated
sleep scoring software using electrophysiological recordings in
rats. Journal of Neuroscience Methods 184(1): 10-8.
Hirshkowitz, M. Monitoring and Staging Human Sleep. In:
Kryger, M. H., Roth, T., and Dement, W. C. eds. Principles and
Practice of Sleep Medicine. 2011. St. Louis, Elsevier: 1602-1609.
Martin, W. B., Johnson, L. C., Viglione, S. S., Naitoh, P., Joseph,
R. D., and Moses, J. D. 1972. Pattern recognition of EEG-EOG as
a technique for all-night sleep stage scoring. Electroencephalography and Clinical Neurophysiology 32(4): 417-27.
Shambroom, J. R., Fabregas, S. E., and Johnstone, J. 2011. Validation of an Automated Wireless System to Monitor Sleep in
Healthy Adults. Journal of Sleep Research.
Acknowledgements
We would like to thank Shintaro Chiba, M.D., Ph.D. and
Ms. Tomoko Yagi for providing the scored polysomnograms.
23
Download