AAAI Technical Report SS-12-05 Self-Tracking and Collective Intelligence for Personal Wellness Frequency-Based Sleep Stage Detections by Single EEG Derivation in Healthy Human Subjects Nobuhide Hirai 1, 2, Seiji Nishino 2 1. Department of Psychiatry, Jichi Medical University School of Medicine, 3311-1 Yakushiji, Shimotsuke, Tochigi 239-0498, JAPAN 2. Sleep and Circadian Neurology Laboratory, Center for Narcolepsy, Stanford University School of Medicine, 3165 Porter Drive, RM1195, Palo Alto, CA 94304, USA nobu@nobu.com Abstract on subjects. The simple method can also extend EEG application to various fields. However, in such cases, electrode condition could be easily disturbed by environmental influences, and signal levels might be uneven among recordings. Sometimes, detected signals could be so noisy that human scorers can hardly score it. Thus, an automated scoring method that works on low quality EEG signals is much needed. There have been an increasing number of applications to determine human sleep stages by automated computer systems. However, most applications aim to imitate human scorers’ decisions, such as by EEG pattern recognitions (Martin et al. 1972) or by focusing on certain EEG frequency bands (Gross et al. 2009); they could poorly work on low quality signals when some specific noise pattern or frequency were disturbed. We developed a simple frequency-based sleep stage classifier by single EEG derivation of polysomnograms, and evaluated the performance. Our simple classifier was not based on specific frequency bands or any specific EEG patterns, and it could work better on low quality signals. Further more, our method can be applied to any other biological signals of unknown nature. A need for sleep monitoring is increasing in modern society. However, sleep stage scoring is time consuming, and large inconsistencies may exist among scorers. The settings for the recordings are also complicated and usually need to be professionally prepared. If simple small equipment could record human EEG and detect sleep stages, it would bring significant benefits to a large population. We thus developed a simple frequency-based sleep stage classifier by single EEG derivation, and evaluated the performance of the classifier. It showed a potential to work as well as the other known automated classifiers. The classifier was not based on specific frequency bands or EEG patterns. It could perform as well with poor quality signals and could easily be adopted to score any other biological signals. Introduction In clinical settings, sleep stages are visually determined based on polysomnograms (multiple electroencephalograms (EEGs) and other physiological signals) recorded during sleep by human scorers. The scoring is time consuming, and large inconsistencies may exist among scorers. The settings for the recording are also complicated and usually need to be professionally prepared. The recorder has many electrodes and is usually too large to carry around. Moreover, people’s behaviors can be disturbed so easily in such settings that it is very hard to obtain natural sleep recordings. Thus, if small equipment with a small computer could record human EEG and detect sleep stages automatically, we can evaluate human sleep in various environments and for extended periods without large stress Subjects and Method To evaluate the classifiers by single EEG derivation, we used single derivation derived from polysomnogram signals because human scorers usually determine sleep stages by polysomnograms. Polysomnograms with eight EEG recordings: F3, 4, C3, 4, O1, 2 and A1, 2; spanning a whole night were recorded from ten healthy human subjects (6 males and 4 females; 20-26 years old; average 22.2 years old) and nine derivations (F3-A2, F4-A1, C3-A2, C4-A1, O1-A2, Copyright © 2012, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 19 We did not exclude epochs even if it had apparent artifacts when at least one human scorer could rate it. For every DFT frequency, probability distribution functions of each stage were calculated on each derivation of each subject. To obtain this function, we estimated t-distribution. A set of the distribution functions of 0.5-45 Hz was used as a classifier. Each classifier was trained by a single derivation of a single subject. Thus, total 90 (10 each for 9 derivations) classifiers were obtained. O2-A1, F3-F4, C3-C4 and O1-O2) were chosen to be subjected. The polysomnograms were rated by two human scorers for every 30-second epoch. The scorers were clinical professionals at the same institute and regularly trained to keep inter-rater consistency. The original human scorers’ ratings had six stages: stage 1-4, REM and Wake. However, we reduced them into four stages: stage L (light sleep: stage 1 and 2), stage D (deep sleep: stage 3 and 4), REM and Wake. Inter-rater agreement rates were 94.0 % on average (Table 1). S* 1 2 3 4 5 6 7 8 9 10 Age Sex 20 20 21 21 22 22 22 24 24 26 M M M F M F F M F M BMI 19.5 23.8 22.5 22.1 22.5 24.0 31.2 20.0 22.2 24.2 TS 401 416 423 419 398 396 422 441 534 396 W 16 4 10 6 12 14 7 22 8 12 R 14 18 19 27 12 18 24 8 16 26 L 60 53 64 53 65 52 49 62 61 51 D 9 23 6 13 9 15 18 8 14 9 Applying classifiers When applying this frequency-based classifier (CF), the standardized logarithmic power was obtained for every frequency bin in the same way. Subsequently, the product of probabilities among all the frequency bins was calculated using the distribution functions of each stage. The most probable stage was chosen as the estimate of each second, and the most frequent estimation within every 30-second was selected and compared with the human scorers’ ratings to calculate sensitivity, specificity and total agreement rate of each classifier. Each classifier was applied to its own subject (self-evaluation) and other nine subjects (crossevaluation). For comparison, we prepared frequency band based classifier (CB) of each subject where delta (0.5-5 Hz), theta (6.5-9 Hz), alpha (9-14 Hz), and beta (14-29 Hz) bands were subjected in the same way. These bands were often used for automated sleep stage detection (Gross et al. 2009), and compatible with the focused frequency bands when clinical professionals examine EEG (Hirshkowitz 2011). IRA 93.6 92.9 94.5 99.1 91.1 93.0 95.0 94.4 91.6 94.5 425 11 18 57 12 94.0 Avg 22.2 23.2 Table 1. Subject profiles and inter-rater agreement rates. S*: Subject. BMI: Body mass index. TS: Total sleep period (minutes). W, R, L, D: Ratio of Wake, REM, stage L and stage D (%). IRA: Inter-rater agreement rate (%). Avg: Average. Results The average logarithmic powers of each stage on a derivation of a subject are shown in Figure 1. The power values in this figure are not standardized as described above. After the standardization, power distributions were relativized. The center values of distribution functions of the same derivation are shown in Figure 2. Obtaining standardized frequency power Discrete Fourier Transform (DFT) was performed upon two second signals clipped every second (overlapping one second). Before the DFT, signals were de-trended by subtracting the regression line derived by the least-square method, and the Hamming window function was applied. On each 0.5-45 Hz frequency bin, the average and the standard deviation of the logarithmic power were calculated over the recording span. Each DFT logarithmic power was standardized by these values (subtracted the average and divided by the standard deviation) before the following analysis. Training classifiers For training classifiers, only agreed epochs were used. The first and the last epochs of each stage were also excluded because transitional epochs may contain multiple stages. 20 Hz Figure 1. Average logarithmic powers (subject 9, F3-A2). H* T* F3 F4 C3 C4 O1 O2 F C O Average 1 1 2 3 4 5 6 7 8 9 10 84 84 82 86 75 80 88 86 84 90 84 83 85 87 79 77 84 86 82 87 86 84 85 89 79 82 89 85 80 88 85 87 85 90 83 83 85 84 79 86 84 84 80 87 76 81 87 83 82 90 82 80 86 88 80 78 85 84 82 88 84 85 85 87 76 81 88 86 83 90 80 83 87 85 79 78 88 87 75 90 77 84 87 87 79 80 90 83 84 90 83 84 85 87 78 80 87 85 81 89 2 1 2 3 4 5 6 7 8 9 10 83 83 83 86 72 79 88 85 83 90 83 82 86 87 76 76 84 85 81 88 86 83 85 89 77 81 89 84 79 89 85 85 85 90 81 82 85 83 79 88 83 83 81 87 73 79 87 83 80 90 81 80 87 88 77 75 86 84 81 89 84 83 86 87 75 79 89 85 83 90 78 82 87 85 78 78 89 86 75 91 75 83 88 87 77 79 93 83 83 91 82 83 85 87 76 79 88 84 80 90 Average 83 83 84 84 83 83 84 83 84 84 Table 2. Self-evaluated averaged total agreement rates (%). H*: Human scorer. T*: Trainer subject for classifier. H* T* F3 F4 C3 C4 O1 O2 F C O Average 63 1 70 58 67 59 66 49 71 66 61 77 77 75 77 74 73 78 73 76 75 2 63 3 55 63 61 66 60 66 62 69 66 78 4 79 78 79 79 80 79 78 73 80 68 5 69 69 63 68 67 70 70 66 71 1 75 6 75 76 76 76 75 74 75 68 76 72 7 73 71 72 68 72 70 74 72 75 60 8 60 59 59 59 58 57 62 64 61 72 73 72 75 71 71 76 72 73 73 9 73 10 74 71 75 72 75 73 72 71 73 63 1 69 58 67 58 65 49 71 65 61 76 2 78 77 75 77 74 73 78 73 77 62 3 54 62 61 65 59 65 61 68 65 79 4 79 79 79 79 80 79 78 73 81 69 69 63 67 67 70 70 66 71 68 5 2 75 77 76 76 75 74 75 68 76 75 6 72 7 73 72 72 68 72 70 75 72 76 60 8 60 58 58 59 58 56 62 64 61 73 9 72 73 72 74 71 71 76 72 73 72 10 74 70 75 72 74 72 71 71 73 70 70 70 70 70 68 72 69 71 70 Average Table 3. Cross-evaluated averaged total agreement rates (%). H*: Human scorer. T*: Trainer subject for classifier. Hz Figure 2. Average of distributions (subject 9, F3-A2). Self- and cross-evaluated averaged total agreement rates of CF classifiers are shown in Table 2 and 3, respectively. Figure 3 is the scattered plots of these self- and crossevaluated values with the results of CB classifiers. 21 H* Human 1 Wake REM Stage L Stage D Classifier CF (self-evaluation) Wake REM Stage L Stage D 8702 260 209 0 1310 12002 1340 0 3931 2967 37628 1059 219 3 1445 8161 8406 167 184 Wake 1291 11950 1276 REM 2 4268 3114 37451 Stage L 188 1 1711 Stage D Table 5. Humans and classifier s’ ratings (epochs). H*: Human scorer. Figure 3. Cross- and self-evaluation of classifiers. X-axis: Total agreement rate (%) of self-evaluation. Y-axis: Total agreement rate (%) of cross-evaluation. CF CB Self-evaluation Min Max Avg 81% 78% 1 Wake REM Stage L Stage D Classifier CF (cross-evaluation) Wake REM Stage L Stage D 71697 5785 5051 6 29966 63771 38126 5 45535 26106 329185 9439 4464 238 48886 34864 5 4 7710 36595 Averaged sensitivities and specificities of the classifiers with two types of classifiers are shown in Table 7. Although the specificities of stage L and stage D exceeded 80%, and the sensitivities of stage L and Wake exceeded or were close to 80%, the sensitivities of REM and stage D and the specificities of Wake and REM remained in a lower range. Cross-evaluation Min Max Avg 72% 93% 84% 49% 49% 89% 78% 42% Table 4. Total agreement rate of classifiers. Min: Minimum. Max: Maximum. Avg: Average. Human 69808 4577 4423 Wake 29942 63190 37517 REM 2 47753 27907 328758 Stage L 4078 226 50550 Stage D Table 6. Humans and classifiers’ ratings (epochs). H*: Human scorer. There appears to be a positive correlation between selfand cross-evaluated values in Figure 3. Total agreement rates of self- and cross-evaluations of CF and CB are summarized in Table 4. The total agreement rate of CF is higher than CB on minimum, maximum and average. Classifier H* 0 0 959 8261 70% 65% Self-evaluation Cross-evaluation W R L D W R L D Sensitivity 95 81 82 82 88 49 80 41 CF Specificity 60 78 92 90 53 69 79 85 Sensitivity 91 79 74 86 76 62 67 48 CB Specificity 56 68 93 88 46 62 81 84 Table 7. Average sensitivities and specificities of classifiers. W, R, L and D: Wake, REM, stage L and stage D. Classifier Table 5 and 6 are the matrices of humans and classifiers’ ratings on self- and cross-evaluation, respectively. In these matrices, agreed ratings are in the diagonal cells. Disagreement occurred prominently with the adjacent cells (Wake-REM, REM-stage L and stage L-stage D). However, the column “Wake” (classifiers) has high numbers in rows “Stage L” and “State D” (human scorers) that cannot be dismissed. 22 Discussions References Performance of our classifiers can be expected to be relatively low because they are based only on a single EEG derivation of a single subject. In reality, their total agreement rate is, on average, not excellent. However, some classifiers exceeded 80% (cross-evaluation) in total agreement rate, which is comparable with other known classifiers (Shambroom, Fabregas, and Johnstone 2011; Berthomier et al. 2007; Flexer, Gruber, and Dorffner 2005). This suggests that the classifier could be reliable if trained with proper data. The performance of cross-evaluation had a positive correlation with the one of self-evaluation. Thus, if the classifier performed well with the trainer subjects, it is expected to achieve high performance with other subjects. Total agreement rate of the CF classifier was higher than that of the CB classifier. This may suggest that summing up each frequency bin into a band loses some important information. Or, higher frequency (gamma band) may work for better detection. Even though sensitivities on REM and stage D were higher in CB, the specificities were lower, which suggests that CB included too many incorrect epochs into these stages and lowered the total agreement rates. Disagreement in CF classifier is largely due to over inclusion into Wake, which caused low specificity of Wake and low sensitivity of stage D. This error could be caused by the training data that contained relatively small amount of Wake epochs. If the classifier would be trained by data containing more Wake epochs, the classifier may show better performance. Although the performance of the classifier differed among trainer subjects, the differences among derivations were not prominent. This may suggest that our classifier could perform well even when the electrode position is inaccurate. Our standardization method could contribute to compensate uneven electrode conditions. This feature can be helpful when analyzing low quality signals. This simple classifier based on statistical data does not simulate complex decisions by human scorers, but it can eliminate arbitrary errors. Especially, it could be expected to work on low-quality signals containing artifacts since this method does not depend on specific frequencies or waveforms. It could also be applied to other biological signals where frequency features differ among stages. Berthomier, C., Drouot, X., Herman-Stoica, M., Berthomier, P., Prado, J., Bokar-Thire, D., Benoit, O., Mattout, J., and d'Ortho, M. P. 2007. Automatic Analysis of Single-Channel Sleep Eeg: Validation in Healthy Individuals. Sleep 30(11): 1587-95. Flexer, A., Gruber, G., and Dorffner, G. 2005. A Reliable Probabilistic Sleep Stager Based on a Single Eeg Signal. Artificial Intelligence in Medicine 33(3): 199-207. Gross, B. A., Walsh, C. M., Turakhia, A. A., Booth, V., Mashour, G. A., and Poe, G. R. 2009. Open-source logic-based automated sleep scoring software using electrophysiological recordings in rats. Journal of Neuroscience Methods 184(1): 10-8. Hirshkowitz, M. Monitoring and Staging Human Sleep. In: Kryger, M. H., Roth, T., and Dement, W. C. eds. Principles and Practice of Sleep Medicine. 2011. St. Louis, Elsevier: 1602-1609. Martin, W. B., Johnson, L. C., Viglione, S. S., Naitoh, P., Joseph, R. D., and Moses, J. D. 1972. Pattern recognition of EEG-EOG as a technique for all-night sleep stage scoring. Electroencephalography and Clinical Neurophysiology 32(4): 417-27. Shambroom, J. R., Fabregas, S. E., and Johnstone, J. 2011. Validation of an Automated Wireless System to Monitor Sleep in Healthy Adults. Journal of Sleep Research. Acknowledgements We would like to thank Shintaro Chiba, M.D., Ph.D. and Ms. Tomoko Yagi for providing the scored polysomnograms. 23